|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
- sr |
|
- hr |
|
- bs |
|
--- |
|
# Prodigy SM Base v0.1 |
|
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/617bbeec14572ebe9e6ea83f/4p2zaOWu6kTS3fcbevHef.png" width="70%" height="70%"> |
|
|
|
In our latest endeavour, we performed continued pre-training of a large language model (Mistral-7b-v0.1) to understand and generate text in new languages, including **Serbian**, **Bosnian** and **Croatian** using an innovative approach. |
|
|
|
Rather than depending only on extensive datasets in the target language, our method utilizes a more compact set of both synthetic and human-curated data along with some mixture of CC Web data, which is implemented in two strategic phases: |
|
|
|
1. Establishing a comprehensive demonstration of all grammatical and orthographic rules pertinent to the language. |
|
2. Supplying a diverse array of examples that not only reinforce these rules but also integrate a wide range of linguistic nuances. |
|
|
|
While our approach is uniquely tailored to our objectives, we have drawn some inspiration from recent advancements in language model training. Specifically, the conceptual strategies discussed in the paper [ADAPTING LARGE LANGUAGE MODELS VIA READING COMPREHENSION](https://arxiv.org/pdf/2309.09530.pdf) provided valuable insights, though our methods diverge significantly in practice. By adopting this inspired approach, we aim to efficiently teach the model new languages with a balanced blend of accuracy and linguistic diversity. |
|
|
|
So... Did it work?! |
|
|
|
# **Yes!** |
|
See the benchmark results, or even better, download the model and try it yourself. As you know by now, there's no better benchmark than a quick 'try it yourself' vibe check. :) |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/617bbeec14572ebe9e6ea83f/C9m_OjnYEpQo43VCrwz4A.png" width="100%" height="100%"> |
|
|
|
Here, we demonstrate results of benchmark that is not frequently performed, yet equally important: how adapting the model for a new language impacted its original English-only performance. |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/617bbeec14572ebe9e6ea83f/IPY0myfQI-Ne5x6b11glz.png" width="100%" height="100%"> |
|
|
|
*All evals are performed in zero shot manner. |
|
*Also bear in mind that llama-2-7b, llama-3-8b and mistral-7b models compared to Prodigy SM base aren't trained on extensive Serbian language datasets, and these benchmarks demonstrate that primarily English models can be adapted to other languages. |
|
|
|
So, as you can see, we successfully improved the original model's performance for Serbian language use cases while retaining or even slightly improving its performance for English language. |
|
|
|
### Training results |
|
Training results of continued pre-training of [mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) |
|
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/617bbeec14572ebe9e6ea83f/5xeJ-vfWk4RhJNC7t5I0g.png" width="70%" height="70%"> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/617bbeec14572ebe9e6ea83f/R4R8ai8LaN3WlYCOenUyb.png" width="70%" height="70%"> |
|
|
|
As last experimental step we merged produced model with **Mistral-7B-v0.1** and two earlier checkpoints from **prodigy-sm-base** using [Model Stock](https://arxiv.org/abs/2403.19522) method. |
|
|
|
# Notes |
|
As this is base model, there is no chat template or strict chat following capabilities, this model is best candidate for further pre-train on Serbian language as there is a lot more room for improvement (you can hit sweet spot), or next step in the pipeline, such as some form of chat or instruct tuning. |
|
|
|
If you want model that is already instruction tuned we did that too, check **Prodigy SM Instruct v0.1** |
|
# Prodigy SM Instruct v0.1 |
|
π[prodigy-sm-instruct]() **COMING SOON** |
|
|
|
And stay tuned for: |
|
[prodigy-sm-base (llama-3)]() **COMING SOON** |
|
[prodigy-sm-instruct (llama-3)]() **COMING SOON** |
|
|
|
π’ Also we are excited to announce that [iskon.ai](https://Iskon.ai) will soon launch an API platform featuring advanced **Prodigy** series of models, advanced AI tools and much more! π |
|
|
|
|
|
# Thanks |
|
- [gordicaleksa/serbian-llm-eval](https://github.com/gordicaleksa/serbian-llm-eval) and his community for curating translations and adaptation of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) |
|
that we used to perform benchmarks. |
|
- [jondurbin](https://huggingface.co/jondurbin) for amazing airoboros framework |
|
- [teknium](https://huggingface.co/teknium) for various insights shared on discord and twitter aka x.com |
|
- [Eric](https://twitter.com/erhartford) for various insights shared on discord and twitter aka x.com |
|
- [mergekit](https://github.com/arcee-ai/mergekit) for model merging tools |
|
|
|
*Huge thanks to Redmond.ai for generous DGX cloud credits* [redmond.ai]( https://redmond.ai) |
|
|