projecte-aina
/

Plume32k

@@ -24,7 +24,7 @@ pipeline_tag: translation
 - [Model description](#model-description)
 - [Intended uses and limitations](#intended-uses-and-limitations)
-- [How to use](#how-to-use)
 - [Training](#training)
 - [Evaluation](#evaluation)
 - [Citation](#citation)
@@ -44,4 +44,30 @@ Plume is the first LLM trained for Neural Machine Translation with only parallel
 In recent years, Large Language Models (LLMs) have demonstrated exceptional proficiency across a broad spectrum of Natural Language Processing (NLP) tasks, including Machine Translation. However, previous methodologies predominantly relied on iterative processes such as instruction fine-tuning or continual pre-training, leaving unexplored the challenges of training LLMs solely on parallel data. In this work, we introduce Plume (**P**arallel **L**ang**u**age **M**od**e**l), a collection of three 2B LLMs featuring varying vocabulary sizes (32k, 128k, and 256k) trained exclusively on  Catalan-centric parallel examples. These models perform comparable to previous encoder-decoder architectures on 16 supervised translation directions and 56 zero-shot ones.
-For more details regarding the model architecture take a look at the paper which is available on [arXiv]().

 - [Model description](#model-description)
 - [Intended uses and limitations](#intended-uses-and-limitations)
+- [Run the model](#Run-the-model)
 - [Training](#training)
 - [Evaluation](#evaluation)
 - [Citation](#citation)
 In recent years, Large Language Models (LLMs) have demonstrated exceptional proficiency across a broad spectrum of Natural Language Processing (NLP) tasks, including Machine Translation. However, previous methodologies predominantly relied on iterative processes such as instruction fine-tuning or continual pre-training, leaving unexplored the challenges of training LLMs solely on parallel data. In this work, we introduce Plume (**P**arallel **L**ang**u**age **M**od**e**l), a collection of three 2B LLMs featuring varying vocabulary sizes (32k, 128k, and 256k) trained exclusively on  Catalan-centric parallel examples. These models perform comparable to previous encoder-decoder architectures on 16 supervised translation directions and 56 zero-shot ones.
+For more details regarding the model architecture, the dataset and model interpretability take a look at the paper which is available on [arXiv]().
+## Intended Uses and Limitations
+The model is proficient in 16 supervised translation directions that include Catalan and is capable of translating in other 56 zero-shot directions as well.
+At the time of submission, no measures have been taken to estimate the bias and added toxicity embedded in the model. However, we are aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
+## Run the model
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+model_id = "projecte-aina/Plume32k"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id)
+src_lang_code = 'spa_Latn'
+tgt_lang_code = 'cat_Latn'
+sentence = 'Ayer se fue, tomó sus cosas y se puso a navegar.'
+prompt = '<s> [{}] {} \n[{}]'.format(src_lang_code, sentence, tgt_lang_code)
+input_ids = tokenizer(prompt, return_tensors='pt').input_ids
+output_ids = model.generate( input_ids, max_length=200, num_beams=5 )
+input_length = input_ids.shape[1]
+generated_text = tokenizer.decode(output_ids[0, input_length: ], skip_special_tokens=True).strip()
+```