vngrs-ai
/

VBART-XLarge-Paraphrasing

Text2Text Generation

Inference Endpoints

Model card Files Files and versions Community

erdiari commited on Mar 18

Commit

15b63a3

•

1 Parent(s): a1cf0bf

Update README.md

Files changed (1) hide show

README.md +6 -5

README.md CHANGED Viewed

@@ -54,7 +54,7 @@ The fine-tuning dataset is a mixture of [OpenSubtitles](https://huggingface.co/d
 This model is fine-tuned for paraphrasing tasks and finetuned in sentence level only. It is not intended to be used in any other case and can not be fine-tuned to any other task with full performance of the base model. It is also not guaranteed that this model will work without specified prompts.
 ### Training Procedure
-Pre-trained for 30 days and for a total of 708B tokens. Finetuned for 25 epoch.
 #### Hardware
 - **GPUs**: 8 x Nvidia A100-80 GB
 #### Software
@@ -65,17 +65,18 @@ Pre-trained for 30 days and for a total of 708B tokens. Finetuned for 25 epoch.
 - **Training objective**: Sentence permutation and span masking (using mask lengths sampled from Poisson distribution λ=3.5, masking 30% of tokens)
 - **Optimizer** : Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e-6)
 - **Scheduler**: Custom scheduler from the original Transformers paper (20,000 warm-up steps)
-- **Dropout**: 0.1 (dropped to 0.05 and then to 0 in the last 165k and 205k steps, respectively)
 - **Initial Learning rate**: 5e-6
-- **Training tokens**: 708B
 ##### Fine-tuning
 - **Training regime:** fp16 mixed precision
 - **Optimizer** : Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e-6)
 - **Scheduler**: Linear decay scheduler
 - **Dropout**: 0.1
--  **Learning rate**: 1e-5
--  **Fine-tune epochs**: 25
 #### Metrics
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62f8b3c84588fe31f435a92b/nrM_FA3bGk9NAYW_044HW.png)

 This model is fine-tuned for paraphrasing tasks and finetuned in sentence level only. It is not intended to be used in any other case and can not be fine-tuned to any other task with full performance of the base model. It is also not guaranteed that this model will work without specified prompts.
 ### Training Procedure
+Pre-trained for 8 days and for a total of 84B tokens. Finetuned for 25 epoch.
 #### Hardware
 - **GPUs**: 8 x Nvidia A100-80 GB
 #### Software
 - **Training objective**: Sentence permutation and span masking (using mask lengths sampled from Poisson distribution λ=3.5, masking 30% of tokens)
 - **Optimizer** : Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e-6)
 - **Scheduler**: Custom scheduler from the original Transformers paper (20,000 warm-up steps)
+- **Weight Initialization**: Model Enlargement from VBART-Large. See the related section in the [paper](https://arxiv.org/abs/2403.01308) for the details.
+- **Dropout**: 0.1 (dropped to 0.05 and then to 0 in the last 80K and 80k steps, respectively)
 - **Initial Learning rate**: 5e-6
+- **Training tokens**: 84B
 ##### Fine-tuning
 - **Training regime:** fp16 mixed precision
 - **Optimizer** : Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e-6)
 - **Scheduler**: Linear decay scheduler
 - **Dropout**: 0.1
+- **Learning rate**: 5e-6
+- **Fine-tune epochs**: 55
 #### Metrics
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62f8b3c84588fe31f435a92b/nrM_FA3bGk9NAYW_044HW.png)