Update README.md
Browse files
README.md
CHANGED
@@ -54,7 +54,7 @@ The fine-tuning dataset is a mixture of [OpenSubtitles](https://huggingface.co/d
|
|
54 |
This model is fine-tuned for paraphrasing tasks and finetuned in sentence level only. It is not intended to be used in any other case and can not be fine-tuned to any other task with full performance of the base model. It is also not guaranteed that this model will work without specified prompts.
|
55 |
|
56 |
### Training Procedure
|
57 |
-
Pre-trained for
|
58 |
#### Hardware
|
59 |
- **GPUs**: 8 x Nvidia A100-80 GB
|
60 |
#### Software
|
@@ -65,17 +65,18 @@ Pre-trained for 30 days and for a total of 708B tokens. Finetuned for 25 epoch.
|
|
65 |
- **Training objective**: Sentence permutation and span masking (using mask lengths sampled from Poisson distribution λ=3.5, masking 30% of tokens)
|
66 |
- **Optimizer** : Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e-6)
|
67 |
- **Scheduler**: Custom scheduler from the original Transformers paper (20,000 warm-up steps)
|
68 |
-
- **
|
|
|
69 |
- **Initial Learning rate**: 5e-6
|
70 |
-
- **Training tokens**:
|
71 |
|
72 |
##### Fine-tuning
|
73 |
- **Training regime:** fp16 mixed precision
|
74 |
- **Optimizer** : Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e-6)
|
75 |
- **Scheduler**: Linear decay scheduler
|
76 |
- **Dropout**: 0.1
|
77 |
-
-
|
78 |
-
-
|
79 |
|
80 |
#### Metrics
|
81 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/62f8b3c84588fe31f435a92b/nrM_FA3bGk9NAYW_044HW.png)
|
|
|
54 |
This model is fine-tuned for paraphrasing tasks and finetuned in sentence level only. It is not intended to be used in any other case and can not be fine-tuned to any other task with full performance of the base model. It is also not guaranteed that this model will work without specified prompts.
|
55 |
|
56 |
### Training Procedure
|
57 |
+
Pre-trained for 8 days and for a total of 84B tokens. Finetuned for 25 epoch.
|
58 |
#### Hardware
|
59 |
- **GPUs**: 8 x Nvidia A100-80 GB
|
60 |
#### Software
|
|
|
65 |
- **Training objective**: Sentence permutation and span masking (using mask lengths sampled from Poisson distribution λ=3.5, masking 30% of tokens)
|
66 |
- **Optimizer** : Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e-6)
|
67 |
- **Scheduler**: Custom scheduler from the original Transformers paper (20,000 warm-up steps)
|
68 |
+
- **Weight Initialization**: Model Enlargement from VBART-Large. See the related section in the [paper](https://arxiv.org/abs/2403.01308) for the details.
|
69 |
+
- **Dropout**: 0.1 (dropped to 0.05 and then to 0 in the last 80K and 80k steps, respectively)
|
70 |
- **Initial Learning rate**: 5e-6
|
71 |
+
- **Training tokens**: 84B
|
72 |
|
73 |
##### Fine-tuning
|
74 |
- **Training regime:** fp16 mixed precision
|
75 |
- **Optimizer** : Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e-6)
|
76 |
- **Scheduler**: Linear decay scheduler
|
77 |
- **Dropout**: 0.1
|
78 |
+
- **Learning rate**: 5e-6
|
79 |
+
- **Fine-tune epochs**: 55
|
80 |
|
81 |
#### Metrics
|
82 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/62f8b3c84588fe31f435a92b/nrM_FA3bGk9NAYW_044HW.png)
|