yhavinga commited on
Commit
02d8a7b
1 Parent(s): fa6f0d1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -60
README.md CHANGED
@@ -80,64 +80,4 @@ and getting an idea what sensible hyper-parameters are for training gpt2 from sc
80
  * [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp)
81
  * [Flax/Jax Community week t5-base-dutch](https://huggingface.co/flax-community/t5-base-dutch)
82
 
83
- Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
84
- ## Tokenizer
85
-
86
- * SentencePiece tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface
87
- Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
88
-
89
- ## Dataset
90
-
91
- All models listed below are trained on of the `full` configuration (39B tokens) of
92
- [cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned),
93
- which is the original mC4, except
94
-
95
- * Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed
96
- * Sentences with less than 3 words are removed
97
- * Sentences with a word of more than 1000 characters are removed
98
- * Documents with less than 5 sentences are removed
99
- * Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
100
- "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
101
-
102
- ## Models
103
-
104
- TL;DR: [yhavinga/t5-v1.1-base-dutch-cased](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-cased) is the best model.
105
-
106
- * `yhavinga/t5-base-dutch` is a re-training of the Dutch T5 base v1.0 model trained during the summer 2021
107
- Flax/Jax community week. Accuracy was improved from 0.64 to 0.70.
108
- * The two T5 v1.1 base models are an uncased and cased version of `t5-v1.1-base`, again pre-trained from scratch on Dutch,
109
- with a tokenizer also trained from scratch. The t5 v1.1 models are slightly different from the t5 models, and the
110
- base models are trained with a dropout of 0.0. For fine-tuning it is intended to set this back to 0.1.
111
- * The large cased model is a pre-trained Dutch version of `t5-v1.1-large`. Training of t5-v1.1-large proved difficult.
112
- Without dropout regularization, the training would diverge at a certain point. With dropout training went better,
113
- be it much slower than training the t5-model. At some point convergance was too slow to warrant further training.
114
- The latest checkpoint, training scripts and metrics are available for reference. For actual fine-tuning the cased
115
- base model is probably the better choice.
116
-
117
- | | model | train seq len | acc | loss | batch size | epochs | steps | dropout | optim | lr | duration |
118
- |---------------------------------------------------------------------------------------------------|---------|---------------|----------|----------|------------|--------|---------|---------|-----------|------|----------|
119
- | [yhavinga/t5-base-dutch](https://huggingface.co/yhavinga/t5-base-dutch) | T5 | 512 | 0,70 | 1,38 | 128 | 1 | 528481 | 0.1 | adafactor | 5e-3 | 2d 9h |
120
- | [yhavinga/t5-v1.1-base-dutch-uncased](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-uncased) | t5-v1.1 | 1024 | 0,73 | 1,20 | 64 | 2 | 1014525 | 0.0 | adafactor | 5e-3 | 5d 5h |
121
- | [yhavinga/t5-v1.1-base-dutch-cased](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-cased) | t5-v1.1 | 1024 | **0,78** | **0,96** | 64 | 2 | 1210000 | 0.0 | adafactor | 5e-3 | 6d 6h |
122
- | [yhavinga/t5-v1.1-large-dutch-cased](https://huggingface.co/yhavinga/t5-v1.1-large-dutch-cased) | t5-v1.1 | 512 | 0,76 | 1,07 | 64 | 1 | 1120000 | 0.1 | adafactor | 5e-3 | 86 13h |
123
-
124
- The cased t5-v1.1 Dutch models were fine-tuned on summarizing the CNN Daily Mail dataset.
125
-
126
- | | model | input len | target len | Rouge1 | Rouge2 | RougeL | RougeLsum | Test Gen Len | epochs | batch size | steps | duration |
127
- |-------------------------------------------------------------------------------------------------------|---------|-----------|------------|--------|--------|--------|-----------|--------------|--------|------------|-------|----------|
128
- | [yhavinga/t5-v1.1-base-dutch-cnn-test](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-cnn-test) | t5-v1.1 | 1024 | 96 | 34,8 | 13,6 | 25,2 | 32,1 | 79 | 6 | 64 | 26916 | 2h 40m |
129
- | [yhavinga/t5-v1.1-large-dutch-cnn-test](https://huggingface.co/yhavinga/t5-v1.1-large-dutch-cnn-test) | t5-v1.1 | 1024 | 96 | 34,4 | 13,6 | 25,3 | 31,7 | 81 | 5 | 16 | 89720 | 11h |
130
-
131
-
132
- ## Acknowledgements
133
-
134
- This project would not have been possible without compute generously provided by Google through the
135
- [TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was also
136
- instrumental in many, if not all parts of the training. The following repositories where helpful in setting up the TPU-VM,
137
- and training the models:
138
-
139
- * [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp)
140
- * [HUggingFace Flax MLM examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
141
- * [Flax/Jax Community week t5-base-dutch](https://huggingface.co/flax-community/t5-base-dutch)
142
-
143
  Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
 
80
  * [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp)
81
  * [Flax/Jax Community week t5-base-dutch](https://huggingface.co/flax-community/t5-base-dutch)
82
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
  Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)