|
--- |
|
language: |
|
- nl |
|
datasets: |
|
- yhavinga/mc4_nl_cleaned |
|
tags: |
|
- seq2seq |
|
- lm-head |
|
license: apache-2.0 |
|
inference: false |
|
--- |
|
# T5-base pre-trained on cleaned Dutch mC4 π³π± |
|
|
|
|
|
A [T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) v1.0 base model pre-trained from scratch on [Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned). |
|
|
|
* NB! Consider using [yhavinga/t5-v1.1-base-dutch-cased](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-cased) that has an accuracy of 0,78 and loss 0,96 on the validation split. |
|
* This model and the [flax-community/t5-base-dutch model](https://huggingface.co/flax-community/t5-base-dutch) now have the same latest checkpoint with accuracy 0.70 and loss 1,38 on the validation split. |
|
* Pre-trained T5 models need to be finetuned before they can be used for downstream tasks, therefore the inference widget on the right has been turned off. |
|
* For a fine-tuned version for summarization, see [yhavinga/t5-v1.1-base-dutch-cnn-test](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-cnn-test). |
|
* For a demo of the Dutch CNN summarization models, head over to the Hugging Face Spaces for |
|
the **[Netherformer π°](https://huggingface.co/spaces/flax-community/netherformer)** example application! |
|
* T5 paper: [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf) |
|
|
|
![model image](https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67) |
|
|
|
## Tokenizer |
|
|
|
* Tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface |
|
Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling). |
|
|
|
## Dataset |
|
|
|
All models listed below are trained on of the `full` configuration (39B tokens) of |
|
[cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned), |
|
which is the original mC4, except |
|
|
|
* Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed |
|
* Sentences with less than 3 words are removed |
|
* Sentences with a word of more than 1000 characters are removed |
|
* Documents with less than 5 sentences are removed |
|
* Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies", |
|
"use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed. |
|
|
|
## Models |
|
|
|
* The first model, `t5-base-dutch` is a re-training of the Dutch T5 base v1.0 model trained during the Flax/Jax community |
|
week. With training complete, accuracy was improved from 0,64 to 0,70. |
|
* The second two models are a uncased and cased version of `t5-v1.1-base`, again pre-trained from scratch on Dutch, |
|
with a tokenizer also trained from scratch. The t5 v1.1 models are slightly different from the t5 models, and the |
|
base models are trained with a dropout of 0.0. For fine-tuning it is intended to set this back to 0.1. |
|
* The large cased model is a pre-trained Dutch version of `t5-v1.1-large`. Training of t5-v1.1-large proved difficult. |
|
Without dropout regularization, the training would diverge at a certain point. With dropout training went better, |
|
be it much slower than training the t5-model. At some point convergance was too slow to warrant further training. |
|
The latest checkpoint, training scripts and metrics are available for reference. For actual fine-tuning the cased |
|
base model is probably the better choice. |
|
|
|
| | model | train seq len | acc | loss | batch size | epochs | steps | dropout | optim | lr | duration | |
|
|----------------------------|---------|---------------|----------|----------|------------|--------|---------|---------|-----------|------|----------| |
|
| t5-base-dutch | T5 | 512 | 0,70 | 1,38 | 128 | 1 | 528481 | 0.1 | adafactor | 5e-3 | 2d 9h | |
|
| t5-v1.1-base-dutch-uncased | t5-v1.1 | 1024 | 0,73 | 1,20 | 64 | 2 | 1014525 | 0.0 | adafactor | 5e-3 | 5d 5h | |
|
| t5-v1.1-base-dutch-cased | t5-v1.1 | 1024 | **0,78** | **0,96** | 64 | 2 | 1210000 | 0.0 | adafactor | 5e-3 | 6d 6h | |
|
| t5-v1.1-large-dutch-cased | t5-v1.1 | 512 | 0,76 | 1,07 | 64 | 1 | 1120000 | 0.1 | adafactor | 5e-3 | 86 13h | |
|
|
|
The cased t5-v1.1 Dutch models were fine-tuned on summarizing the CNN Daily Mail dataset. |
|
|
|
| | model | input len | target len | Rouge1 | Rouge2 | RougeL | RougeLsum | Test Gen Len | epochs | batch size | steps | duration | |
|
|------------------------------|---------|-----------|------------|--------|--------|--------|-----------|--------------|--------|------------|-------|----------| |
|
| t5-v1.1-base-dutch-cnn-test | t5-v1.1 | 1024 | 96 | 34,8 | 13,6 | 25,2 | 32,1 | 79 | 6 | 64 | 26916 | 2h 40m | |
|
| t5-v1.1-large-dutch-cnn-test | t5-v1.1 | 1024 | 96 | 34,4 | 13,6 | 25,3 | 31,7 | 81 | 5 | 16 | 89720 | 11h | |
|
|
|
|
|
## Acknowledgements |
|
|
|
This project would not have been possible without compute generously provided by Google through the |
|
[TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace π€ ecosystem was also |
|
instrumental in many, if not all parts of the training. The following repositories where helpful in setting up the TPU-VM, |
|
and getting an idea what sensible hyper-parameters are for training gpt2 from scratch. |
|
|
|
* [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp) |
|
* [Flax/Jax Community week t5-base-dutch](https://huggingface.co/flax-community/t5-base-dutch) |
|
|
|
Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/) |