t5-base-dutch / README.md

Create README.md

fceee51 almost 3 years ago

5.72 kB

	---
	language:
	- nl
	datasets:
	- yhavinga/mc4_nl_cleaned
	tags:
	- seq2seq
	- lm-head
	license: apache-2.0
	inference: false
	---

	# Work in progress. Dec 2021.

	# A collection of Dutch T5 models

	* Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!
	* Continuation of work started during the [Hugging Face community week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google, for the project [Pre-train T5 from scratch in Dutch](https://discuss.huggingface.co/t/pretrain-t5-from-scratch-in-dutch/8109).
	* Using improved training script - no more exceptions during training, so no restarting required.
	* All models trained with tensorflow metrics.
	* Thanks to @gsarti for creating the [t5-flax-gcp repository](https://github.com/gsarti/t5-flax-gcp)!


	\| \|`t5-base-dutch` \|`t5-v1.1-base-dutch` \|`t5-v1.1-large-dutch-cased`\| `t5-v1.1-base-dutch-uncased`\|
	\|-----------------------\|-------------------------\|-------------------------\|---------------------------\|-----------------------------\|
	\|`tokenizer` \|`cased` \|`uncased` \|`cased` \|`uncased` \|
	\|`source model config` \|`google/t5-base` \|`google/t5-v1_1-base` \|`google/t5-v1_1-large` \|`google/t5-v1_1_base` \|
	\|`dataset` \|`yhavinga/mc4_nl_cleaned`\|`yhavinga/mc4_nl_cleaned`\|`yhavinga/mc4_nl_cleaned` \|`yhavinga/mc4_nl_cleaned` \|
	\|`tpu vm` \| two \| one \| three \| one \|
	\|`finished` \| \| YES \| \| \|
	\|Hyperparameters \| \| \| \| \|
	\|`epochs` \| 1 \| 1 \| 4 \| 2 \|
	\|`per-device batch size`\| 16 \| 16 \| 2 \| 8 \|
	\|`tot. batch size` \| 128 \| 128 \| 16 \| ? \|
	\|`steps` \| 508 976 \| 508 976 \| 8 428 012 \| ? \|
	\|`max seq. length` \| 512 \| 512 \| 1024 \| 1024 \|
	\|`tot. tok. trained on` \| 33B \| 33B \| 138B \| ? \|
	\|`optimizer` \| adafactor \| adafactor \| adafactor \| adafactor \|
	\|`warmup steps` \| 10000 \| 10000 \| 10000 \| 10000 \|
	\|`learning rate` \| 0.005 \| 0.005 \| 0.005 \| 0.005 \|
	\|`weigth decay` \| 0.01 \| 0.01 \| 0.01 \| 0.001 \|
	\|`tie embeds` \|`false` \|`false` \|`false` \|`false` \|
	\|`validation split size`\| 15K examples \| 15K examples \| 15K examples \| 15K examples \|
	\|Model config \| \| \| \| \|
	\|`d_ff` \| 3072 \| 2048 \| 2816 \| 2048 \|
	\|`d_kv` \| 64 \| 64 \| 64 \| 64 \|
	\|`d_model` \| 768 \| 768 \| 1024 \| 768 \|
	\|`dropout rate` \| 0.1 \| 0.1 \| 0.1 (0.0 wh. pre-train.) \| 0.1 (0.0 wh. pre-train.) \|
	\|`ff projection` \|`relu` \|`gated-gelu` \|`gated-gelu` \|`gated-relu` \|
	\|`num decoder layers` \| 12 \| 12 \| 24 \| 12 \|
	\|`num heads` \| 12 \| 12 \| 16 \| 12 \|
	\|`num layers` \| 12 \| 12 \| 24 \| 12 \|
	\|`rel. attn. buckets` \| 32 \| 32 \| 32 \| 32 \|
	\|`vocab size` \| 32103 \| 32103 \| 32103 \| 32103 \|
	\|Training time \| ~ 100 hours \| 101 hours \| ~ 370 hours \| ? \|
	\|Evaluation \| \| \| \| \|
	\|`accuracy` \| \| 0.6976 \| \| \|
	\|`loss` \| \| 1.379 \| \| \|

	---
	language:
	- nl
	datasets:
	- yhavinga/mc4_nl_cleaned
	tags:
	- seq2seq
	- lm-head
	license: apache-2.0
	inference: false
	---

	# Work in progress. Dec 2021.

	# A collection of Dutch T5 models

	* Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!
	* Continuation of work started during the [Hugging Face community week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google, for the project [Pre-train T5 from scratch in Dutch](https://discuss.huggingface.co/t/pretrain-t5-from-scratch-in-dutch/8109).
	* Using improved training script - no more exceptions during training, so no restarting required.
	* All models trained with tensorflow metrics.
	* Thanks to @gsarti for creating the [t5-flax-gcp repository](https://github.com/gsarti/t5-flax-gcp)!


	\| \|`t5-base-dutch` \|`t5-v1.1-base-dutch` \|`t5-v1.1-large-dutch-cased`\| `t5-v1.1-base-dutch-uncased`\|
	\|-----------------------\|-------------------------\|-------------------------\|---------------------------\|-----------------------------\|
	\|`tokenizer` \|`cased` \|`uncased` \|`cased` \|`uncased` \|
	\|`source model config` \|`google/t5-base` \|`google/t5-v1_1-base` \|`google/t5-v1_1-large` \|`google/t5-v1_1_base` \|
	\|`dataset` \|`yhavinga/mc4_nl_cleaned`\|`yhavinga/mc4_nl_cleaned`\|`yhavinga/mc4_nl_cleaned` \|`yhavinga/mc4_nl_cleaned` \|
	\|`tpu vm` \| two \| one \| three \| one \|
	\|`finished` \| \| YES \| \| \|
	\|Hyperparameters \| \| \| \| \|
	\|`epochs` \| 1 \| 1 \| 4 \| 2 \|
	\|`per-device batch size`\| 16 \| 16 \| 2 \| 8 \|
	\|`tot. batch size` \| 128 \| 128 \| 16 \| ? \|
	\|`steps` \| 508 976 \| 508 976 \| 8 428 012 \| ? \|
	\|`max seq. length` \| 512 \| 512 \| 1024 \| 1024 \|
	\|`tot. tok. trained on` \| 33B \| 33B \| 138B \| ? \|
	\|`optimizer` \| adafactor \| adafactor \| adafactor \| adafactor \|
	\|`warmup steps` \| 10000 \| 10000 \| 10000 \| 10000 \|
	\|`learning rate` \| 0.005 \| 0.005 \| 0.005 \| 0.005 \|
	\|`weigth decay` \| 0.01 \| 0.01 \| 0.01 \| 0.001 \|
	\|`tie embeds` \|`false` \|`false` \|`false` \|`false` \|
	\|`validation split size`\| 15K examples \| 15K examples \| 15K examples \| 15K examples \|
	\|Model config \| \| \| \| \|
	\|`d_ff` \| 3072 \| 2048 \| 2816 \| 2048 \|
	\|`d_kv` \| 64 \| 64 \| 64 \| 64 \|
	\|`d_model` \| 768 \| 768 \| 1024 \| 768 \|
	\|`dropout rate` \| 0.1 \| 0.1 \| 0.1 (0.0 wh. pre-train.) \| 0.1 (0.0 wh. pre-train.) \|
	\|`ff projection` \|`relu` \|`gated-gelu` \|`gated-gelu` \|`gated-relu` \|
	\|`num decoder layers` \| 12 \| 12 \| 24 \| 12 \|
	\|`num heads` \| 12 \| 12 \| 16 \| 12 \|
	\|`num layers` \| 12 \| 12 \| 24 \| 12 \|
	\|`rel. attn. buckets` \| 32 \| 32 \| 32 \| 32 \|
	\|`vocab size` \| 32103 \| 32103 \| 32103 \| 32103 \|
	\|Training time \| ~ 100 hours \| 101 hours \| ~ 370 hours \| ? \|
	\|Evaluation \| \| \| \| \|
	\|`accuracy` \| \| 0.6976 \| \| \|
	\|`loss` \| \| 1.379 \| \| \|