hmByT5 - Language Models

Historical Multilingual and Monolingual ByT5 Models. Following languages are currently covered:

English (British Library Corpus - Books)
German (Europeana Newspaper)
French (Europeana Newspaper)
Finnish (Europeana Newspaper)
Swedish (Europeana Newspaper)
Dutch (Delpher Corpus)
Norwegian (NCC)

More details can be found in our GitHub repository.

Leaderboard

We test our pretrained language models on various datasets from HIPE-2020, HIPE-2022 and Europeana. The following table shows an overview of used datasets.

Language	Dataset	Additional Dataset
English	AjMC	-
German	AjMC	-
French	AjMC	ICDAR-Europeana
Finnish	NewsEye	-
Swedish	NewsEye	-
Dutch	ICDAR-Europeana	-

Current best models:

Model	English AjMC	German AjMC	French AjMC	Finnish NewsEye	Swedish NewsEye	Dutch ICDAR	French ICDAR
`hmbyt5/byt5-small-english`	85.65 ± 1.21	87.27 ± 0.50	84.44 ± 0.79
`hmbyt5-preliminary/byt5-small-english-german`	85.74 ± 0.72	87.45 ± 0.67	84.23 ± 0.65
`hmbyt5-preliminary/byt5-small-english-german-french`	85.61 ± 0.96	87.24 ± 0.76	84.39 ± 0.68
`hmbyt5-preliminary/byt5-small-english-german-french-finnish`	85.30 ± 1.14	87.37 ± 0.53	84.12 ± 0.42
`hmbyt5-preliminary/byt5-small-english-german-french-finnish-swedish`	85.40 ± 0.78	87.12 ± 0.19	84.41 ± 0.34
`hmbyt5-preliminary/byt5-small-english-german-french-finnish-swedish-dutch`	85.51 ± 0.68	87.58 ± 0.39	84.39 ± 0.83	55.46 ± 1.99	73.38 ± 2.45	84.80 ± 0.44	75.97 ± 0.55
`hmbyt5-preliminary/byt5-small-multilingual-4g`	83.49 ± 0.96	87.65 ± 0.63	84.16 ± 0.90
`hmbyt5-preliminary/byt5-small-multilingual-4g-2e`	83.86 ± 0.61	87.54 ± 0.19	84.29 ± 0.41
`hmbyt5-preliminary/byt5-small-multilingual-4g-3e`	83.49 ± 0.99	87.38 ± 0.53	84.30 ± 0.51
`hmbyt5-preliminary/byt5-small-historic-multilingual-flax`	83.28 ± 1.67	86.98 ± 0.71	83.49 ± 1.06	76.96 ± 1.58	78.80 ± 1.89	86.47 ± 0.79	77.43 ± 0.51
`hmbyt5-preliminary/byt5-small-historic-multilingual-span20-flax`	84.91 ± 0.86	88.02 ± 0.35	84.78 ± 0.75	77.77 ± 1.83	79.94 ± 0.60	86.85 ± 0.91	77.45 ± 0.54

More recent results on more datasets can be found in the hmLeaderboard.

Acknowledgements

We thank Luisa März, Katharina Schmid and Erion Çano for their fruitful discussions about Historical Language Models.

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). Many Thanks for providing access to the TPUs ❤️

hmByT5

AI & ML interests

hmByT5 - Language Models

Leaderboard

Acknowledgements

models 7

hmbyt5/byt5-small-english

hmbyt5/byt5-small-historic-dutch-span20

hmbyt5/byt5-small-historic-dutch

hmbyt5/byt5-base-historic-english-span20

hmbyt5/byt5-base-historic-english-span3

hmbyt5/byt5-base-historic-dutch

hmbyt5/byt5-small-historic-english-span20

datasets

AI & ML interests

Team members 1

hmByT5 - Language Models

Leaderboard

Acknowledgements

models 7 Sort: Recently updated

datasets

models 7