Extending BLOOM to Dutch - tips for hyperparameters
Hi there!
High level: starting from BLOOM, we want to create a LLM for Dutch and look for tips regarding the hyperparameters/training paradigm.
Context: We’re setting up the code and infrastructure to fine-tune all BLOOM models on a Dutch corpus. In general, we don’t wish to particularly optimize for multilingual capabilities and aim for a SOTA Dutch LLM. We have collected around 150GB of Dutch data, fitted a new tokenizer with a vocabulary size of 40,000 tokens and prepared the language model training dataset. We have access to a HPC infrastructure, which would allow us to train for 50-100k GPU hours and have set up the ZeRO framework using DeepSpeed for parallel training.
Questions:
We currently consider two training paradigms. Any general or specific piece of advice would be much appreciated.
Continued pretraining. To our knowledge, this seems to be the easiest way forward. We currently anticipate running the training with all weights unfrozen, for 1 epoch and 256 as effective batch size. What could you recommend using as hyperparameters? In particular:
1.1 Would you recommend freezing a certain set of weights?
1.2 Which learning rate (paradigm) would you propose?
1.3 How would these decisions scale across different model sizes?Alternatively, we would think to use MAD-X adapters just as Yong et al (https://arxiv.org/pdf/2204.04873.pdf).
2.1 Would you recommend this approach over continued pretraining? And if so, where in what capacity would you recommend plugging in the adapters (again in function of the BLOOM model size)?
Please let us know if you would need any more context to answer our questions. Many thanks in advance.