Text Generation
Transformers
Safetensors
English
llama
finance
text-generation-inference
Inference Endpoints

How large is the corpus size used for pretraining the finance LLaMA?

#2
by dhkong - opened

I just want to know the adequate data size for pre-training a domain-specific LLaMA.
Appreciate.

1B tokens in total. We run the model for 4k steps with the batch size of 0.25 M tokens (as in Table 10 in our paper).

Sign up or log in to comment