How large is the corpus size used for pretraining the finance LLaMA?

by dhkong - opened Sep 27

dhkong

Sep 27

I just want to know the adequate data size for pre-training a domain-specific LLaMA.
Appreciate.

Owner Sep 27

1B tokens in total. We run the model for 4k steps with the batch size of 0.25 M tokens (as in Table 10 in our paper).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment