Dataset Size

#3
by aslawliet - opened

Can you tell me about the dataset size and sampling methods?

Keynote Technology org

One of the datasets used to train this model, PLANE-2K, has a size of 2 thousand rows (1.8 megabytes). You can use pretty much any sampling method you need as long as you have the appropriate tools.
https://huggingface.co/datasets/Keynote-Technology/PLANE-2K

I meant the size of data you picked up from RedPajama-Data-v2?

Keynote Technology org
edited Dec 2, 2023

The size that I used to train this model was close to 900,000 rows, a size equivalent to 4.41GB

Keynote Technology org

I sampled randomly in no particular order.

PlanetDOGE changed discussion status to closed

Sign up or log in to comment