Introduction to Dataset Creation - What Makes a Good Dataset?
Making a dataset is a complex task. Whether you want a model to master a programming language, or if you want a model to solve quadratic formulas. You might have attempted making a dataset in the past, but faced challenges due to not having a good dataset. Making a dataset good is VERY important, in fact, I think it's the most important than everything else when bulding an LLM. So how do you make a good dataset that actually makes the model better in an efficient way? In this post we will get into some important aspects of a good dataset.
Note that the following are hypotheticals, let me know if you want a blog post on ways to datasets, popular approaches/techniques, ways to make datasets follow those aspects, or ways of finetuning.
Quality
The quality of the dataset is certainly one of, if not the most important aspect of a dataset. Without the quality aspect in a dataset, the dataset is, well, low quality. If a dataset is low quality, the model finetuned on it is not going to produce better outputs than the quality of the dataset(Therefore "Garbage in = garbage out", as you/data scientists might say). However, what does really quality mean, and what determines if a dataset's quality is high enough?
Diversity
Diversity is one of the most important quality-related aspect of a dataset. You need a diverse dataset, so that when the user asks a model something, the model actually knows it. This is because the user is asking diverse questions, even if it's a not-so-diverse topic, like coding. In coding there is there can be a lot of diversity involved, like the programming language, library, and sub-topics in coding(For example, building mathematics libraries or interactive GUIs). Therefore, you want to include as many of them as possible in your dataset.
Educational Value
The educational value of the dataset also very important. It is how effective/useful is the data for an LLM to learn. A solving process of a very easy math equation that the model could already solve, is not very useful. A row in a dataset that's just plain code without annotation is not very useful for an LLM to learn. However, the solving process of a complex math equation or a richly annotated code is. This is because the model will actually "learn" from the data, not just knowing it. This can cause a higher probability that the model will make connections to other codes/scenarios too. The data is supposed to effectively improve the model's performance by providing useful and relevant information, rather than feeding the model with any text.
Detailness
For each data point in the dataset, you want it to be as detailed as possible. This way there's a high probability that the model will learn it and pick it up, and make connections to other similar prompts too. For example, "To solve 4 + 3 * 5, we can multiply 3 by 5(According to the order of operations), this gives us...." Is much more detailed than just "4 + 3 * 5 = 19". This might also teach the model the answer of 3 * 5. Additionally, it will generate more user engagement, too, as a lot of users want/need/like detailed answers.
Correctness
This is probably THE BIGGEST quality-related aspect of a dataset. You want the data to be correct, and if it is incorrect, the model will produce incorrect answers. As mentioned earlier, this is illustrates the concept "Garbage in = garbage out". If the dataset contains incorrect information, the model is going to produce the same incorrect output. I think this is what most good datasets are aimed for now, the correctness. This is probably most people's thoughts when thinking about dataset quality, so I don't think I need to explain more here.
Censorship
This is probably not as important as others, but if your dataset is censored, you might want to consider removing the censorship or avoid purposely censor the dataset. This is because censorship can make the LLM mistakenly identify permissible questions as censored, therefore it can refuse to answer even if the question is perfectly fine. You don't want the LLM to refuse to answer a perfectly fine question, as this can degrade the model's answer against those questions. Additionally, it will generate more user engangement, too, as some people use LLMs for purposes you might not have anticipated.
Quantity
Quantity is an important aspect apart from the quality. From what all we have talked so far about quality, a dataset with all that but with only 1k rows is not going to do that much. A dataset with all that, but with 50k rows, is going to be a lot better. Note that not the bigger the better, as you want to avoid overfitting, since the dataset is high quality. I'd say you need at least 10k so that the model learns enough to actually get better at a specific subject. However, I wont go beyond 100k though, as this can be very hard to do and might potentially overfit the LLM. Additionally, make sure that the model has a reasonable quantity while also having a good quality, if you can't do both, try to make a balance of them.
Final Words
So yeah, these are what makes a dataset good. We talked about the quality, specifically the diversity, educational value, detailness, correctness, and censorship. Additionally, we also talked about quantity briefly. Both of them are very important to dataset creation, and I suggest you following them whenever possible when making a dataset. Again, let me know if you want to see more blogs from me. You don't believe how those hard-to-follow "rules" make an LLM better? Just think about how much Alpaca sucks compared to current LLMs. Finally, I hope you took something from this blog that helped you!