Update README.md
Browse files
README.md
CHANGED
@@ -76,6 +76,24 @@ We utilized OpenCSG's enterprise-grade large language model, csg-wukong-enterpri
|
|
76 |
|
77 |
We recorded 100,000 data samples along with their scores, creating the dataset `fineweb_edu_classifier_chinese_data`. Using the scores from this dataset as labels, we trained a Chinese BERT model, `fineweb_edu_classifier_chinese`, which can assign a score of 0-5 to each input text. We plan to further optimize this scoring model, and in the future, the OpenCSG algorithm team will open-source the `fineweb_edu_classifier_chinese_data` and the `fineweb_edu_classifier_chinese scoring model` to further promote community development and collaboration. This dataset contains meticulously annotated and scored educational text data, providing high-quality training data for researchers and developers.
|
78 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
79 |
**We warmly invite developers and researchers interested in this field to follow and engage with the community, working together to advance the technology. Stay tuned for the open-source release of the dataset!**
|
80 |
|
81 |
## License Agreement
|
@@ -158,6 +176,20 @@ Chinese Fineweb Edu 数据集的原始数据来源广泛,涵盖了多个国内
|
|
158 |
|
159 |
我们记录了100k条数据及其得分,形成`fineweb_edu_classifier_chinese_data`。将数据集中的得分作为文本打分的标签,我们训练了一个中文Bert模型 `fineweb_edu_classifier_chinese`,此模型能够为每条输入文本给出0-5分的得分。我们会进一步优化这个打分模型,未来,OpenCSG算法团队将开源`fineweb_edu_classifier_chinese_data`数据集以及`fineweb_edu_classifier_chinese`打分模型,以进一步推动社区的发展和交流。该数据集包含了经过精细标注打分的教育领域文本数据,能够为研究人员和开发者提供高质量的训练数据。
|
160 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
161 |
**我们诚邀对这一领域感兴趣的开发者和研究者关注和联系社区,共同推动技术的进步。敬请期待数据集的开源发布!**
|
162 |
|
163 |
## 许可协议
|
|
|
76 |
|
77 |
We recorded 100,000 data samples along with their scores, creating the dataset `fineweb_edu_classifier_chinese_data`. Using the scores from this dataset as labels, we trained a Chinese BERT model, `fineweb_edu_classifier_chinese`, which can assign a score of 0-5 to each input text. We plan to further optimize this scoring model, and in the future, the OpenCSG algorithm team will open-source the `fineweb_edu_classifier_chinese_data` and the `fineweb_edu_classifier_chinese scoring model` to further promote community development and collaboration. This dataset contains meticulously annotated and scored educational text data, providing high-quality training data for researchers and developers.
|
78 |
|
79 |
+
|
80 |
+
# Abaltion experiments
|
81 |
+
After meticulously designed ablation studies, we aimed to contrast the effects between the Chinese-fineweb-edu dataset and traditional Chinese pre-training corpora.
|
82 |
+
For this purpose, we randomly selected samples from five datasets—CCI2-Data, SkyPile-150B, TeleChat-PTD, IndustryCorpus, and MAP-CC—proportional to the Chinese-fineweb-edu dataset, constructing a comparison dataset named chinese-random-select.
|
83 |
+
In our experiments, we utilized a model with 2.1 billion parameters, training it for 65k steps on both datasets respectively.
|
84 |
+
Throughout the training, we periodically saved checkpoints of the model and conducted validations on Chinese evaluation benchmarks CEval and CMMLU.
|
85 |
+
The graph below displays the performance trends of these two datasets in evaluation tasks.
|
86 |
+
The results distinctly show that the dataset trained on Chinese-fineweb-edu significantly outperforms the chinese-random-select dataset in both evaluation tasks, especially demonstrating considerable advantages in the later stages of training. This underscores the effectiveness and adaptability of Chinese-fineweb-edu in Chinese language tasks. Furthermore, these experimental outcomes also highlight the critical impact of dataset selection and construction on the ultimate performance of models.
|
87 |
+
<p align="center">
|
88 |
+
<img width="900px" alt="experiment" src="./chinese-fineweb-benchmark.png">
|
89 |
+
</p>
|
90 |
+
|
91 |
+
The experimental results reveal that in the later stages of training, as it enters the second epoch and the learning rate rapidly decreases, the model trained with the chinese-fineweb-edu data shows a significant increase in accuracy,
|
92 |
+
whereas the model trained with randomly selected data remains at a lower level. This proves that the high-quality data of chinese-fineweb-edu significantly aids in training effectiveness.
|
93 |
+
With the same training duration, it can enhance model capabilities faster and save training resources.
|
94 |
+
This outcome also shares a striking similarity with the data ablation experiments conducted by HuggingFace on fineweb edu.
|
95 |
+
|
96 |
+
|
97 |
**We warmly invite developers and researchers interested in this field to follow and engage with the community, working together to advance the technology. Stay tuned for the open-source release of the dataset!**
|
98 |
|
99 |
## License Agreement
|
|
|
176 |
|
177 |
我们记录了100k条数据及其得分,形成`fineweb_edu_classifier_chinese_data`。将数据集中的得分作为文本打分的标签,我们训练了一个中文Bert模型 `fineweb_edu_classifier_chinese`,此模型能够为每条输入文本给出0-5分的得分。我们会进一步优化这个打分模型,未来,OpenCSG算法团队将开源`fineweb_edu_classifier_chinese_data`数据集以及`fineweb_edu_classifier_chinese`打分模型,以进一步推动社区的发展和交流。该数据集包含了经过精细标注打分的教育领域文本数据,能够为研究人员和开发者提供高质量的训练数据。
|
178 |
|
179 |
+
## 消融实验
|
180 |
+
|
181 |
+
经过精心设计的消融实验,我们旨在对比 Chinese-fineweb-edu 数据集与传统中文预训练语料的效果差异。为此,我们从 CCI2-Data、SkyPile-150B、TeleChat-PTD、IndustryCorpus 和 MAP-CC 这五个数据集中,随机抽取了与 Chinese-fineweb-edu 数据比例相同的样本,构建了一个对比数据集chinese-random-select。
|
182 |
+
实验中,我们使用了一个 2.1B 参数规模的模型,分别使用这两种数据集,训练 65k 步。在训练过程中,我们定期保存模型的 checkpoint,并在中文评测基准 CEval 和 CMMLU 数据集上进行了验证。下图展示了这两个数据集在评测任务中的表现变化趋势。
|
183 |
+
从结果可以清晰看出,使用 Chinese-fineweb-edu 训练的数据集在两个评测任务中均显著优于 chinese-random-select 数据集,特别是在训练到后期时表现出极大的优势,证明了 Chinese-fineweb-edu 在中文语言任务中的有效性和适配性。这一实验结果也进一步表明,数据集的选择和构建对模型的最终性能有着关键性的影响。
|
184 |
+
<p align="center">
|
185 |
+
<img width="900px" alt="experiment" src="./chinese-fineweb-benchmark.png">
|
186 |
+
</p>
|
187 |
+
|
188 |
+
|
189 |
+
通过实验结果可以发现,在训练的靠后阶段,由于进入了第2个epoch,且学习率进入快速下降阶段此时,使用chinese-fineweb-edu训练的模型,准确率有了明显的上升,而使用随机抽取的数据训练,则一直处于较低水平
|
190 |
+
这证明了chinese-fineweb-edu高质量数据对于模型训练效果有显著帮助,在同样训练时间下,能够更快的提升模型能力,节省训练资源,这个结果也和HuggingFace fineweb edu 的数据消融实验有异曲同工之妙。
|
191 |
+
|
192 |
+
|
193 |
**我们诚邀对这一领域感兴趣的开发者和研究者关注和联系社区,共同推动技术的进步。敬请期待数据集的开源发布!**
|
194 |
|
195 |
## 许可协议
|