TomPei commited on
Commit
7c92146
1 Parent(s): 8d355a6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -0
README.md CHANGED
@@ -76,6 +76,24 @@ We utilized OpenCSG's enterprise-grade large language model, csg-wukong-enterpri
76
 
77
  We recorded 100,000 data samples along with their scores, creating the dataset `fineweb_edu_classifier_chinese_data`. Using the scores from this dataset as labels, we trained a Chinese BERT model, `fineweb_edu_classifier_chinese`, which can assign a score of 0-5 to each input text. We plan to further optimize this scoring model, and in the future, the OpenCSG algorithm team will open-source the `fineweb_edu_classifier_chinese_data` and the `fineweb_edu_classifier_chinese scoring model` to further promote community development and collaboration. This dataset contains meticulously annotated and scored educational text data, providing high-quality training data for researchers and developers.
78
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
  **We warmly invite developers and researchers interested in this field to follow and engage with the community, working together to advance the technology. Stay tuned for the open-source release of the dataset!**
80
 
81
  ## License Agreement
@@ -158,6 +176,20 @@ Chinese Fineweb Edu 数据集的原始数据来源广泛,涵盖了多个国内
158
 
159
  我们记录了100k条数据及其得分,形成`fineweb_edu_classifier_chinese_data`。将数据集中的得分作为文本打分的标签,我们训练了一个中文Bert模型 `fineweb_edu_classifier_chinese`,此模型能够为每条输入文本给出0-5分的得分。我们会进一步优化这个打分模型,未来,OpenCSG算法团队将开源`fineweb_edu_classifier_chinese_data`数据集以及`fineweb_edu_classifier_chinese`打分模型,以进一步推动社区的发展和交流。该数据集包含了经过精细标注打分的教育领域文本数据,能够为研究人员和开发者提供高质量的训练数据。
160
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
161
  **我们诚邀对这一领域感兴趣的开发者和研究者关注和联系社区,共同推动技术的进步。敬请期待数据集的开源发布!**
162
 
163
  ## 许可协议
 
76
 
77
  We recorded 100,000 data samples along with their scores, creating the dataset `fineweb_edu_classifier_chinese_data`. Using the scores from this dataset as labels, we trained a Chinese BERT model, `fineweb_edu_classifier_chinese`, which can assign a score of 0-5 to each input text. We plan to further optimize this scoring model, and in the future, the OpenCSG algorithm team will open-source the `fineweb_edu_classifier_chinese_data` and the `fineweb_edu_classifier_chinese scoring model` to further promote community development and collaboration. This dataset contains meticulously annotated and scored educational text data, providing high-quality training data for researchers and developers.
78
 
79
+
80
+ # Abaltion experiments
81
+ After meticulously designed ablation studies, we aimed to contrast the effects between the Chinese-fineweb-edu dataset and traditional Chinese pre-training corpora.
82
+ For this purpose, we randomly selected samples from five datasets—CCI2-Data, SkyPile-150B, TeleChat-PTD, IndustryCorpus, and MAP-CC—proportional to the Chinese-fineweb-edu dataset, constructing a comparison dataset named chinese-random-select.
83
+ In our experiments, we utilized a model with 2.1 billion parameters, training it for 65k steps on both datasets respectively.
84
+ Throughout the training, we periodically saved checkpoints of the model and conducted validations on Chinese evaluation benchmarks CEval and CMMLU.
85
+ The graph below displays the performance trends of these two datasets in evaluation tasks.
86
+ The results distinctly show that the dataset trained on Chinese-fineweb-edu significantly outperforms the chinese-random-select dataset in both evaluation tasks, especially demonstrating considerable advantages in the later stages of training. This underscores the effectiveness and adaptability of Chinese-fineweb-edu in Chinese language tasks. Furthermore, these experimental outcomes also highlight the critical impact of dataset selection and construction on the ultimate performance of models.
87
+ <p align="center">
88
+ <img width="900px" alt="experiment" src="./chinese-fineweb-benchmark.png">
89
+ </p>
90
+
91
+ The experimental results reveal that in the later stages of training, as it enters the second epoch and the learning rate rapidly decreases, the model trained with the chinese-fineweb-edu data shows a significant increase in accuracy,
92
+ whereas the model trained with randomly selected data remains at a lower level. This proves that the high-quality data of chinese-fineweb-edu significantly aids in training effectiveness.
93
+ With the same training duration, it can enhance model capabilities faster and save training resources.
94
+ This outcome also shares a striking similarity with the data ablation experiments conducted by HuggingFace on fineweb edu.
95
+
96
+
97
  **We warmly invite developers and researchers interested in this field to follow and engage with the community, working together to advance the technology. Stay tuned for the open-source release of the dataset!**
98
 
99
  ## License Agreement
 
176
 
177
  我们记录了100k条数据及其得分,形成`fineweb_edu_classifier_chinese_data`。将数据集中的得分作为文本打分的标签,我们训练了一个中文Bert模型 `fineweb_edu_classifier_chinese`,此模型能够为每条输入文本给出0-5分的得分。我们会进一步优化这个打分模型,未来,OpenCSG算法团队将开源`fineweb_edu_classifier_chinese_data`数据集以及`fineweb_edu_classifier_chinese`打分模型,以进一步推动社区的发展和交流。该数据集包含了经过精细标注打分的教育领域文本数据,能够为研究人员和开发者提供高质量的训练数据。
178
 
179
+ ## 消融实验
180
+
181
+ 经过精心设计的消融实验,我们旨在对比 Chinese-fineweb-edu 数据集与传统中文预训练语料的效果差异。为此,我们从 CCI2-Data、SkyPile-150B、TeleChat-PTD、IndustryCorpus 和 MAP-CC 这五个数据集中,随机抽取了与 Chinese-fineweb-edu 数据比例相同的样本,构建了一个对比数据集chinese-random-select。
182
+ 实验中,我们使用了一个 2.1B 参数规模的模型,分别使用这两种数据集,训练 65k 步。在训练过程中,我们定期保存模型的 checkpoint,并在中文评测基准 CEval 和 CMMLU 数据集上进行了验证。下图展示了这两个数据集在评测任务中的表现变化趋势。
183
+ 从结果可以清晰看出,使用 Chinese-fineweb-edu 训练的数据集在两个评测任务中均显著优于 chinese-random-select 数据集,特别是在训练到后期时表现出极大的优势,证明了 Chinese-fineweb-edu 在中文语言任务中的有效性和适配性。这一实验结果也进一步表明,数据集的选择和构建对模型的最终性能有着关键性的影响。
184
+ <p align="center">
185
+ <img width="900px" alt="experiment" src="./chinese-fineweb-benchmark.png">
186
+ </p>
187
+
188
+
189
+ 通过实验结果可以发现,在训练的靠后阶段,由于进入了第2个epoch,且学习率进入快速下降阶段此时,使用chinese-fineweb-edu训练的模型,准确率有了明显的上升,而使用随机抽取的数据训练,则一直处于较低水平
190
+ 这证明了chinese-fineweb-edu高质量数据对于模型训练效果有显著帮助,在同样训练时间下,能够更快的提升模型能力,节省训练资源,这个结果也和HuggingFace fineweb edu 的数据消融实验有异曲同工之妙。
191
+
192
+
193
  **我们诚邀对这一领域感兴趣的开发者和研究者关注和联系社区,共同推动技术的进步。敬请期待数据集的开源发布!**
194
 
195
  ## 许可协议