Jinkin commited on
Commit
bf14719
1 Parent(s): d1d71a7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -0
README.md CHANGED
@@ -1057,6 +1057,11 @@ model-index:
1057
 
1058
  ## piccolo-base-zh
1059
 
 
 
 
 
 
1060
  piccolo is a general text embedding model, powered by General Model Group from SenseTime Research.
1061
  Based on BERT framework, piccolo is trained using a two stage pipeline. On the first stage, we collect and crawl 400 million weakly supervised Chinese text pairs from the Internet,
1062
  and train the model with the pair(text and text pos) softmax contrastive loss.
 
1057
 
1058
  ## piccolo-base-zh
1059
 
1060
+ piccolo是一个通用embedding模型, 由来自商汤科技的通用模型组完成训练。piccolo借鉴了E5以及GTE的训练流程,采用了两阶段的训练方式。
1061
+ 在第一阶段中,我们搜集和爬取了4亿的中文文本对(可视为弱监督文本对数据),并采用二元组的softmax对比学习损失来优化模型。
1062
+ 在第二阶段中,我们从互联网搜集了2000万人工标注的中文文本对(精标数据),并采用带有难负样本的三元组的softmax对比学习损失来帮助模型更好地优化。
1063
+ 目前,我们提供了piccolo-base-zh和piccolo-large-zh两个模型。
1064
+
1065
  piccolo is a general text embedding model, powered by General Model Group from SenseTime Research.
1066
  Based on BERT framework, piccolo is trained using a two stage pipeline. On the first stage, we collect and crawl 400 million weakly supervised Chinese text pairs from the Internet,
1067
  and train the model with the pair(text and text pos) softmax contrastive loss.