sensenova
/

piccolo-base-zh

Feature Extraction

text-embeddings-inference

Inference Endpoints

Model card Files Files and versions Community

Jinkin commited on Sep 5, 2023

Commit

bf14719

•

1 Parent(s): d1d71a7

Update README.md

Files changed (1) hide show

README.md +5 -0

README.md CHANGED Viewed

@@ -1057,6 +1057,11 @@ model-index:
 ## piccolo-base-zh
 piccolo is a general text embedding model, powered by General Model Group from SenseTime Research.
 Based on BERT framework, piccolo is trained using a two stage pipeline. On the first stage, we collect and crawl 400 million weakly supervised Chinese text pairs from the Internet,
 and train the model with the pair(text and text pos) softmax contrastive loss.

 ## piccolo-base-zh
+piccolo是一个通用embedding模型, 由来自商汤科技的通用模型组完成训练。piccolo借鉴了E5以及GTE的训练流程，采用了两阶段的训练方式。
+在第一阶段中，我们搜集和爬取了4亿的中文文本对(可视为弱监督文本对数据)，并采用二元组的softmax对比学习损失来优化模型。
+在第二阶段中，我们从互联网搜集了2000万人工标注的中文文本对(精标数据)，并采用带有难负样本的三元组的softmax对比学习损失来帮助模型更好地优化。
+目前，我们提供了piccolo-base-zh和piccolo-large-zh两个模型。
 piccolo is a general text embedding model, powered by General Model Group from SenseTime Research.
 Based on BERT framework, piccolo is trained using a two stage pipeline. On the first stage, we collect and crawl 400 million weakly supervised Chinese text pairs from the Internet,
 and train the model with the pair(text and text pos) softmax contrastive loss.