update reference
Browse files
README.md
CHANGED
@@ -1137,6 +1137,22 @@ some useful tricks:
|
|
1137 |
2. Dataset sampler, we use M3E's dataset sampler to ensure that the samples in each batch come from a dataset, and negative samples are more valuable.
|
1138 |
3. instruction. Instruction has greatly improved the performance of the retrieval task in our experiments. We added instructions like 'query: ' and 'result: ' before each training sample.
|
1139 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1140 |
|
1141 |
## License
|
1142 |
Piccolo 使用 MIT License,免费商用。
|
|
|
1137 |
2. Dataset sampler, we use M3E's dataset sampler to ensure that the samples in each batch come from a dataset, and negative samples are more valuable.
|
1138 |
3. instruction. Instruction has greatly improved the performance of the retrieval task in our experiments. We added instructions like 'query: ' and 'result: ' before each training sample.
|
1139 |
|
1140 |
+
## Reference
|
1141 |
+
|
1142 |
+
这里我们列出了我们参考过的embedding项目和论文
|
1143 |
+
1. [M3E](https://github.com/wangyuxinwhy/uniem)。非常棒的中文开源embedding项目,收集和整理了较多的中文高质量数据集,uniem也是一个不错的框架。
|
1144 |
+
2. [Text2vec](https://github.com/shibing624/text2vec)。另一个一个非常棒的中文开源embedding项目。
|
1145 |
+
3. [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding)。智源AI开源的embedding模型,收集和整理了CMTEB benchmark,填补了中文embedding系统性评测的空缺。
|
1146 |
+
4. [E5](https://github.com/microsoft/unilm/tree/master/e5)。来自微软的一篇文章,有非常详细的消融实验以及数据处理过滤细节。
|
1147 |
+
5. [GTE](https://arxiv.org/abs/2308.03281)。一篇来自阿里达摩的embedding论文。
|
1148 |
+
|
1149 |
+
Here we list the embedding projects and papers we have referenced
|
1150 |
+
1. [M3E](https://github.com/wangyuxinwhy/uniem). A great Chinese open source embedding project that collects and organizes a large number of high-quality Chinese datasets. Uniem is also a good framework.
|
1151 |
+
2. [Text2vec](https://github.com/shibing624/text2vec). Another great Chinese open source embedding project.
|
1152 |
+
3. [Flag Embedding](https://github.com/FlagOpen/FlagEmbedding). Zhiyuan AI’s open source embedding model.They collect and organize CMTEB benchmark, filling the gap in systematic evaluation of Chinese embeddings.
|
1153 |
+
4. [E5](https://github.com/microsoft/unilm/tree/master/e5). Powerd by microsoft,producing very detailed ablation experiments and data processing filtering details.
|
1154 |
+
5. [GTE](https://arxiv.org/abs/2308.03281). An embedding paper from Alibaba Damo.
|
1155 |
+
|
1156 |
|
1157 |
## License
|
1158 |
Piccolo 使用 MIT License,免费商用。
|