Commit
•
31a500a
1
Parent(s):
f20d0dd
Update README.md (#6)
Browse files- Update README.md (1661329696483c2d43fae5f241c862deaa8b9dc6)
Co-authored-by: namespace-Pt <namespace-Pt@users.noreply.huggingface.co>
README.md
CHANGED
@@ -1,10 +1,18 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
---
|
4 |
-
|
5 |
-
|
6 |
<h1 align="center">FlagEmbedding</h1>
|
7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
|
9 |
<h4 align="center">
|
10 |
<p>
|
@@ -19,18 +27,18 @@ license: mit
|
|
19 |
<p>
|
20 |
</h4>
|
21 |
|
22 |
-
More details please refer to our Github: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding).
|
23 |
-
|
24 |
|
25 |
[English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
|
26 |
|
27 |
-
|
28 |
-
|
|
|
|
|
29 |
|
30 |
************* 🌟**Updates**🌟 *************
|
31 |
-
- 10/12/2023: Release [LLM-Embedder](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder), a unified embedding model to support diverse retrieval augmentation needs for LLMs. [Paper](https://arxiv.org/pdf/2310.07554.pdf)
|
32 |
- 09/15/2023: The [technical report](https://arxiv.org/pdf/2309.07597.pdf) of BGE has been released
|
33 |
-
- 09/15/2023: The [
|
34 |
- 09/12/2023: New models:
|
35 |
- **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
|
36 |
- **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
|
@@ -72,29 +80,27 @@ And it also can be used in vector databases for LLMs.
|
|
72 |
| [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
|
73 |
|
74 |
|
75 |
-
[1\]: If you need to search the relevant passages
|
76 |
|
77 |
-
[2\]: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models.
|
78 |
-
For
|
79 |
|
80 |
All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI.
|
81 |
-
If you cannot open the Huggingface Hub, you also
|
82 |
|
83 |
|
84 |
## Frequently asked questions
|
85 |
|
86 |
-
|
87 |
-
<summary>1. How to fine-tune bge embedding model?</summary>
|
88 |
|
89 |
-
<!-- ### How to fine-tune bge embedding model? -->
|
90 |
Following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) to prepare data and fine-tune your model.
|
91 |
Some suggestions:
|
92 |
- Mine hard negatives following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#hard-negatives), which can improve the retrieval performance.
|
|
|
93 |
- If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity.
|
94 |
- If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker.
|
95 |
|
96 |
-
|
97 |
-
</details>
|
98 |
|
99 |
<details>
|
100 |
<summary>2. The similarity score between two dissimilar sentences is higher than 0.5</summary>
|
@@ -134,7 +140,7 @@ In all cases, the documents/passages do not need to add the instruction.
|
|
134 |
|
135 |
### Usage for Embedding Model
|
136 |
|
137 |
-
Here are some examples
|
138 |
[FlagEmbedding](#using-flagembedding), [Sentence-Transformers](#using-sentence-transformers), [Langchain](#using-langchain), or [Huggingface Transformers](#using-huggingface-transformers).
|
139 |
|
140 |
#### Using FlagEmbedding
|
@@ -366,11 +372,11 @@ See [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/) for
|
|
366 |
|
367 |
### BAAI Embedding
|
368 |
|
369 |
-
We pre-train the models using [retromae](https://github.com/staoxiao/RetroMAE) and train them on large-scale
|
370 |
**You can fine-tune the embedding model on your data following our [examples](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune).**
|
371 |
We also provide a [pre-train example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain).
|
372 |
Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned.
|
373 |
-
|
374 |
|
375 |
|
376 |
|
@@ -381,8 +387,14 @@ which is more accurate than embedding model (i.e., bi-encoder) but more time-con
|
|
381 |
Therefore, it can be used to re-rank the top-k documents returned by embedding model.
|
382 |
We train the cross-encoder on a multilingual pair data,
|
383 |
The data format is the same as embedding model, so you can fine-tune it easily following our [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker).
|
384 |
-
|
|
|
385 |
|
|
|
|
|
|
|
|
|
|
|
386 |
|
387 |
## Contact
|
388 |
If you have any question or suggestion related to this project, feel free to open an issue or pull request.
|
@@ -402,6 +414,15 @@ If you find this repository useful, please consider giving a star :star: and cit
|
|
402 |
archivePrefix={arXiv},
|
403 |
primaryClass={cs.CL}
|
404 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
405 |
```
|
406 |
|
407 |
## License
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
<h1 align="center">FlagEmbedding</h1>
|
2 |
+
<p align="center">
|
3 |
+
<a href="https://github.com/FlagOpen/FlagEmbedding">
|
4 |
+
<img alt="Build" src="https://img.shields.io/badge/Contribution-Welcome-blue">
|
5 |
+
</a>
|
6 |
+
<a href="https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE">
|
7 |
+
<img alt="License" src="https://img.shields.io/badge/LICENSE-MIT-green">
|
8 |
+
</a>
|
9 |
+
<a href="https://huggingface.co/C-MTEB">
|
10 |
+
<img alt="Build" src="https://img.shields.io/badge/C_MTEB-🤗-yellow">
|
11 |
+
</a>
|
12 |
+
<a href="https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding">
|
13 |
+
<img alt="Build" src="https://img.shields.io/badge/FlagEmbedding-1.1-red">
|
14 |
+
</a>
|
15 |
+
</p>
|
16 |
|
17 |
<h4 align="center">
|
18 |
<p>
|
|
|
27 |
<p>
|
28 |
</h4>
|
29 |
|
|
|
|
|
30 |
|
31 |
[English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
|
32 |
|
33 |
+
<span style="#FF69B4;"> **Hiring:** We're seeking experienced NLP researchers and intern students focusing on dense retrieval and retrieval-augmented LLMs. If you're interested, please feel free to reach out to us via email at zhengliu1026@gmail.com.</span>
|
34 |
+
|
35 |
+
FlagEmbedding can map any text to a low-dimensional dense vector, which can be used for tasks like retrieval, classification, clustering, and semantic search.
|
36 |
+
And it can also be used in vector databases for LLMs.
|
37 |
|
38 |
************* 🌟**Updates**🌟 *************
|
39 |
+
- 10/12/2023: Release [LLM-Embedder](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder), a unified embedding model to support diverse retrieval augmentation needs for LLMs. [Paper](https://arxiv.org/pdf/2310.07554.pdf) :fire:
|
40 |
- 09/15/2023: The [technical report](https://arxiv.org/pdf/2309.07597.pdf) of BGE has been released
|
41 |
+
- 09/15/2023: The [massive training data](https://data.baai.ac.cn/details/BAAI-MTP) of BGE has been released
|
42 |
- 09/12/2023: New models:
|
43 |
- **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
|
44 |
- **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
|
|
|
80 |
| [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
|
81 |
|
82 |
|
83 |
+
[1\]: If you need to search the relevant passages in a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** needs to be added to passages.
|
84 |
|
85 |
+
[2\]: Different from the embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models.
|
86 |
+
For example, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 documents to get the final top-3 results.
|
87 |
|
88 |
All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI.
|
89 |
+
If you cannot open the Huggingface Hub, you can also download the models at https://model.baai.ac.cn/models .
|
90 |
|
91 |
|
92 |
## Frequently asked questions
|
93 |
|
94 |
+
**1. How to fine-tune bge embedding model?**
|
|
|
95 |
|
|
|
96 |
Following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) to prepare data and fine-tune your model.
|
97 |
Some suggestions:
|
98 |
- Mine hard negatives following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#hard-negatives), which can improve the retrieval performance.
|
99 |
+
- In general, larger hyper-parameter `per_device_train_batch_size` brings better performance. You can expand it by enabling `--fp16`, `--deepspeed df_config.json` (df_config.json can refer to [ds_config.json](https://github.com/FlagOpen/FlagEmbedding/blob/master/examples/finetune/ds_config.json), `--gradient_checkpointing`, etc.
|
100 |
- If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity.
|
101 |
- If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker.
|
102 |
|
103 |
+
|
|
|
104 |
|
105 |
<details>
|
106 |
<summary>2. The similarity score between two dissimilar sentences is higher than 0.5</summary>
|
|
|
140 |
|
141 |
### Usage for Embedding Model
|
142 |
|
143 |
+
Here are some examples of using `bge` models with
|
144 |
[FlagEmbedding](#using-flagembedding), [Sentence-Transformers](#using-sentence-transformers), [Langchain](#using-langchain), or [Huggingface Transformers](#using-huggingface-transformers).
|
145 |
|
146 |
#### Using FlagEmbedding
|
|
|
372 |
|
373 |
### BAAI Embedding
|
374 |
|
375 |
+
We pre-train the models using [retromae](https://github.com/staoxiao/RetroMAE) and train them on large-scale pair data using contrastive learning.
|
376 |
**You can fine-tune the embedding model on your data following our [examples](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune).**
|
377 |
We also provide a [pre-train example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain).
|
378 |
Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned.
|
379 |
+
For more training details for bge see [baai_general_embedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md).
|
380 |
|
381 |
|
382 |
|
|
|
387 |
Therefore, it can be used to re-rank the top-k documents returned by embedding model.
|
388 |
We train the cross-encoder on a multilingual pair data,
|
389 |
The data format is the same as embedding model, so you can fine-tune it easily following our [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker).
|
390 |
+
For more details please refer to [./FlagEmbedding/reranker/README.md](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker)
|
391 |
+
|
392 |
|
393 |
+
### Our Contributors:
|
394 |
+
|
395 |
+
<a href="https://github.com/FlagOpen/FlagEmbedding/graphs/contributors">
|
396 |
+
<img src="https://contrib.rocks/image?repo=FlagOpen/FlagEmbedding" />
|
397 |
+
</a>
|
398 |
|
399 |
## Contact
|
400 |
If you have any question or suggestion related to this project, feel free to open an issue or pull request.
|
|
|
414 |
archivePrefix={arXiv},
|
415 |
primaryClass={cs.CL}
|
416 |
}
|
417 |
+
|
418 |
+
@misc{llm_embedder,
|
419 |
+
title={Retrieve Anything To Augment Large Language Models},
|
420 |
+
author={Peitian Zhang and Shitao Xiao and Zheng Liu and Zhicheng Dou and Jian-Yun Nie},
|
421 |
+
year={2023},
|
422 |
+
eprint={2310.07554},
|
423 |
+
archivePrefix={arXiv},
|
424 |
+
primaryClass={cs.IR}
|
425 |
+
}
|
426 |
```
|
427 |
|
428 |
## License
|