jina-colbert-v2 / README.md

bwang0911

docs: add pylate example usage (#8)

ea785cd verified 6 days ago

preview code

raw

history blame contribute delete

No virus

11 kB

	---
	license: cc-by-nc-4.0
	language:
	- multilingual
	- af
	- am
	- ar
	- as
	- az
	- be
	- bg
	- bn
	- br
	- bs
	- ca
	- cs
	- cy
	- da
	- de
	- el
	- en
	- eo
	- es
	- et
	- eu
	- fa
	- fi
	- fr
	- fy
	- ga
	- gd
	- gl
	- gu
	- ha
	- he
	- hi
	- hr
	- hu
	- hy
	- id
	- is
	- it
	- ja
	- jv
	- ka
	- kk
	- km
	- kn
	- ko
	- ku
	- ky
	- la
	- lo
	- lt
	- lv
	- mg
	- mk
	- ml
	- mn
	- mr
	- ms
	- my
	- ne
	- nl
	- 'no'
	- om
	- or
	- pa
	- pl
	- ps
	- pt
	- ro
	- ru
	- sa
	- sd
	- si
	- sk
	- sl
	- so
	- sq
	- sr
	- su
	- sv
	- sw
	- ta
	- te
	- th
	- tl
	- tr
	- ug
	- uk
	- ur
	- uz
	- vi
	- xh
	- yi
	- zh
	inference: false
	tags:
	- ColBERT
	- passage-retrieval
	---

	<br><br>

	<p align="center">
	<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
	</p>


	<p align="center">
	<b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
	</p>

	<p align="center">
	<b>JinaColBERT V2: your multilingual late interaction retriever!</b>
	</p>

	JinaColBERT V2 (`jina-colbert-v2`) is a new model based on the [JinaColBERT V1](https://jina.ai/news/what-is-colbert-and-late-interaction-and-why-they-matter-in-search/) that expands on the capabilities and performance of the [`jina-colbert-v1-en`](https://huggingface.co/jinaai/jina-colbert-v1-en) model. Like the previous release, it has Jina AI’s 8192 token input context and the [improved efficiency, performance](https://jina.ai/news/what-is-colbert-and-late-interaction-and-why-they-matter-in-search/), and [explainability](https://jina.ai/news/ai-explainability-made-easy-how-late-interaction-makes-jina-colbert-transparent/) of token-level embeddings and late interaction.

	This new release adds new functionality and performance improvements:

	- Multilingual support for dozens of languages, with strong performance on major global languages.
	- [Matryoshka embeddings](https://huggingface.co/blog/matryoshka), which allow users to trade between efficiency and precision flexibly.
	- Superior retrieval performance when compared to the English-only [`jina-colbert-v1-en`](https://huggingface.co/jinaai/jina-colbert-v1-en).

	JinaColBERT V2 offers three different versions for different embeddings dimensions:
	[`jinaai/jina-colbert-v2`](https://huggingface.co/jinaai/jina-colbert-v2): 128 dimension embeddings
	[`jinaai/jina-colbert-v2-96`](https://huggingface.co/jinaai/jina-colbert-v2-96): 96 dimension embeddings
	[`jinaai/jina-colbert-v2-64`](https://huggingface.co/jinaai/jina-colbert-v2-64): 64 dimension embeddings


	## Usage

	### Installation

	`jina-colbert-v2` is trained with flash attention and therefore requires `einops` and `flash_attn` to be installed.

	To use the model, you could either use the Standford ColBERT library or use the `pylate`/`ragatouille` package that we provide.

	```bash
	pip install -U einops flash_attn
	pip install -U ragatouille # or
	pip install -U colbert-ai # or
	pip install -U pylate
	```

	### PyLate

	```python
	# Please refer to Pylate: https://github.com/lightonai/pylate for detailed usage
	from pylate import indexes, models, retrieve

	model = models.ColBERT(
	model_name_or_path="jinaai/jina-colbert-v2",
	query_prefix="[QueryMarker]",
	document_prefix="[DocumentMarker]",
	attend_to_expansion_tokens=True,
	trust_remote_code=True,
	)
	```

	### RAGatouille

	```python
	from ragatouille import RAGPretrainedModel

	RAG = RAGPretrainedModel.from_pretrained("jinaai/jina-colbert-v2")
	docs = [
	"ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
	"Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval.",
	]
	RAG.index(docs, index_name="demo")
	query = "What does ColBERT do?"
	results = RAG.search(query)
	```

	### Stanford ColBERT

	```python
	from colbert.infra import ColBERTConfig
	from colbert.modeling.checkpoint import Checkpoint

	ckpt = Checkpoint("jinaai/jina-colbert-v2", colbert_config=ColBERTConfig())
	docs = [
	"ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
	"Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval.",
	]
	query_vectors = ckpt.queryFromText(docs, bsize=2)
	```

	## Evaluation Results

	### Retrieval Benchmarks

	#### BEIR

	\| NDCG@10 \| jina-colbert-v2 \| jina-colbert-v1 \| ColBERTv2.0 \| BM25 \|
	\|--------------------\|---------------------\|---------------------\|-----------------\|----------\|
	\| avg \| 0.531 \| 0.502 \| 0.496 \| 0.440 \|
	\| nfcorpus \| 0.346 \| 0.338 \| 0.337 \| 0.325 \|
	\| fiqa \| 0.408 \| 0.368 \| 0.354 \| 0.236 \|
	\| trec-covid \| 0.834 \| 0.750 \| 0.726 \| 0.656 \|
	\| arguana \| 0.366 \| 0.494 \| 0.465 \| 0.315 \|
	\| quora \| 0.887 \| 0.823 \| 0.855 \| 0.789 \|
	\| scidocs \| 0.186 \| 0.169 \| 0.154 \| 0.158 \|
	\| scifact \| 0.678 \| 0.701 \| 0.689 \| 0.665 \|
	\| webis-touche \| 0.274 \| 0.270 \| 0.260 \| 0.367 \|
	\| dbpedia-entity \| 0.471 \| 0.413 \| 0.452 \| 0.313 \|
	\| fever \| 0.805 \| 0.795 \| 0.785 \| 0.753 \|
	\| climate-fever \| 0.239 \| 0.196 \| 0.176 \| 0.213 \|
	\| hotpotqa \| 0.766 \| 0.656 \| 0.675 \| 0.603 \|
	\| nq \| 0.640 \| 0.549 \| 0.524 \| 0.329 \|



	#### MS MARCO Passage Retrieval

	\| MRR@10 \| jina-colbert-v2 \| jina-colbert-v1 \| ColBERTv2.0 \| BM25 \|
	\|-------------\|---------------------\|---------------------\|-----------------\|----------\|
	\| MSMARCO \| 0.396 \| 0.390 \| 0.397 \| 0.187 \|


	### Multilingual Benchmarks

	#### MIRACLE

	\| NDCG@10 \| jina-colbert-v2 \| mDPR (zero shot) \|
	\|---------\|---------------------\|----------------------\|
	\| avg \| 0.627 \| 0.427 \|
	\| ar \| 0.753 \| 0.499 \|
	\| bn \| 0.750 \| 0.443 \|
	\| de \| 0.504 \| 0.490 \|
	\| es \| 0.538 \| 0.478 \|
	\| en \| 0.570 \| 0.394 \|
	\| fa \| 0.563 \| 0.480 \|
	\| fi \| 0.740 \| 0.472 \|
	\| fr \| 0.541 \| 0.435 \|
	\| hi \| 0.600 \| 0.383 \|
	\| id \| 0.547 \| 0.272 \|
	\| ja \| 0.632 \| 0.439 \|
	\| ko \| 0.671 \| 0.419 \|
	\| ru \| 0.643 \| 0.407 \|
	\| sw \| 0.499 \| 0.299 \|
	\| te \| 0.742 \| 0.356 \|
	\| th \| 0.772 \| 0.358 \|
	\| yo \| 0.623 \| 0.396 \|
	\| zh \| 0.523 \| 0.512 \|

	#### mMARCO

	\| MRR@10 \| jina-colbert-v2 \| BM-25 \| ColBERT-XM \|
	\|------------\|---------------------\|-----------\|----------------\|
	\| avg \| 0.313 \| 0.141 \| 0.254 \|
	\| ar \| 0.272 \| 0.111 \| 0.195 \|
	\| de \| 0.331 \| 0.136 \| 0.270 \|
	\| nl \| 0.330 \| 0.140 \| 0.275 \|
	\| es \| 0.341 \| 0.158 \| 0.285 \|
	\| fr \| 0.335 \| 0.155 \| 0.269 \|
	\| hi \| 0.309 \| 0.134 \| 0.238 \|
	\| id \| 0.319 \| 0.149 \| 0.263 \|
	\| it \| 0.337 \| 0.153 \| 0.265 \|
	\| ja \| 0.276 \| 0.141 \| 0.241 \|
	\| pt \| 0.337 \| 0.152 \| 0.276 \|
	\| ru \| 0.298 \| 0.124 \| 0.251 \|
	\| vi \| 0.287 \| 0.136 \| 0.226 \|
	\| zh \| 0.302 \| 0.116 \| 0.246 \|



	### Matryoshka Representation Benchmarks

	#### BEIR

	\| NDCG@10 \| dim=128 \| dim=96 \| dim=64 \|
	\|----------------\|-------------\|------------\|------------\|
	\| avg \| 0.599 \| 0.591 \| 0.589 \|
	\| nfcorpus \| 0.346 \| 0.340 \| 0.347 \|
	\| fiqa \| 0.408 \| 0.404 \| 0.404 \|
	\| trec-covid \| 0.834 \| 0.808 \| 0.805 \|
	\| hotpotqa \| 0.766 \| 0.764 \| 0.756 \|
	\| nq \| 0.640 \| 0.640 \| 0.635 \|


	#### MSMARCO

	\| MRR@10 \| dim=128 \| dim=96 \| dim=64 \|
	\|----------------\|-------------\|------------\|------------\|
	\| msmarco \| 0.396 \| 0.391 \| 0.388 \|

	## Other Models

	Additionally, we provide the following embedding models, you can also use them for retrieval.

	- [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
	- [`jina-embeddings-v2-base-zh`](https://huggingface.co/jinaai/jina-embeddings-v2-base-zh): 161 million parameters Chinese-English bilingual model.
	- [`jina-embeddings-v2-base-de`](https://huggingface.co/jinaai/jina-embeddings-v2-base-de): 161 million parameters German-English bilingual model.
	- [`jina-embeddings-v2-base-es`](https://huggingface.co/jinaai/jina-embeddings-v2-base-es): 161 million parameters Spanish-English bilingual model.
	- [`jina-reranker-v2`](https://huggingface.co/jinaai/jina-reranker-v2-base-multilingual): multilingual reranker model.
	- [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1): English multimodal (text-image) embedding model.

	## Contact

	Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.

	```
	@misc{jha2024jinacolbertv2generalpurposemultilinguallate,
	title={Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever},
	author={Rohan Jha and Bo Wang and Michael Günther and Saba Sturua and Mohammad Kalim Akram and Han Xiao},
	year={2024},
	eprint={2408.16672},
	archivePrefix={arXiv},
	primaryClass={cs.IR},
	url={https://arxiv.org/abs/2408.16672},
	}
	```