feat: update README
Browse files
README.md
CHANGED
@@ -1,7 +1,100 @@
|
|
1 |
---
|
2 |
license: cc-by-4.0
|
3 |
language:
|
4 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
tags:
|
6 |
- ColBERT
|
7 |
- passage-retrieval
|
@@ -20,7 +113,198 @@ tags:
|
|
20 |
|
21 |
# Jina-ColBERT-v2
|
22 |
|
23 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
|
25 |
## Other Models
|
26 |
|
|
|
1 |
---
|
2 |
license: cc-by-4.0
|
3 |
language:
|
4 |
+
- multilingual
|
5 |
+
- af
|
6 |
+
- am
|
7 |
+
- ar
|
8 |
+
- as
|
9 |
+
- az
|
10 |
+
- be
|
11 |
+
- bg
|
12 |
+
- bn
|
13 |
+
- br
|
14 |
+
- bs
|
15 |
+
- ca
|
16 |
+
- cs
|
17 |
+
- cy
|
18 |
+
- da
|
19 |
+
- de
|
20 |
+
- el
|
21 |
+
- en
|
22 |
+
- eo
|
23 |
+
- es
|
24 |
+
- et
|
25 |
+
- eu
|
26 |
+
- fa
|
27 |
+
- fi
|
28 |
+
- fr
|
29 |
+
- fy
|
30 |
+
- ga
|
31 |
+
- gd
|
32 |
+
- gl
|
33 |
+
- gu
|
34 |
+
- ha
|
35 |
+
- he
|
36 |
+
- hi
|
37 |
+
- hr
|
38 |
+
- hu
|
39 |
+
- hy
|
40 |
+
- id
|
41 |
+
- is
|
42 |
+
- it
|
43 |
+
- ja
|
44 |
+
- jv
|
45 |
+
- ka
|
46 |
+
- kk
|
47 |
+
- km
|
48 |
+
- kn
|
49 |
+
- ko
|
50 |
+
- ku
|
51 |
+
- ky
|
52 |
+
- la
|
53 |
+
- lo
|
54 |
+
- lt
|
55 |
+
- lv
|
56 |
+
- mg
|
57 |
+
- mk
|
58 |
+
- ml
|
59 |
+
- mn
|
60 |
+
- mr
|
61 |
+
- ms
|
62 |
+
- my
|
63 |
+
- ne
|
64 |
+
- nl
|
65 |
+
- 'no'
|
66 |
+
- om
|
67 |
+
- or
|
68 |
+
- pa
|
69 |
+
- pl
|
70 |
+
- ps
|
71 |
+
- pt
|
72 |
+
- ro
|
73 |
+
- ru
|
74 |
+
- sa
|
75 |
+
- sd
|
76 |
+
- si
|
77 |
+
- sk
|
78 |
+
- sl
|
79 |
+
- so
|
80 |
+
- sq
|
81 |
+
- sr
|
82 |
+
- su
|
83 |
+
- sv
|
84 |
+
- sw
|
85 |
+
- ta
|
86 |
+
- te
|
87 |
+
- th
|
88 |
+
- tl
|
89 |
+
- tr
|
90 |
+
- ug
|
91 |
+
- uk
|
92 |
+
- ur
|
93 |
+
- uz
|
94 |
+
- vi
|
95 |
+
- xh
|
96 |
+
- yi
|
97 |
+
- zh
|
98 |
tags:
|
99 |
- ColBERT
|
100 |
- passage-retrieval
|
|
|
113 |
|
114 |
# Jina-ColBERT-v2
|
115 |
|
116 |
+
## Usage
|
117 |
+
|
118 |
+
### Installation
|
119 |
+
|
120 |
+
`jina-colbert-v2` is trained with flash attention and therefore requires `einops` and `flash_attn` to be installed.
|
121 |
+
|
122 |
+
To use the model, you could either use the Standford ColBERT library or use the `ragatouille` package that we provide.
|
123 |
+
|
124 |
+
```bash
|
125 |
+
pip install -U einops flash_attn
|
126 |
+
pip install -U ragatouille
|
127 |
+
pip install -U colbert-ai
|
128 |
+
```
|
129 |
+
|
130 |
+
### RAGatouille
|
131 |
+
|
132 |
+
```python
|
133 |
+
from ragatouille import RAGPretrainedModel
|
134 |
+
|
135 |
+
RAG = RAGPretrainedModel.from_pretrained("jinaai/jina-colbert-v2")
|
136 |
+
|
137 |
+
docs = [
|
138 |
+
"ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
|
139 |
+
"Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval."
|
140 |
+
]
|
141 |
+
|
142 |
+
RAG.index(docs, index_name="demo")
|
143 |
+
|
144 |
+
query = 'What does ColBERT do?'
|
145 |
+
|
146 |
+
results = RAG.search(query)
|
147 |
+
```
|
148 |
+
|
149 |
+
### Stanford ColBERT
|
150 |
+
Typically, you would run the following code to index using the Stanford ColBERT library on a GPU machine. Check the reference at [Stanford ColBERT](https://github.com/stanford-futuredata/ColBERT?tab=readme-ov-file#installation) for more details.
|
151 |
+
|
152 |
+
#### Indexing
|
153 |
+
|
154 |
+
```python
|
155 |
+
from colbert import Indexer
|
156 |
+
from colbert.infra import ColBERTConfig
|
157 |
+
|
158 |
+
if __name__ == "__main__":
|
159 |
+
config = ColBERTConfig(
|
160 |
+
doc_maxlen=512,
|
161 |
+
nbits=2
|
162 |
+
)
|
163 |
+
indexer = Indexer(
|
164 |
+
checkpoint="jinaai/jina-colbert-v2",
|
165 |
+
config=config,
|
166 |
+
)
|
167 |
+
docs = [
|
168 |
+
"ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
|
169 |
+
"Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval."
|
170 |
+
]
|
171 |
+
indexer.index(name='demo', collection=docs)
|
172 |
+
```
|
173 |
+
|
174 |
+
#### Searching
|
175 |
+
|
176 |
+
```python
|
177 |
+
from colbert import Searcher
|
178 |
+
from colbert.infra import ColBERTConfig
|
179 |
+
|
180 |
+
k = 10
|
181 |
+
|
182 |
+
if __name__ == "__main__":
|
183 |
+
config = ColBERTConfig(
|
184 |
+
query_maxlen=128
|
185 |
+
)
|
186 |
+
searcher = Searcher(
|
187 |
+
index='demo',
|
188 |
+
config=config
|
189 |
+
)
|
190 |
+
query = 'What does ColBERT do?'
|
191 |
+
results = searcher.search(query, k=k)
|
192 |
+
|
193 |
+
```
|
194 |
+
|
195 |
+
#### Creating vectors
|
196 |
+
|
197 |
+
```python
|
198 |
+
from colbert.infra import ColBERTConfig
|
199 |
+
from colbert.modeling.checkpoint import Checkpoint
|
200 |
+
|
201 |
+
ckpt = Checkpoint("jinaai/jina-colbert-v2", colbert_config=ColBERTConfig())
|
202 |
+
docs = [
|
203 |
+
"ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
|
204 |
+
"Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval."
|
205 |
+
]
|
206 |
+
query_vectors = ckpt.queryFromText( docs, bsize=2)
|
207 |
+
print(query_vectors)
|
208 |
+
```
|
209 |
+
|
210 |
+
## Evaluation Results
|
211 |
+
|
212 |
+
### Retrieval Benchmarks
|
213 |
+
|
214 |
+
#### BEIR
|
215 |
+
|
216 |
+
| **Short Name** | **jina-colbert-v2** | **jina-colbert-v1** | **ColBERTv2.0** | **BM25** |
|
217 |
+
|--------------------|---------------------|---------------------|-----------------|----------|
|
218 |
+
| **avg** | 0.531 | 0.502 | 0.496 | 0.440 |
|
219 |
+
| **nfcorpus** | 0.346 | 0.338 | 0.337 | 0.325 |
|
220 |
+
| **fiqa** | 0.408 | 0.368 | 0.354 | 0.236 |
|
221 |
+
| **trec-covid** | 0.834 | 0.750 | 0.726 | 0.656 |
|
222 |
+
| **arguana** | 0.366 | 0.494 | 0.465 | 0.315 |
|
223 |
+
| **quora** | 0.887 | 0.823 | 0.855 | 0.789 |
|
224 |
+
| **scidocs** | 0.186 | 0.169 | 0.154 | 0.158 |
|
225 |
+
| **scifact** | 0.678 | 0.701 | 0.689 | 0.665 |
|
226 |
+
| **webis-touche** | 0.274 | 0.270 | 0.260 | 0.367 |
|
227 |
+
| **dbpedia-entity** | 0.471 | 0.413 | 0.452 | 0.313 |
|
228 |
+
| **fever** | 0.805 | 0.795 | 0.785 | 0.753 |
|
229 |
+
| **climate-fever** | 0.239 | 0.196 | 0.176 | 0.213 |
|
230 |
+
| **hotpotqa** | 0.766 | 0.656 | 0.675 | 0.603 |
|
231 |
+
| **nq** | 0.640 | 0.549 | 0.524 | 0.329 |
|
232 |
+
|
233 |
+
|
234 |
+
|
235 |
+
#### MS MARCO Passage Retrieval
|
236 |
+
|
237 |
+
| **Models** | **MRR@10** |
|
238 |
+
|---------------------|------------|
|
239 |
+
| **jina-colbert-v2** | 0.396 |
|
240 |
+
| **jina-colbert-v1** | 0.390 |
|
241 |
+
| **ColBERTv2.0** | **0.397** |
|
242 |
+
| **BM25** | 0.187 |
|
243 |
+
|
244 |
+
### Multilingual Benchmarks
|
245 |
+
|
246 |
+
#### MIRACLE
|
247 |
+
We present our model's performance on the MIRACLE dataset, which is a multilingual retrieval benchmark.
|
248 |
+
|
249 |
+
| **** | **jina-colbert-v2** | **mDPR (zero shot)** |
|
250 |
+
|---------|---------------------|----------------------|
|
251 |
+
| **avg** | 0.627 | 0.427 |
|
252 |
+
| **ar** | 0.753 | 0.499 |
|
253 |
+
| **bn** | 0.750 | 0.443 |
|
254 |
+
| **de** | 0.504 | 0.490 |
|
255 |
+
| **es** | 0.538 | 0.478 |
|
256 |
+
| **en** | 0.570 | 0.394 |
|
257 |
+
| **fa** | 0.563 | 0.480 |
|
258 |
+
| **fi** | 0.740 | 0.472 |
|
259 |
+
| **fr** | 0.541 | 0.435 |
|
260 |
+
| **hi** | 0.600 | 0.383 |
|
261 |
+
| **id** | 0.547 | 0.272 |
|
262 |
+
| **ja** | 0.632 | 0.439 |
|
263 |
+
| **ko** | 0.671 | 0.419 |
|
264 |
+
| **ru** | 0.643 | 0.407 |
|
265 |
+
| **sw** | 0.499 | 0.299 |
|
266 |
+
| **te** | 0.742 | 0.356 |
|
267 |
+
| **th** | 0.772 | 0.358 |
|
268 |
+
| **yo** | 0.623 | 0.396 |
|
269 |
+
| **zh** | 0.523 | 0.512 |
|
270 |
+
|
271 |
+
#### mMARCO
|
272 |
+
|
273 |
+
| **** | **jina-colbert-v2** | **BM-25** |
|
274 |
+
|--------|---------------------|-----------|
|
275 |
+
| **ar** | 0.272 | 0.111 |
|
276 |
+
| **de** | 0.331 | 0.136 |
|
277 |
+
| **nl** | 0.330 | 0.140 |
|
278 |
+
| **es** | 0.341 | 0.158 |
|
279 |
+
| **fr** | 0.335 | 0.155 |
|
280 |
+
| **hi** | 0.309 | 0.134 |
|
281 |
+
| **id** | 0.319 | 0.149 |
|
282 |
+
| **it** | 0.337 | 0.153 |
|
283 |
+
| **ja** | 0.276 | 0.141 |
|
284 |
+
| **pt** | 0.337 | 0.152 |
|
285 |
+
| **ru** | 0.298 | 0.124 |
|
286 |
+
| **vi** | 0.287 | 0.136 |
|
287 |
+
| **zh** | 0.302 | |
|
288 |
+
|
289 |
+
|
290 |
+
### Matryoshka Representation Benchmarks
|
291 |
+
|
292 |
+
#### BEIR
|
293 |
+
|
294 |
+
| **dim** | **Average** | **nfcorpus** | **fiqa** | **trec-covid** | **hotpotqa** | **nq** |
|
295 |
+
|---------|-------------|--------------|----------|----------------|--------------|--------|
|
296 |
+
| **128** | 0.565 | 0.346 | 0.408 | 0.834 | 0.766 | 0.640 |
|
297 |
+
| **96** | 0.558 | 0.340 | 0.404 | 0.808 | 0.764 | 0.640 |
|
298 |
+
| **64** | 0.556 | 0.347 | 0.404 | 0.805 | 0.756 | 0.635 |
|
299 |
+
|
300 |
+
|
301 |
+
#### MSMARCO
|
302 |
+
|
303 |
+
| **dim** | **msmarco** |
|
304 |
+
|---------|-------------|
|
305 |
+
| **128** | 0.396 |
|
306 |
+
| **96** | 0.391 |
|
307 |
+
| **64** | 0.388 |
|
308 |
|
309 |
## Other Models
|
310 |
|