Describe CrossEncoder integration with Sentence Transformers (#8)
Browse files- Describe usage via sentence-transformers CrossEncoder (4d6f3df72ac9f4654a73070da3adc511f84fc8eb)
- Add convert_to_tensor=True to fix predict/rank until the next ST release (61b3d1697c127ec8e51ddfee14150f7eb22e4353)
README.md
CHANGED
@@ -209,6 +209,66 @@ Inside the `result` object, you will find the reranked documents along with thei
|
|
209 |
The `rerank()` function will automatically chunk the input documents into smaller pieces if they exceed the model's maximum input length. This allows you to rerank long documents without running into memory issues.
|
210 |
Specifically, the `rerank()` function will split the documents into chunks of size `max_length` and rerank each chunk separately. The scores from all the chunks are then combined to produce the final reranking results. You can control the query length and document length in each chunk by setting the `max_query_length` and `max_length` parameters. The `rerank()` function also supports the `overlap` parameter (default is `80`) which determines how much overlap there is between adjacent chunks. This can be useful when reranking long documents to ensure that the model has enough context to make accurate predictions.
|
211 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
212 |
|
213 |
# Evaluation
|
214 |
|
|
|
209 |
The `rerank()` function will automatically chunk the input documents into smaller pieces if they exceed the model's maximum input length. This allows you to rerank long documents without running into memory issues.
|
210 |
Specifically, the `rerank()` function will split the documents into chunks of size `max_length` and rerank each chunk separately. The scores from all the chunks are then combined to produce the final reranking results. You can control the query length and document length in each chunk by setting the `max_query_length` and `max_length` parameters. The `rerank()` function also supports the `overlap` parameter (default is `80`) which determines how much overlap there is between adjacent chunks. This can be useful when reranking long documents to ensure that the model has enough context to make accurate predictions.
|
211 |
|
212 |
+
3. Alternatively, `jina-reranker-v2-base-multilingual` has been integrated with `CrossEncoder` from the `sentence-transformers` library.
|
213 |
+
|
214 |
+
Before you start, install the `sentence-transformers` libraries:
|
215 |
+
|
216 |
+
```bash
|
217 |
+
pip install sentence-transformers
|
218 |
+
```
|
219 |
+
|
220 |
+
The [`CrossEncoder`](https://sbert.net/docs/package_reference/cross_encoder/cross_encoder.html) class supports a [`predict`](https://sbert.net/docs/package_reference/cross_encoder/cross_encoder.html#sentence_transformers.cross_encoder.CrossEncoder.predict) method to get query-document relevance scores, and a [`rank`](https://sbert.net/docs/package_reference/cross_encoder/cross_encoder.html#sentence_transformers.cross_encoder.CrossEncoder.rank) method to rank all documents given your query.
|
221 |
+
|
222 |
+
```python
|
223 |
+
from sentence_transformers import CrossEncoder
|
224 |
+
|
225 |
+
model = CrossEncoder(
|
226 |
+
"jinaai/jina-reranker-v2-base-multilingual",
|
227 |
+
automodel_args={"torch_dtype": "auto"},
|
228 |
+
trust_remote_code=True,
|
229 |
+
)
|
230 |
+
|
231 |
+
# Example query and documents
|
232 |
+
query = "Organic skincare products for sensitive skin"
|
233 |
+
documents = [
|
234 |
+
"Organic skincare for sensitive skin with aloe vera and chamomile.",
|
235 |
+
"New makeup trends focus on bold colors and innovative techniques",
|
236 |
+
"Bio-Hautpflege für empfindliche Haut mit Aloe Vera und Kamille",
|
237 |
+
"Neue Make-up-Trends setzen auf kräftige Farben und innovative Techniken",
|
238 |
+
"Cuidado de la piel orgánico para piel sensible con aloe vera y manzanilla",
|
239 |
+
"Las nuevas tendencias de maquillaje se centran en colores vivos y técnicas innovadoras",
|
240 |
+
"针对敏感肌专门设计的天然有机护肤产品",
|
241 |
+
"新的化妆趋势注重鲜艳的颜色和创新的技巧",
|
242 |
+
"敏感肌のために特別に設計された天然有機スキンケア製品",
|
243 |
+
"新しいメイクのトレンドは鮮やかな色と革新的な技術に焦点を当てています",
|
244 |
+
]
|
245 |
+
|
246 |
+
# construct sentence pairs
|
247 |
+
sentence_pairs = [[query, doc] for doc in documents]
|
248 |
+
|
249 |
+
scores = model.predict(sentence_pairs, convert_to_tensor=True).tolist()
|
250 |
+
"""
|
251 |
+
[0.828125, 0.0927734375, 0.6328125, 0.08251953125, 0.76171875, 0.099609375, 0.92578125, 0.058349609375, 0.84375, 0.111328125]
|
252 |
+
"""
|
253 |
+
|
254 |
+
rankings = model.rank(query, documents, return_documents=True, convert_to_tensor=True)
|
255 |
+
print(f"Query: {query}")
|
256 |
+
for ranking in rankings:
|
257 |
+
print(f"ID: {ranking['corpus_id']}, Score: {ranking['score']:.4f}, Text: {ranking['text']}")
|
258 |
+
"""
|
259 |
+
Query: Organic skincare products for sensitive skin
|
260 |
+
ID: 6, Score: 0.9258, Text: 针对敏感肌专门设计的天然有机护肤产品
|
261 |
+
ID: 8, Score: 0.8438, Text: 敏感肌のために特別に設計された天然有機スキンケア製品
|
262 |
+
ID: 0, Score: 0.8281, Text: Organic skincare for sensitive skin with aloe vera and chamomile.
|
263 |
+
ID: 4, Score: 0.7617, Text: Cuidado de la piel orgánico para piel sensible con aloe vera y manzanilla
|
264 |
+
ID: 2, Score: 0.6328, Text: Bio-Hautpflege für empfindliche Haut mit Aloe Vera und Kamille
|
265 |
+
ID: 9, Score: 0.1113, Text: 新しいメイクのトレンドは鮮やかな色と革新的な技術に焦点を当てています
|
266 |
+
ID: 5, Score: 0.0996, Text: Las nuevas tendencias de maquillaje se centran en colores vivos y técnicas innovadoras
|
267 |
+
ID: 1, Score: 0.0928, Text: New makeup trends focus on bold colors and innovative techniques
|
268 |
+
ID: 3, Score: 0.0825, Text: Neue Make-up-Trends setzen auf kräftige Farben und innovative Techniken
|
269 |
+
ID: 7, Score: 0.0583, Text: 新的化妆趋势注重鲜艳的颜色和创新的技巧
|
270 |
+
"""
|
271 |
+
```
|
272 |
|
273 |
# Evaluation
|
274 |
|