juanluisdb's picture
Update README.md
ab216b6 verified
metadata
library_name: transformers
tags:
  - cross-encoder
datasets:
  - lightonai/ms-marco-en-bge
  - juanluisdb/triviaqa-bge-m3-logits
  - juanluisdb/nq-bge-m3-logits
language:
  - en
base_model:
  - cross-encoder/ms-marco-MiniLM-L-6-v2

Model Card for Model ID

This model is finetuned starting from the well-known ms-marco-MiniLM-L-6-v2 using KL distillation techniques as described here, using bge-reranker-v2-m3 as teacher

Usage

Usage with Transformers

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model = AutoModelForSequenceClassification.from_pretrained("juanluisdb/MiniLM-L-6-rerank-m3")
tokenizer = AutoTokenizer.from_pretrained("juanluisdb/MiniLM-L-6-rerank-m3")
features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'],  padding=True, truncation=True, return_tensors="pt")
model.eval()
with torch.no_grad():
    scores = model(**features).logits
    print(scores)

Usage with SentenceTransformers

from sentence_transformers import CrossEncoder
model = CrossEncoder("juanluisdb/MiniLM-L-6-rerank-m3", max_length=512)
scores = model.predict([('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')])

Evaluation

BEIR (NDCG@10)

I've run tests on different BEIR datasets. Cross Encoders rerank top100 BM25 results.

bm25 jina-reranker-v1-turbo-en bge-reranker-v2-m3 mxbai-rerank-base-v1 ms-marco-MiniLM-L-6-v2 MiniLM-L-6-rerank-m3
nq* 0.305 0.533 0.597 0.535 0.523 0.580
fever* 0.638 0.852 0.857 0.767 0.801 0.867
fiqa 0.238 0.336 0.397 0.382 0.349 0.364
trec-covid 0.589 0.774 0.784 0.830 0.741 0.738
scidocs 0.15 0.166 0.169 0.171 0.164 0.165
scifact 0.676 0.739 0.731 0.719 0.688 0.750
nfcorpus 0.318 0.353 0.336 0.353 0.349 0.350
hotpotqa 0.629 0.745 0.794 0.668 0.724 0.775
dbpedia-entity 0.319 0.421 0.445 0.416 0.445 0.444
quora 0.787 0.858 0.858 0.747 0.825 0.871
climate-fever 0.163 0.233 0.314 0.253 0.244 0.309

* Training splits of NQ and Fever were used as part of the training data.

Comparison with ablated model trained only on MSMarco:

ms-marco-MiniLM-L-6-v2 MiniLM-L-6-rerank-m3-ablated
nq 0.5234 0.5412
fever 0.8007 0.8221
fiqa 0.349 0.3598
trec-covid 0.741 0.7331
scidocs 0.1638 0.163
scifact 0.688 0.7376
nfcorpus 0.3493 0.3495
hotpotqa 0.7235 0.7583
dbpedia-entity 0.4445 0.4382
quora 0.8251 0.8619
climate-fever 0.2438 0.2449

Datasets Used

~900k queries with 32-way triplets were used from these datasets:

  • MSMarco
  • TriviaQA
  • Natural Questions
  • FEVER