--- library_name: transformers tags: - cross-encoder datasets: - lightonai/ms-marco-en-bge - juanluisdb/triviaqa-bge-m3-logits - juanluisdb/nq-bge-m3-logits language: - en base_model: - cross-encoder/ms-marco-MiniLM-L-6-v2 --- # Model Card for Model ID This model is finetuned starting from the well-known [ms-marco-MiniLM-L-6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2) using KL distillation techniques as described [here](https://www.answer.ai/posts/2024-08-13-small-but-mighty-colbert.html), using [bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) as teacher # Usage ## Usage with Transformers ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model = AutoModelForSequenceClassification.from_pretrained("juanluisdb/MiniLM-L-6-rerank-m3") tokenizer = AutoTokenizer.from_pretrained("juanluisdb/MiniLM-L-6-rerank-m3") features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'], padding=True, truncation=True, return_tensors="pt") model.eval() with torch.no_grad(): scores = model(**features).logits print(scores) ``` ## Usage with SentenceTransformers ```python from sentence_transformers import CrossEncoder model = CrossEncoder("juanluisdb/MiniLM-L-6-rerank-m3", max_length=512) scores = model.predict([('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')]) ``` # Evaluation ### BEIR (NDCG@10) I've run tests on different BEIR datasets. Cross Encoders rerank top100 BM25 results. | | bm25 | jina-reranker-v1-turbo-en | bge-reranker-v2-m3 | mxbai-rerank-base-v1 | ms-marco-MiniLM-L-6-v2 | MiniLM-L-6-rerank-m3 | |:---------------|-------:|----------------------------:|:---------------------|:-----------------------|-------------------------:|:------------------------------| | nq* | 0.305 | 0.533 | **0.597** | 0.535 | 0.523 | 0.580 | | fever* | 0.638 | 0.852 | 0.857 | 0.767 | 0.801 | **0.867** | | fiqa | 0.238 | 0.336 | **0.397** | 0.382 | 0.349 | 0.364 | | trec-covid | 0.589 | 0.774 | 0.784 | **0.830** | 0.741 | 0.738 | | scidocs | 0.15 | 0.166 | 0.169 | **0.171** | 0.164 | 0.165 | | scifact | 0.676 | 0.739 | 0.731 | 0.719 | 0.688 | **0.750** | | nfcorpus | 0.318 | 0.353 | 0.336 | **0.353** | 0.349 | 0.350 | | hotpotqa | 0.629 | 0.745 | **0.794** | 0.668 | 0.724 | 0.775 | | dbpedia-entity | 0.319 | 0.421 | **0.445** | 0.416 | 0.445 | 0.444 | | quora | 0.787 | 0.858 | 0.858 | 0.747 | 0.825 | **0.871** | | climate-fever | 0.163 | 0.233 | **0.314** | 0.253 | 0.244 | 0.309 | \* Training splits of NQ and Fever were used as part of the training data. Comparison with [ablated model](https://huggingface.co/juanluisdb/MiniLM-L-6-rerank-m3-ablated) trained only on MSMarco: | | ms-marco-MiniLM-L-6-v2 | MiniLM-L-6-rerank-m3-ablated | |:---------------|-------------------------:|--------------------------------------:| | nq | 0.5234 | **0.5412** | | fever | 0.8007 | **0.8221** | | fiqa | 0.349 | **0.3598** | | trec-covid | **0.741** | 0.7331 | | scidocs | **0.1638** | 0.163 | | scifact | 0.688 | **0.7376** | | nfcorpus | 0.3493 | **0.3495** | | hotpotqa | 0.7235 | **0.7583** | | dbpedia-entity | **0.4445** | 0.4382 | | quora | 0.8251 | **0.8619** | | climate-fever | 0.2438 | **0.2449** | # Datasets Used ~900k queries with 32-way triplets were used from these datasets: * MSMarco * TriviaQA * Natural Questions * FEVER