score mteb french

by abhamadi - opened Jul 7

Jul 7

Hello,
It's great with your open-source model. But there seems to be confusion about the model's score when evaluating on mteb-french. I tried running evaluation locally, the average score is 59.92, different from 66.6 as on the leaderboard.

thenlper

Alibaba-NLP org Jul 8

Could you please provide your evaluation results or at least some results from your dataset? We would like to compare the results of the dataset. Please note that this model is trained with instructions (instruct training), and when encoding the text, it is necessary to concatenate the instruction on the query side.

abhamadi

Jul 18

"""Example script for benchmarking all datasets constituting the MTEB French leaderboard & average scores"""

from future import annotations
import os
import logging
import torch
import gc
from sentence_transformers import SentenceTransformer
device = torch.device('cuda:0')
torch.cuda.set_device(device)
from mteb import MTEB

logging.basicConfig(level=logging.INFO)

logger = logging.getLogger("main")

TASK_LIST_CLASSIFICATION = [
"AmazonReviewsClassification",
"MasakhaNEWSClassification",
"MassiveIntentClassification",
"MassiveScenarioClassification",
"MTOPDomainClassification",
"MTOPIntentClassification",
]

TASK_LIST_CLUSTERING = [
"AlloProfClusteringP2P",
"AlloProfClusteringS2S",
"HALClusteringS2S",
"MasakhaNEWSClusteringP2P",
"MasakhaNEWSClusteringS2S",
"MLSUMClusteringP2P",
"MLSUMClusteringS2S",
]

TASK_LIST_PAIR_CLASSIFICATION = [
"OpusparcusPC",
"PawsX",
]

TASK_LIST_RERANKING = ["SyntecReranking", "AlloprofReranking"]

TASK_LIST_RETRIEVAL = [
"AlloprofRetrieval",
"BSARDRetrieval",
"SyntecRetrieval",
"XPQARetrieval",
"MintakaRetrieval",
]

TASK_LIST_STS = ["SummEvalFr", "STSBenchmarkMultilingualSTS", "STS22", "SICKFr"]

TASK_LIST = (
TASK_LIST_CLASSIFICATION
+ TASK_LIST_CLUSTERING
+ TASK_LIST_PAIR_CLASSIFICATION
+ TASK_LIST_RERANKING
+ TASK_LIST_RETRIEVAL
+ TASK_LIST_STS
)

model_name = "Alibaba-NLP/gte-Qwen2-1.5B-instruct"

model = SentenceTransformer(model_name, trust_remote_code=True)

logger.info(f"Task list : {TASK_LIST}")
for task in TASK_LIST:
logger.info(f"Running task: {task}")
evaluation = MTEB(
tasks=[task], task_langs=["fr"]
) # Remove "fr" for running all languages
evaluation.run(model, batch_size = 1, output_folder=f"results/{model_name}")

This is the result after I ran the above code with 26 resulting json files:
https://www.dropbox.com/scl/fi/7is59edlapzdnhacp2ysf/Alibaba-NLP__gte-Qwen2-1.5B-instruct.zip?rlkey=pv0hppw7dvdbb25e7rftybd2c&st=867jjbh0&dl=0

LeMoussel

3 days ago

•

edited 1 day ago

@abhamadi
In TASK_LIST_PAIR_CLASSIFICATION "PawsX" is not FRA (It's CMN). See MTEB Tasks

and how do you get the average score: 59.92 ?
is in this way ?

evaluation = MTEB(tasks=[task], task_langs=["fr"])
results = evaluation.run(model, batch_size=1, output_folder=f"results/{model_name}")

# Calculate the average score across all tasks
average_score = sum(results.values()) / len(results)
print(f"Average Score: {average_score}")

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment