score mteb french
Hello,
It's great with your open-source model. But there seems to be confusion about the model's score when evaluating on mteb-french. I tried running evaluation locally, the average score is 59.92, different from 66.6 as on the leaderboard.
Could you please provide your evaluation results or at least some results from your dataset? We would like to compare the results of the dataset. Please note that this model is trained with instructions (instruct training), and when encoding the text, it is necessary to concatenate the instruction on the query side.
"""Example script for benchmarking all datasets constituting the MTEB French leaderboard & average scores"""
from future import annotations
import os
import logging
import torch
import gc
from sentence_transformers import SentenceTransformer
device = torch.device('cuda:0')
torch.cuda.set_device(device)
from mteb import MTEB
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("main")
TASK_LIST_CLASSIFICATION = [
"AmazonReviewsClassification",
"MasakhaNEWSClassification",
"MassiveIntentClassification",
"MassiveScenarioClassification",
"MTOPDomainClassification",
"MTOPIntentClassification",
]
TASK_LIST_CLUSTERING = [
"AlloProfClusteringP2P",
"AlloProfClusteringS2S",
"HALClusteringS2S",
"MasakhaNEWSClusteringP2P",
"MasakhaNEWSClusteringS2S",
"MLSUMClusteringP2P",
"MLSUMClusteringS2S",
]
TASK_LIST_PAIR_CLASSIFICATION = [
"OpusparcusPC",
"PawsX",
]
TASK_LIST_RERANKING = ["SyntecReranking", "AlloprofReranking"]
TASK_LIST_RETRIEVAL = [
"AlloprofRetrieval",
"BSARDRetrieval",
"SyntecRetrieval",
"XPQARetrieval",
"MintakaRetrieval",
]
TASK_LIST_STS = ["SummEvalFr", "STSBenchmarkMultilingualSTS", "STS22", "SICKFr"]
TASK_LIST = (
TASK_LIST_CLASSIFICATION
+ TASK_LIST_CLUSTERING
+ TASK_LIST_PAIR_CLASSIFICATION
+ TASK_LIST_RERANKING
+ TASK_LIST_RETRIEVAL
+ TASK_LIST_STS
)
model_name = "Alibaba-NLP/gte-Qwen2-1.5B-instruct"
model = SentenceTransformer(model_name, trust_remote_code=True)
logger.info(f"Task list : {TASK_LIST}")
for task in TASK_LIST:
logger.info(f"Running task: {task}")
evaluation = MTEB(
tasks=[task], task_langs=["fr"]
) # Remove "fr" for running all languages
evaluation.run(model, batch_size = 1, output_folder=f"results/{model_name}")
This is the result after I ran the above code with 26 resulting json files:
https://www.dropbox.com/scl/fi/7is59edlapzdnhacp2ysf/Alibaba-NLP__gte-Qwen2-1.5B-instruct.zip?rlkey=pv0hppw7dvdbb25e7rftybd2c&st=867jjbh0&dl=0
@abhamadi
In TASK_LIST_PAIR_CLASSIFICATION "PawsX" is not FRA (It's CMN). See MTEB Tasks
and how do you get the average score: 59.92 ?
is in this way ?
evaluation = MTEB(tasks=[task], task_langs=["fr"])
results = evaluation.run(model, batch_size=1, output_folder=f"results/{model_name}")
# Calculate the average score across all tasks
average_score = sum(results.values()) / len(results)
print(f"Average Score: {average_score}")