MMLW-e5-base

MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish. This is a distilled model that can be used to generate embeddings applicable to many tasks such as semantic similarity, clustering, information retrieval. The model can also serve as a base for further fine-tuning. It transforms texts to 768 dimensional vectors. The model was initialized with multilingual E5 checkpoint, and then trained with multilingual knowledge distillation method on a diverse corpus of 60 million Polish-English text pairs. We utilised English FlagEmbeddings (BGE) as teacher models for distillation.

Usage (Sentence-Transformers)

⚠️ Our embedding models require the use of specific prefixes and suffixes when encoding texts. For this model, queries should be prefixed with "query: " and passages with "passage: " ⚠️

You can use the model like this with sentence-transformers:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

query_prefix = "query: "
answer_prefix = "passage: "
queries = [query_prefix + "Jak dożyć 100 lat?"]
answers = [
    answer_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.",
    answer_prefix + "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    answer_prefix + "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]
model = SentenceTransformer("sdadas/mmlw-e5-base")
queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False)

best_answer = cos_sim(queries_emb, answers_emb).argmax().item()
print(answers[best_answer])
# Trzeba zdrowo się odżywiać i uprawiać sport.

Evaluation Results

The model achieves an Average Score of 59.71 on the Polish Massive Text Embedding Benchmark (MTEB). See MTEB Leaderboard for detailed results.
The model achieves NDCG@10 of 53.56 on the Polish Information Retrieval Benchmark. See PIRB Leaderboard for detailed results.

Acknowledgements

This model was trained with the A100 GPU cluster support delivered by the Gdansk University of Technology within the TASK center initiative.

Citation

@article{dadas2024pirb,
  title={{PIRB}: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods}, 
  author={Sławomir Dadas and Michał Perełkiewicz and Rafał Poświata},
  year={2024},
  eprint={2402.13350},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

sdadas
/

mmlw-e5-base

MMLW-e5-base

Usage (Sentence-Transformers)

Evaluation Results

Acknowledgements

Citation

Spaces using sdadas/mmlw-e5-base 2

Collection including sdadas/mmlw-e5-base

Polish Embedding Models

Evaluation results