Model Card for ru-en-RoSBERTa
The ru-en-RoSBERTa is a general text embedding model for Russian. The model is based on ruRoBERTa and fine-tuned with ~4M pairs of supervised, synthetic and unsupervised data in Russian and English. Tokenizer supports some English tokens from RoBERTa tokenizer.
For more model details please refer to our article.
Usage
The model can be used as is with prefixes. It is recommended to use CLS pooling. The choice of prefix and pooling depends on the task.
We use the following basic rules to choose a prefix:
"search_query: "
and"search_document: "
prefixes are for answer or relevant paragraph retrieval"classification: "
prefix is for symmetric paraphrasing related tasks (STS, NLI, Bitext Mining)"clustering: "
prefix is for any tasks that rely on thematic features (topic classification, title-body retrieval)
To better tailor the model to your needs, you can fine-tune it with relevant high-quality Russian and English datasets.
Below are examples of texts encoding using the Transformers and SentenceTransformers libraries.
Transformers
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
def pool(hidden_state, mask, pooling_method="cls"):
if pooling_method == "mean":
s = torch.sum(hidden_state * mask.unsqueeze(-1).float(), dim=1)
d = mask.sum(axis=1, keepdim=True).float()
return s / d
elif pooling_method == "cls":
return hidden_state[:, 0]
inputs = [
#
"classification: Он нам и <unk> не нужон ваш Интернет!",
"clustering: В Ярославской области разрешили работу бань, но без посетителей",
"search_query: Сколько программистов нужно, чтобы вкрутить лампочку?",
#
"classification: What a time to be alive!",
"clustering: Ярославским баням разрешили работать без посетителей",
"search_document: Чтобы вкрутить лампочку, требуется три программиста: один напишет программу извлечения лампочки, другой — вкручивания лампочки, а третий проведет тестирование.",
]
tokenizer = AutoTokenizer.from_pretrained("ai-forever/ru-en-RoSBERTa")
model = AutoModel.from_pretrained("ai-forever/ru-en-RoSBERTa")
tokenized_inputs = tokenizer(inputs, max_length=512, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**tokenized_inputs)
embeddings = pool(
outputs.last_hidden_state,
tokenized_inputs["attention_mask"],
pooling_method="cls" # or try "mean"
)
embeddings = F.normalize(embeddings, p=2, dim=1)
sim_scores = embeddings[:3] @ embeddings[3:].T
print(sim_scores.diag().tolist())
# [0.4796873927116394, 0.9409002065658569, 0.7761015892028809]
SentenceTransformers
from sentence_transformers import SentenceTransformer
inputs = [
#
"classification: Он нам и <unk> не нужон ваш Интернет!",
"clustering: В Ярославской области разрешили работу бань, но без посетителей",
"search_query: Сколько программистов нужно, чтобы вкрутить лампочку?",
#
"classification: What a time to be alive!",
"clustering: Ярославским баням разрешили работать без посетителей",
"search_document: Чтобы вкрутить лампочку, требуется три программиста: один напишет программу извлечения лампочки, другой — вкручивания лампочки, а третий проведет тестирование.",
]
# loads model with CLS pooling
model = SentenceTransformer("ai-forever/ru-en-RoSBERTa")
# embeddings are normalized by default
embeddings = model.encode(inputs, convert_to_tensor=True)
sim_scores = embeddings[:3] @ embeddings[3:].T
print(sim_scores.diag().tolist())
# [0.47968706488609314, 0.940900444984436, 0.7761018872261047]
or using prompts (sentence-transformers>=2.4.0):
from sentence_transformers import SentenceTransformer
# loads model with CLS pooling
model = SentenceTransformer("ai-forever/ru-en-RoSBERTa")
classification = model.encode(["Он нам и <unk> не нужон ваш Интернет!", "What a time to be alive!"], prompt_name="classification")
print(classification[0] @ classification[1].T) # 0.47968706488609314
clustering = model.encode(["В Ярославской области разрешили работу бань, но без посетителей", "Ярославским баням разрешили работать без посетителей"], prompt_name="clustering")
print(clustering[0] @ clustering[1].T) # 0.940900444984436
query_embedding = model.encode("Сколько программистов нужно, чтобы вкрутить лампочку?", prompt_name="search_query")
document_embedding = model.encode("Чтобы вкрутить лампочку, требуется три программиста: один напишет программу извлечения лампочки, другой — вкручивания лампочки, а третий проведет тестирование.", prompt_name="search_document")
print(query_embedding @ document_embedding.T) # 0.7761018872261047
Citation
@misc{snegirev2024russianfocusedembeddersexplorationrumteb,
title={The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design},
author={Artem Snegirev and Maria Tikhonova and Anna Maksimova and Alena Fenogenova and Alexander Abramov},
year={2024},
eprint={2408.12503},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2408.12503},
}
Limitations
The model is designed to process texts in Russian, the quality in English is unknown. Maximum input text length is limited to 512 tokens.
- Downloads last month
- 6,391
Model tree for ai-forever/ru-en-RoSBERTa
Base model
ai-forever/ruRoberta-largeSpaces using ai-forever/ru-en-RoSBERTa 3
Collection including ai-forever/ru-en-RoSBERTa
Evaluation results
- accuracy on MTEB CEDRClassification (default)test set self-reported44.687
- f1 on MTEB CEDRClassification (default)test set self-reported40.760
- lrap on MTEB CEDRClassification (default)test set self-reported70.696
- main_score on MTEB CEDRClassification (default)test set self-reported44.687
- accuracy on MTEB GeoreviewClassification (default)test set self-reported49.697
- f1 on MTEB GeoreviewClassification (default)test set self-reported47.793
- f1_weighted on MTEB GeoreviewClassification (default)test set self-reported47.791
- main_score on MTEB GeoreviewClassification (default)test set self-reported49.697
- main_score on MTEB GeoreviewClusteringP2P (default)test set self-reported65.422
- v_measure on MTEB GeoreviewClusteringP2P (default)test set self-reported65.422