T5 English, Russian and Chinese sentence similarity model

This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space. The model works well for sentence similarity tasks, but doesn't perform that well for semantic search tasks.

The model can be used to search for parallel texts in Russian, English and Chinese.

To determine the similarity of sentences in the model, only the encoder from the T5-based model is used.

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
import torch.nn.functional as F

model = SentenceTransformer('utrobinmv/t5_translate_en_ru_zh_base_200_sent')

sentences_1 = ["The purpose of the development is to provide users with a personal simultaneous interpreter.",
            "Съешь ещё этих мягких французских булок.",
            "再吃这些法国的甜蜜的面包。"]

sentences_2 = ["Цель разработки — предоставить пользователям личного синхронного переводчика.",
            "Have some more of these soft French rolls.",
            "开发的目的就是向用户提供个性化的同步翻译。"]

embeddings = model.encode(sentences_1+sentences_2)
embeddings_1 = embeddings[:len(sentences_1)]
embeddings_2 = embeddings[len(sentences_1):]

similarity = embeddings_1 @ embeddings_2.T
print(similarity)
#[[ 0.8956245  -0.0390042   0.8493222 ]
# [ 0.00778637  0.85185283 -0.010229  ]
# [ 0.01991986  0.72560245  0.02547248]]

Example translate Russian to Chinese

from transformers import T5ForConditionalGeneration, T5Tokenizer

device = 'cuda' #or 'cpu' for translate on cpu

model_name = 'utrobinmv/t5_translate_en_ru_zh_base_200_sent'
model = T5ForConditionalGeneration.from_pretrained(model_name)
model.to(device)
tokenizer = T5Tokenizer.from_pretrained(model_name)

prefix = 'translate to zh: '
src_text = prefix + "Съешь ещё этих мягких французских булок."

# translate Russian to Chinese
input_ids = tokenizer(src_text, return_tensors="pt")

generated_tokens = model.generate(**input_ids.to(device))

result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(result)
# 再吃这些法国的甜蜜的面包。

and Example translate Chinese to Russian

from transformers import T5ForConditionalGeneration, T5Tokenizer

device = 'cuda' #or 'cpu' for translate on cpu

model_name = 'utrobinmv/t5_translate_en_ru_zh_base_200_sent'
model = T5ForConditionalGeneration.from_pretrained(model_name)
model.to(device)
tokenizer = T5Tokenizer.from_pretrained(model_name)

prefix = 'translate to ru: '
src_text = prefix + "再吃这些法国的甜蜜的面包。"

# translate Russian to Chinese
input_ids = tokenizer(src_text, return_tensors="pt")

generated_tokens = model.generate(**input_ids.to(device))

result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(result)
# Съешьте этот сладкий хлеб из Франции.

Languages covered

Russian (ru_RU), Chinese (zh_CN), English (en_US)

Downloads last month
17
Safetensors
Model size
298M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for utrobinmv/t5_translate_en_ru_zh_base_200_sent

Finetuned
(2)
this model