T5 English, Russian and Chinese sentence similarity model
This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space. The model works well for sentence similarity tasks, but doesn't perform that well for semantic search tasks.
The model can be used to search for parallel texts in Russian, English and Chinese.
To determine the similarity of sentences in the model, only the encoder from the T5-based model is used.
Usage (Sentence-Transformers)
Using this model becomes easy when you have sentence-transformers installed:
pip install -U sentence-transformers
Then you can use the model like this:
from sentence_transformers import SentenceTransformer
import torch.nn.functional as F
model = SentenceTransformer('utrobinmv/t5_translate_en_ru_zh_base_200_sent')
sentences_1 = ["The purpose of the development is to provide users with a personal simultaneous interpreter.",
"Съешь ещё этих мягких французских булок.",
"再吃这些法国的甜蜜的面包。"]
sentences_2 = ["Цель разработки — предоставить пользователям личного синхронного переводчика.",
"Have some more of these soft French rolls.",
"开发的目的就是向用户提供个性化的同步翻译。"]
embeddings = model.encode(sentences_1+sentences_2)
embeddings_1 = embeddings[:len(sentences_1)]
embeddings_2 = embeddings[len(sentences_1):]
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
#[[ 0.8956245 -0.0390042 0.8493222 ]
# [ 0.00778637 0.85185283 -0.010229 ]
# [ 0.01991986 0.72560245 0.02547248]]
Example translate Russian to Chinese
from transformers import T5ForConditionalGeneration, T5Tokenizer
device = 'cuda' #or 'cpu' for translate on cpu
model_name = 'utrobinmv/t5_translate_en_ru_zh_base_200_sent'
model = T5ForConditionalGeneration.from_pretrained(model_name)
model.to(device)
tokenizer = T5Tokenizer.from_pretrained(model_name)
prefix = 'translate to zh: '
src_text = prefix + "Съешь ещё этих мягких французских булок."
# translate Russian to Chinese
input_ids = tokenizer(src_text, return_tensors="pt")
generated_tokens = model.generate(**input_ids.to(device))
result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(result)
# 再吃这些法国的甜蜜的面包。
and Example translate Chinese to Russian
from transformers import T5ForConditionalGeneration, T5Tokenizer
device = 'cuda' #or 'cpu' for translate on cpu
model_name = 'utrobinmv/t5_translate_en_ru_zh_base_200_sent'
model = T5ForConditionalGeneration.from_pretrained(model_name)
model.to(device)
tokenizer = T5Tokenizer.from_pretrained(model_name)
prefix = 'translate to ru: '
src_text = prefix + "再吃这些法国的甜蜜的面包。"
# translate Russian to Chinese
input_ids = tokenizer(src_text, return_tensors="pt")
generated_tokens = model.generate(**input_ids.to(device))
result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(result)
# Съешьте этот сладкий хлеб из Франции.
Languages covered
Russian (ru_RU), Chinese (zh_CN), English (en_US)
- Downloads last month
- 17
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for utrobinmv/t5_translate_en_ru_zh_base_200_sent
Base model
utrobinmv/t5_translate_en_ru_zh_base_200