---
language: 
  - ru
tags:
- sentence-similarity
- text-classification
datasets:
- merionum/ru_paraphraser
- RuPAWS
---


This is a cross-encoder model trained to predict semantic equivalence of two Russian sentences. 

It classifies text pairs as paraphrases (class 1) or non-paraphrases (class 0). Its scores can be used as a metric of content preservation for paraphrasing or text style transfer.

It is a [sberbank-ai/ruRoberta-large](https://huggingface.co/sberbank-ai/ruRoberta-large) model fine-tuned on a union of 3 datasets: 
1. `RuPAWS`: https://github.com/ivkrotova/rupaws_dataset based on Quora and QQP;
2. `ru_paraphraser`: https://huggingface.co/merionum/ru_paraphraser;
3. Results of the manual check of content preservation for the [RUSSE-2022](https://www.dialog-21.ru/media/5755/dementievadplusetal105.pdf) text detoxification dataset collection  (`content_5.tsv`). 

The task was formulated as binary classification: whether the two sentences have the same meaning (1) or different (0). 

The table shows the training dataset size after duplication (joining `text1 + text2` and `text2 + text1` pairs): 

source \ label	| 0	| 1
-- | -- | --
detox |	1412|	3843
paraphraser	|5539	| 1688
rupaws_qqp	|1112 |	792
rupaws_wiki	|3526	| 2166

The model was trained with Adam optimizer and the following hyperparameters: 

```
learning_rate = 1e-5
batch_size = 8
gradient_accumulation_steps = 4
n_epochs = 3
max_grad_norm = 1.0
```

After training, the model had the following ROC AUC scores on the test sets:
set | ROC AUC
- | -
detox         | 0.857112
paraphraser   | 0.858465
rupaws_qqp    | 0.859195
rupaws_wiki   | 0.906121

Example usage:

```Python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained('SkolkovoInstitute/ruRoberta-large-paraphrase-v1')
tokenizer = AutoTokenizer.from_pretrained('SkolkovoInstitute/ruRoberta-large-paraphrase-v1')

def get_similarity(text1, text2):
    """ Predict the probability that two Russian sentences are paraphrases of each other. """
    with torch.inference_mode():
        batch = tokenizer(
            text1, text2, 
            truncation=True, max_length=model.config.max_position_embeddings, return_tensors='pt',
        ).to(model.device)
        proba = torch.softmax(model(**batch).logits, -1)
    return proba[0][1].item()

print(get_similarity('Я тебя люблю', 'Ты мне нравишься'))  # 0.9798
print(get_similarity('Я тебя люблю', 'Я тебя ненавижу'))   # 0.0008
```