LaBSE-small-AZ / README.md
vrashad's picture
Update README.md
c45d8c4 verified
|
raw
history blame
2.46 kB
metadata
license: apache-2.0
language:
  - en
  - az
base_model:
  - sentence-transformers/LaBSE
pipeline_tag: sentence-similarity

Small LaBSE for English-Azerbaijani

This is an optimized version of LaBSE

Benchmark

STSBenchmark biosses-sts sickr-sts sts12-sts sts13-sts sts15-sts sts16-sts Average Pearson Model
0.7363 0.8148 0.7067 0.7050 0.6535 0.7514 0.7070 0.7250 sentence-transformers/LaBSE
0.7400 0.8216 0.6946 0.7098 0.6781 0.7637 0.7222 0.7329 LocalDoc/LaBSE-small-AZ
0.5830 0.2486 0.5921 0.5593 0.5559 0.5404 0.5289 0.5155 antoinelouis/colbert-xm
0.7572 0.8139 0.7328 0.7646 0.6318 0.7542 0.7092 0.7377 intfloat/multilingual-e5-large-instruct
0.7485 0.7714 0.7271 0.7170 0.6496 0.7570 0.7255 0.7280 intfloat/multilingual-e5-large
0.6960 0.8185 0.6950 0.6752 0.5899 0.7186 0.6790 0.6960 intfloat/multilingual-e5-base
0.7376 0.7917 0.7190 0.7441 0.6286 0.7461 0.7026 0.7242 intfloat/multilingual-e5-small
0.7927 0.6672 0.7758 0.8122 0.7312 0.7831 0.7416 0.7577 BAAI/bge-m3

STS-Benchmark

How to Use

from transformers import AutoTokenizer, AutoModel
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("LocalDoc/LaBSE-small-AZ")
model = AutoModel.from_pretrained("LocalDoc/LaBSE-small-AZ")

# Prepare texts
texts = [
    "Hello world",
    "Salam dünya"
]

# Tokenize and generate embeddings
encoded = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    embeddings = model(**encoded).pooler_output

# Compute similarity
similarity = torch.nn.functional.cosine_similarity(embeddings[0], embeddings[1], dim=0)