metadata

license: apache-2.0
language:
  - en
  - az
base_model:
  - sentence-transformers/LaBSE
pipeline_tag: sentence-similarity

Small LaBSE for English-Azerbaijani

This is an optimized version of LaBSE

Benchmark

STSBenchmark	biosses-sts	sickr-sts	sts12-sts	sts13-sts	sts15-sts	sts16-sts	Average Pearson	Model
0.7363	0.8148	0.7067	0.7050	0.6535	0.7514	0.7070	0.7250	sentence-transformers/LaBSE
0.7400	0.8216	0.6946	0.7098	0.6781	0.7637	0.7222	0.7329	LocalDoc/LaBSE-small-AZ
0.5830	0.2486	0.5921	0.5593	0.5559	0.5404	0.5289	0.5155	antoinelouis/colbert-xm
0.7572	0.8139	0.7328	0.7646	0.6318	0.7542	0.7092	0.7377	intfloat/multilingual-e5-large-instruct
0.7485	0.7714	0.7271	0.7170	0.6496	0.7570	0.7255	0.7280	intfloat/multilingual-e5-large
0.6960	0.8185	0.6950	0.6752	0.5899	0.7186	0.6790	0.6960	intfloat/multilingual-e5-base
0.7376	0.7917	0.7190	0.7441	0.6286	0.7461	0.7026	0.7242	intfloat/multilingual-e5-small
0.7927	0.6672	0.7758	0.8122	0.7312	0.7831	0.7416	0.7577	BAAI/bge-m3

STS-Benchmark

How to Use

from transformers import AutoTokenizer, AutoModel
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("LocalDoc/LaBSE-small-AZ")
model = AutoModel.from_pretrained("LocalDoc/LaBSE-small-AZ")

# Prepare texts
texts = [
    "Hello world",
    "Salam dünya"
]

# Tokenize and generate embeddings
encoded = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    embeddings = model(**encoded).pooler_output

# Compute similarity
similarity = torch.nn.functional.cosine_similarity(embeddings[0], embeddings[1], dim=0)