LocalDoc
/

LaBSE-small-AZ

Sentence Similarity

Model card Files Files and versions Community

LaBSE-small-AZ / README.md

vrashad's picture

Update README.md

c45d8c4 verified 5 days ago

|

2.46 kB

	---
	license: apache-2.0
	language:
	- en
	- az
	base_model:
	- sentence-transformers/LaBSE
	pipeline_tag: sentence-similarity
	---




	# Small LaBSE for English-Azerbaijani

	This is an optimized version of [LaBSE](https://huggingface.co/sentence-transformers/LaBSE)





	# Benchmark

	\| STSBenchmark \| biosses-sts \| sickr-sts \| sts12-sts \| sts13-sts \| sts15-sts \| sts16-sts \| Average Pearson \| Model \|
	\|--------------\|-------------\|-----------\|-----------\|-----------\|-----------\|-----------\|-----------------\|--------------------------------------\|
	\| 0.7363 \| 0.8148 \| 0.7067 \| 0.7050 \| 0.6535 \| 0.7514 \| 0.7070 \| 0.7250 \| sentence-transformers/LaBSE \|
	\| 0.7400 \| 0.8216 \| 0.6946 \| 0.7098 \| 0.6781 \| 0.7637 \| 0.7222 \| 0.7329 \| LocalDoc/LaBSE-small-AZ \|
	\| 0.5830 \| 0.2486 \| 0.5921 \| 0.5593 \| 0.5559 \| 0.5404 \| 0.5289 \| 0.5155 \| antoinelouis/colbert-xm \|
	\| 0.7572 \| 0.8139 \| 0.7328 \| 0.7646 \| 0.6318 \| 0.7542 \| 0.7092 \| 0.7377 \| intfloat/multilingual-e5-large-instruct \|
	\| 0.7485 \| 0.7714 \| 0.7271 \| 0.7170 \| 0.6496 \| 0.7570 \| 0.7255 \| 0.7280 \| intfloat/multilingual-e5-large \|
	\| 0.6960 \| 0.8185 \| 0.6950 \| 0.6752 \| 0.5899 \| 0.7186 \| 0.6790 \| 0.6960 \| intfloat/multilingual-e5-base \|
	\| 0.7376 \| 0.7917 \| 0.7190 \| 0.7441 \| 0.6286 \| 0.7461 \| 0.7026 \| 0.7242 \| intfloat/multilingual-e5-small \|
	\| 0.7927 \| 0.6672 \| 0.7758 \| 0.8122 \| 0.7312 \| 0.7831 \| 0.7416 \| 0.7577 \| BAAI/bge-m3 \|

	[STS-Benchmark](https://github.com/LocalDoc-Azerbaijan/STS-Benchmark)





	## How to Use

	```python
	from transformers import AutoTokenizer, AutoModel
	import torch

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("LocalDoc/LaBSE-small-AZ")
	model = AutoModel.from_pretrained("LocalDoc/LaBSE-small-AZ")

	# Prepare texts
	texts = [
	"Hello world",
	"Salam dünya"
	]

	# Tokenize and generate embeddings
	encoded = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
	with torch.no_grad():
	embeddings = model(**encoded).pooler_output

	# Compute similarity
	similarity = torch.nn.functional.cosine_similarity(embeddings[0], embeddings[1], dim=0)
	```