Fix broken SentenceTransformer snippet; format code with Python format

Hello!

## Pull Request overview
* Fix broken SentenceTransformer snippet: It said `'` instead of `model_name_or_path` (bug was introduced in https://huggingface.co/Alibaba-NLP/gte-multilingual-base/commit/167d84dcd20a1f29c277626db5175f7442274cee)
* Give the code snippets Python coloring by adding `python` right after the triple tick
* Use more SentenceTransformer functionality: `normalize_embeddings=True` in `model.encode` & `model.similarity` to compute the cosine similarity.

## Details
This should fix the snippet and make the README code blocks easier to read.

cc

@thenlper

- Tom Aarsen

Files changed (1) hide show

README.md +6 -14

README.md CHANGED Viewed

@@ -4660,7 +4660,7 @@ refer to [enable-unpadding-and-xformers](https://huggingface.co/Alibaba-NLP/new-
 ### Get Dense Embeddings with Transformers
-```
 # Requires transformers>=4.36.0
 import torch.nn.functional as F
@@ -4693,12 +4693,10 @@ print(scores.tolist())
 ```
 ### Use with sentence-transformers
-```
 # Requires sentences-transformers>=3.0.0
 from sentence_transformers import SentenceTransformer
-from sentence_transformers.util import cos_sim
-import numpy as np
 input_texts = [
     "what is the capital of China?",
@@ -4708,24 +4706,18 @@ input_texts = [
 ]
 model_name_or_path="Alibaba-NLP/gte-multilingual-base"
-model = SentenceTransformer(', trust_remote_code=True)
-embeddings = model.encode(input_texts) # embeddings.shape (4, 768)
-# normalized embeddings
-norms = np.linalg.norm(embeddings, ord=2, axis=1, keepdims=True)
-norms[norms == 0] = 1
-embeddings = embeddings / norms
 # sim scores
-scores = (embeddings[:1] @ embeddings[1:].T)
 print(scores.tolist())
 # [[0.301699697971344, 0.7503870129585266, 0.32030850648880005]]
 ```
 ### Use with custom code to get dense embeddigns and sparse token weights
-```
 # You can find the script gte_embedding.py in https://huggingface.co/Alibaba-NLP/gte-multilingual-base/blob/main/scripts/gte_embedding.py
 from gte_embedding import GTEEmbeddidng

 ### Get Dense Embeddings with Transformers
+```python
 # Requires transformers>=4.36.0
 import torch.nn.functional as F
 ```
 ### Use with sentence-transformers
+```python
 # Requires sentences-transformers>=3.0.0
 from sentence_transformers import SentenceTransformer
 input_texts = [
     "what is the capital of China?",
 ]
 model_name_or_path="Alibaba-NLP/gte-multilingual-base"
+model = SentenceTransformer(model_name_or_path, trust_remote_code=True)
+embeddings = model.encode(input_texts, normalize_embeddings=True) # embeddings.shape (4, 768)
 # sim scores
+scores = model.similarity(embeddings[:1], embeddings[1:])
 print(scores.tolist())
 # [[0.301699697971344, 0.7503870129585266, 0.32030850648880005]]
 ```
 ### Use with custom code to get dense embeddigns and sparse token weights
+```python
 # You can find the script gte_embedding.py in https://huggingface.co/Alibaba-NLP/gte-multilingual-base/blob/main/scripts/gte_embedding.py
 from gte_embedding import GTEEmbeddidng