Use of `layer_norm` in examples
In this example from the model card I'm having trouble working out why F.layer_norm
is being used.
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer
matryoshka_dim = 512
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
embeddings = model.encode(sentences, convert_to_tensor=True)
embeddings = F.layer_norm(embeddings, normalized_shape=(embeddings.shape[1],))
embeddings = embeddings[:, :matryoshka_dim]
embeddings = F.normalize(embeddings, p=2, dim=1)
It seems unusual, is this a mistake, or is there something I'm not understanding?
Or in other words, what's wrong with this:
embeddings = model.encode(sentences, convert_to_tensor=True)
embeddings = F.normalize(embeddings[:, :matryoshka_dim]) # limit dims and normalize
You're right that it's non-standard! we used it to train our model to be binary-aware, inspired by this tweet/paper. We messed around with this during a hack week and found it worked fairly well and was simpler than using a STE
Ah so this is specific to the classification case. If I just want to use (and truncate) the embeddings for similarity search, I assume I don't need the layer_norm
step.
I compared the distributions of values with and without the layer_norm
step and they're close to identical (since the values out of the model are already mean close to 0 and std near-ish 1)