Accept tokens instead of string and question regarding tokenizer behaviour

#35
by MH1P - opened

Hi all,

I usually prefer passing tokens directly to an embedding model (it make more sens to me, as the max sequence length is express in tokens, not in string length). It looks to me that this is not possible with the .encode() method you have.

Attempting to do so, I noticed that the tokenizer is calling .strip() and .lower(). Is the model oblivious to capital letters? Did you quantify the impact on doing so when capital letters are used, e.g. for 'NY' (New York) vs 'ny'. Could you see an impact on the ability to distinguish between proper names and other words?

Anyway, if anyone else is looking to do feed token instead of strings, here is some sample code.

Cheers!

from transformers import AutoModel, AutoTokenizer
import torch


def mean_pooling( token_embeddings: torch.Tensor, attention_mask: torch.Tensor):
    input_mask_expanded = (attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float())
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

tokenizer = AutoTokenizer.from_pretrained("jinaai/jina-embeddings-v3")
print(type(tokenizer))

input = ["Hello, world!", "How are you today?"]
# Follow transformation applied in custom_st.py
input = [s.strip() for s in input]
input = [s.lower() for s in input]

batch_tokenized = tokenizer(input, return_tensors='pt', padding=True, truncation="longest_first",)
print(batch_tokenized)
print(type(batch_tokenized))

model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True)
print(type(model))

embs = model(**batch_tokenized)[0]
embs = mean_pooling(embs, batch_tokenized['attention_mask'])
print(type(embs))
print(embs)

print("----")

embs2 = model.encode(input, normalize_embeddings=False)
print(type(embs2))
print(embs2)

Sign up or log in to comment