Latency observed in Embedding computation
#4
by
RajaRamKankipati
- opened
Hi Team,
Implementing MPNET code for long documents which have more than 512 tokens in the following approach:
- Get all the tokens from the tokenizers without truncation
- Split the tokens in chunks of 512 and
- Pass the chunks to the model in a batch
encoded_input = tokenizer(
document,
max_length=None,
padding=True,
truncation=False,
return_tensors="pt",
).to(device)
encoded_input = pre_processing_encoded_input(encoded_input, size = 512)
# Compute token embeddings
with torch.no_grad():
model_output = self.model(**encoded_input)
With a simple encoded_input of 512 tokens, the model takes around 230ms to compute the embedding, with the array shape (2, 512) taking 2000ms and increasing exponentially, is there any way I can achieve low latency using the model for long documents ?