New tokens generated with FP16 inference are only exclamation marks "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"

#89
by rasyosef - opened

The model was working fine until a couple of hours ago, then it started generating a bunch of "!!!!!!!!!!!!!!!!!!!!!" no matter the input text. To my knowledge, this issue is only present with FP16 inference, but even the sample code in your model card replicates this problem since torch_dtype="auto" defaults to torch.float16 .

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

torch.set_default_device("cuda")

model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)

inputs = tokenizer('''def print_prime(n):
   """
   Print all primes between 1 and n
   """''', return_tensors="pt", return_attention_mask=False)

outputs = model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
print(text)

Output:

def print_prime(n):
   """
   Print all primes between 1 and n
   """!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Here's another example,

inputs = tokenizer(
    '''Write a detailed analogy between mathematics and a lighthouse.\n''', 
    return_tensors="pt", 
    return_attention_mask=False
)

outputs = model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
print(text)

Output:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!```

All of the newly generated tokens are just a bunch of "!!!!!!!!!!!!!!!!!!!!!!!..."

same here. When using as pipeline for text completion it still answers, btw.

Same here. Does someone have sample code where it doesn't print out only "!"?

Thanks!

@eschmitt88 , If you already have accelerate installed, you only need to change torch_dtype="auto" to device_map="auto" when loading the model like so.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

torch.set_default_device("cuda")

# changed torch_dtype="auto" to device_map="auto" in following line
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", device_map="auto", trust_remote_code=True) 
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)

inputs = tokenizer('''def print_prime(n):
   """
   Print all primes between 1 and n
   """''', return_tensors="pt", return_attention_mask=False)

outputs = model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
print(text)

@rasyosef thank you!

Same problem here, can someone explain a bit the origin of this issue?

I think it was a patch to prevent errors in an "attention overflow issue (with FP16)" that requires autocast to be disabled. as per this change record.

image.png

Microsoft org

Could you please re-try with the latest commit?

Unfortunately, for Phi-2 to work amongst all use cases, we need to upcast queries and keys to FP32, and disable the autocast in the attention's forward pass.

@gugarosa do you think it is necessary to update the readme as well? Mainly to prevent people not aware of the new behaviour running into issues and to adjust the provided sample code (if needed)?

Microsoft org

I don't think we need to update the readme.

The goal is to ensure that the model works with any use case (as it was working prior to the integration with transformers' source code).

@gugarosa

I thought we would have to change the torch_dtype="auto" argument to device_map="auto" in the model definition line, as per the @rasyosef post above. In fact, yesterday, after I tried that, it solved the "!!!!!!" response issue for me. In that case, the readme sample code would in fact be outdated.

However, I tested again today, with the new modeling_phi.py, and it is no longer the case.
The readme sample code, with torch_dtype="auto", is working fine again now.

@gugarosa FP16 inference is functioning correctly now, including the sample code from the model card. Closing this issue.

rasyosef changed discussion status to closed
Microsoft org
edited Jan 18

@gugarosa

I thought we would have to change the torch_dtype="auto" argument to device_map="auto" in the model definition line, as per the @rasyosef post above. In fact, yesterday, after I tried that, it solved the "!!!!!!" response issue for me. In that case, the readme sample code would in fact be outdated.

However, I tested again today, with the new modeling_phi.py, and it is no longer the case.
The readme sample code, with torch_dtype="auto", is working fine again now.

Removing torch_dtype="auto" loads the model's weights in FP32, which does not produce an overflow.

Sign up or log in to comment