transformers text gen: probability tensor contains either `inf`, `nan` or element < 0, or gibberish output

by CookieMaster - opened Apr 27

Apr 27

Trying to run the model results in the following error:

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Example Code:

import torch
import transformers
print(transformers.__version__)

from transformers import AutoModelForCausalLM, AutoTokenizer, LocalAgent, GPTQConfig, Tool, pipeline
model = AutoModelForCausalLM.from_pretrained("astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit", device_map="cuda:0")
tokenizer = AutoTokenizer.from_pretrained("astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit")
llama_pipe = pipeline('text-generation', model=model, tokenizer=tokenizer)
llama_pipe("What is the meaning of life?", max_length=100, do_sample=True, temperature=0.7)

Output:

4.40.1
c:\Users\User\miniconda3\envs\Lucid\Lib\site-packages\transformers\modeling_utils.py:4371: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
  warnings.warn(
The cos_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the `LlamaAttention` class
The sin_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the `LlamaAttention` class
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
c:\Users\User\miniconda3\envs\Lucid\Lib\site-packages\transformers\models\llama\modeling_llama.py:671: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[1], line 9
      7 tokenizer = AutoTokenizer.from_pretrained("astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit")
      8 llama_pipe = pipeline('text-generation', model=model, tokenizer=tokenizer)
----> 9 llama_pipe("What is the meaning of life?", max_length=100, do_sample=True, temperature=0.7)

File c:\Users\User\miniconda3\envs\Lucid\Lib\site-packages\transformers\pipelines\text_generation.py:240, in TextGenerationPipeline.__call__(self, text_inputs, **kwargs)
    238         return super().__call__(chats, **kwargs)
    239 else:
--> 240     return super().__call__(text_inputs, **kwargs)

File c:\Users\User\miniconda3\envs\Lucid\Lib\site-packages\transformers\pipelines\base.py:1242, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
   1234     return next(
   1235         iter(
   1236             self.get_iterator(
   (...)
   1239         )
   1240     )
   1241 else:
-> 1242     return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)

File c:\Users\User\miniconda3\envs\Lucid\Lib\site-packages\transformers\pipelines\base.py:1249, in Pipeline.run_single(self, inputs, preprocess_params, forward_params, postprocess_params)
   1247 def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
   1248     model_inputs = self.preprocess(inputs, **preprocess_params)
-> 1249     model_outputs = self.forward(model_inputs, **forward_params)
   1250     outputs = self.postprocess(model_outputs, **postprocess_params)
   1251     return outputs

File c:\Users\User\miniconda3\envs\Lucid\Lib\site-packages\transformers\pipelines\base.py:1149, in Pipeline.forward(self, model_inputs, **forward_params)
   1147     with inference_context():
   1148         model_inputs = self._ensure_tensor_on_device(model_inputs, device=self.device)
-> 1149         model_outputs = self._forward(model_inputs, **forward_params)
   1150         model_outputs = self._ensure_tensor_on_device(model_outputs, device=torch.device("cpu"))
   1151 else:

File c:\Users\User\miniconda3\envs\Lucid\Lib\site-packages\transformers\pipelines\text_generation.py:327, in TextGenerationPipeline._forward(self, model_inputs, **generate_kwargs)
    324         generate_kwargs["min_length"] += prefix_length
    326 # BS x SL
--> 327 generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
    328 out_b = generated_sequence.shape[0]
    329 if self.framework == "pt":

File c:\Users\User\miniconda3\envs\Lucid\Lib\site-packages\torch\utils\_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File c:\Users\User\miniconda3\envs\Lucid\Lib\site-packages\transformers\generation\utils.py:1622, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
   1614     input_ids, model_kwargs = self._expand_inputs_for_generation(
   1615         input_ids=input_ids,
   1616         expand_size=generation_config.num_return_sequences,
   1617         is_encoder_decoder=self.config.is_encoder_decoder,
   1618         **model_kwargs,
   1619     )
   1621     # 13. run sample
-> 1622     result = self._sample(
   1623         input_ids,
   1624         logits_processor=prepared_logits_processor,
   1625         logits_warper=logits_warper,
   1626         stopping_criteria=prepared_stopping_criteria,
   1627         pad_token_id=generation_config.pad_token_id,
   1628         output_scores=generation_config.output_scores,
   1629         output_logits=generation_config.output_logits,
   1630         return_dict_in_generate=generation_config.return_dict_in_generate,
   1631         synced_gpus=synced_gpus,
   1632         streamer=streamer,
   1633         **model_kwargs,
   1634     )
   1636 elif generation_mode == GenerationMode.BEAM_SEARCH:
   1637     # 11. prepare beam search scorer
   1638     beam_scorer = BeamSearchScorer(
   1639         batch_size=batch_size,
   1640         num_beams=generation_config.num_beams,
   (...)
   1645         max_length=generation_config.max_length,
   1646     )

File c:\Users\User\miniconda3\envs\Lucid\Lib\site-packages\transformers\generation\utils.py:2829, in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, logits_warper, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, output_logits, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
   2827 # sample
   2828 probs = nn.functional.softmax(next_token_scores, dim=-1)
-> 2829 next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
   2831 # finished sentences should have their next token be a padding token
   2832 if eos_token_id is not None:

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Enviroment:
Windows 11
Python 3.11.5
Transformers 4.40.1

davidxmle

Astronomer org Apr 27

•

edited Apr 27

Thanks for raising the error. I am looking into this.

So far I am able to reproduce the error on a Linux server. I am able to get around the error by loading it in bf16, or turning off do_sample so it does greedy decoding instead, however, I am running into a different problem with the model spitting out gibberish...

This is weird as a code piece very similar to yours using text generation pipeline was tested when this quant was made. I also just remade the quant and this problem persists.

But the good thing is I tested this model again serving on vLLM (or any vLLM like Aphrodite should have the same behavior), and it works fine. See screenshot below. My suspicion is there's something off with how transformers are integrating with AutoGPTQ library or the decoding strategies used.

I will keep looking into this and get back to you. In the mean time I recommend just serving this directly using vLLM. If you want to fine-tune something, it maybe better to fine-tune directly on the full precision model, or alternatively, use QLORA and load in 8 bits or 4 bits to save VRAM then re-quant later.

davidxmle changed discussion title from Can not generate text with the model? to transformers text gen: probability tensor contains either `inf`, `nan` or element < 0, or gibberish output Apr 27

davidxmle

Astronomer org Apr 27

This is still under investigation. The 4 bit model seems to work fine (https://huggingface.co/astronomer/Llama-3-8B-Instruct-GPTQ-4-Bit). Let me get back to you once I have more definitive answers, but at the moment, serving in production with vLLM for this model is tested to be ok. The issue is specific to 8 bit and huggingface transformers library

CookieMaster

Apr 29

Understood! Thank you for taking the time to respond! I will simply use the 4-bit model in the meantime.

CookieMaster

Apr 30

•

edited Apr 30

Oddly enough, the model works (albeit it feels a lot slower than what it probably should be) when loading it with transformer using the oobabooga webui?
When loading with autogptq through the same webui, it mentions some sort of issue with expected sizes of some layers?(These observations were made last night and I forgot to comment them here, I will double check the results when I get back tonight.)

davidxmle

Astronomer org Apr 30

•

edited Apr 30

Oobabooga would be faster if it can run on exllamav2 and/or with injected fused attention (which has been broken since llama 2 release for GPTQ quants, this issue is universal for models that use GQA and fused attention). That's why when loading to Oobabooga its inference speed is slower. I think more people use 4 bit quant (which works with exllamav2) when using oobabooga and it maybe slightly better.

At the end of the day, there are just issues with GPTQ 8 Bit quants when using more hobbyist frameworks like Oobabooga, Since 8 bit I feel shows up more in production LLMs that are massed served than local ran... The support on this front has been lagging behind. However, I think people are working on integrating with the Marlin kernel and things should be a lot better for GPTQ soon.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment