transformers text gen: probability tensor contains either `inf`, `nan` or element < 0, or gibberish output
Trying to run the model results in the following error:
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Example Code:
import torch
import transformers
print(transformers.__version__)
from transformers import AutoModelForCausalLM, AutoTokenizer, LocalAgent, GPTQConfig, Tool, pipeline
model = AutoModelForCausalLM.from_pretrained("astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit", device_map="cuda:0")
tokenizer = AutoTokenizer.from_pretrained("astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit")
llama_pipe = pipeline('text-generation', model=model, tokenizer=tokenizer)
llama_pipe("What is the meaning of life?", max_length=100, do_sample=True, temperature=0.7)
Output:
4.40.1
c:\Users\User\miniconda3\envs\Lucid\Lib\site-packages\transformers\modeling_utils.py:4371: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
warnings.warn(
The cos_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the `LlamaAttention` class
The sin_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the `LlamaAttention` class
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
c:\Users\User\miniconda3\envs\Lucid\Lib\site-packages\transformers\models\llama\modeling_llama.py:671: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
attn_output = torch.nn.functional.scaled_dot_product_attention(
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[1], line 9
7 tokenizer = AutoTokenizer.from_pretrained("astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit")
8 llama_pipe = pipeline('text-generation', model=model, tokenizer=tokenizer)
----> 9 llama_pipe("What is the meaning of life?", max_length=100, do_sample=True, temperature=0.7)
File c:\Users\User\miniconda3\envs\Lucid\Lib\site-packages\transformers\pipelines\text_generation.py:240, in TextGenerationPipeline.__call__(self, text_inputs, **kwargs)
238 return super().__call__(chats, **kwargs)
239 else:
--> 240 return super().__call__(text_inputs, **kwargs)
File c:\Users\User\miniconda3\envs\Lucid\Lib\site-packages\transformers\pipelines\base.py:1242, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
1234 return next(
1235 iter(
1236 self.get_iterator(
(...)
1239 )
1240 )
1241 else:
-> 1242 return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File c:\Users\User\miniconda3\envs\Lucid\Lib\site-packages\transformers\pipelines\base.py:1249, in Pipeline.run_single(self, inputs, preprocess_params, forward_params, postprocess_params)
1247 def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
1248 model_inputs = self.preprocess(inputs, **preprocess_params)
-> 1249 model_outputs = self.forward(model_inputs, **forward_params)
1250 outputs = self.postprocess(model_outputs, **postprocess_params)
1251 return outputs
File c:\Users\User\miniconda3\envs\Lucid\Lib\site-packages\transformers\pipelines\base.py:1149, in Pipeline.forward(self, model_inputs, **forward_params)
1147 with inference_context():
1148 model_inputs = self._ensure_tensor_on_device(model_inputs, device=self.device)
-> 1149 model_outputs = self._forward(model_inputs, **forward_params)
1150 model_outputs = self._ensure_tensor_on_device(model_outputs, device=torch.device("cpu"))
1151 else:
File c:\Users\User\miniconda3\envs\Lucid\Lib\site-packages\transformers\pipelines\text_generation.py:327, in TextGenerationPipeline._forward(self, model_inputs, **generate_kwargs)
324 generate_kwargs["min_length"] += prefix_length
326 # BS x SL
--> 327 generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
328 out_b = generated_sequence.shape[0]
329 if self.framework == "pt":
File c:\Users\User\miniconda3\envs\Lucid\Lib\site-packages\torch\utils\_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)
File c:\Users\User\miniconda3\envs\Lucid\Lib\site-packages\transformers\generation\utils.py:1622, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
1614 input_ids, model_kwargs = self._expand_inputs_for_generation(
1615 input_ids=input_ids,
1616 expand_size=generation_config.num_return_sequences,
1617 is_encoder_decoder=self.config.is_encoder_decoder,
1618 **model_kwargs,
1619 )
1621 # 13. run sample
-> 1622 result = self._sample(
1623 input_ids,
1624 logits_processor=prepared_logits_processor,
1625 logits_warper=logits_warper,
1626 stopping_criteria=prepared_stopping_criteria,
1627 pad_token_id=generation_config.pad_token_id,
1628 output_scores=generation_config.output_scores,
1629 output_logits=generation_config.output_logits,
1630 return_dict_in_generate=generation_config.return_dict_in_generate,
1631 synced_gpus=synced_gpus,
1632 streamer=streamer,
1633 **model_kwargs,
1634 )
1636 elif generation_mode == GenerationMode.BEAM_SEARCH:
1637 # 11. prepare beam search scorer
1638 beam_scorer = BeamSearchScorer(
1639 batch_size=batch_size,
1640 num_beams=generation_config.num_beams,
(...)
1645 max_length=generation_config.max_length,
1646 )
File c:\Users\User\miniconda3\envs\Lucid\Lib\site-packages\transformers\generation\utils.py:2829, in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, logits_warper, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, output_logits, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
2827 # sample
2828 probs = nn.functional.softmax(next_token_scores, dim=-1)
-> 2829 next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
2831 # finished sentences should have their next token be a padding token
2832 if eos_token_id is not None:
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Enviroment:
Windows 11
Python 3.11.5
Transformers 4.40.1
Thanks for raising the error. I am looking into this.
So far I am able to reproduce the error on a Linux server. I am able to get around the error by loading it in bf16, or turning off do_sample so it does greedy decoding instead, however, I am running into a different problem with the model spitting out gibberish...
This is weird as a code piece very similar to yours using text generation pipeline was tested when this quant was made. I also just remade the quant and this problem persists.
But the good thing is I tested this model again serving on vLLM (or any vLLM like Aphrodite should have the same behavior), and it works fine. See screenshot below. My suspicion is there's something off with how transformers are integrating with AutoGPTQ library or the decoding strategies used.
I will keep looking into this and get back to you. In the mean time I recommend just serving this directly using vLLM. If you want to fine-tune something, it maybe better to fine-tune directly on the full precision model, or alternatively, use QLORA and load in 8 bits or 4 bits to save VRAM then re-quant later.
This is still under investigation. The 4 bit model seems to work fine (https://huggingface.co/astronomer/Llama-3-8B-Instruct-GPTQ-4-Bit). Let me get back to you once I have more definitive answers, but at the moment, serving in production with vLLM
for this model is tested to be ok. The issue is specific to 8 bit and huggingface transformers library
Understood! Thank you for taking the time to respond! I will simply use the 4-bit model in the meantime.
Oddly enough, the model works (albeit it feels a lot slower than what it probably should be) when loading it with transformer using the oobabooga webui?
When loading with autogptq through the same webui, it mentions some sort of issue with expected sizes of some layers?(These observations were made last night and I forgot to comment them here, I will double check the results when I get back tonight.)
Oobabooga would be faster if it can run on exllamav2 and/or with injected fused attention (which has been broken since llama 2 release for GPTQ quants, this issue is universal for models that use GQA and fused attention). That's why when loading to Oobabooga its inference speed is slower. I think more people use 4 bit quant (which works with exllamav2) when using oobabooga and it maybe slightly better.
At the end of the day, there are just issues with GPTQ 8 Bit quants when using more hobbyist frameworks like Oobabooga, Since 8 bit I feel shows up more in production LLMs that are massed served than local ran... The support on this front has been lagging behind. However, I think people are working on integrating with the Marlin kernel and things should be a lot better for GPTQ soon.