Nvidia H100 Finetuning Error on BitsandBytes
#82
by
ashmitbhattarai
- opened
I am trying to fine-tune the model on H100 80GB Graphics card. The same code runs on A100 40GB. Its 4 bits quantized model
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_doube_quant=False,
bnb_4bit_compute_dtype = torch.bfloat16
)
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quant_config,
device_map="cuda:0",
trust_remote_code=True,
# max_seq_len=8192
# use_safetensors=True
)
tokenizer = AutoTokenizer.from_pretrained(
model_name,
trust_remote_code = True,
use_fast=True
)
## Prepare the model for K-Bit Trainining
base_model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(base_model)
....
trainer...
trainer.train()
I get CUDA Error: an illegal instruction was encountered.
Again, the same code runs great on A100 just not on H100.. FYI: Inference on the base model works just fine just the fine-tune training is erroneous.
Using notebook as reference
i have the same issue...
I do have the same issue . How to solve this ?