OOM when quantizing for 32k context length

#3
by harshilp - opened

Hey @TheBloke !

I was wondering how you quantized the model using 32k sequence length. I have a Llama 7B model that I extended to 32k context length using rope scaling and then trained it on private data. When I tried to quantize it to 4 bits on calibration dataset of 32k sequence length, I ran into OOM errors on 80GB A100. I have more GPUs available but am not sure how to use them as quantization seems to only use one GPU. I am using the auto-gptq integration with huggingface transformers for quantization.

Sign up or log in to comment