mobiuslabsgmbh/Llama-2-7b-chat-hf_4bitnogs_hqq

This is an HQQ 4-bit quantized Llama2-7B-chat model without grouping using a low-rank adapter to improve the performance (referred to as HQQ+).
This model doesn't use grouping to make it compatible with the fast Marlin inference kernel.

Running quantized models efficiently for inference requires using fused matrix-vector multiplications. The kernels available now have some constraints on the choice of the group-size and the axis along-which quantization is performed. This model doesn't use grouping to make it compatible with all the kernels that operate along axis=1.

Performance

Models	Llama2-7B-chat (fp16)	Llama2-7B-chat (HQQ+ 4-bit/no-gs)
ARC (25-shot)	53.67	48.46
HellaSwag (10-shot)	78.56	73.33
MMLU (5-shot)	48.16	44.87
TruthfulQA-MC2	45.32	43.27
Winogrande (5-shot)	72.53	71.67
GSM8K (5-shot)	23.12	27.82
Average	53.56	51.57

Usage

First, install the latest version of HQQ:

pip install git+https://github.com/mobiusml/hqq.git
pip install git+https://github.com/IST-DASLab/marlin.git #to use the marlin backend

Make sure you use pip install transformers==4.39.0

Then you can use the sample code below:

import torch, os

os.environ["TOKENIZERS_PARALLELISM"]  = "1"
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32       = True

import torch
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
from hqq.core.quantize import *
from hqq.utils.patching import *

#Load the model
model_id = 'mobiuslabsgmbh/Llama-2-7b-chat-hf_4bitnogs_hqq' 
model     = HQQModelForCausalLM.from_quantized(model_id, cache_dir='.', compute_dtype=torch.float16, adapter='adapter_v0.1.lora')
tokenizer = AutoTokenizer.from_pretrained(model_id)

patch_linearlayers(model, patch_add_quant_config, 
                          BaseQuantizeConfig(nbits=4, group_size=None, quant_scale=False, quant_zero=False, axis=1))

HQQLinear.set_backend(HQQBackend.PYTORCH)
model.eval();

#Use optimized inference kernels
from hqq.utils.patching import prepare_for_inference
#prepare_for_inference(model) #default
#prepare_for_inference(model, backend="torchao_int4") #use bfloat16
prepare_for_inference(model, backend="marlin", allow_merge=True) #use float16

#Generate
from hqq.utils.generation_hf import HFGenerator
#For longer context, make sure to allocate enough cache via the cache_size= parameter 
gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial") 

gen.generate("Write an essay about large language models", print_tokens=True)
gen.generate("Tell me a funny joke!", print_tokens=True)
gen.generate("How to make a yummy chocolate cake?", print_tokens=True)