--- license: llama2 train: false inference: false pipeline_tag: text-generation --- This is an experimental HQQ 1-bit quantized (binary weights) Llama2-7B-chat model using a LoRA adapter to improve the performance (referred to as HQQ+). Quantizing small models at extreme low-bits is a challenging task. The purpose of this model is to show the community what to expect when fine-tuning such models. ## Datasets The adapter was trained via SFT on random subsets of the following: ### Base Model * wikitext-2-raw-v1 (full) ### Chat Model * timdettmers/openassistant-guanaco (full) * microsoft/orca-math-word-problems-200k (25K) * meta-math/MetaMathQA (25K) * HuggingFaceH4/ultrafeedback_binarized (25K - chosen answers only) ## Performance | Models | Llama2-7B (fp16)| Llama2-7B (HQQ-1bit)| Llama2-7B (HQQ+-1bit)| Quip# (2bit)| |-------------------|------------------|------------------|------------------|------------------| | Wiki Perpexlity | 5.18 | 9866 | 8.53 | 8.54 | | VRAM (GB) | 13.5 | 1.76 | 1.85 | 2.72 | | forward time (sec)| 0.1 | 0.231 | 0.257 | 0.353 | | Models | Llama2-7B-chat (fp16)| Llama2-7B-chat (HQQ-1bit)| Llama2-7B-chat (HQQ+-1bit)| |-------------------|------------------|------------------|------------------| | ARC (25-shot) | 53.67 | 21.59 | 31.14 | | HellaSwag (10-shot)| 78.56 | 25.66 | 52.96 | | MMLU (5-shot) | 48.16 | | 26.54 | | TruthfulQA-MC2 | 45.32 | 47.81 | 43.16 | | Winogrande (5-shot)| 72.53 | 49.72 | 60.54 | | GSM8K (5-shot) | 23.12 | | 11 | | Average | 53.56 | | 37.56 | ## Usage First, install the latest version of HQQ: ``` pip install git+https://github.com/mobiusml/hqq.git ``` Then you can use the sample code below: ``` Python from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer #Load the model model_id = 'mobiuslabsgmbh/Llama-2-7b-chat-hf_1bitgs8_hqq' model = HQQModelForCausalLM.from_quantized(model_id, adapter='adapter_v0.1.lora') tokenizer = AutoTokenizer.from_pretrained(model_id) #Setup Inference Mode tokenizer.add_bos_token = False tokenizer.add_eos_token = False if not tokenizer.pad_token: tokenizer.add_special_tokens({'pad_token': '[PAD]'}) model.config.use_cache = True model.eval(); # Optional: torch compile for faster inference # model = torch.compile(model) #Streaming Inference import torch from threading import Thread def chat_processor(chat, max_new_tokens=100, do_sample=True): tokenizer.use_default_system_prompt = False streamer = transformers.TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True) generate_params = dict( tokenizer(" [INST] " + chat + " [/INST] ", return_tensors="pt").to(device), streamer=streamer, max_new_tokens=max_new_tokens, do_sample=do_sample, pad_token_id=tokenizer.pad_token_id, top_p=0.90 if do_sample else None, top_k=50 if do_sample else None, temperature= 0.6 if do_sample else None, num_beams=1, repetition_penalty=1.2, ) t = Thread(target=model.generate, kwargs=generate_params) t.start() print("User: ", chat); print("Assistant: "); outputs = "" for text in streamer: outputs += text print(text, end="", flush=True) torch.cuda.empty_cache() return outputs ``` ### Example ``` Python outputs = chat_processor("What is the solution to x^2 - 1 = 0", max_new_tokens=1000, do_sample=False) ``` ``` User: What is the solution to x^2 - 1 = 0 Assistant: The equation $x^2 - 1 = 0$ can be factored as $(x-1)(x+1) = 0$. You want to find a value of $x$ that makes this true for all values of $x$. This means that either $x=1$ or $-1$, or $x=-1$. So, there are two solutions: $x=\boxed{1}$ and $x=\boxed{-1}$. The answer is: 1 ```