qlora llama 70b openorca
Browse files- README.md +118 -0
- adapter_config.json +22 -0
- adapter_model.bin +3 -0
README.md
CHANGED
@@ -1,3 +1,121 @@
|
|
1 |
---
|
2 |
license: cc-by-nc-4.0
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: cc-by-nc-4.0
|
3 |
---
|
4 |
+
|
5 |
+
# QLoRA Instruction Tuned Models
|
6 |
+
|
7 |
+
| [Paper](https://arxiv.org/abs/2305.14314) | [Code](https://github.com/artidoro/qlora) |
|
8 |
+
|
9 |
+
**The `LLaMA-2 QLoRA OpenOrca` are open-source models obtained through 4-bit QLoRA tuning of LLaMA-2 base models 240k exmaples of OpenOrca.**
|
10 |
+
|
11 |
+
⚠️ These models are purely intended for research purposes and could produce problematic outputs.
|
12 |
+
|
13 |
+
## What are QLoRA Instruction Tuned Models and why use them?
|
14 |
+
- **Strong performance on MMLU** following the QLoRA instruction tuning.
|
15 |
+
- **Replicable and efficient instruction tuning procedure** that can be extended to new use cases. QLoRA training scripts are available in the [QLoRA repo](https://github.com/artidoro/qlora).
|
16 |
+
- **Rigorous comparison to 16-bit methods** (both 16-bit full-finetuning and LoRA) in [our paper](https://arxiv.org/abs/2305.14314) demonstrates the effectiveness of 4-bit QLoRA finetuning.
|
17 |
+
- **Lightweight** checkpoints which only contain adapter weights.
|
18 |
+
|
19 |
+
## License and Intended Use
|
20 |
+
Note the use of these adapter weights, requires access to the LLaMA-2 model weighs and therefore should be used according to the LLaMA-2 license.
|
21 |
+
|
22 |
+
## Usage
|
23 |
+
Here is an example of how you would load the model 4-bits:
|
24 |
+
```python
|
25 |
+
import torch
|
26 |
+
from peft import PeftModel
|
27 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
|
28 |
+
|
29 |
+
model_name = "meta-llama/Llama-2-70b-hf"
|
30 |
+
adapters_name = 'uwnlp/llama-2-70b-qlora-openorca'
|
31 |
+
|
32 |
+
model = AutoModelForCausalLM.from_pretrained(
|
33 |
+
model_name,
|
34 |
+
load_in_4bit=True,
|
35 |
+
torch_dtype=torch.bfloat16,
|
36 |
+
device_map="auto",
|
37 |
+
quantization_config=BitsAndBytesConfig(
|
38 |
+
load_in_4bit=True,
|
39 |
+
bnb_4bit_compute_dtype=torch.bfloat16,
|
40 |
+
bnb_4bit_use_double_quant=True,
|
41 |
+
bnb_4bit_quant_type='nf4'
|
42 |
+
),
|
43 |
+
)
|
44 |
+
model = PeftModel.from_pretrained(model, adapters_name)
|
45 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
46 |
+
|
47 |
+
```
|
48 |
+
Inference can then be performed as usual with HF models as follows:
|
49 |
+
```python
|
50 |
+
prompt = "Introduce yourself"
|
51 |
+
formatted_prompt = (
|
52 |
+
f"A chat between a curious human and an artificial intelligence assistant."
|
53 |
+
f"The assistant gives helpful, detailed, and polite answers to the user's questions.\n"
|
54 |
+
f"### Human: {prompt} ### Assistant:"
|
55 |
+
)
|
56 |
+
inputs = tokenizer(formatted_prompt, return_tensors="pt").to("cuda:0")
|
57 |
+
outputs = model.generate(inputs=inputs.input_ids, max_new_tokens=20)
|
58 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
59 |
+
```
|
60 |
+
Expected output similar to the following:
|
61 |
+
```
|
62 |
+
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
|
63 |
+
### Human: Introduce yourself ### Assistant: I am an artificial intelligence assistant. I am here to help you with any questions you may have.
|
64 |
+
```
|
65 |
+
|
66 |
+
## Model Card
|
67 |
+
**Architecture**: The models released here are LoRA adapters to be used on top of LLaMA-2 models. They are added to all layers. For all model sizes, we use $r=64$.
|
68 |
+
|
69 |
+
**Base Model**: These models use LLaMA-2 as base model. LLaMA is a causal language model pretrained on a large corpus of text. See [LLaMA-2 paper](https://arxiv.org/abs/2307.09288) for more details. Note that these models can inherit biases and limitations of the base model.
|
70 |
+
|
71 |
+
**Finetuning Data**: These models are finetuned on 240k examples of the [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) dataset.
|
72 |
+
|
73 |
+
|
74 |
+
**Languages**: The different datasets cover different languages. We direct to the various papers and resources describing the datasets for more details.
|
75 |
+
|
76 |
+
Next, we describe Training and Evaluation details.
|
77 |
+
|
78 |
+
### Training
|
79 |
+
QLoRA Instruction Tuned Models are the result of 4-bit QLoRA supervised finetuning on different instruction tuning datasets.
|
80 |
+
|
81 |
+
All models use NormalFloat4 datatype for the base model and LoRA adapters on all linear layers with BFloat16 as computation datatype. We set LoRA $r=64$, $\alpha=16$. We also use Adam beta2 of 0.999, max grad norm of 0.3 and LoRA dropout of 0.1 for models up to 13B and 0.05 for 33B and 65B/70B models.
|
82 |
+
For the finetuning process, we use constant learning rate schedule and paged AdamW optimizer.
|
83 |
+
|
84 |
+
### Training hyperparameters
|
85 |
+
| Parameters | Dataset | Batch size | LR | Steps | Source Length | Target Length |
|
86 |
+
|------------|----------|------------|------|-------|---------------|---------------|
|
87 |
+
| 7B | All | 16 | 2e-4 | 10000 | 384 | 128 |
|
88 |
+
| 13B | All | 16 | 2e-4 | 10000 | 384 | 128 |
|
89 |
+
| 70B | All | 64 | 1e-4 | 2500 | 384 | 128 |
|
90 |
+
|
91 |
+
### Evaluation
|
92 |
+
We use the MMLU benchmark to measure performance on a range of language understanding tasks. This is a multiple-choice benchmark covering 57 tasks including elementary mathematics, US history, computer science, law, and more. We report 5-shot test accuracy.
|
93 |
+
|
94 |
+
Dataset | 7B | 13B | 33B | 65B
|
95 |
+
---|---|---|---|---
|
96 |
+
LLaMA-1 no tuning | 35.1 | 46.9 | 57.8 | 63.4
|
97 |
+
Self-Instruct | 36.4 | 33.3 | 53.0 | 56.7
|
98 |
+
Longform | 32.1 | 43.2 | 56.6 | 59.7
|
99 |
+
Chip2 | 34.5 | 41.6 | 53.6 | 59.8
|
100 |
+
HH-RLHF | 34.9 | 44.6 | 55.8 | 60.1
|
101 |
+
Unnatural Instruct | 41.9 | 48.1 | 57.3 | 61.3
|
102 |
+
OASST1 (Guanaco) | 36.6 | 46.4 | 57.0 | 62.2
|
103 |
+
Alpaca | 38.8 | 47.8 | 57.3 | 62.5
|
104 |
+
FLAN v2 | 44.5 | 51.4 | 59.2 | 63.9
|
105 |
+
|
106 |
+
Dataset | 7B | 13B | 34B | 70B
|
107 |
+
---|---|---|---|---
|
108 |
+
LLaMA-2 no tuning | 45.3 | 54.8 | 62.6 | 68.9
|
109 |
+
OpenOrca | 45.0 | | | 69.0
|
110 |
+
|
111 |
+
|
112 |
+
## Citation
|
113 |
+
|
114 |
+
```bibtex
|
115 |
+
@article{dettmers2023qlora,
|
116 |
+
title={QLoRA: Efficient Finetuning of Quantized LLMs},
|
117 |
+
author={Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
|
118 |
+
journal={arXiv preprint arXiv:2305.14314},
|
119 |
+
year={2023}
|
120 |
+
}
|
121 |
+
```
|
adapter_config.json
ADDED
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"base_model_name_or_path": "meta-llama/Llama-2-70b-hf",
|
3 |
+
"bias": "none",
|
4 |
+
"fan_in_fan_out": false,
|
5 |
+
"inference_mode": true,
|
6 |
+
"init_lora_weights": true,
|
7 |
+
"lora_alpha": 16.0,
|
8 |
+
"lora_dropout": 0.05,
|
9 |
+
"modules_to_save": null,
|
10 |
+
"peft_type": "LORA",
|
11 |
+
"r": 64,
|
12 |
+
"target_modules": [
|
13 |
+
"v_proj",
|
14 |
+
"k_proj",
|
15 |
+
"down_proj",
|
16 |
+
"o_proj",
|
17 |
+
"q_proj",
|
18 |
+
"up_proj",
|
19 |
+
"gate_proj"
|
20 |
+
],
|
21 |
+
"task_type": "CAUSAL_LM"
|
22 |
+
}
|
adapter_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:f21d5abca6f23a6a2a8c554dd68ed596361ab8c7a2c60f721ed5765f36df9a1d
|
3 |
+
size 1657155077
|