artidoro commited on
Commit
24a4437
1 Parent(s): a35f3d0

qlora llama 70b openorca

Browse files
Files changed (3) hide show
  1. README.md +118 -0
  2. adapter_config.json +22 -0
  3. adapter_model.bin +3 -0
README.md CHANGED
@@ -1,3 +1,121 @@
1
  ---
2
  license: cc-by-nc-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-4.0
3
  ---
4
+
5
+ # QLoRA Instruction Tuned Models
6
+
7
+ | [Paper](https://arxiv.org/abs/2305.14314) | [Code](https://github.com/artidoro/qlora) |
8
+
9
+ **The `LLaMA-2 QLoRA OpenOrca` are open-source models obtained through 4-bit QLoRA tuning of LLaMA-2 base models 240k exmaples of OpenOrca.**
10
+
11
+ ⚠️ These models are purely intended for research purposes and could produce problematic outputs.
12
+
13
+ ## What are QLoRA Instruction Tuned Models and why use them?
14
+ - **Strong performance on MMLU** following the QLoRA instruction tuning.
15
+ - **Replicable and efficient instruction tuning procedure** that can be extended to new use cases. QLoRA training scripts are available in the [QLoRA repo](https://github.com/artidoro/qlora).
16
+ - **Rigorous comparison to 16-bit methods** (both 16-bit full-finetuning and LoRA) in [our paper](https://arxiv.org/abs/2305.14314) demonstrates the effectiveness of 4-bit QLoRA finetuning.
17
+ - **Lightweight** checkpoints which only contain adapter weights.
18
+
19
+ ## License and Intended Use
20
+ Note the use of these adapter weights, requires access to the LLaMA-2 model weighs and therefore should be used according to the LLaMA-2 license.
21
+
22
+ ## Usage
23
+ Here is an example of how you would load the model 4-bits:
24
+ ```python
25
+ import torch
26
+ from peft import PeftModel
27
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
28
+
29
+ model_name = "meta-llama/Llama-2-70b-hf"
30
+ adapters_name = 'uwnlp/llama-2-70b-qlora-openorca'
31
+
32
+ model = AutoModelForCausalLM.from_pretrained(
33
+ model_name,
34
+ load_in_4bit=True,
35
+ torch_dtype=torch.bfloat16,
36
+ device_map="auto",
37
+ quantization_config=BitsAndBytesConfig(
38
+ load_in_4bit=True,
39
+ bnb_4bit_compute_dtype=torch.bfloat16,
40
+ bnb_4bit_use_double_quant=True,
41
+ bnb_4bit_quant_type='nf4'
42
+ ),
43
+ )
44
+ model = PeftModel.from_pretrained(model, adapters_name)
45
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
46
+
47
+ ```
48
+ Inference can then be performed as usual with HF models as follows:
49
+ ```python
50
+ prompt = "Introduce yourself"
51
+ formatted_prompt = (
52
+ f"A chat between a curious human and an artificial intelligence assistant."
53
+ f"The assistant gives helpful, detailed, and polite answers to the user's questions.\n"
54
+ f"### Human: {prompt} ### Assistant:"
55
+ )
56
+ inputs = tokenizer(formatted_prompt, return_tensors="pt").to("cuda:0")
57
+ outputs = model.generate(inputs=inputs.input_ids, max_new_tokens=20)
58
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
59
+ ```
60
+ Expected output similar to the following:
61
+ ```
62
+ A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
63
+ ### Human: Introduce yourself ### Assistant: I am an artificial intelligence assistant. I am here to help you with any questions you may have.
64
+ ```
65
+
66
+ ## Model Card
67
+ **Architecture**: The models released here are LoRA adapters to be used on top of LLaMA-2 models. They are added to all layers. For all model sizes, we use $r=64$.
68
+
69
+ **Base Model**: These models use LLaMA-2 as base model. LLaMA is a causal language model pretrained on a large corpus of text. See [LLaMA-2 paper](https://arxiv.org/abs/2307.09288) for more details. Note that these models can inherit biases and limitations of the base model.
70
+
71
+ **Finetuning Data**: These models are finetuned on 240k examples of the [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) dataset.
72
+
73
+
74
+ **Languages**: The different datasets cover different languages. We direct to the various papers and resources describing the datasets for more details.
75
+
76
+ Next, we describe Training and Evaluation details.
77
+
78
+ ### Training
79
+ QLoRA Instruction Tuned Models are the result of 4-bit QLoRA supervised finetuning on different instruction tuning datasets.
80
+
81
+ All models use NormalFloat4 datatype for the base model and LoRA adapters on all linear layers with BFloat16 as computation datatype. We set LoRA $r=64$, $\alpha=16$. We also use Adam beta2 of 0.999, max grad norm of 0.3 and LoRA dropout of 0.1 for models up to 13B and 0.05 for 33B and 65B/70B models.
82
+ For the finetuning process, we use constant learning rate schedule and paged AdamW optimizer.
83
+
84
+ ### Training hyperparameters
85
+ | Parameters | Dataset | Batch size | LR | Steps | Source Length | Target Length |
86
+ |------------|----------|------------|------|-------|---------------|---------------|
87
+ | 7B | All | 16 | 2e-4 | 10000 | 384 | 128 |
88
+ | 13B | All | 16 | 2e-4 | 10000 | 384 | 128 |
89
+ | 70B | All | 64 | 1e-4 | 2500 | 384 | 128 |
90
+
91
+ ### Evaluation
92
+ We use the MMLU benchmark to measure performance on a range of language understanding tasks. This is a multiple-choice benchmark covering 57 tasks including elementary mathematics, US history, computer science, law, and more. We report 5-shot test accuracy.
93
+
94
+ Dataset | 7B | 13B | 33B | 65B
95
+ ---|---|---|---|---
96
+ LLaMA-1 no tuning | 35.1 | 46.9 | 57.8 | 63.4
97
+ Self-Instruct | 36.4 | 33.3 | 53.0 | 56.7
98
+ Longform | 32.1 | 43.2 | 56.6 | 59.7
99
+ Chip2 | 34.5 | 41.6 | 53.6 | 59.8
100
+ HH-RLHF | 34.9 | 44.6 | 55.8 | 60.1
101
+ Unnatural Instruct | 41.9 | 48.1 | 57.3 | 61.3
102
+ OASST1 (Guanaco) | 36.6 | 46.4 | 57.0 | 62.2
103
+ Alpaca | 38.8 | 47.8 | 57.3 | 62.5
104
+ FLAN v2 | 44.5 | 51.4 | 59.2 | 63.9
105
+
106
+ Dataset | 7B | 13B | 34B | 70B
107
+ ---|---|---|---|---
108
+ LLaMA-2 no tuning | 45.3 | 54.8 | 62.6 | 68.9
109
+ OpenOrca | 45.0 | | | 69.0
110
+
111
+
112
+ ## Citation
113
+
114
+ ```bibtex
115
+ @article{dettmers2023qlora,
116
+ title={QLoRA: Efficient Finetuning of Quantized LLMs},
117
+ author={Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
118
+ journal={arXiv preprint arXiv:2305.14314},
119
+ year={2023}
120
+ }
121
+ ```
adapter_config.json ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "base_model_name_or_path": "meta-llama/Llama-2-70b-hf",
3
+ "bias": "none",
4
+ "fan_in_fan_out": false,
5
+ "inference_mode": true,
6
+ "init_lora_weights": true,
7
+ "lora_alpha": 16.0,
8
+ "lora_dropout": 0.05,
9
+ "modules_to_save": null,
10
+ "peft_type": "LORA",
11
+ "r": 64,
12
+ "target_modules": [
13
+ "v_proj",
14
+ "k_proj",
15
+ "down_proj",
16
+ "o_proj",
17
+ "q_proj",
18
+ "up_proj",
19
+ "gate_proj"
20
+ ],
21
+ "task_type": "CAUSAL_LM"
22
+ }
adapter_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f21d5abca6f23a6a2a8c554dd68ed596361ab8c7a2c60f721ed5765f36df9a1d
3
+ size 1657155077