Xenova HF staff commited on
Commit
e4beb03
1 Parent(s): 3ce3d98

Improve installation + code snippets

Browse files
Files changed (1) hide show
  1. README.md +29 -57
README.md CHANGED
@@ -33,20 +33,12 @@ This repository contains [`meta-llama/Meta-Llama-3.1-70B-Instruct`](https://hugg
33
 
34
  In order to use the current quantized model, support is offered for different solutions as `transformers`, `autoawq`, or `text-generation-inference`.
35
 
36
- ### 🤗 transformers
37
 
38
- In order to run the inference with Llama 3.1 70B Instruct AWQ in INT4, both `torch` and `autoawq` need to be installed as:
39
 
40
  ```bash
41
- pip install "torch>=2.2.0,<2.3.0" autoawq --upgrade
42
- ```
43
-
44
- Otherwise, running the inference may fail, since the AutoAWQ kernels are built with PyTorch 2.2.1, meaning that those will break with PyTorch 2.3.0.
45
-
46
- Then, the latest version of `transformers` need to be installed, being 4.43.0 or higher, as:
47
-
48
- ```bash
49
- pip install "transformers[accelerate]>=4.43.0" --upgrade
50
  ```
51
 
52
  To run the inference on top of Llama 3.1 70B Instruct AWQ in INT4 precision, the AWQ model can be instantiated as any other causal language modeling model via `AutoModelForCausalLM` and run the inference normally.
@@ -56,13 +48,18 @@ import torch
56
  from transformers import AutoModelForCausalLM, AutoTokenizer
57
 
58
  model_id = "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4"
 
 
 
 
 
 
 
 
59
  prompt = [
60
  {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
61
  {"role": "user", "content": "What's Deep Learning?"},
62
  ]
63
-
64
- tokenizer = AutoTokenizer.from_pretrained(model_id)
65
-
66
  inputs = tokenizer.apply_chat_template(
67
  prompt,
68
  tokenize=True,
@@ -71,48 +68,38 @@ inputs = tokenizer.apply_chat_template(
71
  return_dict=True,
72
  ).to("cuda")
73
 
74
- model = AutoModelForCausalLM.from_pretrained(
75
- model_id,
76
- torch_dtype=torch.float16,
77
- low_cpu_mem_usage=True,
78
- device_map="auto",
79
- )
80
-
81
  outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
82
- print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
83
  ```
84
 
85
  ### AutoAWQ
86
 
87
- In order to run the inference with Llama 3.1 70B Instruct AWQ in INT4, both `torch` and `autoawq` need to be installed as:
88
-
89
- ```bash
90
- pip install "torch>=2.2.0,<2.3.0" autoawq --upgrade
91
- ```
92
-
93
- Otherwise, running the inference may fail, since the AutoAWQ kernels are built with PyTorch 2.2.1, meaning that those will break with PyTorch 2.3.0.
94
-
95
- Then, the latest version of `transformers` need to be installed, being 4.43.0 or higher, as:
96
 
97
  ```bash
98
- pip install "transformers[accelerate]>=4.43.0" --upgrade
99
  ```
100
 
101
  Alternatively, one may want to run that via `AutoAWQ` even though it's built on top of 🤗 `transformers`, which is the recommended approach instead as described above.
102
 
103
  ```python
104
  import torch
105
- from autoawq import AutoAWQForCausalLM
106
  from transformers import AutoModelForCausalLM, AutoTokenizer
107
 
108
  model_id = "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4"
 
 
 
 
 
 
 
 
109
  prompt = [
110
  {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
111
  {"role": "user", "content": "What's Deep Learning?"},
112
  ]
113
-
114
- tokenizer = AutoTokenizer.from_pretrained(model_id)
115
-
116
  inputs = tokenizer.apply_chat_template(
117
  prompt,
118
  tokenize=True,
@@ -121,15 +108,8 @@ inputs = tokenizer.apply_chat_template(
121
  return_dict=True,
122
  ).to("cuda")
123
 
124
- model = AutoAWQForCausalLM.from_pretrained(
125
- model_id,
126
- torch_dtype=torch.float16,
127
- low_cpu_mem_usage=True,
128
- device_map="auto",
129
- )
130
-
131
  outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
132
- print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
133
  ```
134
 
135
  The AutoAWQ script has been adapted from [AutoAWQ/examples/generate.py](https://github.com/casper-hansen/AutoAWQ/blob/main/examples/generate.py).
@@ -143,21 +123,13 @@ Coming soon!
143
  > [!NOTE]
144
  > In order to quantize Llama 3.1 70B Instruct using AutoAWQ, you will need to use an instance with at least enough CPU RAM to fit the whole model i.e. ~140GiB, and an NVIDIA GPU with 40GiB of VRAM to quantize it.
145
 
146
- In order to quantize Llama 3.1 70B Instruct, first install `torch` and `autoawq` as follows:
147
-
148
- ```bash
149
- pip install "torch>=2.2.0,<2.3.0" autoawq --upgrade
150
- ```
151
-
152
- Otherwise the quantization may fail, since the AutoAWQ kernels are built with PyTorch 2.2.1, meaning that those will break with PyTorch 2.3.0.
153
-
154
- Then install the latest version of `transformers` as follows:
155
 
156
  ```bash
157
- pip install "transformers>=4.43.0" --upgrade
158
  ```
159
 
160
- And then, run the following script, adapted from [`AutoAWQ/examples/quantize.py`](https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py) as follows:
161
 
162
  ```python
163
  from awq import AutoAWQForCausalLM
@@ -174,9 +146,9 @@ quant_config = {
174
 
175
  # Load model
176
  model = AutoAWQForCausalLM.from_pretrained(
177
- model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
178
  )
179
- tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
180
 
181
  # Quantize
182
  model.quantize(tokenizer, quant_config=quant_config)
 
33
 
34
  In order to use the current quantized model, support is offered for different solutions as `transformers`, `autoawq`, or `text-generation-inference`.
35
 
36
+ ### 🤗 Transformers
37
 
38
+ In order to run the inference with Llama 3.1 70B Instruct AWQ in INT4, you need to install the following packages:
39
 
40
  ```bash
41
+ pip install -q --upgrade transformers autoawq accelerate
 
 
 
 
 
 
 
 
42
  ```
43
 
44
  To run the inference on top of Llama 3.1 70B Instruct AWQ in INT4 precision, the AWQ model can be instantiated as any other causal language modeling model via `AutoModelForCausalLM` and run the inference normally.
 
48
  from transformers import AutoModelForCausalLM, AutoTokenizer
49
 
50
  model_id = "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4"
51
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
52
+ model = AutoModelForCausalLM.from_pretrained(
53
+ model_id,
54
+ torch_dtype=torch.float16,
55
+ low_cpu_mem_usage=True,
56
+ device_map="auto",
57
+ )
58
+
59
  prompt = [
60
  {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
61
  {"role": "user", "content": "What's Deep Learning?"},
62
  ]
 
 
 
63
  inputs = tokenizer.apply_chat_template(
64
  prompt,
65
  tokenize=True,
 
68
  return_dict=True,
69
  ).to("cuda")
70
 
 
 
 
 
 
 
 
71
  outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
72
+ print(tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0])
73
  ```
74
 
75
  ### AutoAWQ
76
 
77
+ In order to run the inference with Llama 3.1 70B Instruct AWQ in INT4, you need to install the following packages:
 
 
 
 
 
 
 
 
78
 
79
  ```bash
80
+ pip install -q --upgrade transformers autoawq accelerate
81
  ```
82
 
83
  Alternatively, one may want to run that via `AutoAWQ` even though it's built on top of 🤗 `transformers`, which is the recommended approach instead as described above.
84
 
85
  ```python
86
  import torch
87
+ from awq import AutoAWQForCausalLM
88
  from transformers import AutoModelForCausalLM, AutoTokenizer
89
 
90
  model_id = "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4"
91
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
92
+ model = AutoAWQForCausalLM.from_pretrained(
93
+ model_id,
94
+ torch_dtype=torch.float16,
95
+ low_cpu_mem_usage=True,
96
+ device_map="auto",
97
+ )
98
+
99
  prompt = [
100
  {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
101
  {"role": "user", "content": "What's Deep Learning?"},
102
  ]
 
 
 
103
  inputs = tokenizer.apply_chat_template(
104
  prompt,
105
  tokenize=True,
 
108
  return_dict=True,
109
  ).to("cuda")
110
 
 
 
 
 
 
 
 
111
  outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
112
+ print(tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0])
113
  ```
114
 
115
  The AutoAWQ script has been adapted from [AutoAWQ/examples/generate.py](https://github.com/casper-hansen/AutoAWQ/blob/main/examples/generate.py).
 
123
  > [!NOTE]
124
  > In order to quantize Llama 3.1 70B Instruct using AutoAWQ, you will need to use an instance with at least enough CPU RAM to fit the whole model i.e. ~140GiB, and an NVIDIA GPU with 40GiB of VRAM to quantize it.
125
 
126
+ In order to quantize Llama 3.1 70B Instruct, first install the following packages:
 
 
 
 
 
 
 
 
127
 
128
  ```bash
129
+ pip install -q --upgrade transformers autoawq accelerate
130
  ```
131
 
132
+ Then run the following script, adapted from [`AutoAWQ/examples/quantize.py`](https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py):
133
 
134
  ```python
135
  from awq import AutoAWQForCausalLM
 
146
 
147
  # Load model
148
  model = AutoAWQForCausalLM.from_pretrained(
149
+ model_path, low_cpu_mem_usage=True, use_cache=False,
150
  )
151
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
152
 
153
  # Quantize
154
  model.quantize(tokenizer, quant_config=quant_config)