yangapku commited on
Commit
cde901c
1 Parent(s): 9dd262c

update content of vllm gptq model

Browse files
Files changed (1) hide show
  1. README.md +12 -4
README.md CHANGED
@@ -134,9 +134,9 @@ print(response)
134
  # They are an asset to the team, and their efforts do not go unnoticed. Keep up the great work!
135
  ```
136
 
137
- 注意:vLLM暂不支持gptq量化方案,我们将近期给出解决方案。
138
 
139
- Note: vLLM does not currently support the `gptq` quantization, and we will provide a solution in the near future.
140
 
141
  关于更多的使用说明,请参考我们的[GitHub repo](https://github.com/QwenLM/Qwen)获取更多信息。
142
 
@@ -190,12 +190,20 @@ We measured the average inference speed and GPU memory usage of generating 2048
190
  | Int4 | HF + FlashAttn-v2 | 1 | 1 | 2048 | 11.67 | 48.86GB |
191
  | Int4 | HF + FlashAttn-v1 | 1 | 1 | 2048 | 11.27 | 48.86GB |
192
  | Int4 | HF + No FlashAttn | 1 | 1 | 2048 | 11.32 | 48.86GB |
 
 
 
193
  | Int4 | HF + FlashAttn-v2 | 2 | 6144 | 2048 | 6.75 | 85.99GB |
194
  | Int4 | HF + FlashAttn-v1 | 2 | 6144 | 2048 | 6.32 | 85.99GB |
195
  | Int4 | HF + No FlashAttn | 2 | 6144 | 2048 | 5.97 | 88.30GB |
196
- | Int4 | HF + FlashAttn-v2 | 3 | 14336 | 2048 | 4.18 | 85.99GB |
197
- | Int4 | HF + FlashAttn-v1 | 3 | 14336 | 2048 | 3.72 | 85.99GB |
 
 
198
  | Int4 | HF + No FlashAttn | 3 | 14336 | 2048 | OOM | OOM |
 
 
 
199
 
200
  \* vLLM会提前预分配显存,因此无法探测最大显存使用情况。HF是指使用Huggingface Transformers库进行推理。
201
 
 
134
  # They are an asset to the team, and their efforts do not go unnoticed. Keep up the great work!
135
  ```
136
 
137
+ 注意:使用vLLM运行量化模型需安装我们[vLLM分支仓库](https://github.com/QwenLM/vllm-gptq)。暂不支持int8模型,近期将更新。
138
 
139
+ Note: You need to install our [vLLM repo] (https://github.com/qwenlm/vllm-gptq) for AutoGPTQ. The int8 model is not supported for the time being, and we will add the support soon.
140
 
141
  关于更多的使用说明,请参考我们的[GitHub repo](https://github.com/QwenLM/Qwen)获取更多信息。
142
 
 
190
  | Int4 | HF + FlashAttn-v2 | 1 | 1 | 2048 | 11.67 | 48.86GB |
191
  | Int4 | HF + FlashAttn-v1 | 1 | 1 | 2048 | 11.27 | 48.86GB |
192
  | Int4 | HF + No FlashAttn | 1 | 1 | 2048 | 11.32 | 48.86GB |
193
+ | Int4 | vLLM | 1 | 1 | 2048 | 14.63 | Pre-Allocated* |
194
+ | Int4 | vLLM | 2 | 1 | 2048 | 20.76 | Pre-Allocated* |
195
+ | Int4 | vLLM | 4 | 1 | 2048 | 27.19 | Pre-Allocated* |
196
  | Int4 | HF + FlashAttn-v2 | 2 | 6144 | 2048 | 6.75 | 85.99GB |
197
  | Int4 | HF + FlashAttn-v1 | 2 | 6144 | 2048 | 6.32 | 85.99GB |
198
  | Int4 | HF + No FlashAttn | 2 | 6144 | 2048 | 5.97 | 88.30GB |
199
+ | Int4 | vLLM | 2 | 6144 | 2048 | 18.07 | Pre-Allocated* |
200
+ | Int4 | vLLM | 4 | 6144 | 2048 | 24.56 | Pre-Allocated* |
201
+ | Int4 | HF + FlashAttn-v2 | 3 | 14336 | 2048 | 4.18 | 148.73GB |
202
+ | Int4 | HF + FlashAttn-v1 | 3 | 14336 | 2048 | 3.72 | 148.73GB |
203
  | Int4 | HF + No FlashAttn | 3 | 14336 | 2048 | OOM | OOM |
204
+ | Int4 | vLLM | 2 | 14336 | 2048 | 14.51 | Pre-Allocated* |
205
+ | Int4 | vLLM | 4 | 14336 | 2048 | 19.28 | Pre-Allocated* |
206
+ | Int4 | vLLM | 4 | 30720 | 2048 | 16.93 | Pre-Allocated* |
207
 
208
  \* vLLM会提前预分配显存,因此无法探测最大显存使用情况。HF是指使用Huggingface Transformers库进行推理。
209