update content of vllm gptq model
Browse files
README.md
CHANGED
@@ -134,9 +134,9 @@ print(response)
|
|
134 |
# They are an asset to the team, and their efforts do not go unnoticed. Keep up the great work!
|
135 |
```
|
136 |
|
137 |
-
|
138 |
|
139 |
-
Note: vLLM
|
140 |
|
141 |
关于更多的使用说明,请参考我们的[GitHub repo](https://github.com/QwenLM/Qwen)获取更多信息。
|
142 |
|
@@ -190,12 +190,20 @@ We measured the average inference speed and GPU memory usage of generating 2048
|
|
190 |
| Int4 | HF + FlashAttn-v2 | 1 | 1 | 2048 | 11.67 | 48.86GB |
|
191 |
| Int4 | HF + FlashAttn-v1 | 1 | 1 | 2048 | 11.27 | 48.86GB |
|
192 |
| Int4 | HF + No FlashAttn | 1 | 1 | 2048 | 11.32 | 48.86GB |
|
|
|
|
|
|
|
193 |
| Int4 | HF + FlashAttn-v2 | 2 | 6144 | 2048 | 6.75 | 85.99GB |
|
194 |
| Int4 | HF + FlashAttn-v1 | 2 | 6144 | 2048 | 6.32 | 85.99GB |
|
195 |
| Int4 | HF + No FlashAttn | 2 | 6144 | 2048 | 5.97 | 88.30GB |
|
196 |
-
| Int4 |
|
197 |
-
| Int4 |
|
|
|
|
|
198 |
| Int4 | HF + No FlashAttn | 3 | 14336 | 2048 | OOM | OOM |
|
|
|
|
|
|
|
199 |
|
200 |
\* vLLM会提前预分配显存,因此无法探测最大显存使用情况。HF是指使用Huggingface Transformers库进行推理。
|
201 |
|
|
|
134 |
# They are an asset to the team, and their efforts do not go unnoticed. Keep up the great work!
|
135 |
```
|
136 |
|
137 |
+
注意:使用vLLM运行量化模型需安装我们[vLLM分支仓库](https://github.com/QwenLM/vllm-gptq)。暂不支持int8模型,近期将更新。
|
138 |
|
139 |
+
Note: You need to install our [vLLM repo] (https://github.com/qwenlm/vllm-gptq) for AutoGPTQ. The int8 model is not supported for the time being, and we will add the support soon.
|
140 |
|
141 |
关于更多的使用说明,请参考我们的[GitHub repo](https://github.com/QwenLM/Qwen)获取更多信息。
|
142 |
|
|
|
190 |
| Int4 | HF + FlashAttn-v2 | 1 | 1 | 2048 | 11.67 | 48.86GB |
|
191 |
| Int4 | HF + FlashAttn-v1 | 1 | 1 | 2048 | 11.27 | 48.86GB |
|
192 |
| Int4 | HF + No FlashAttn | 1 | 1 | 2048 | 11.32 | 48.86GB |
|
193 |
+
| Int4 | vLLM | 1 | 1 | 2048 | 14.63 | Pre-Allocated* |
|
194 |
+
| Int4 | vLLM | 2 | 1 | 2048 | 20.76 | Pre-Allocated* |
|
195 |
+
| Int4 | vLLM | 4 | 1 | 2048 | 27.19 | Pre-Allocated* |
|
196 |
| Int4 | HF + FlashAttn-v2 | 2 | 6144 | 2048 | 6.75 | 85.99GB |
|
197 |
| Int4 | HF + FlashAttn-v1 | 2 | 6144 | 2048 | 6.32 | 85.99GB |
|
198 |
| Int4 | HF + No FlashAttn | 2 | 6144 | 2048 | 5.97 | 88.30GB |
|
199 |
+
| Int4 | vLLM | 2 | 6144 | 2048 | 18.07 | Pre-Allocated* |
|
200 |
+
| Int4 | vLLM | 4 | 6144 | 2048 | 24.56 | Pre-Allocated* |
|
201 |
+
| Int4 | HF + FlashAttn-v2 | 3 | 14336 | 2048 | 4.18 | 148.73GB |
|
202 |
+
| Int4 | HF + FlashAttn-v1 | 3 | 14336 | 2048 | 3.72 | 148.73GB |
|
203 |
| Int4 | HF + No FlashAttn | 3 | 14336 | 2048 | OOM | OOM |
|
204 |
+
| Int4 | vLLM | 2 | 14336 | 2048 | 14.51 | Pre-Allocated* |
|
205 |
+
| Int4 | vLLM | 4 | 14336 | 2048 | 19.28 | Pre-Allocated* |
|
206 |
+
| Int4 | vLLM | 4 | 30720 | 2048 | 16.93 | Pre-Allocated* |
|
207 |
|
208 |
\* vLLM会提前预分配显存,因此无法探测最大显存使用情况。HF是指使用Huggingface Transformers库进行推理。
|
209 |
|