Update README.md
Browse files
README.md
CHANGED
@@ -33,15 +33,6 @@ This repository provides the **AWQ 4-bit quantized** version of the **QwQ-32B-Pr
|
|
33 |
|
34 |
---
|
35 |
|
36 |
-
## Requirements
|
37 |
-
|
38 |
-
Ensure you are using the latest version of Hugging Face Transformers, as the code for Qwen2.5 is integrated there. Using a version earlier than **4.37.0** may result in the following error:
|
39 |
-
|
40 |
-
```plaintext
|
41 |
-
KeyError: 'qwen2'
|
42 |
-
```
|
43 |
-
---
|
44 |
-
|
45 |
## Steps to deploying the solution to Inference Endpoints (dedicated)
|
46 |
Use this approach if you want to try out the approach from my <a href="https://www.kaggle.com/code/mbmmurad/lb-20-qwq-32b-preview-optimized-inference">Kaggle notebook</a>, but you don't feel comfortable with coding.
|
47 |
|
@@ -67,6 +58,15 @@ Other values should be left at default values.
|
|
67 |
|
68 |
---
|
69 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
70 |
## Quickstart
|
71 |
|
72 |
Here's how to load the tokenizer and model, and generate content using the quantized model:
|
@@ -112,7 +112,7 @@ print(response)
|
|
112 |
|
113 |
---
|
114 |
|
115 |
-
Parameter values to get a better response and reproducible values similar to the Kaggle
|
116 |
|
117 |
`model.generate(
|
118 |
**model_inputs,
|
@@ -124,9 +124,9 @@ Use the sampling method here setting the following parameters:
|
|
124 |
* `temperature = 1`
|
125 |
* `top_k = 50`
|
126 |
|
127 |
-
Setting the `max_new_tokens` to `4096*8` would increase the performance, but it will take a lot of time for inference. Using faster inference engines (e.g.
|
128 |
|
129 |
-
To get the most optimal performance, it is suggested to use the Kaggle
|
130 |
|
131 |
|
132 |
## Original Model
|
@@ -140,7 +140,7 @@ https://huggingface.co/Qwen/QwQ-32B-Preview
|
|
140 |
|
141 |
## Citation
|
142 |
|
143 |
-
If you find the original model helpful, please consider citing the original authors as well as the Kaggle notebook on which this model is based on:
|
144 |
|
145 |
```bibtext
|
146 |
@misc{qwq-32b-preview,
|
|
|
33 |
|
34 |
---
|
35 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
36 |
## Steps to deploying the solution to Inference Endpoints (dedicated)
|
37 |
Use this approach if you want to try out the approach from my <a href="https://www.kaggle.com/code/mbmmurad/lb-20-qwq-32b-preview-optimized-inference">Kaggle notebook</a>, but you don't feel comfortable with coding.
|
38 |
|
|
|
58 |
|
59 |
---
|
60 |
|
61 |
+
## Requirements
|
62 |
+
|
63 |
+
Ensure you are using the latest version of Hugging Face Transformers, as the code for Qwen2.5 is integrated there. Using a version earlier than **4.37.0** may result in the following error:
|
64 |
+
|
65 |
+
```plaintext
|
66 |
+
KeyError: 'qwen2'
|
67 |
+
```
|
68 |
+
---
|
69 |
+
|
70 |
## Quickstart
|
71 |
|
72 |
Here's how to load the tokenizer and model, and generate content using the quantized model:
|
|
|
112 |
|
113 |
---
|
114 |
|
115 |
+
Parameter values to get a better response and reproducible values similar to the Kaggle notebook:
|
116 |
|
117 |
`model.generate(
|
118 |
**model_inputs,
|
|
|
124 |
* `temperature = 1`
|
125 |
* `top_k = 50`
|
126 |
|
127 |
+
Setting the `max_new_tokens` to `4096*8` would increase the performance, but it will take a lot of time for inference. Using faster inference engines (e.g. vLLM, TGI) would make the inference faster.
|
128 |
|
129 |
+
To get the most optimal performance, it is suggested to use the <a href="https://www.kaggle.com/code/mbmmurad/lb-20-qwq-32b-preview-optimized-inference">Kaggle notebook</a> mentioned above.
|
130 |
|
131 |
|
132 |
## Original Model
|
|
|
140 |
|
141 |
## Citation
|
142 |
|
143 |
+
If you find the original model helpful, please consider citing the original authors as well as the <a href="https://www.kaggle.com/code/mbmmurad/lb-20-qwq-32b-preview-optimized-inference">Kaggle notebook</a> on which this model is based on:
|
144 |
|
145 |
```bibtext
|
146 |
@misc{qwq-32b-preview,
|