MBMMurad
/

QwQ-32B-preview-AWQ-AIMO-earlysharing

@@ -33,15 +33,6 @@ This repository provides the **AWQ 4-bit quantized** version of the **QwQ-32B-Pr
 ---
-## Requirements
-Ensure you are using the latest version of Hugging Face Transformers, as the code for Qwen2.5 is integrated there. Using a version earlier than **4.37.0** may result in the following error:
-```plaintext
-KeyError: 'qwen2'
-```
----
 ## Steps to deploying the solution to Inference Endpoints (dedicated)
 Use this approach if you want to try out the approach from my <a href="https://www.kaggle.com/code/mbmmurad/lb-20-qwq-32b-preview-optimized-inference">Kaggle notebook</a>, but you don't feel comfortable with coding.
@@ -67,6 +58,15 @@ Other values should be left at default values.
 ---
 ## Quickstart
 Here's how to load the tokenizer and model, and generate content using the quantized model:
@@ -112,7 +112,7 @@ print(response)
 ---
-Parameter values to get a better response and reproducible values similar to the Kaggle Notebook:
 `model.generate(
     **model_inputs,
@@ -124,9 +124,9 @@ Use the sampling method here setting the following parameters:
 * `temperature = 1`
 * `top_k = 50`
-Setting the `max_new_tokens` to `4096*8` would increase the performance, but it will take a lot of time for inference. Using faster inference engines (e.g. vllm, TGI) would make the inference faster.
-To get the most optimal performance, it is suggested to use the Kaggle Notebook mentioned above.
 ## Original Model
@@ -140,7 +140,7 @@ https://huggingface.co/Qwen/QwQ-32B-Preview
 ## Citation
-If you find the original model helpful, please consider citing the original authors as well as the Kaggle notebook on which this model is based on:
 ```bibtext
 @misc{qwq-32b-preview,

 ---
 ## Steps to deploying the solution to Inference Endpoints (dedicated)
 Use this approach if you want to try out the approach from my <a href="https://www.kaggle.com/code/mbmmurad/lb-20-qwq-32b-preview-optimized-inference">Kaggle notebook</a>, but you don't feel comfortable with coding.
 ---
+## Requirements
+Ensure you are using the latest version of Hugging Face Transformers, as the code for Qwen2.5 is integrated there. Using a version earlier than **4.37.0** may result in the following error:
+```plaintext
+KeyError: 'qwen2'
+```
+---
 ## Quickstart
 Here's how to load the tokenizer and model, and generate content using the quantized model:
 ---
+Parameter values to get a better response and reproducible values similar to the Kaggle notebook:
 `model.generate(
     **model_inputs,
 * `temperature = 1`
 * `top_k = 50`
+Setting the `max_new_tokens` to `4096*8` would increase the performance, but it will take a lot of time for inference. Using faster inference engines (e.g. vLLM, TGI) would make the inference faster.
+To get the most optimal performance, it is suggested to use the <a href="https://www.kaggle.com/code/mbmmurad/lb-20-qwq-32b-preview-optimized-inference">Kaggle notebook</a> mentioned above.
 ## Original Model
 ## Citation
+If you find the original model helpful, please consider citing the original authors as well as the <a href="https://www.kaggle.com/code/mbmmurad/lb-20-qwq-32b-preview-optimized-inference">Kaggle notebook</a> on which this model is based on:
 ```bibtext
 @misc{qwq-32b-preview,