what is the context window size of this model , i means what is the input token and output tokens of this model
can we chage the input and output tokens of the model
The model's context size is 2048 tokens, but I don't understand what you mean by "chage the input and output tokens of the model"
sorry for the spell mistake, it's actually can i change the input tokens and output tokens of the model
If you apply 2-bit quantization to a model, the model size should theoretically decrease since the number of bits per parameter is reduced. For example, going from 32-bit floating-point to 2-bit representation should result in a smaller model size. However, I saw one case where 2-bit quantization was applied to a model, and the size remained 3GB, the same as the original.
As far as I know, quantizing a model reduces the number of parameters, which should lead to a smaller model. But in this case, even after using 2-bit quantization, the model size didn't reduce. why....?
If you look at the sizes of the quantizations of llama 3.1, they keep decreasing as you use fewer and fewer bits. The Q4_1 quant is 5GB and Q2_K is 3.2 GB.
So if you noticed no decrease in size after quantization, I wouldn't know how to explain it. Maybe you should use a different quantization library.
https://huggingface.co/QuantFactory/Meta-Llama-3.1-8B-Instruct-GGUF/tree/main