Update README.md

75c572f 10 months ago

1.82 kB

	---
	license: apache-2.0
	---

	This repository contains alternative Mixtral-instruct-8x7B (https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) quantized models in GGUF format for use with `llama.cpp`.
	The models are fully compatible with the oficial `llama.cpp` release and can be used out-of-the-box.

	I'm carefull to say "alternative" rather than "better" or "improved" as I have not put any effort into evaluating performance
	differences in actual usage. Perplexity is lower compared to the "official" `llama.cpp` quantization, but perplexity is not
	necessarily a good measure for real world performance. Nevertheless, perplexity does measure quantization error, so below is a table
	comparing perplexities of these quantized models to the current `llama.cpp` quantization approach on Wikitext for a context length of 512 tokens.
	The "Quantization Error" columns in the table are defined as `(PPL(quantized model) - PPL(fp16))/PPL(fp16)`.

	\| Quantization \| Model file \| PPL(llama.cpp) \| Quantization Error \| PPL(new quants) \| Quantization Error \|
	\|--:\|--:\|--:\|--:\|--:\|--:\|
	\|Q2_K \| mixtral-instruct-8x7b-q2k.gguf \| 6.8953 \| 56.4% \| 5.2679 \| 19.5% \|
	\|Q3_K_S\| mixtral-instruct-8x7b-q3k-small.gguf \| 4.7038 \| 6.68% \| 4.6401 \| 5.24% \|
	\|Q3_K_M\| mixtral-instruct-8x7b-q3k-medium.gguf\| 4.6663 \| 5.83% \| 4.5608 \| 3.44% \|
	\|Q4_K_S\| mixtral-instruct-8x7b-q4k-small.gguf \| 4.5105 \| 2.30% \| 4.4630 \| 1.22% \|
	\|Q4_K_M\| mixtral-instruct-8x7b-q4k-medium.gguf\| 4.5105 \| 2.30% \| 4.4568 \| 1.08% \|
	\|Q5_K_S\| mixtral-instruct-8x7b-q5k-small.gguf \| 4.4402 \| 0.71% \| 4.4277 \| 0.42% \|
	\|Q4_0 \| mixtral-instruct-8x7b-q40.gguf \| 4.5102 \| 2.29% \| 4.4908 \| 1.85% \|
	\|Q4_1 \| mixtral-instruct-8x7b-q41.gguf \| 4.5415 \| 3.00% \| 4.4612 \| 1.18% \|
	\|Q5_0 \| mixtral-instruct-8x7b-q50.gguf \| 4.4361 \| 0.61% \| 4.4297 \| 0.47% \|

	---
	license: apache-2.0
	---

	This repository contains alternative Mixtral-instruct-8x7B (https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) quantized models in GGUF format for use with `llama.cpp`.
	The models are fully compatible with the oficial `llama.cpp` release and can be used out-of-the-box.

	I'm carefull to say "alternative" rather than "better" or "improved" as I have not put any effort into evaluating performance
	differences in actual usage. Perplexity is lower compared to the "official" `llama.cpp` quantization, but perplexity is not
	necessarily a good measure for real world performance. Nevertheless, perplexity does measure quantization error, so below is a table
	comparing perplexities of these quantized models to the current `llama.cpp` quantization approach on Wikitext for a context length of 512 tokens.
	The "Quantization Error" columns in the table are defined as `(PPL(quantized model) - PPL(fp16))/PPL(fp16)`.

	\| Quantization \| Model file \| PPL(llama.cpp) \| Quantization Error \| PPL(new quants) \| Quantization Error \|
	\|--:\|--:\|--:\|--:\|--:\|--:\|
	\|Q2_K \| mixtral-instruct-8x7b-q2k.gguf \| 6.8953 \| 56.4% \| 5.2679 \| 19.5% \|
	\|Q3_K_S\| mixtral-instruct-8x7b-q3k-small.gguf \| 4.7038 \| 6.68% \| 4.6401 \| 5.24% \|
	\|Q3_K_M\| mixtral-instruct-8x7b-q3k-medium.gguf\| 4.6663 \| 5.83% \| 4.5608 \| 3.44% \|
	\|Q4_K_S\| mixtral-instruct-8x7b-q4k-small.gguf \| 4.5105 \| 2.30% \| 4.4630 \| 1.22% \|
	\|Q4_K_M\| mixtral-instruct-8x7b-q4k-medium.gguf\| 4.5105 \| 2.30% \| 4.4568 \| 1.08% \|
	\|Q5_K_S\| mixtral-instruct-8x7b-q5k-small.gguf \| 4.4402 \| 0.71% \| 4.4277 \| 0.42% \|
	\|Q4_0 \| mixtral-instruct-8x7b-q40.gguf \| 4.5102 \| 2.29% \| 4.4908 \| 1.85% \|
	\|Q4_1 \| mixtral-instruct-8x7b-q41.gguf \| 4.5415 \| 3.00% \| 4.4612 \| 1.18% \|
	\|Q5_0 \| mixtral-instruct-8x7b-q50.gguf \| 4.4361 \| 0.61% \| 4.4297 \| 0.47% \|