Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,23 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
+
|
5 |
+
This repository contains improved Mixtral-8x7B quantized models in GGUF format for use with `llama.cpp`. The models are fully compatible with the oficial `llama.cpp` release and can be used out-of-the-box.
|
6 |
+
|
7 |
+
The table shows a comparison between these models and the current `llama.cpp` quantization approach using Wikitext perplexities for a context length of 512 tokens.
|
8 |
+
The "Quantization Error" columns in the table are defined as `(PPL(quantized model) - PPL(int8))/PPL(int8)`.
|
9 |
+
Running the full `fp16` Mixtral8x7b model on the systems I have available takes too long, so I'm comparing against the 8-bit quantized model, where I get `PPL = 4.1049`.
|
10 |
+
From past experience the 8-bit quantization should be basically equivalent to `fp16`.
|
11 |
+
|
12 |
+
| Quantization | Model file | PPL(llama.cpp) | Quantization Error | PPL(new quants) | Quantization Error |
|
13 |
+
|--:|--:|--:|--:|--:|--:|
|
14 |
+
|Q2_K | mixtral-8x7b-q2k.gguf | 7.4660 | 81.9% | 5.0576 | 23.2% |
|
15 |
+
|Q3_K_S | mixtral-8x7b-q3k-small.gguf | 4.4601 | 8.65% | 4.3848 | 6.82% |
|
16 |
+
|Q3_K_M| mixtral-8x7b-q3k-medium.gguf | 4.4194 | 7.66% | 4.2884 | 4.47% |
|
17 |
+
|Q4_K_S| mixtral-8x7b-q4k-small.gguf | 4.2523 | 3.59% | 4.1764 | 1.74% |
|
18 |
+
|Q4_K_M| mistral-8x7b-q4k-medium.gguf | 4.2523 | 3.59% | 4.1652 | 1.47% |
|
19 |
+
|Q5_K_S | mixtral-7b-q5k-small.gguf | 4.1395 | 0.84% | 4.1278 | 0.56% |
|
20 |
+
|Q4_0 | mixtral-8x7b-q40.gguf | 4.2232 | 2.88% | 4.2001 | 2.32% |
|
21 |
+
|Q4_1 | mistral-8x7b-q41.gguf | 4.2547 | 3.65% | 4.1713 | 1.62% |
|
22 |
+
|Q5_0 | mistral-8x7b-q50.gguf | 4.1426 | 0.92% | 4.1335 | 0.70% |
|
23 |
+
|