|
--- |
|
license: llama3 |
|
base_model: meta-llama/Meta-Llama-3-8B-Instruct |
|
--- |
|
|
|
# 32K GGUF of LLAMA3-8B-INSTRUCT ๐ |
|
### *THIS IS NOT A FINETUNE IT JUST WORKS GREAT VIA YARN SCALING* |
|
|
|
|
|
## imatrix custom edge-quants tested ok at 4,3 & 2bit |
|
|
|
> [!TIP] |
|
> You have to set context with ***-c 32000*** in llama.cpp to take advantage of this when you run it. |
|
> |
|
|
|
## How to run the model in interactive mode using llama.cpp with a long prompt inside a textfile with -f |
|
```verilog |
|
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make -j |
|
|
|
./main -m llama3ins-8b-32k-q4ns.gguf --temp 0.3 --color -f mylongprompt.txt -ngl 33 -n 2000 -i -c 32000 |
|
``` |
|
|
|
## Prompt format - paste up to 32000 token long prompt inside the user{} brackets |
|
> [!TIP] put this inside your ***longprompt.txt*** file |
|
> or copy from below and add to above command like this -p "<|im_start....." |
|
|
|
```xml |
|
<|im_start|>system{You are a hyperintelligent hilarious raccoon that solves everything via first-principles based resoning.}<|im_end|> |
|
<|im_start|>user{How to build a city on mars via aldrin cycler orbits DUMP THE BIG LONG PROMPT HERE.} |
|
<|im_end|>assistant |
|
``` |
|
|
|
## Perplexity Benchmarks |
|
|
|
```verilog |
|
./perplexity -m ../llama3ins-8b-32k-f16.gguf -ngl 99 -f wiki.test.raw --chunks 16 |
|
perplexity: 2.10 seconds per pass - ETA 0.13 minutes |
|
[1]6.1736,[2]6.8769,[3]7.4226,[4]8.0199,[5]8.4531,[6]8.7808,[7]9.3213,[8]10.0461,[9]10.7468,[10]11.0909,[11]11.2691,[12]11.4318,[13]11.9160,[14]11.4038,[15]11.2641,[16]10.9073, |
|
Final estimate: PPL = 10.9073 +/- 0.50026 |
|
|
|
./perplexity -m ../llama3ins-8b-32k-q8.gguf -ngl 99 -f wiki.test.raw --chunks 16 YES 8BIT IS BETTER THAN BF16 - F16 conversion |
|
perplexity: 2.38 seconds per pass - ETA 0.15 minutes |
|
[1]6.1454,[2]6.8672,[3]7.4109,[4]8.0148,[5]8.4472,[6]8.7771,[7]9.3182,[8]10.0466,[9]10.7509,[10]11.0836,[11]11.2563,[12]11.4218,[13]11.9095,[14]11.4000,[15]11.2587,[16]10.9028, |
|
Final estimate: PPL = 10.9028 +/- 0.49958 |
|
|
|
./perplexity -m ../llama3ins-8b-32k-q6.gguf -ngl 99 -f wiki.test.raw --chunks 16 |
|
perplexity: 2.36 seconds per pass - ETA 0.15 minutes |
|
[1]6.0654,[2]6.7806,[3]7.3319,[4]7.9600,[5]8.3961,[6]8.7512,[7]9.2932,[8]10.0314,[9]10.7402,[10]11.0786,[11]11.2597,[12]11.4410,[13]11.9342,[14]11.4223,[15]11.2818,[16]10.9354, |
|
Final estimate: PPL = 10.9354 +/- 0.50190 |
|
|
|
./perplexity -m ../llama3ins-8b-32k-q5km.gguf -ngl 99 -f wiki.test.raw --chunks 16 |
|
perplexity: 2.40 seconds per pass - ETA 0.15 minutes |
|
[1]6.0044,[2]6.8263,[3]7.3989,[4]8.0044,[5]8.4508,[6]8.7716,[7]9.3220,[8]10.0606,[9]10.7709,[10]11.1098,[11]11.2956,[12]11.4743,[13]11.9661,[14]11.4569,[15]11.3028,[16]10.9474, |
|
Final estimate: PPL = 10.9474 +/- 0.50185 |
|
|
|
./perplexity -m ../llama3ins-8b-32k-q4ns.gguf -ngl 99 -f wiki.test.raw --chunks 16 |
|
perplexity: 2.40 seconds per pass - ETA 0.15 minutes |
|
[1]6.5618,[2]7.1233,[3]7.5647,[4]8.1198,[5]8.5365,[6]8.8386,[7]9.4233,[8]10.1359,[9]10.8601,[10]11.1981,[11]11.3705,[12]11.5619,[13]12.0492,[14]11.5287,[15]11.3823,[16]11.0269, |
|
Final estimate: PPL = 11.0269 +/- 0.50623 |
|
|
|
IQ4_XS - NON IMATRIX FOR REFERENCE is quite a bit worse than my imat one |
|
perplexity: 7.41 seconds per pass - ETA 0.48 minutes |
|
[1]6.9103,[2]7.4907,[3]7.9577,[4]8.3949,[5]8.8029,[6]9.0275,[7]9.6252,[8]10.2914,[9]10.9833,[10]11.3498,[11]11.5059,[12]11.7275,[13]12.1804,[14]11.6848,[15]11.5226,[16]11.1761, |
|
Final estimate: PPL = 11.1761 +/- 0.51803 |
|
|
|
./perplexity -m ../llama3ins-8b-32k-q3ns.gguf -ngl 99 -f wiki.test.raw --chunks 16 |
|
perplexity: 2.43 seconds per pass - ETA 0.15 minutes |
|
[1]6.6955,[2]7.2732,[3]7.9483,[4]8.5310,[5]9.0020,[6]9.3664,[7]9.9324,[8]10.7019,[9]11.4163,[10]11.6981,[11]11.8420,[12]12.1191,[13]12.6709,[14]12.1222,[15]11.9778,[16]11.5624, |
|
Final estimate: PPL = 11.5624 +/- 0.53444 |
|
|
|
./perplexity -m ../llama3ins-8b-32k-q2ns.gguf -ngl 99 -f wiki.test.raw --chunks 16 SUPRISINGLY USABLE |
|
perplexity: 2.48 seconds per pass - ETA 0.15 minutes |
|
[1]7.0861,[2]7.8057,[3]8.5360,[4]9.1910,[5]9.6240,[6]10.0848,[7]10.7928,[8]11.4729,[9]12.3032,[10]12.5115,[11]12.7422,[12]13.1224,[13]13.7716,[14]13.1772,[15]13.0020,[16]12.5578, |
|
Final estimate: PPL = 12.5578 +/- 0.57323 |
|
|
|
./perplexity -m ../llama3ins-8b-32k-q1ns.gguf -ngl 99 -f wiki.test.raw --chunks 16 ONE BIT TURNS TO JUNK |
|
perplexity: 2.41 seconds per pass - ETA 0.15 minutes |
|
[1]15.1640,[2]16.2585,[3]17.8912,[4]18.2226,[5]18.4974,[6]19.2407,[7]20.0085,[8]21.6465,[9]22.7656,[10]22.7903,[11]23.2208,[12]24.2318,[13]25.7172,[14]24.5111,[15]23.8096,[16]22.7933, |
|
Final estimate: PPL = 22.7933 +/- 1.05192 |
|
``` |
|
> [!TIP] |
|
> Yes 8bit q8_0 is slightly better than f16 because converting fom bf16 to f16 reduces bits in the mantisa. |
|
> The ns quants are custom nisten quants and work well down to 2 bit. |
|
> 1.75bit quant is included for reference however perplexity tanks and is incoherent. |
|
|
|
# Built with Meta Llama 3 |
|
|