Initial GGML model commit
Browse files
README.md
CHANGED
@@ -35,7 +35,7 @@ tags:
|
|
35 |
|
36 |
This repo contains GGML format model files for [Mikael110's Llama2 70b Guanaco QLoRA](https://huggingface.co/Mikael110/llama-2-70b-guanaco-qlora).
|
37 |
|
38 |
-
|
39 |
* [llama.cpp](https://github.com/ggerganov/llama.cpp), commit `e76d630` and later.
|
40 |
* [text-generation-webui](https://github.com/oobabooga/text-generation-webui), the most popular web UI.
|
41 |
* [KoboldCpp](https://github.com/LostRuins/koboldcpp), version 1.37 and later. A powerful GGML web UI, especially good for story telling.
|
@@ -62,8 +62,6 @@ These 70B Llama 2 GGML files currently only support CPU inference. They are kno
|
|
62 |
|
63 |
Or one of the other tools and libraries listed above.
|
64 |
|
65 |
-
There is currently no GPU acceleration; only CPU can be used.
|
66 |
-
|
67 |
To use in llama.cpp, you must add `-gqa 8` argument.
|
68 |
|
69 |
For other UIs and libraries, please check the docs.
|
@@ -107,10 +105,12 @@ Refer to the Provided Files table below to see what files use which methods, and
|
|
107 |
I use the following command line; adjust for your tastes and needs:
|
108 |
|
109 |
```
|
110 |
-
./main -t 10 -gqa 8 -m llama-2-70b-guanaco-qlora.ggmlv3.q4_K_M.bin --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Human: Write a story about llamas\n### Assistant:"
|
111 |
```
|
112 |
Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
|
113 |
|
|
|
|
|
114 |
Remember the `-gqa 8` argument, required for Llama 70B models.
|
115 |
|
116 |
If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
|
|
|
35 |
|
36 |
This repo contains GGML format model files for [Mikael110's Llama2 70b Guanaco QLoRA](https://huggingface.co/Mikael110/llama-2-70b-guanaco-qlora).
|
37 |
|
38 |
+
CUDA GPU acceleration is now available for Llama 2 70B GGML files. Metal acceleration (macOS) is not yet available. I haven't tested AMD acceleration - let me know if it owrks. The following clients/libraries are known to work with these files, including with CUDA GPU acceleration:
|
39 |
* [llama.cpp](https://github.com/ggerganov/llama.cpp), commit `e76d630` and later.
|
40 |
* [text-generation-webui](https://github.com/oobabooga/text-generation-webui), the most popular web UI.
|
41 |
* [KoboldCpp](https://github.com/LostRuins/koboldcpp), version 1.37 and later. A powerful GGML web UI, especially good for story telling.
|
|
|
62 |
|
63 |
Or one of the other tools and libraries listed above.
|
64 |
|
|
|
|
|
65 |
To use in llama.cpp, you must add `-gqa 8` argument.
|
66 |
|
67 |
For other UIs and libraries, please check the docs.
|
|
|
105 |
I use the following command line; adjust for your tastes and needs:
|
106 |
|
107 |
```
|
108 |
+
./main -t 10 -ngl 40 -gqa 8 -m llama-2-70b-guanaco-qlora.ggmlv3.q4_K_M.bin --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Human: Write a story about llamas\n### Assistant:"
|
109 |
```
|
110 |
Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
|
111 |
|
112 |
+
Change -ngl 40 to the number of GPU layers you have VRAM for. Use -ngl 100 to offload all layers to VRAM, if you have a 48GB card, or 2 x 24GB, or similar. Otherwise you can partially offload as many as you have VRAM for, on one or more GPUs.
|
113 |
+
|
114 |
Remember the `-gqa 8` argument, required for Llama 70B models.
|
115 |
|
116 |
If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
|