TheBloke
/

llama-2-70b-Guanaco-QLoRA-GGML

Text Classification

Transformers

English

llama

llama-2

Model card Files Files and versions Community

TheBloke commited on Jul 28, 2023

Commit

58459f9

•

1 Parent(s): d26045d

Initial GGML model commit

Browse files

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -35,7 +35,7 @@ tags:
 This repo contains GGML format model files for [Mikael110's Llama2 70b Guanaco QLoRA](https://huggingface.co/Mikael110/llama-2-70b-guanaco-qlora).
-These 70B Llama 2 GGML files currently only support CPU inference.  They are known to work with:
 * [llama.cpp](https://github.com/ggerganov/llama.cpp), commit `e76d630` and later.
 * [text-generation-webui](https://github.com/oobabooga/text-generation-webui), the most popular web UI.
 * [KoboldCpp](https://github.com/LostRuins/koboldcpp), version 1.37 and later. A powerful GGML web UI, especially good for story telling.
@@ -62,8 +62,6 @@ These 70B Llama 2 GGML files currently only support CPU inference.  They are kno
 Or one of the other tools and libraries listed above.
-There is currently no GPU acceleration; only CPU can be used.
 To use in llama.cpp, you must add `-gqa 8` argument.
 For other UIs and libraries, please check the docs.
@@ -107,10 +105,12 @@ Refer to the Provided Files table below to see what files use which methods, and
 I use the following command line; adjust for your tastes and needs:
 ```
-./main -t 10 -gqa 8 -m llama-2-70b-guanaco-qlora.ggmlv3.q4_K_M.bin --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Human: Write a story about llamas\n### Assistant:"
 ```
 Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
 Remember the `-gqa 8` argument, required for Llama 70B models.
 If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`

 This repo contains GGML format model files for [Mikael110's Llama2 70b Guanaco QLoRA](https://huggingface.co/Mikael110/llama-2-70b-guanaco-qlora).
+CUDA GPU acceleration is now available for Llama 2 70B GGML files. Metal acceleration (macOS) is not yet available. I haven't tested AMD acceleration - let me know if it owrks. The following clients/libraries are known to work with these files, including with CUDA GPU acceleration:
 * [llama.cpp](https://github.com/ggerganov/llama.cpp), commit `e76d630` and later.
 * [text-generation-webui](https://github.com/oobabooga/text-generation-webui), the most popular web UI.
 * [KoboldCpp](https://github.com/LostRuins/koboldcpp), version 1.37 and later. A powerful GGML web UI, especially good for story telling.
 Or one of the other tools and libraries listed above.
 To use in llama.cpp, you must add `-gqa 8` argument.
 For other UIs and libraries, please check the docs.
 I use the following command line; adjust for your tastes and needs:
 ```
+./main -t 10 -ngl 40 -gqa 8 -m llama-2-70b-guanaco-qlora.ggmlv3.q4_K_M.bin --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Human: Write a story about llamas\n### Assistant:"
 ```
 Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
+Change -ngl 40 to the number of GPU layers you have VRAM for. Use -ngl 100 to offload all layers to VRAM, if you have a 48GB card, or 2 x 24GB, or similar.  Otherwise you can partially offload as many as you have VRAM for, on one or more GPUs.
 Remember the `-gqa 8` argument, required for Llama 70B models.
 If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`