Transformers
llamafile
mixtral
jartine commited on
Commit
59f4894
1 Parent(s): a5894ec

Add README.md to repo

Browse files
Files changed (1) hide show
  1. README.md +44 -54
README.md CHANGED
@@ -45,40 +45,38 @@ widget:
45
  <!-- header start -->
46
  <!-- 200823 -->
47
  <div style="width: auto; margin-left: auto; margin-right: auto">
48
- <img src="https://i.imgur.com/EBdldam.jpg" alt="TheBlokeAI" style="width: 100%; min-width: 400px; display: block; margin: auto;">
49
  </div>
50
  <div style="display: flex; justify-content: space-between; width: 100%;">
51
  <div style="display: flex; flex-direction: column; align-items: flex-start;">
52
- <p style="margin-top: 0.5em; margin-bottom: 0em;"><a href="https://discord.gg/theblokeai">Chat & support: TheBloke's Discord server</a></p>
53
  </div>
54
  <div style="display: flex; flex-direction: column; align-items: flex-end;">
55
- <p style="margin-top: 0.5em; margin-bottom: 0em;"><a href="https://www.patreon.com/TheBlokeAI">Want to contribute? TheBloke's Patreon page</a></p>
56
  </div>
57
  </div>
58
- <div style="text-align:center; margin-top: 0em; margin-bottom: 0em"><p style="margin-top: 0.25em; margin-bottom: 0em;">TheBloke's LLM work is generously supported by a grant from <a href="https://a16z.com">andreessen horowitz (a16z)</a></p></div>
59
  <hr style="margin-top: 1.0em; margin-bottom: 1.0em;">
60
  <!-- header end -->
61
 
62
- # Mixtral 8X7B Instruct v0.1 - GGUF
63
  - Model creator: [Mistral AI_](https://huggingface.co/mistralai)
64
  - Original model: [Mixtral 8X7B Instruct v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)
65
 
66
  <!-- description start -->
67
  ## Description
68
 
69
- This repo contains GGUF format model files for [Mistral AI_'s Mixtral 8X7B Instruct v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1).
70
 
71
- <!-- description end -->
72
- <!-- README_GGUF.md-about-gguf start -->
73
- ### About GGUF
74
 
75
- GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp.
76
 
77
- ### Mixtral GGUF
78
 
79
  Support for Mixtral was merged into Llama.cpp on December 13th.
80
 
81
- These Mixtral GGUFs are known to work in:
82
 
83
  * llama.cpp as of December 13th
84
  * KoboldCpp 1.52 as later
@@ -87,13 +85,13 @@ These Mixtral GGUFs are known to work in:
87
 
88
  Other clients/libraries, not listed above, may not yet work.
89
 
90
- <!-- README_GGUF.md-about-gguf end -->
91
  <!-- repositories-available start -->
92
  ## Repositories available
93
 
94
- * [AWQ model(s) for GPU inference.](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ)
95
- * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ)
96
- * [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF)
97
  * [Mistral AI_'s original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)
98
  <!-- repositories-available end -->
99
 
@@ -108,10 +106,10 @@ Other clients/libraries, not listed above, may not yet work.
108
  <!-- prompt-template end -->
109
 
110
 
111
- <!-- compatibility_gguf start -->
112
  ## Compatibility
113
 
114
- These Mixtral GGUFs are compatible with llama.cpp from December 13th onwards. Other clients/libraries may not work yet.
115
 
116
  ## Explanation of quantisation methods
117
 
@@ -128,30 +126,30 @@ The new methods available are:
128
 
129
  Refer to the Provided Files table below to see what files use which methods, and how.
130
  </details>
131
- <!-- compatibility_gguf end -->
132
 
133
- <!-- README_GGUF.md-provided-files start -->
134
  ## Provided files
135
 
136
  | Name | Quant method | Bits | Size | Max RAM required | Use case |
137
  | ---- | ---- | ---- | ---- | ---- | ----- |
138
- | [mixtral-8x7b-instruct-v0.1.Q2_K.gguf](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/blob/main/mixtral-8x7b-instruct-v0.1.Q2_K.gguf) | Q2_K | 2 | 15.64 GB| 18.14 GB | smallest, significant quality loss - not recommended for most purposes |
139
- | [mixtral-8x7b-instruct-v0.1.Q3_K_M.gguf](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/blob/main/mixtral-8x7b-instruct-v0.1.Q3_K_M.gguf) | Q3_K_M | 3 | 20.36 GB| 22.86 GB | very small, high quality loss |
140
- | [mixtral-8x7b-instruct-v0.1.Q4_0.gguf](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/blob/main/mixtral-8x7b-instruct-v0.1.Q4_0.gguf) | Q4_0 | 4 | 26.44 GB| 28.94 GB | legacy; small, very high quality loss - prefer using Q3_K_M |
141
- | [mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/blob/main/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf) | Q4_K_M | 4 | 26.44 GB| 28.94 GB | medium, balanced quality - recommended |
142
- | [mixtral-8x7b-instruct-v0.1.Q5_0.gguf](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/blob/main/mixtral-8x7b-instruct-v0.1.Q5_0.gguf) | Q5_0 | 5 | 32.23 GB| 34.73 GB | legacy; medium, balanced quality - prefer using Q4_K_M |
143
- | [mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/blob/main/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf) | Q5_K_M | 5 | 32.23 GB| 34.73 GB | large, very low quality loss - recommended |
144
- | [mixtral-8x7b-instruct-v0.1.Q6_K.gguf](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/blob/main/mixtral-8x7b-instruct-v0.1.Q6_K.gguf) | Q6_K | 6 | 38.38 GB| 40.88 GB | very large, extremely low quality loss |
145
- | [mixtral-8x7b-instruct-v0.1.Q8_0.gguf](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/blob/main/mixtral-8x7b-instruct-v0.1.Q8_0.gguf) | Q8_0 | 8 | 49.62 GB| 52.12 GB | very large, extremely low quality loss - not recommended |
146
 
147
  **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
148
 
149
 
150
 
151
- <!-- README_GGUF.md-provided-files end -->
152
 
153
- <!-- README_GGUF.md-how-to-download start -->
154
- ## How to download GGUF files
155
 
156
  **Note for manual downloaders:** You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file.
157
 
@@ -163,7 +161,7 @@ The following clients/libraries will automatically download models for you, prov
163
 
164
  ### In `text-generation-webui`
165
 
166
- Under Download Model, you can enter the model repo: TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF and below it, a specific filename to download, such as: mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf.
167
 
168
  Then click Download.
169
 
@@ -178,7 +176,7 @@ pip3 install huggingface-hub
178
  Then you can download any individual model file to the current directory, at high speed, with a command like this:
179
 
180
  ```shell
181
- huggingface-cli download TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
182
  ```
183
 
184
  <details>
@@ -187,7 +185,7 @@ huggingface-cli download TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF mixtral-8x7b-i
187
  You can also download multiple files at once with a pattern:
188
 
189
  ```shell
190
- huggingface-cli download TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF --local-dir . --local-dir-use-symlinks False --include='*Q4_K*gguf'
191
  ```
192
 
193
  For more documentation on downloading with `huggingface-cli`, please see: [HF -> Hub Python Library -> Download files -> Download from the CLI](https://huggingface.co/docs/huggingface_hub/guides/download#download-from-the-cli).
@@ -201,25 +199,25 @@ pip3 install hf_transfer
201
  And set environment variable `HF_HUB_ENABLE_HF_TRANSFER` to `1`:
202
 
203
  ```shell
204
- HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
205
  ```
206
 
207
  Windows Command Line users: You can set the environment variable by running `set HF_HUB_ENABLE_HF_TRANSFER=1` before the download command.
208
  </details>
209
- <!-- README_GGUF.md-how-to-download end -->
210
 
211
- <!-- README_GGUF.md-how-to-run start -->
212
  ## Example `llama.cpp` command
213
 
214
  Make sure you are using `llama.cpp` from commit [d0cee0d](https://github.com/ggerganov/llama.cpp/commit/d0cee0d36d5be95a0d9088b674dbb27354107221) or later.
215
 
216
  ```shell
217
- ./main -ngl 35 -m mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "[INST] {prompt} [/INST]"
218
  ```
219
 
220
  Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
221
 
222
- Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. Note that longer sequence lengths require much more resources, so you may need to reduce this value.
223
 
224
  If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
225
 
@@ -227,13 +225,13 @@ For other parameters and how to use them, please refer to [the llama.cpp documen
227
 
228
  ## How to run in `text-generation-webui`
229
 
230
- Note that text-generation-webui may not yet be compatible with Mixtral GGUFs. Please check compatibility first.
231
 
232
  Further instructions can be found in the text-generation-webui documentation, here: [text-generation-webui/docs/04 ‐ Model Tab.md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/04%20%E2%80%90%20Model%20Tab.md#llamacpp).
233
 
234
  ## How to run from Python code
235
 
236
- You can use GGUF models from Python using the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) version 0.2.23 and later.
237
 
238
  ### How to load this model in Python code, using llama-cpp-python
239
 
@@ -269,7 +267,7 @@ from llama_cpp import Llama
269
 
270
  # Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
271
  llm = Llama(
272
- model_path="./mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf", # Download the model file first
273
  n_ctx=2048, # The max sequence length to use - note that longer sequence lengths require much more resources
274
  n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
275
  n_gpu_layers=35 # The number of layers to offload to GPU, if you have GPU acceleration available
@@ -285,7 +283,7 @@ output = llm(
285
 
286
  # Chat Completion API
287
 
288
- llm = Llama(model_path="./mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf", chat_format="llama-2") # Set chat_format according to the model you are using
289
  llm.create_chat_completion(
290
  messages = [
291
  {"role": "system", "content": "You are a story writing assistant."},
@@ -303,7 +301,7 @@ Here are guides on using llama-cpp-python and ctransformers with LangChain:
303
 
304
  * [LangChain + llama-cpp-python](https://python.langchain.com/docs/integrations/llms/llamacpp)
305
 
306
- <!-- README_GGUF.md-how-to-run end -->
307
 
308
  <!-- footer start -->
309
  <!-- 200823 -->
@@ -311,31 +309,23 @@ Here are guides on using llama-cpp-python and ctransformers with LangChain:
311
 
312
  For further support, and discussions on these models and AI in general, join us at:
313
 
314
- [TheBloke AI's Discord server](https://discord.gg/theblokeai)
315
 
316
  ## Thanks, and how to contribute
317
 
318
- Thanks to the [chirper.ai](https://chirper.ai) team!
319
 
320
- Thanks to Clay from [gpus.llm-utils.org](llm-utils)!
321
 
322
  I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training.
323
 
324
  If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects.
325
 
326
- Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits.
327
 
328
- * Patreon: https://patreon.com/TheBlokeAI
329
- * Ko-Fi: https://ko-fi.com/TheBlokeAI
330
 
331
- **Special thanks to**: Aemon Algiz.
332
 
333
- **Patreon special mentions**: Michael Levine, 阿明, Trailburnt, Nikolai Manek, John Detwiler, Randy H, Will Dee, Sebastain Graf, NimbleBox.ai, Eugene Pentland, Emad Mostaque, Ai Maven, Jim Angel, Jeff Scroggin, Michael Davis, Manuel Alberto Morcote, Stephen Murray, Robert, Justin Joy, Luke @flexchar, Brandon Frisco, Elijah Stavena, S_X, Dan Guido, Undi ., Komninos Chatzipapas, Shadi, theTransient, Lone Striker, Raven Klaugh, jjj, Cap'n Zoog, Michel-Marie MAUDET (LINAGORA), Matthew Berman, David, Fen Risland, Omer Bin Jawed, Luke Pendergrass, Kalila, OG, Erik Bjäreholt, Rooh Singh, Joseph William Delisle, Dan Lewis, TL, John Villwock, AzureBlack, Brad, Pedro Madruga, Caitlyn Gatomon, K, jinyuan sun, Mano Prime, Alex, Jeffrey Morgan, Alicia Loh, Illia Dulskyi, Chadd, transmissions 11, fincy, Rainer Wilmers, ReadyPlayerEmma, knownsqashed, Mandus, biorpg, Deo Leter, Brandon Phillips, SuperWojo, Sean Connelly, Iucharbius, Jack West, Harry Royden McLaughlin, Nicholas, terasurfer, Vitor Caleffi, Duane Dunston, Johann-Peter Hartmann, David Ziegler, Olakabola, Ken Nordquist, Trenton Dambrowitz, Tom X Nguyen, Vadim, Ajan Kanaga, Leonard Tan, Clay Pascal, Alexandros Triantafyllidis, JM33133, Xule, vamX, ya boyyy, subjectnull, Talal Aujan, Alps Aficionado, wassieverse, Ari Malik, James Bentley, Woland, Spencer Kim, Michael Dempsey, Fred von Graf, Elle, zynix, William Richards, Stanislav Ovsiannikov, Edmond Seymore, Jonathan Leane, Martin Kemka, usrbinkat, Enrico Ros
334
 
335
 
336
- Thank you to all my generous patrons and donaters!
337
 
338
- And thank you again to a16z for their generous grant.
339
 
340
  <!-- footer end -->
341
 
 
45
  <!-- header start -->
46
  <!-- 200823 -->
47
  <div style="width: auto; margin-left: auto; margin-right: auto">
 
48
  </div>
49
  <div style="display: flex; justify-content: space-between; width: 100%;">
50
  <div style="display: flex; flex-direction: column; align-items: flex-start;">
51
+ <p style="margin-top: 0.5em; margin-bottom: 0em;"><a href="https://discord.gg/FwAVVu7eJ4">Chat & support: jartine's Discord server</a></p>
52
  </div>
53
  <div style="display: flex; flex-direction: column; align-items: flex-end;">
 
54
  </div>
55
  </div>
56
+ <div style="text-align:center; margin-top: 0em; margin-bottom: 0em"><p style="margin-top: 0.25em; margin-bottom: 0em;">jartine's LLM work is generously supported by a grant from <a href="https://mozilla.org">mozilla</a></p></div>
57
  <hr style="margin-top: 1.0em; margin-bottom: 1.0em;">
58
  <!-- header end -->
59
 
60
+ # Mixtral 8X7B Instruct v0.1 - llamafile
61
  - Model creator: [Mistral AI_](https://huggingface.co/mistralai)
62
  - Original model: [Mixtral 8X7B Instruct v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)
63
 
64
  <!-- description start -->
65
  ## Description
66
 
67
+ This repo contains llamafile format model files for [Mistral AI_'s Mixtral 8X7B Instruct v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1).
68
 
69
+ WARNING: This README may contain inaccuracies. It was generated automatically by forking <a href=/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF>TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF</a> and piping the README through sed. Errors should be reported to jartine, and do not reflect TheBloke. You can also support his work on [Patreon](https://www.patreon.com/TheBlokeAI).
70
+ <!-- README_llamafile.md-about-llamafile start -->
71
+ ### About llamafile
72
 
73
+ llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023. It uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp binaries that run on the stock installs of six OSes for both ARM64 and AMD64.
74
 
75
+ ### Mixtral llamafile
76
 
77
  Support for Mixtral was merged into Llama.cpp on December 13th.
78
 
79
+ These Mixtral llamafiles are known to work in:
80
 
81
  * llama.cpp as of December 13th
82
  * KoboldCpp 1.52 as later
 
85
 
86
  Other clients/libraries, not listed above, may not yet work.
87
 
88
+ <!-- README_llamafile.md-about-llamafile end -->
89
  <!-- repositories-available start -->
90
  ## Repositories available
91
 
92
+ * [AWQ model(s) for GPU inference.](https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-AWQ)
93
+ * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-GPTQ)
94
+ * [2, 3, 4, 5, 6 and 8-bit llamafile models for CPU+GPU inference](https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-llamafile)
95
  * [Mistral AI_'s original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)
96
  <!-- repositories-available end -->
97
 
 
106
  <!-- prompt-template end -->
107
 
108
 
109
+ <!-- compatibility_llamafile start -->
110
  ## Compatibility
111
 
112
+ These Mixtral llamafiles are compatible with llama.cpp from December 13th onwards. Other clients/libraries may not work yet.
113
 
114
  ## Explanation of quantisation methods
115
 
 
126
 
127
  Refer to the Provided Files table below to see what files use which methods, and how.
128
  </details>
129
+ <!-- compatibility_llamafile end -->
130
 
131
+ <!-- README_llamafile.md-provided-files start -->
132
  ## Provided files
133
 
134
  | Name | Quant method | Bits | Size | Max RAM required | Use case |
135
  | ---- | ---- | ---- | ---- | ---- | ----- |
136
+ | [mixtral-8x7b-instruct-v0.1.Q2_K.llamafile](https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-llamafile/blob/main/mixtral-8x7b-instruct-v0.1.Q2_K.llamafile) | Q2_K | 2 | 15.64 GB| 18.14 GB | smallest, significant quality loss - not recommended for most purposes |
137
+ | [mixtral-8x7b-instruct-v0.1.Q3_K_M.llamafile](https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-llamafile/blob/main/mixtral-8x7b-instruct-v0.1.Q3_K_M.llamafile) | Q3_K_M | 3 | 20.36 GB| 22.86 GB | very small, high quality loss |
138
+ | [mixtral-8x7b-instruct-v0.1.Q4_0.llamafile](https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-llamafile/blob/main/mixtral-8x7b-instruct-v0.1.Q4_0.llamafile) | Q4_0 | 4 | 26.44 GB| 28.94 GB | legacy; small, very high quality loss - prefer using Q3_K_M |
139
+ | [mixtral-8x7b-instruct-v0.1.Q4_K_M.llamafile](https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-llamafile/blob/main/mixtral-8x7b-instruct-v0.1.Q4_K_M.llamafile) | Q4_K_M | 4 | 26.44 GB| 28.94 GB | medium, balanced quality - recommended |
140
+ | [mixtral-8x7b-instruct-v0.1.Q5_0.llamafile](https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-llamafile/blob/main/mixtral-8x7b-instruct-v0.1.Q5_0.llamafile) | Q5_0 | 5 | 32.23 GB| 34.73 GB | legacy; medium, balanced quality - prefer using Q4_K_M |
141
+ | [mixtral-8x7b-instruct-v0.1.Q5_K_M.llamafile](https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-llamafile/blob/main/mixtral-8x7b-instruct-v0.1.Q5_K_M.llamafile) | Q5_K_M | 5 | 32.23 GB| 34.73 GB | large, very low quality loss - recommended |
142
+ | [mixtral-8x7b-instruct-v0.1.Q6_K.llamafile](https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-llamafile/blob/main/mixtral-8x7b-instruct-v0.1.Q6_K.llamafile) | Q6_K | 6 | 38.38 GB| 40.88 GB | very large, extremely low quality loss |
143
+ | [mixtral-8x7b-instruct-v0.1.Q8_0.llamafile](https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-llamafile/blob/main/mixtral-8x7b-instruct-v0.1.Q8_0.llamafile) | Q8_0 | 8 | 49.62 GB| 52.12 GB | very large, extremely low quality loss - not recommended |
144
 
145
  **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
146
 
147
 
148
 
149
+ <!-- README_llamafile.md-provided-files end -->
150
 
151
+ <!-- README_llamafile.md-how-to-download start -->
152
+ ## How to download llamafile files
153
 
154
  **Note for manual downloaders:** You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file.
155
 
 
161
 
162
  ### In `text-generation-webui`
163
 
164
+ Under Download Model, you can enter the model repo: jartine/Mixtral-8x7B-Instruct-v0.1-llamafile and below it, a specific filename to download, such as: mixtral-8x7b-instruct-v0.1.Q4_K_M.llamafile.
165
 
166
  Then click Download.
167
 
 
176
  Then you can download any individual model file to the current directory, at high speed, with a command like this:
177
 
178
  ```shell
179
+ huggingface-cli download jartine/Mixtral-8x7B-Instruct-v0.1-llamafile mixtral-8x7b-instruct-v0.1.Q4_K_M.llamafile --local-dir . --local-dir-use-symlinks False
180
  ```
181
 
182
  <details>
 
185
  You can also download multiple files at once with a pattern:
186
 
187
  ```shell
188
+ huggingface-cli download jartine/Mixtral-8x7B-Instruct-v0.1-llamafile --local-dir . --local-dir-use-symlinks False --include='*Q4_K*llamafile'
189
  ```
190
 
191
  For more documentation on downloading with `huggingface-cli`, please see: [HF -> Hub Python Library -> Download files -> Download from the CLI](https://huggingface.co/docs/huggingface_hub/guides/download#download-from-the-cli).
 
199
  And set environment variable `HF_HUB_ENABLE_HF_TRANSFER` to `1`:
200
 
201
  ```shell
202
+ HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download jartine/Mixtral-8x7B-Instruct-v0.1-llamafile mixtral-8x7b-instruct-v0.1.Q4_K_M.llamafile --local-dir . --local-dir-use-symlinks False
203
  ```
204
 
205
  Windows Command Line users: You can set the environment variable by running `set HF_HUB_ENABLE_HF_TRANSFER=1` before the download command.
206
  </details>
207
+ <!-- README_llamafile.md-how-to-download end -->
208
 
209
+ <!-- README_llamafile.md-how-to-run start -->
210
  ## Example `llama.cpp` command
211
 
212
  Make sure you are using `llama.cpp` from commit [d0cee0d](https://github.com/ggerganov/llama.cpp/commit/d0cee0d36d5be95a0d9088b674dbb27354107221) or later.
213
 
214
  ```shell
215
+ ./main -ngl 35 -m mixtral-8x7b-instruct-v0.1.Q4_K_M.llamafile --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "[INST] {prompt} [/INST]"
216
  ```
217
 
218
  Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
219
 
220
+ Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the llamafile file and set by llama.cpp automatically. Note that longer sequence lengths require much more resources, so you may need to reduce this value.
221
 
222
  If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
223
 
 
225
 
226
  ## How to run in `text-generation-webui`
227
 
228
+ Note that text-generation-webui may not yet be compatible with Mixtral llamafiles. Please check compatibility first.
229
 
230
  Further instructions can be found in the text-generation-webui documentation, here: [text-generation-webui/docs/04 ‐ Model Tab.md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/04%20%E2%80%90%20Model%20Tab.md#llamacpp).
231
 
232
  ## How to run from Python code
233
 
234
+ You can use llamafile models from Python using the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) version 0.2.23 and later.
235
 
236
  ### How to load this model in Python code, using llama-cpp-python
237
 
 
267
 
268
  # Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
269
  llm = Llama(
270
+ model_path="./mixtral-8x7b-instruct-v0.1.Q4_K_M.llamafile", # Download the model file first
271
  n_ctx=2048, # The max sequence length to use - note that longer sequence lengths require much more resources
272
  n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
273
  n_gpu_layers=35 # The number of layers to offload to GPU, if you have GPU acceleration available
 
283
 
284
  # Chat Completion API
285
 
286
+ llm = Llama(model_path="./mixtral-8x7b-instruct-v0.1.Q4_K_M.llamafile", chat_format="llama-2") # Set chat_format according to the model you are using
287
  llm.create_chat_completion(
288
  messages = [
289
  {"role": "system", "content": "You are a story writing assistant."},
 
301
 
302
  * [LangChain + llama-cpp-python](https://python.langchain.com/docs/integrations/llms/llamacpp)
303
 
304
+ <!-- README_llamafile.md-how-to-run end -->
305
 
306
  <!-- footer start -->
307
  <!-- 200823 -->
 
309
 
310
  For further support, and discussions on these models and AI in general, join us at:
311
 
312
+ [jartine AI's Discord server](https://discord.gg/FwAVVu7eJ4)
313
 
314
  ## Thanks, and how to contribute
315
 
 
316
 
 
317
 
318
  I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training.
319
 
320
  If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects.
321
 
 
322
 
 
 
323
 
 
324
 
 
325
 
326
 
 
327
 
328
+ And thank you again to mozilla for their generous grant.
329
 
330
  <!-- footer end -->
331