Update README.md
Browse files
README.md
CHANGED
@@ -20,9 +20,10 @@ Based on [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama
|
|
20 |
For more details, please refer to [our blog post](https://note.com/elyza/n/n360b6084fdbd).
|
21 |
|
22 |
## Quantization
|
23 |
-
We performed quantization using [llama.cpp](https://github.com/ggerganov/llama.cpp) and converted the model to GGUF format. Currently, we only offer quantized models in the Q4_K_M format.
|
24 |
|
25 |
-
We have prepared two quantized model options, GGUF and AWQ.
|
|
|
|
|
26 |
|
27 |
| Model | ELYZA-tasks-100 GPT4 score |
|
28 |
| :-------------------------------- | ---: |
|
@@ -37,6 +38,7 @@ Install llama.cpp through brew (works on Mac and Linux)
|
|
37 |
```bash
|
38 |
brew install llama.cpp
|
39 |
```
|
|
|
40 |
Invoke the llama.cpp server.
|
41 |
|
42 |
```bash
|
@@ -83,17 +85,13 @@ completion = client.chat.completions.create(
|
|
83 |
|
84 |
## Use with Desktop App
|
85 |
|
86 |
-
There are various desktop applications that can handle GGUF models, but here we will introduce how to use
|
87 |
|
88 |
- **Installation**: Download and install [LM Studio](https://lmstudio.ai/).
|
89 |
- **Downloading the Model**: Search for `elyza/Llama-3-ELYZA-JP-8B-GGUF` in the search bar on the home page 🏠, and download `Llama-3-ELYZA-JP-8B-q4_k_m.gguf`.
|
90 |
-
- **Start Chatting**: Click on 💬 in the sidebar, select `Llama-3-ELYZA-JP-8B-GGUF` from "Select a Model to load" in the header, and load the model.
|
91 |
-
- **Setting Options**: You can set options from the sidebar on the right. Faster inference can be achieved by setting Quick GPU Offload
|
92 |
-
- **For Developers
|
93 |
-
|
94 |
-
## Quantization Options
|
95 |
-
|
96 |
-
Currently, we only offer quantized models in the Q4_K_M format.
|
97 |
|
98 |
## Developers
|
99 |
|
|
|
20 |
For more details, please refer to [our blog post](https://note.com/elyza/n/n360b6084fdbd).
|
21 |
|
22 |
## Quantization
|
|
|
23 |
|
24 |
+
We have prepared two quantized model options, GGUF and AWQ. This is the GGUF (Q4_K_M) model, converted using [llama.cpp](https://github.com/ggerganov/llama.cpp).
|
25 |
+
|
26 |
+
Here is a table showing the performance degradation due to quantization.
|
27 |
|
28 |
| Model | ELYZA-tasks-100 GPT4 score |
|
29 |
| :-------------------------------- | ---: |
|
|
|
38 |
```bash
|
39 |
brew install llama.cpp
|
40 |
```
|
41 |
+
|
42 |
Invoke the llama.cpp server.
|
43 |
|
44 |
```bash
|
|
|
85 |
|
86 |
## Use with Desktop App
|
87 |
|
88 |
+
There are various desktop applications that can handle GGUF models, but here we will introduce how to use the model in the no-code environment LM Studio.
|
89 |
|
90 |
- **Installation**: Download and install [LM Studio](https://lmstudio.ai/).
|
91 |
- **Downloading the Model**: Search for `elyza/Llama-3-ELYZA-JP-8B-GGUF` in the search bar on the home page 🏠, and download `Llama-3-ELYZA-JP-8B-q4_k_m.gguf`.
|
92 |
+
- **Start Chatting**: Click on 💬 in the sidebar, select `Llama-3-ELYZA-JP-8B-GGUF` from "Select a Model to load" in the header, and load the model. You can now freely chat with the local LLM.
|
93 |
+
- **Setting Options**: You can set options from the sidebar on the right. Faster inference can be achieved by setting Quick GPU Offload to Max in the GPU Settings.
|
94 |
+
- **For Developers: Starting the API Server**: Click `<->` in the left sidebar and move to the Local Server tab. Select the model and click Start Server to launch an OpenAI API-compatible API server.
|
|
|
|
|
|
|
|
|
95 |
|
96 |
## Developers
|
97 |
|