File size: 4,705 Bytes
ad90505
 
5c01a58
 
 
 
ad90505
 
 
 
 
 
5c01a58
 
 
 
 
 
 
 
 
d6ac9a3
 
b617fe5
 
ae550c6
d6ac9a3
 
 
05aa8dc
 
 
d6ac9a3
 
ad90505
 
7471f47
ad90505
 
 
b617fe5
7471f47
ad90505
5c01a58
 
 
 
ad90505
 
7471f47
ad90505
5c01a58
 
 
 
 
 
 
 
 
 
 
 
 
7471f47
5c01a58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7471f47
5c01a58
 
 
b617fe5
 
bad0264
5c01a58
05aa8dc
 
 
 
5c01a58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ad90505
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
library_name: transformers
license: llama3
language:
- ja
- en
tags:
- llama-cpp
---

# Llama-3-ELYZA-JP-8B-GGUF

![Llama-3-ELYZA-JP-8B-image](./key_visual.png)

## Model Description

**Llama-3-ELYZA-JP-8B** is a large language model trained by [ELYZA, Inc](https://elyza.ai/).
Based on [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), it has been enhanced for Japanese usage through additional pre-training and instruction tuning.

For more details, please refer to [our blog post](https://note.com/elyza/n/n360b6084fdbd).

## Quantization

We have prepared two quantized model options, GGUF and AWQ. This is the GGUF (Q4_K_M) model, converted using [llama.cpp](https://github.com/ggerganov/llama.cpp).

The following table shows the performance degradation due to quantization:

| Model | ELYZA-tasks-100 GPT4 score |
| :-------------------------------- | ---: |
| [Llama-3-ELYZA-JP-8B](https://huggingface.co/elyza/Llama-3-ELYZA-JP-8B)               | 3.655 |
| [Llama-3-ELYZA-JP-8B-GGUF (Q4_K_M)](https://huggingface.co/elyza/Llama-3-ELYZA-JP-8B-GGUF) | 3.57  |
| [Llama-3-ELYZA-JP-8B-AWQ](https://huggingface.co/elyza/Llama-3-ELYZA-JP-8B-AWQ)           | 3.39  |


## Use with llama.cpp

Install llama.cpp through brew (works on Mac and Linux):
```bash
brew install llama.cpp
```

Invoke the llama.cpp server:
```bash
$ llama-server \
--hf-repo elyza/Llama-3-ELYZA-JP-8B-GGUF \
--hf-file Llama-3-ELYZA-JP-8B-q4_k_m.gguf \
--port 8080
```

Call the API using curl:
```bash
$ curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    { "role": "system", "content": "あなたは誠実で優秀な日本人のアシスタントです。特に指示が無い場合は、常に日本語で回答してください。" },
    { "role": "user", "content": "古代ギリシャを学ぶ上で知っておくべきポイントは?" }
  ],
  "temperature": 0.6,
  "max_tokens": -1,
  "stream": false
}'
```

Call the API using Python:
```python
import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key = "dummy_api_key"
)

completion = client.chat.completions.create(
    model="dummy_model_name",
    messages=[
        {"role": "system", "content": "あなたは誠実で優秀な日本人のアシスタントです。特に指示が無い場合は、常に日本語で回答してください。"},
        {"role": "user", "content": "古代ギリシャを学ぶ上で知っておくべきポイントは?"}
    ]
)
```

## Use with Desktop App

There are various desktop applications that can handle GGUF models, but here we will introduce how to use the model in the no-code environment [LM Studio](https://lmstudio.ai/).

- **Installation**: Download and install [LM Studio](https://lmstudio.ai/).
- **Downloading the Model**: Search for `elyza/Llama-3-ELYZA-JP-8B-GGUF` in the search bar on the home page 🏠, and download `Llama-3-ELYZA-JP-8B-q4_k_m.gguf`.
- **Start Chatting**: Click on 💬 in the sidebar, select `Llama-3-ELYZA-JP-8B-GGUF` from "Select a Model to load" in the header, and load the model. You can now freely chat with the local LLM.
- **Setting Options**: You can set options from the sidebar on the right. Faster inference can be achieved by setting Quick GPU Offload to Max in the GPU Settings.
- **(For Developers) Starting an API Server**: Click `<->` in the left sidebar and move to the Local Server tab. Select the model and click Start Server to launch an OpenAI API-compatible API server.

![lmstudio-demo](./lmstudio-demo.gif)

This demo showcases Llama-3-ELYZA-JP-8B-GGUF running smoothly on a MacBook Pro (M1 Pro), achieving an inference speed of approximately 20 tokens per second.

## Developers

Listed in alphabetical order.

- [Masato Hirakawa](https://huggingface.co/m-hirakawa)
- [Shintaro Horie](https://huggingface.co/e-mon)
- [Tomoaki Nakamura](https://huggingface.co/tyoyo)
- [Daisuke Oba](https://huggingface.co/daisuk30ba)
- [Sam Passaglia](https://huggingface.co/passaglia)
- [Akira Sasaki](https://huggingface.co/akirasasaki)

## License

[Meta Llama 3 Community License](https://llama.meta.com/llama3/license/)

## How to Cite

```tex
@misc{elyzallama2024,
      title={elyza/Llama-3-ELYZA-JP-8B},
      url={https://huggingface.co/elyza/Llama-3-ELYZA-JP-8B},
      author={Masato Hirakawa and Shintaro Horie and Tomoaki Nakamura and Daisuke Oba and Sam Passaglia and Akira Sasaki},
      year={2024},
}
```

## Citations

```tex
@article{llama3modelcard,
    title={Llama 3 Model Card},
    author={AI@Meta},
    year={2024},
    url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
}
```