--- library_name: transformers license: llama3 language: - ja - en tags: - llama-cpp --- # Llama-3-ELYZA-JP-8B-GGUF ![Llama-3-ELYZA-JP-8B-image](./key_visual.png) ## Model Description **Llama-3-ELYZA-JP-8B** is a large language model trained by [ELYZA, Inc](https://elyza.ai/). Based on [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), it has been enhanced for Japanese usage through additional pre-training and instruction tuning. For more details, please refer to [our blog post](https://note.com/elyza/n/n360b6084fdbd). ## Quantization We performed quantization using [llama.cpp](https://github.com/ggerganov/llama.cpp) and converted the model to GGUF format. Currently, we only offer quantized models in the Q4_K_M format. We have prepared two quantized model options, GGUF and AWQ. Here is the table measuring the performance degradation due to quantization. | Model | ELYZA-tasks-100 GPT4 score | | :-------------------------------- | ---: | | Llama-3-ELYZA-JP-8B | 3.655 | | Llama-3-ELYZA-JP-8B-GGUF (Q4_K_M) | 3.57 | | Llama-3-ELYZA-JP-8B-AWQ | 3.39 | ## Use with llama.cpp Install llama.cpp through brew (works on Mac and Linux) ```bash brew install llama.cpp ``` Invoke the llama.cpp server. ```bash $ llama-server \ --hf-repo elyza/Llama-3-ELYZA-JP-8B-GGUF \ --hf-file Llama-3-ELYZA-JP-8B-q4_k_m.gguf \ --port 8080 ``` Call the API using curl. ```bash $ curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [ { "role": "system", "content": "あなたは誠実で優秀な日本人のアシスタントです。特に指示が無い場合は、常に日本語で回答してください。" }, { "role": "user", "content": "古代ギリシャを学ぶ上で知っておくべきポイントは?" } ], "temperature": 0.6, "max_tokens": -1, "stream": false }' ``` Call the API using Python. ```python import openai client = openai.OpenAI( base_url="http://localhost:8080/v1", api_key = "dummy_api_key" ) completion = client.chat.completions.create( model="dummy_model_name", messages=[ {"role": "system", "content": "あなたは誠実で優秀な日本人のアシスタントです。特に指示が無い場合は、常に日本語で回答してください。"}, {"role": "user", "content": "古代ギリシャを学ぶ上で知っておくべきポイントは?"} ] ) ``` ## Use with Desktop App There are various desktop applications that can handle GGUF models, but here we will introduce how to use a model in a local environment without coding by using LM Studio. - **Installation**: Download and install [LM Studio](https://lmstudio.ai/). - **Downloading the Model**: Search for `elyza/Llama-3-ELYZA-JP-8B-GGUF` in the search bar on the home page 🏠, and download `Llama-3-ELYZA-JP-8B-q4_k_m.gguf`. - **Start Chatting**: Click on 💬 in the sidebar, select `Llama-3-ELYZA-JP-8B-GGUF` from "Select a Model to load" in the header, and load the model. Now you can freely chat with the local LLM. - **Setting Options**: You can set options from the sidebar on the right. Faster inference can be achieved by setting Quick GPU Offload Settings to Max in the GPU Settings. - **For Developers, Starting the API Server**: Click `<->` in the left sidebar and move to the Local Server tab. Select the model and click Start Server to launch an OpenAI API-compatible API server. ## Quantization Options Currently, we only offer quantized models in the Q4_K_M format. ## Developers Listed in alphabetical order. - [Masato Hirakawa](https://huggingface.co/m-hirakawa) - [Shintaro Horie](https://huggingface.co/e-mon) - [Tomoaki Nakamura](https://huggingface.co/tyoyo) - [Daisuke Oba](https://huggingface.co/daisuk30ba) - [Sam Passaglia](https://huggingface.co/passaglia) - [Akira Sasaki](https://huggingface.co/akirasasaki) ## License [Meta Llama 3 Community License](https://llama.meta.com/llama3/license/) ## How to Cite ```tex @misc{elyzallama2024, title={elyza/Llama-3-ELYZA-JP-8B}, url={https://huggingface.co/elyza/Llama-3-ELYZA-JP-8B}, author={Masato Hirakawa and Shintaro Horie and Tomoaki Nakamura and Daisuke Oba and Sam Passaglia and Akira Sasaki}, year={2024}, } ``` ## Citations ```tex @article{llama3modelcard, title={Llama 3 Model Card}, author={AI@Meta}, year={2024}, url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md} } ```