metadata

library_name: transformers
license: llama3
language:
  - ja
  - en
tags:
  - llama-cpp

Llama-3-ELYZA-JP-8B-GGUF

Model Description

Llama-3-ELYZA-JP-8B is a large language model trained by ELYZA, Inc. Based on meta-llama/Meta-Llama-3-8B-Instruct, it has been enhanced for Japanese usage through additional pre-training and instruction tuning.

For more details, please refer to our blog post.

Quantization

We performed quantization using llama.cpp and converted the model to GGUF format. Currently, we only offer quantized models in the Q4_K_M format.

We have prepared two quantized model options, GGUF and AWQ. Here is the table measuring the performance degradation due to quantization.

Model	ELYZA-tasks-100 GPT4 score
Llama-3-ELYZA-JP-8B	3.655
Llama-3-ELYZA-JP-8B-GGUF (Q4_K_M)	3.57
Llama-3-ELYZA-JP-8B-AWQ	3.39

Use with llama.cpp

Install llama.cpp through brew (works on Mac and Linux)

brew install llama.cpp

Invoke the llama.cpp server.

$ llama-server \
--hf-repo elyza/Llama-3-ELYZA-JP-8B-GGUF \
--hf-file Llama-3-ELYZA-JP-8B-q4_k_m.gguf \
--port 8080

Call the API using curl.

$ curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    { "role": "system", "content": "あなたは誠実で優秀な日本人のアシスタントです。特に指示が無い場合は、常に日本語で回答してください。" },
    { "role": "user", "content": "古代ギリシャを学ぶ上で知っておくべきポイントは？" }
  ],
  "temperature": 0.6,
  "max_tokens": -1,
  "stream": false
}'

Call the API using Python.

import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key = "dummy_api_key"
)

completion = client.chat.completions.create(
    model="dummy_model_name",
    messages=[
        {"role": "system", "content": "あなたは誠実で優秀な日本人のアシスタントです。特に指示が無い場合は、常に日本語で回答してください。"},
        {"role": "user", "content": "古代ギリシャを学ぶ上で知っておくべきポイントは？"}
    ]
)

Use with Desktop App

There are various desktop applications that can handle GGUF models, but here we will introduce how to use a model in a local environment without coding by using LM Studio.

Installation: Download and install LM Studio.
Downloading the Model: Search for elyza/Llama-3-ELYZA-JP-8B-GGUF in the search bar on the home page 🏠, and download Llama-3-ELYZA-JP-8B-q4_k_m.gguf.
Start Chatting: Click on 💬 in the sidebar, select Llama-3-ELYZA-JP-8B-GGUF from "Select a Model to load" in the header, and load the model. Now you can freely chat with the local LLM.
Setting Options: You can set options from the sidebar on the right. Faster inference can be achieved by setting Quick GPU Offload Settings to Max in the GPU Settings.
For Developers, Starting the API Server: Click <-> in the left sidebar and move to the Local Server tab. Select the model and click Start Server to launch an OpenAI API-compatible API server.

Quantization Options

Currently, we only offer quantized models in the Q4_K_M format.

Developers

Listed in alphabetical order.

License

Meta Llama 3 Community License

How to Cite

@misc{elyzallama2024,
      title={elyza/Llama-3-ELYZA-JP-8B},
      url={https://huggingface.co/elyza/Llama-3-ELYZA-JP-8B},
      author={Masato Hirakawa and Shintaro Horie and Tomoaki Nakamura and Daisuke Oba and Sam Passaglia and Akira Sasaki},
      year={2024},
}

Citations

@article{llama3modelcard,
    title={Llama 3 Model Card},
    author={AI@Meta},
    year={2024},
    url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
}