Edit model card

Model Card for Model ID

Model Details

meta-llama/Meta-Llama-3.1-8B-Instruct quantized to ONNX GenAI INT4 with Microsoft DirectML optimization

Model Description

meta-llama/Meta-Llama-3.1-8B-Instruct quantized to ONNX GenAI INT4 with Microsoft DirectML optimization
https://onnxruntime.ai/docs/genai/howto/install.html#directml

Created using ONNX Runtime GenAI's builder.py
https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/src/python/py/models/builder.py

Build options:
INT4 accuracy level: FP32 (float32)
8-bit quantization for MoE layers

  • Developed by: Mochamad Aris Zamroni

Model Sources [optional]

https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct

Direct Use

This is Microsoft Windows DirectML optimized model.
It might not be working in ONNX execution provider other than DmlExecutionProvider.
The needed python scripts are included in this repository

Prerequisites:

  1. Install Python 3.10 from Windows Store:
    https://apps.microsoft.com/detail/9pjpw5ldxlz5?hl=en-us&gl=US

  2. Open command line cmd.exe

  3. Create python virtual environment, activate the environment then install onnxruntime-genai-directml
    mkdir c:\temp
    cd c:\temp
    python -m venv dmlgenai
    dmlgenai\Scripts\activate.bat
    pip install onnxruntime-genai-directml

  4. Use the onnxgenairun.py to get chat interface.
    It is modified version of "https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py".
    The modification makes the text output changes to new line after "., :, and ;" to make the output easier to be read.

python onnxgenairun.py --help
python onnxgenairun.py -m . -v -g

  1. (Optional) Device specific optimization.
    a. Open "dml-device-specific-optim.py" with text editor and change the file path accordingly.
    b. Run the python script: python dml-device-specific-optim.py
    c. Rename the original model.onnx to other file name and put and rename the optimized onnx file from step 5.b to model.onnx file.
    d. Rerun step 4.

Speeds, Sizes, Times [optional]

15 token/s in Radeon 780M with 8GB dedicated RAM.
Increase to 16 token/s with device specific optimized model.onnx.
As comparison, LM Studio using GGUF INT4 model and VulkanML GPU acceleration runs at 13 token/s.

Hardware

AMD Ryzen Zen4 7840U with integrated Radeon 780M GPU
RAM 32GB
8GB pre-allocated iGPU VRAM

Software

Microsoft DirectML on Windows 10

Model Card Authors [optional]

Mochamad Aris Zamroni

Model Card Contact

https://www.linkedin.com/in/zamroni/

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
Inference API (serverless) is not available, repository is disabled.

Model tree for zamroni111/Meta-Llama-3.1-8B-Instruct-ONNX-DirectML-GenAI-INT4

Quantized
this model

Collection including zamroni111/Meta-Llama-3.1-8B-Instruct-ONNX-DirectML-GenAI-INT4