need gguf

#4
by windkkk - opened

need gguf

GGUF would be nice. Hopefully they'll add Phi 3.5 MOE support to llama.cpp.

https://github.com/ggerganov/llama.cpp/issues/9119

Iam need gguf

ChatLLM.cpp supports this:

    ________          __  __    __    __  ___ (Φ)
   / ____/ /_  ____ _/ /_/ /   / /   /  |/  /_________  ____  
  / /   / __ \/ __ `/ __/ /   / /   / /|_/ // ___/ __ \/ __ \ 
 / /___/ / / / /_/ / /_/ /___/ /___/ /  / // /__/ /_/ / /_/ / 
 \____/_/ /_/\__,_/\__/_____/_____/_/  /_(_)___/ .___/ .___/  
You are served by Phi-3.5 MoE,                /_/   /_/       
with 41873153344 (6.6B effect.) parameters.

You  > write a python program to calculate 10!
A.I. > Certainly! Below is a Python program that calculates the factorial of 10 (10!):

def factorial(n):
    if n == 0 or n == 1:
        return 1
    else:
        return n * factorial(n - 1)

result = factorial(10)
print("The factorial of 10 is:", result)

Here's an alternative using a loop for better performance:
...

how to use ChatLLM.cpp to convet microsoft/Phi-3.5-MoE-instruct to gguf?

@goodasdgood Sorry, ChatLLM.cpp uses its own format (something like the old GGML file format).

Mistral.rs supports this now: https://github.com/EricLBuehler/mistral.rs/blob/master/docs/PHI3.5MOE.md!

You can quantize in-place with GGUF and HQQ quantization, and there is model toplogoy for per-layer device-mapping and ISQ parameterization. CUDA and Metal are supported, as well as CPU SIMD acceleration.

Built on Candle: https://github.com/huggingface/candle!

@goodasdgood as of writing, llama.cpp doesn't support Phi 3.5 MoE models, so GGUF models for that wouldn't really make sense.

Mistral.rs uses (among other methods), a technique called ISQ, which enables you to quantize the model quickly, locally, and in-place. You can then use our OpenAI server, Python API, or Rust API to interface with your application.

Microsoft org

For your information, it will be best to quantize the expert weights only. Not the gating nor the attention weight. Expert weights take most of the memory, it gives a very good compression with expert-only quantization.

@ykim362 thank you for your input. We actually already only quantize the expert & attention weights, as we discovered that quantizing the gating layer drastically reduced performance. But I'll try only quantizing the attention weights!

Microsoft org

@EricB Thanks. Just to make sure, I think you meant "I'll try only quantizing the expert weights!". not attention.

@ykim362 yes, sorry my mistake!

Not this model, but we have some studies around the quantization of MoE. We could easily quantize the models down to 3-bits (experts only) with PTQ (the plain absolute min-max). With QAT (experts only), we could push it down to 2-bits. But, quantizing the other parts hurts the perf significantly. https://arxiv.org/abs/2310.02410

@ykim362 thanks for the link. That seems super interesting, I implemented it here quickly so it can be used with Phi 3.5 MoE!

Microsoft org

@EricB that's real quick! Awesome and thank you!!

Bump, am also interested in GGUF format and want to watch this thread.

Sign up or log in to comment