bf16_vs_fp8 / docs /mlx_integration.md
zjasper666's picture
Upload folder using huggingface_hub
8655a4b verified

A newer version of the Gradio SDK is available: 5.7.1

Upgrade

Apple MLX Integration

You can use Apple MLX as an optimized worker implementation in FastChat.

It runs models efficiently on Apple Silicon

See the supported models here.

Note that for Apple Silicon Macs with less memory, smaller models (or quantized models) are recommended.

Instructions

  1. Install MLX.

    pip install "mlx-lm>=0.0.6"
    
  2. When you launch a model worker, replace the normal worker (fastchat.serve.model_worker) with the MLX worker (fastchat.serve.mlx_worker). Remember to launch a model worker after you have launched the controller (instructions)

    python3 -m fastchat.serve.mlx_worker --model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0