bf16_vs_fp8 / docs /xFasterTransformer.md
zjasper666's picture
Upload folder using huggingface_hub
8655a4b verified

A newer version of the Gradio SDK is available: 5.7.1

Upgrade

xFasterTransformer Inference Framework

Integrated xFasterTransformer customized framework into Fastchat to provide Faster inference speed on Intel CPU.

Install xFasterTransformer

Setup environment (please refer to this link for more details):

pip install xfastertransformer

Prepare models

Prepare Model (please refer to this link for more details):

python ./tools/chatglm_convert.py -i ${HF_DATASET_DIR} -o  ${OUTPUT_DIR}

Parameters of xFasterTransformer

--enable-xft to enable xfastertransformer in Fastchat --xft-max-seq-len to set the max token length the model can process. max token length include input token length. --xft-dtype to set datatype used in xFasterTransformer for computation. xFasterTransformer can support fp32, fp16, int8, bf16 and hybrid data types like : bf16_fp16, bf16_int8. For datatype details please refer to this link

Chat with the CLI:

#run inference on all CPUs and using float16
python3 -m fastchat.serve.cli \
    --model-path /path/to/models \
    --enable-xft \
    --xft-dtype fp16

or with numactl on multi-socket server for better performance

#run inference on numanode 0 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16)
numactl -N 0  --localalloc \
python3 -m fastchat.serve.cli \
    --model-path /path/to/models/chatglm2_6b_cpu/ \
    --enable-xft \
    --xft-dtype bf16_fp16

or using MPI to run inference on 2 sockets for better performance

#run inference on numanode 0 and 1 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16)
OMP_NUM_THREADS=$CORE_NUM_PER_SOCKET LD_PRELOAD=libiomp5.so mpirun \
-n 1 numactl -N 0  --localalloc \
python -m fastchat.serve.cli \ 
    --model-path /path/to/models/chatglm2_6b_cpu/ \
    --enable-xft \
    --xft-dtype bf16_fp16 : \
-n 1 numactl -N 1  --localalloc \
python -m fastchat.serve.cli \
    --model-path /path/to/models/chatglm2_6b_cpu/ \
    --enable-xft \
    --xft-dtype bf16_fp16

Start model worker:

# Load model with default configuration (max sequence length 4096, no GPU split setting).
python3 -m fastchat.serve.model_worker \
    --model-path /path/to/models \
    --enable-xft \
    --xft-dtype bf16_fp16 

or with numactl on multi-socket server for better performance

#run inference on numanode 0 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16)
numactl -N 0  --localalloc python3 -m fastchat.serve.model_worker \
    --model-path /path/to/models \
    --enable-xft \
    --xft-dtype bf16_fp16 

or using MPI to run inference on 2 sockets for better performance

#run inference on numanode 0 and 1 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16)
OMP_NUM_THREADS=$CORE_NUM_PER_SOCKET LD_PRELOAD=libiomp5.so mpirun \
-n 1 numactl -N 0  --localalloc  python -m fastchat.serve.model_worker \
    --model-path /path/to/models \
    --enable-xft \
    --xft-dtype bf16_fp16 : \
-n 1 numactl -N 1  --localalloc  python -m fastchat.serve.model_worker \
    --model-path /path/to/models \
    --enable-xft \
    --xft-dtype bf16_fp16 

For more details, please refer to this link