license: mit
language: en
tags:
- LLM
- LLaMA
- Baichuan
- Baichuan2
- XVERSE
Model Card for lyraLLMs
Introduction
We have released lyraLLMs, a highly optimized and easy-to-use inference engine for LLMs.
lyraLLMs is suitable for NVIDIA GPUs:
- Volta (V100)
- Turing (T4)
- Ampere (A100/A10)
- Ada Lovelace (RTX 4090, etc.)
lyraLLMs supports many popular HuggingFace models as follows:
lyraLLMs is fast, memory-efficient & easy to use with:
- State-of-the-art throughtput (up to 7K tokens/s for LLaMA 13B)
- Efficient memory usage of attention with FlashAttention2
- Quantization: MEMOPT mode (W8A16, W4A16), KVCache Int8
- Easy-to-use Python API to serve LLMs
- Streaming outputs
If you like our work and consider to join us, feel free to drop a line at benbinwu@tencent.com
Speed
Settings
- Evaluated at tokens/s (input + output)
- Test on A100 40G, CUDA 12.0
- Enable the use of MEMOPT mode and KVCache Int8
Throughputs
XVERSE-13B-Chat
Input
北京的景点:故宫、天坛、万里长城等。\n深圳的景点:
Version | Batch Size 1 | Batch Size 64 | Batch Size 128 | Batch Size 256 | Batch Size 512 |
---|---|---|---|---|---|
Torch 2.1.0 | 52.9 | 2308.1 | OOM | ||
lyraXVERSE | 200.4 | 4624.8 | 5759.7 | 6075.6 | 5733 |
Baichuan2-7B-Base
Input
北京的景点:登鹳雀楼->王之涣\n夜雨寄北->
Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 |
---|---|---|---|---|---|
Torch 2.0.1 | 41.2 | 323.2 | 640.0 | 1256.8 | 2231.0 |
lyraBaichuan | 125.9 | 948.1 | 1749.3 | 2974.0 | 4370.1 |
Baichuan2-13B-Base
Input
北京的景点:登鹳雀楼->王之涣\n夜雨寄北->
Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 |
---|---|---|---|---|---|
Torch 2.0.1 | 40.9 | 307.9 | 555.6 | 1010.4 | 1601.0 |
lyraBaichuan | 80.0 | 568.2 | 1124.4 | 1942.6 | 2828.0 |
Yi-6B
Input
# write the quick sort algorithm
Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 |
---|---|---|---|---|---|
Torch 2.1.0 | 31.4 | 247.5 | 490.4 | 987.2 | 1796.3 |
lyraLLaMA | 93.8 | 735.6 | 2339.8 | 3020.9 | 4630.8 |
Yi-34B
Due to limitation of VRAM, we cannot profile the throughputs of Yi-34B on A100 40G using Torch.
Input
Let me tell you an interesting story about cat Tom and mouse Jerry,
Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 |
---|---|---|---|---|---|
lyraLLaMA | 52.5 | 399.4 | 753.0 | 1138.2 | 1926.2 |
Usage
Environment (Docker recommended)
- For Cuda 11.X: we recommend
nvcr.io/nvidia/pytorch:22.12-py3
- For Cuda 12.0: we recommend
nvcr.io/nvidia/pytorch:23.02-py3
docker pull nvcr.io/nvidia/pytorch:23.02-py3
docker run --rm -it --gpus all -v ./:/lyraLLMs nvcr.io/nvidia/pytorch:23.02-py3
pip install -r requirements.txt
Convert Models
We have released multiple optimized models converted from original HuggingFace ones:
- ChatGLM-6B
- XVERSE-13B-Chat
- LLaMA-Ziya-13B
- Baichuan-7B, Baichuan-13B-Base, Baichuan-13B-Chat, Baichuan2-7B-Base, Baichuan2-7B-Chat, Baichuan2-13B-Base and lyraBaichuan2-13B-Chat
- Yi-6B, Yi-34B
Feel free to contact us if you would like to convert a finetuned version of LLMs.
Inference
Refer to README.md for inference of converted models with lyraLLMs.
Python Demo
from lyra_llama import lyraLlama
model_path = 'XXX' # 包含转换后的模型参数,配置,tokenizer文件目录
data_type = 'fp16'
memopt_mode = 0 # 如需使用MEMOPT模式推理, memopt_mode=1
model = lyraLlama(model_path, data_type, memopt_mode)
prompts = '列出3个不同的机器学习算法,并说明它们的适用范围.'
prompts = [prompts,] * 64
output_texts = model.generate(prompts, output_length=150, do_sample=False, top_k=30, top_p=0.85, temperature=1.0, repetition_penalty=1.0)
print(output_texts)
Citation
@Misc{lyraLLMs2024,
author = {Kangjian Wu, Zhengtao Wang, Yibo Lu, Haoxiong Su, Bin Wu},
title = {lyraLLMs: A highly optimized and easy-to-use inference engine for LLMs},
howpublished = {\url{https://huggingface.co/TMElyralab/lyraLLMs}},
year = {2024}
}
Report bug
- start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraLLMs/discussions
- report bug with a
[bug]
mark in the title.