add lcmlora instructions

d994c55 11 months ago

5.7 kB

	---
	license: openrail++
	base_model: stabilityai/stable-diffusion-xl-base-1.0
	language:
	- en
	tags:
	- stable-diffusion
	- stable-diffusion-xl
	- stable-diffusion-xl-lcm
	- stable-diffusion-xl-lcmlora
	- tensorrt
	- text-to-image
	---

	# Stable Diffusion XL 1.0 TensorRT

	## Introduction

	This repository hosts the TensorRT versions(sdxl, sdxl-lcm, sdxl-lcmlora) of Stable Diffusion XL 1.0 created in collaboration with [NVIDIA](https://huggingface.co/nvidia). The optimized versions give substantial improvements in speed and efficiency.

	See the [usage instructions](#usage-example) for how to run the SDXL pipeline with the ONNX files hosted in this repository.


	![examples](./examples.jpg)

	## Model Description

	- Developed by: Stability AI
	- Model type: Diffusion-based text-to-image generative model
	- License: [CreativeML Open RAIL++-M License](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/blob/main/LICENSE.md)
	- Model Description: This is a conversion of the [SDXL base 1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) and [SDXL refiner 1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0) models for [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) optimized inference


	## Performance Comparison

	#### Timings for 30 steps at 1024x1024

	\| Accelerator \| Baseline (non-optimized) \| NVIDIA TensorRT (optimized) \| Percentage improvement \|
	\|-------------\|--------------------------\|-----------------------------\|------------------------\|
	\| A10 \| 9399 ms \| 8160 ms \| ~13% \|
	\| A100 \| 3704 ms \| 2742 ms \| ~26% \|
	\| H100 \| 2496 ms \| 1471 ms \| ~41% \|

	#### Image throughput for 30 steps at 1024x1024

	\| Accelerator \| Baseline (non-optimized) \| NVIDIA TensorRT (optimized) \| Percentage improvement \|
	\|-------------\|--------------------------\|-----------------------------\|------------------------\|
	\| A10 \| 0.10 images/sec \| 0.12 images/sec \| ~20% \|
	\| A100 \| 0.27 images/sec \| 0.36 images/sec \| ~33% \|
	\| H100 \| 0.40 images/sec \| 0.68 images/sec \| ~70% \|

	#### Timings for Latent Consistency Model(LCM) version for 4 steps at 1024x1024

	\| Accelerator \| CLIP \| Unet \| VAE \|Total \|
	\|-------------\|--------------------------\|-----------------------------\|------------------------\|------------------------\|
	\| A100 \| 1.08 ms \| 192.02 ms \| 228.34 ms \| 426.16 ms \|
	\| H100 \| 0.78 ms \| 102.8 ms \| 126.95 ms \| 234.22 ms \|


	## Usage Example

	1. Following the [setup instructions](https://github.com/rajeevsrao/TensorRT/blob/release/9.2/demo/Diffusion/README.md) on launching a TensorRT NGC container.
	```shell
	git clone https://github.com/rajeevsrao/TensorRT.git
	cd TensorRT
	git checkout release/9.2
	docker run --rm -it --gpus all -v $PWD:/workspace nvcr.io/nvidia/pytorch:23.11-py3 /bin/bash
	```

	2. Download the SDXL TensorRT files from this repo
	```shell
	git lfs install
	git clone https://huggingface.co/stabilityai/stable-diffusion-xl-1.0-tensorrt
	cd stable-diffusion-xl-1.0-tensorrt
	git lfs pull
	cd ..
	```

	3. Install libraries and requirements
	```shell
	cd demo/Diffusion
	python3 -m pip install --upgrade pip
	pip3 install -r requirements.txt
	python3 -m pip install --pre --upgrade --extra-index-url https://pypi.nvidia.com tensorrt
	```

	4. Perform TensorRT optimized inference:

	- SDXL

	The first invocation produces plan files in `engine_xl_base` and `engine_xl_refiner` specific to the accelerator being run on and are reused for later invocations.

	```
	python3 demo_txt2img_xl.py \
	"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" \
	--build-static-batch \
	--use-cuda-graph \
	--num-warmup-runs 1 \
	--width 1024 \
	--height 1024 \
	--denoising-steps 30 \
	--onnx-base-dir /workspace/stable-diffusion-xl-1.0-tensorrt/sdxl-1.0-base \
	--onnx-refiner-dir /workspace/stable-diffusion-xl-1.0-tensorrt/sdxl-1.0-refiner
	```

	- SDXL-LCM

	The first invocation produces plan files in --engine-dir specific to the accelerator being run on and are reused for later invocations.
	```
	python3 demo_txt2img_xl.py \
	""Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"" \
	--version=xl-1.0 \
	--onnx-dir /workspace/stable-diffusion-xl-1.0-tensorrt/lcm \
	--engine-dir /workspace/stable-diffusion-xl-1.0-tensorrt/lcm/engine-sdxl-lcm-nocfg \
	--scheduler LCM \
	--denoising-steps 4 \
	--guidance-scale 0.0 \
	--seed 42

	```
	- SDXL-LCMLORA

	The first invocation produces plan files in --engine-dir specific to the accelerator being run on and are reused for later invocations.

	```
	python3 demo_txt2img_xl.py \
	""Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"" \
	--version=xl-1.0 \
	--onnx-dir /workspace/stable-diffusion-xl-1.0-tensorrt/lcmlora \
	--engine-dir /workspace/stable-diffusion-xl-1.0-tensorrt/lcm/engine-sdxl-lcmlora-nocfg \
	--scheduler LCM \
	--lora-path latent-consistency/lcm-lora-sdxl \
	--lora-scale 1.0 \
	--denoising-steps 4 \
	--guidance-scale 0.0 \
	--seed 42

	```