🤗 Serve any model with Inference Endpoints + Custom Handlers
TL;DR Inference Endpoints provide a secure production solution to easily deploy any Transformers, Sentence-Transformers, and Diffusers models from the Hugging Face Hub on dedicated and autoscaling infrastructure managed by Hugging Face. Inference Endpoints support running custom code via a handler, allowing for tailored pre-processing, inference, or post-processing based on your specific needs. This article explains how to serve any model on Inference Endpoints with Custom Handlers and walks through real-use case examples that anyone can reproduce.
What are Inference Endpoints?
Inference Endpoints provide a secure production solution to easily deploy any Transformers, Sentence-Transformers, and Diffusers models from the Hub on dedicated and autoscaling infrastructure managed by Hugging Face.
Inference Endpoints can be deployed via the Inference Endpoints UI as dedicated endpoints for any model available in the Hugging Face Hub with the Inference Endpoints tag. Alternatively, they can be used (not deployed) via the Serverless Inference API for any model with either "Warm" or "Cold" Inference Status.
If you're not yet familiar with Inference Endpoints, we recommend checking the documentation first.
What are Custom Handlers?
Custom Handlers are custom classes in Python that define the pre-processing, inference, and post-processing steps required to run the inference on top of a model. These custom classes are internally used by the Inference Endpoints backend when using the default container i.e. the PyTorch container, which comes with support for most of the model architectures and tasks defined within the Hugging Face Hub and supported by Transformers, Sentence-Transformers, and Diffusers.
Custom Handlers extend the functionality of Inference Endpoints beyond native support, offering more flexibility and control of the inference process. They enable users to tweak steps such as pre-processing, inference, and post-processing, incorporate additional dependencies, or implement features like custom metrics or logging, among others. This means users are not stuck with a one-size-fits-all solution, but rather something they can control and modify to fit their specific needs or requirements; if the default solution does not cover those already.
The custom handlers are shipped as a handler.py
file within a model repository, and an optional requirements.txt
file if needed. These are automatically detected and used by the Inference Endpoints backend on startup, if available.
Getting started!
To get started with custom handlers on the Hugging Face Hub, there are multiple alternatives:
- Duplicate the repository with the model weights to include the
handler.py
andrequirements.txt
(if applicable) files under a separate repository. - Open a PR (or commit to
main
if you're the only owner) to include thehandler.py
andrequirements.txt
(if applicable) files in the existing repository. - Create a brand new model repository that just contains the
handler.py
and therequirements.txt
(if applicable).
Note to enable the
Deploy
button within the model repository, theREADME.md
should containpipeline_tag: ...
with a valid pipeline supported by Inference Endpoints, so that the option is enabled within that repository, even if the repository doesn't contain the model weights.
Once the repository with or without the model weights is set up, you should create a handler.py
file within the root directory of the repository that implements the following interface:
from typing import Any, Dict
class EndpointHandler:
def __init__(self, model_dir: str, **kwargs: Any) -> None:
...
def __call__(self, data: Dict[str, Any]) -> Any:
...
Note that you can include any other functionality within the
handler.py
, but the class to be implemented needs to be namedEndpointHandler
and must implement both the__init__
and the__call__
methods, but you are free to include any other method within the class or function outside the class, and then use those within any of those class methods.
Finally, once created you can debug it locally by running the following snippet:
if __name__ == "__main__":
handler = EndpointHandler(model_dir=...)
assert handler(data=...) == ...
Additionally, if your pipeline requires any specific dependency version or even a dependency that doesn't come with the default PyTorch container, you can include that in the requirements.txt
file as:
diffusers>=0.31.0
Then you are all set! When clicking on Deploy
and then selecting Inference Endpoints (dedicated)
, you should be able to deploy your Custom Handler on Inference Endpoints! Alternatively, you can also go directly to the Inference Endpoints UI and search for the model repository with the custom handler on the Hub.
Tips and Tricks
To duplicate the model weights from one repository to another, the most convenient approach to avoid having to pull and push all the LFS files locally first, is to use the Repo Duplicator - Hugging Face Space that will copy everything within Hugging Face without having to pull the model repository locally.
Duplicating an existing repository is always the best approach since the hardware recommendation when creating the Inference Endpoint would still work (except for the LoRA adapter weights when not hosted along with the base model weights); otherwise, the hardware recommendation would be ignored when using a custom handler that just pulls the model within the
EndpointHandler.__init__
method.Since the main engine powering those is the
huggingface-inference-toolkit
, you can make use of some utilities defined in the such as thelogging
via thefrom huggingface_inference_toolkit.logging import logger
, and then just use that importedlogger
normally as e.g.logger.info
,logger.debug
, etc. and all those logs will be displayed within the Inference Endpoints logs.When selecting a task for the default i.e. the PyTorch container, in the Inference Endpoints UI, make sure to set the task to the same one as the model would have (unless not supported) so that the playground UI works normally. Note that it won't work on input payload modifications or for unsupported tasks, so if that's the case, select the "Custom" task instead, otherwise, the playground UI will be useless.
If the model weights are not within the current repository and the model weights are under a gated repository, you will need to manually set a secret variable within the Inference Endpoint configuration so that the gated model weights can be downloaded. To achieve that, the best is to add the following snippet within the
EndpointHandler.__init__
method before running any other step on initialization:if os.getenv("HF_TOKEN") is None: raise ValueError( "Since the model weights are gated, you will need to provide a valid `HF_TOKEN` with read-access" " to the repository where the weights are hosted." )
Note that if the model weights are hosted within the current repository, the token is not required.
When deploying an Inference Endpoint from either a duplicated repository or from an existing repository, not all the files within that repository may be required, as it may contain different formats such as
safetensors
,bin
, etc., and, since all of those will be downloaded on startup, you may want to delete the unused files first. That wouldn't happen if the repository just contained thehandler.py
andrequirements.txt
(if applicable), and thehandler.py
was pointing to another repository via e.g.transformers.pipeline(task=..., model=...)
where just the required files would be downloaded, instead of all the files in the repository.
Use Cases
Below, you'll find several use cases demonstrating why custom handlers can be valuable, along with simple code snippets showcasing how to reproduce and adapt these to your needs.
Serving LoRA Adapters for Diffusion Models
Serving LoRA Adapters for Diffusion Models
Say that you want to serve a fine-tuned LoRA adapter for a Diffusers model such as alvarobartt/ghibli-characters-flux-lora
which is a LoRA fine-tune of black-forest-labs/FLUX.1-dev
. When trying to deploy it on Inference Endpoints, the following error will show:
As the error says, you need to make sure that the model repository with the LoRA adapter contains a handler.py
file that will load the model first and then the adapter, as explained in the Diffusers Documentation on How to load adapters.
Note that since the base model here i.e. not the adapter within the repository, is gated, you need to make sure that you create and set the
HF_TOKEN
environment variable value with a valid Hugging Face Hub token with read access over the gated model, in this case beingblack-forest-labs/FLUX.1-dev
.
import os
from typing import Any, Dict
from diffusers import DiffusionPipeline # type: ignore
from PIL.Image import Image
import torch
from huggingface_inference_toolkit.logging import logger
class EndpointHandler:
def __init__(self, model_dir: str, **kwargs: Any) -> None: # type: ignore
"""The current `EndpointHandler` works with any FLUX.1-dev LoRA Adapter."""
if os.getenv("HF_TOKEN") is None:
raise ValueError(
"Since `black-forest-labs/FLUX.1-dev` is a gated model, you will need to provide a valid "
"`HF_TOKEN` as an environment variable for the handler to work properly."
)
self.pipeline = DiffusionPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
torch_dtype=torch.bfloat16,
token=os.getenv("HF_TOKEN"),
)
self.pipeline.load_lora_weights(model_dir)
self.pipeline.to("cuda")
def __call__(self, data: Dict[str, Any]) -> Image:
logger.info(f"Received incoming request with {data=}")
if "inputs" in data and isinstance(data["inputs"], str):
prompt = data.pop("inputs")
elif "prompt" in data and isinstance(data["prompt"], str):
prompt = data.pop("prompt")
else:
raise ValueError(
"Provided input body must contain either the key `inputs` or `prompt` with the"
" prompt to use for the image generation, and it needs to be a non-empty string."
)
parameters = data.pop("parameters", {})
num_inference_steps = parameters.get("num_inference_steps", 30)
width = parameters.get("width", 1024)
height = parameters.get("height", 768)
guidance_scale = parameters.get("guidance_scale", 3.5)
# seed generator (seed cannot be provided as is but via a generator)
seed = parameters.get("seed", 0)
generator = torch.manual_seed(seed)
return self.pipeline( # type: ignore
prompt,
height=height,
width=width,
guidance_scale=guidance_scale,
num_inference_steps=num_inference_steps,
generator=generator,
).images[0]
The code above can be reused and included as a
handler.py
file within any available LoRA adapter forblack-forest-labs/FLUX.1-dev
without any code modification required; and minimal modifications when changing the base model to e.g.stabilityai/stable-diffusion-3.5-large
, as most of the code is shared among the differenttext-to-image
use cases.
Which when deployed on Inference Endpoints looks like this:
Find the custom handler at alvarobartt/ghibli-characters-flux-lora
.
Serving Models from the Hub Not Supported Natively
Serving Models from the Hub Not Supported Natively
Say that you want to serve nvidia/NVLM-D-72B
which is an image-text-to-text
model, i.e. a Visual Language Model (VLM), that's not supported on Text Generation Inference (TGI), neither on the default PyTorch container (since image-text-to-text
doesn't have a pre-defined AutoPipeline
implementation for that task yet, but should soon have it as per https://github.com/huggingface/transformers/pull/34170).
Then you would need to define a custom handler that runs the pre-processing, inference, and post-processing for that task in the handler.py
file; including any other requirement in requirements.txt
, which shouldn't be needed in most of the cases, since the default PyTorch container already comes with most of the Hugging Face dependencies installed for Transformers, Sentence-Transformers and Diffusers; as well as some commonly used extra dependencies of those.
Note that using custom handlers, in this case, is not just to cover an unsupported model but also to define a custom device mapping, add custom pre-processing code, and add some custom logging messages, among other things.
import math
from typing import Any, Dict, List
import torch
import torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode
import requests
from io import BytesIO
from PIL import Image
from transformers import AutoTokenizer, AutoModel
from huggingface_inference_toolkit.logging import logger
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff = float("inf")
best_ratio = (1, 1)
area = width * height
for ratio in target_ratios:
target_aspect_ratio = ratio[0] / ratio[1]
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
if ratio_diff < best_ratio_diff:
best_ratio_diff = ratio_diff
best_ratio = ratio
elif ratio_diff == best_ratio_diff:
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
best_ratio = ratio
return best_ratio
def dynamic_preprocess(
image, min_num=1, max_num=12, image_size=448, use_thumbnail=False
):
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
# calculate the existing image aspect ratio
target_ratios = set(
(i, j)
for n in range(min_num, max_num + 1)
for i in range(1, n + 1)
for j in range(1, n + 1)
if i * j <= max_num and i * j >= min_num
)
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
# find the closest aspect ratio to the target
target_aspect_ratio = find_closest_aspect_ratio(
aspect_ratio,
target_ratios,
orig_width,
orig_height,
image_size,
)
# calculate the target width and height
target_width = image_size * target_aspect_ratio[0]
target_height = image_size * target_aspect_ratio[1]
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
# resize the image
resized_img = image.resize((target_width, target_height))
processed_images = []
for i in range(blocks):
box = (
(i % (target_width // image_size)) * image_size,
(i // (target_width // image_size)) * image_size,
((i % (target_width // image_size)) + 1) * image_size,
((i // (target_width // image_size)) + 1) * image_size,
)
# split the image
split_img = resized_img.crop(box)
processed_images.append(split_img)
assert len(processed_images) == blocks
if use_thumbnail and len(processed_images) != 1:
thumbnail_img = image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
def load_image(image_url, input_size=448, max_num=12):
response = requests.get(image_url)
image = Image.open(BytesIO(response.content)).convert("RGB")
transform = build_transform(input_size=input_size)
images = dynamic_preprocess(
image, image_size=input_size, use_thumbnail=True, max_num=max_num
)
pixel_values = [transform(image) for image in images]
pixel_values = torch.stack(pixel_values)
return pixel_values
def split_model():
device_map = {}
world_size = torch.cuda.device_count()
num_layers = 80
# Since the first GPU will be used for ViT, treat it as half a GPU.
num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
num_layers_per_gpu = [num_layers_per_gpu] * world_size
num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
layer_cnt = 0
for i, num_layer in enumerate(num_layers_per_gpu):
for j in range(num_layer):
device_map[f"language_model.model.layers.{layer_cnt}"] = i
layer_cnt += 1
device_map["vision_model"] = 0
device_map["mlp1"] = 0
device_map["language_model.model.tok_embeddings"] = 0
device_map["language_model.model.embed_tokens"] = 0
device_map["language_model.output"] = 0
device_map["language_model.model.norm"] = 0
device_map["language_model.lm_head"] = 0
device_map[f"language_model.model.layers.{num_layers - 1}"] = 0
return device_map
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
def build_transform(input_size):
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
transform = T.Compose(
[
T.Lambda(lambda img: img.convert("RGB") if img.mode != "RGB" else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD),
]
)
return transform
class EndpointHandler:
def __init__(self, model_dir: str, **kwargs: Any) -> None:
self.model = AutoModel.from_pretrained(
model_dir,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=False,
trust_remote_code=True,
device_map=split_model(),
).eval()
self.tokenizer = AutoTokenizer.from_pretrained(
model_dir, trust_remote_code=True, use_fast=False
)
def __call__(self, data: Dict[str, Any]) -> Dict[str, List[Any]]:
logger.info(f"Received incoming request with {data=}")
if "instances" in data:
logger.warning("Using `instances` instead of `inputs` is deprecated.")
data["inputs"] = data.pop("instances")
if "inputs" not in data:
raise ValueError(
"The request body must contain a key 'inputs' with a list of inputs."
)
if not isinstance(data["inputs"], list):
raise ValueError(
"The request inputs must be a list of dictionaries with either the key"
" 'prompt' or 'prompt' + 'image_url', and optionally including the key"
" 'generation_config'."
)
if not all(isinstance(input, dict) and "prompt" in input.keys() for input in data["inputs"]):
raise ValueError(
"The request inputs must be a list of dictionaries with either the key"
" 'prompt' or 'prompt' + 'image_url', and optionally including the key"
" 'generation_config'."
)
predictions = []
for input in data["inputs"]:
if "prompt" not in input:
raise ValueError(
"The request input body must contain at least the key 'prompt' with the prompt to use."
)
generation_config = input.get("generation_config", dict(max_new_tokens=1024, do_sample=False))
if "image_url" not in input:
# pure-text conversation
response, history = self.model.chat(
self.tokenizer,
None,
input["prompt"],
generation_config,
history=None,
return_history=True,
)
else:
# single-image single-round conversation
pixel_values = load_image(input["image_url"], max_num=6).to(
torch.bfloat16
)
response = self.model.chat(
self.tokenizer,
pixel_values,
f"<image>\n{input['prompt']}",
generation_config,
)
predictions.append(response)
return {"predictions": predictions}
Which when deployed on Inference Endpoints looks like this:
Find the custom handler at alvarobartt/NVLM-D-72B-IE-compatible
.
Defining Custom Specifications for I/O Payloads
Defining Custom Specifications for I/O Payloads
Note that when using custom specifications for the I/O payloads, the "Task" that runs within the "Default" container within the Inference Endpoint needs to be set to "Custom", otherwise the playground in the UI will be created for that given task, which will fail due to the pre-defined output parsing; whilst the custom task will print out the raw response in JSON format.
Say that you have a UI or SDK that is expecting an API to receive or produce a given payload, but the default Inference Endpoints payload formatting for either or both input and output is not compliant with that, but you still want to leverage Hugging Face Inference Endpoints to use those within your application seamlessly.
Then you would need to implement a custom handler that given a task such as e.g. zero-shot-classification
, expects an input different from the default one:
{"inputs": "I have a problem with my iphone that needs to be resolved asap!", "parameters": {"candidate_labels": ["urgent", "not urgent", "phone", "tablet", "computer"]}}
But you want it to expect the following:
{"sequence": "I have a problem with my iphone that needs to be resolved asap!", "labels": ["urgent", "not urgent", "phone", "tablet", "computer"]}
And by default producing the output:
{"sequence": "I have a problem with my iphone that needs to be resolved asap!!", "labels": ["urgent", "phone", "computer", "not urgent", "tablet"], "scores": [0.504, 0.479, 0.013, 0.003, 0.002]}
But you want it to produce:
{"sequence": "I have a problem with my iphone that needs to be resolved asap!!", "label": "urgent", "timestamp": 1732028280}
Then the custom handler would look similar to the following:
import os
from typing import Any, Dict
import time
from transformers import pipeline
import torch
from huggingface_inference_toolkit.logging import logger
class EndpointHandler:
def __init__(self, model_dir: str, **kwargs: Any) -> None:
"""Initialize the EndpointHandler for zero-shot classification."""
self.classifier = pipeline(
"zero-shot-classification",
model=model_dir,
device=0 if torch.cuda.is_available() else -1
)
def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
logger.info(f"Received incoming request with {data=}")
if "sequence" not in data or not isinstance(data["sequence"], str):
raise ValueError(
"Provided input body must contain the key `sequence` with the text to classify, "
"and it needs to be a non-empty string."
)
if "labels" not in data or not isinstance(data["labels"], list):
raise ValueError(
"Provided input body must contain the key `labels` with a list of classification labels."
)
sequence = data["sequence"]
labels = data["labels"]
output = self.classifier(sequence, candidate_labels=labels)
return {
"sequence": sequence,
"label": output["labels"][0],
"timestamp": int(time.time())
}
These are just some of the multiple use cases of the custom handlers, which can also include downloading model weights from outside the Hub from private storage, e.g., Google Cloud Storage (GCS), adding custom metric reporting or logging, and many others.
Conclusion
While the default implementation of Inference Endpoints should cover most use cases for Text Generation Inference (TGI), Text Embeddings Inference (TEI), or PyTorch-compatible models hosted on the Hugging Face Hub, there may be instances where these implementations have limitations or don't suit your specific needs or specifications. As explained above, these challenges can be easily addressed by using custom code via the Endpoint Handlers.
Custom Handlers provide Inference Endpoints with substantial flexibility, enabling them to serve virtually any model while managed securely and reliably by Hugging Face. This solution is hosted on major cloud provider infrastructures such as Amazon Web Services (AWS), Google Cloud Platform (GCP), or Microsoft Azure, ensuring robust and scalable deployment options.