🤗 Serve any model with Inference Endpoints + Custom Handlers

Community Article Published November 22, 2024

TL;DR Inference Endpoints provide a secure production solution to easily deploy any Transformers, Sentence-Transformers, and Diffusers models from the Hugging Face Hub on dedicated and autoscaling infrastructure managed by Hugging Face. Inference Endpoints support running custom code via a handler, allowing for tailored pre-processing, inference, or post-processing based on your specific needs. This article explains how to serve any model on Inference Endpoints with Custom Handlers and walks through real-use case examples that anyone can reproduce.

What are Inference Endpoints?

Inference Endpoints provide a secure production solution to easily deploy any Transformers, Sentence-Transformers, and Diffusers models from the Hub on dedicated and autoscaling infrastructure managed by Hugging Face.

image/png

Inference Endpoints can be deployed via the Inference Endpoints UI as dedicated endpoints for any model available in the Hugging Face Hub with the Inference Endpoints tag. Alternatively, they can be used (not deployed) via the Serverless Inference API for any model with either "Warm" or "Cold" Inference Status.

If you're not yet familiar with Inference Endpoints, we recommend checking the documentation first.

What are Custom Handlers?

Custom Handlers are custom classes in Python that define the pre-processing, inference, and post-processing steps required to run the inference on top of a model. These custom classes are internally used by the Inference Endpoints backend when using the default container i.e. the PyTorch container, which comes with support for most of the model architectures and tasks defined within the Hugging Face Hub and supported by Transformers, Sentence-Transformers, and Diffusers.

Custom Handlers extend the functionality of Inference Endpoints beyond native support, offering more flexibility and control of the inference process. They enable users to tweak steps such as pre-processing, inference, and post-processing, incorporate additional dependencies, or implement features like custom metrics or logging, among others. This means users are not stuck with a one-size-fits-all solution, but rather something they can control and modify to fit their specific needs or requirements; if the default solution does not cover those already.

The custom handlers are shipped as a handler.py file within a model repository, and an optional requirements.txt file if needed. These are automatically detected and used by the Inference Endpoints backend on startup, if available.

Getting started!

To get started with custom handlers on the Hugging Face Hub, there are multiple alternatives:

  • Duplicate the repository with the model weights to include the handler.py and requirements.txt (if applicable) files under a separate repository.
  • Open a PR (or commit to main if you're the only owner) to include the handler.py and requirements.txt (if applicable) files in the existing repository.
  • Create a brand new model repository that just contains the handler.py and the requirements.txt (if applicable).

Note to enable the Deploy button within the model repository, the README.md should contain pipeline_tag: ... with a valid pipeline supported by Inference Endpoints, so that the option is enabled within that repository, even if the repository doesn't contain the model weights.

Once the repository with or without the model weights is set up, you should create a handler.py file within the root directory of the repository that implements the following interface:

from typing import Any, Dict

class EndpointHandler:
    def __init__(self, model_dir: str, **kwargs: Any) -> None:
        ...

    def __call__(self, data: Dict[str, Any]) -> Any:
        ...

Note that you can include any other functionality within the handler.py, but the class to be implemented needs to be named EndpointHandler and must implement both the __init__ and the __call__ methods, but you are free to include any other method within the class or function outside the class, and then use those within any of those class methods.

Finally, once created you can debug it locally by running the following snippet:

if __name__ == "__main__":
    handler = EndpointHandler(model_dir=...)
    assert handler(data=...) == ...

Additionally, if your pipeline requires any specific dependency version or even a dependency that doesn't come with the default PyTorch container, you can include that in the requirements.txt file as:

diffusers>=0.31.0

Then you are all set! When clicking on Deploy and then selecting Inference Endpoints (dedicated), you should be able to deploy your Custom Handler on Inference Endpoints! Alternatively, you can also go directly to the Inference Endpoints UI and search for the model repository with the custom handler on the Hub.

Tips and Tricks

  • To duplicate the model weights from one repository to another, the most convenient approach to avoid having to pull and push all the LFS files locally first, is to use the Repo Duplicator - Hugging Face Space that will copy everything within Hugging Face without having to pull the model repository locally.

  • Duplicating an existing repository is always the best approach since the hardware recommendation when creating the Inference Endpoint would still work (except for the LoRA adapter weights when not hosted along with the base model weights); otherwise, the hardware recommendation would be ignored when using a custom handler that just pulls the model within the EndpointHandler.__init__ method.

  • Since the main engine powering those is the huggingface-inference-toolkit, you can make use of some utilities defined in the such as the logging via the from huggingface_inference_toolkit.logging import logger, and then just use that imported logger normally as e.g. logger.info, logger.debug, etc. and all those logs will be displayed within the Inference Endpoints logs.

  • When selecting a task for the default i.e. the PyTorch container, in the Inference Endpoints UI, make sure to set the task to the same one as the model would have (unless not supported) so that the playground UI works normally. Note that it won't work on input payload modifications or for unsupported tasks, so if that's the case, select the "Custom" task instead, otherwise, the playground UI will be useless.

  • If the model weights are not within the current repository and the model weights are under a gated repository, you will need to manually set a secret variable within the Inference Endpoint configuration so that the gated model weights can be downloaded. To achieve that, the best is to add the following snippet within the EndpointHandler.__init__ method before running any other step on initialization:

    if os.getenv("HF_TOKEN") is None:
        raise ValueError(
            "Since the model weights are gated, you will need to provide a valid `HF_TOKEN` with read-access"
            " to the repository where the weights are hosted."
        )
    

    Note that if the model weights are hosted within the current repository, the token is not required.

  • When deploying an Inference Endpoint from either a duplicated repository or from an existing repository, not all the files within that repository may be required, as it may contain different formats such as safetensors, bin, etc., and, since all of those will be downloaded on startup, you may want to delete the unused files first. That wouldn't happen if the repository just contained the handler.py and requirements.txt (if applicable), and the handler.py was pointing to another repository via e.g. transformers.pipeline(task=..., model=...) where just the required files would be downloaded, instead of all the files in the repository.

Use Cases

Below, you'll find several use cases demonstrating why custom handlers can be valuable, along with simple code snippets showcasing how to reproduce and adapt these to your needs.

Serving LoRA Adapters for Diffusion Models

Serving LoRA Adapters for Diffusion Models

Say that you want to serve a fine-tuned LoRA adapter for a Diffusers model such as alvarobartt/ghibli-characters-flux-lora which is a LoRA fine-tune of black-forest-labs/FLUX.1-dev. When trying to deploy it on Inference Endpoints, the following error will show:

image/png

As the error says, you need to make sure that the model repository with the LoRA adapter contains a handler.py file that will load the model first and then the adapter, as explained in the Diffusers Documentation on How to load adapters.

Note that since the base model here i.e. not the adapter within the repository, is gated, you need to make sure that you create and set the HF_TOKEN environment variable value with a valid Hugging Face Hub token with read access over the gated model, in this case being black-forest-labs/FLUX.1-dev.

import os
from typing import Any, Dict

from diffusers import DiffusionPipeline  # type: ignore
from PIL.Image import Image
import torch

from huggingface_inference_toolkit.logging import logger


class EndpointHandler:
    def __init__(self, model_dir: str, **kwargs: Any) -> None:  # type: ignore
        """The current `EndpointHandler` works with any FLUX.1-dev LoRA Adapter."""
        if os.getenv("HF_TOKEN") is None:
            raise ValueError(
                "Since `black-forest-labs/FLUX.1-dev` is a gated model, you will need to provide a valid "
                "`HF_TOKEN` as an environment variable for the handler to work properly."
            )

        self.pipeline = DiffusionPipeline.from_pretrained(
            "black-forest-labs/FLUX.1-dev",
            torch_dtype=torch.bfloat16,
            token=os.getenv("HF_TOKEN"),
        )
        self.pipeline.load_lora_weights(model_dir)
        self.pipeline.to("cuda")

    def __call__(self, data: Dict[str, Any]) -> Image:
        logger.info(f"Received incoming request with {data=}")

        if "inputs" in data and isinstance(data["inputs"], str):
            prompt = data.pop("inputs")
        elif "prompt" in data and isinstance(data["prompt"], str):
            prompt = data.pop("prompt")
        else:
            raise ValueError(
                "Provided input body must contain either the key `inputs` or `prompt` with the"
                " prompt to use for the image generation, and it needs to be a non-empty string."
            )

        parameters = data.pop("parameters", {})

        num_inference_steps = parameters.get("num_inference_steps", 30)
        width = parameters.get("width", 1024)
        height = parameters.get("height", 768)
        guidance_scale = parameters.get("guidance_scale", 3.5)

        # seed generator (seed cannot be provided as is but via a generator)
        seed = parameters.get("seed", 0)
        generator = torch.manual_seed(seed)

        return self.pipeline(  # type: ignore
            prompt,
            height=height,
            width=width,
            guidance_scale=guidance_scale,
            num_inference_steps=num_inference_steps,
            generator=generator,
        ).images[0]

The code above can be reused and included as a handler.py file within any available LoRA adapter for black-forest-labs/FLUX.1-dev without any code modification required; and minimal modifications when changing the base model to e.g. stabilityai/stable-diffusion-3.5-large, as most of the code is shared among the different text-to-image use cases.

Which when deployed on Inference Endpoints looks like this:

image/png

Find the custom handler at alvarobartt/ghibli-characters-flux-lora.

Serving Models from the Hub Not Supported Natively

Serving Models from the Hub Not Supported Natively

Say that you want to serve nvidia/NVLM-D-72B which is an image-text-to-text model, i.e. a Visual Language Model (VLM), that's not supported on Text Generation Inference (TGI), neither on the default PyTorch container (since image-text-to-text doesn't have a pre-defined AutoPipeline implementation for that task yet, but should soon have it as per https://github.com/huggingface/transformers/pull/34170).

Then you would need to define a custom handler that runs the pre-processing, inference, and post-processing for that task in the handler.py file; including any other requirement in requirements.txt, which shouldn't be needed in most of the cases, since the default PyTorch container already comes with most of the Hugging Face dependencies installed for Transformers, Sentence-Transformers and Diffusers; as well as some commonly used extra dependencies of those.

Note that using custom handlers, in this case, is not just to cover an unsupported model but also to define a custom device mapping, add custom pre-processing code, and add some custom logging messages, among other things.

import math
from typing import Any, Dict, List

import torch
import torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode

import requests
from io import BytesIO
from PIL import Image

from transformers import AutoTokenizer, AutoModel

from huggingface_inference_toolkit.logging import logger


def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float("inf")
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio


def dynamic_preprocess(
    image, min_num=1, max_num=12, image_size=448, use_thumbnail=False
):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j)
        for n in range(min_num, max_num + 1)
        for i in range(1, n + 1)
        for j in range(1, n + 1)
        if i * j <= max_num and i * j >= min_num
    )
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio,
        target_ratios,
        orig_width,
        orig_height,
        image_size,
    )

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size,
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images


def load_image(image_url, input_size=448, max_num=12):
    response = requests.get(image_url)
    image = Image.open(BytesIO(response.content)).convert("RGB")
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(
        image, image_size=input_size, use_thumbnail=True, max_num=max_num
    )
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values


def split_model():
    device_map = {}
    world_size = torch.cuda.device_count()
    num_layers = 80
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f"language_model.model.layers.{layer_cnt}"] = i
            layer_cnt += 1
    device_map["vision_model"] = 0
    device_map["mlp1"] = 0
    device_map["language_model.model.tok_embeddings"] = 0
    device_map["language_model.model.embed_tokens"] = 0
    device_map["language_model.output"] = 0
    device_map["language_model.model.norm"] = 0
    device_map["language_model.lm_head"] = 0
    device_map[f"language_model.model.layers.{num_layers - 1}"] = 0

    return device_map


IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)


def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose(
        [
            T.Lambda(lambda img: img.convert("RGB") if img.mode != "RGB" else img),
            T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
            T.ToTensor(),
            T.Normalize(mean=MEAN, std=STD),
        ]
    )
    return transform


class EndpointHandler:
    def __init__(self, model_dir: str, **kwargs: Any) -> None:
        self.model = AutoModel.from_pretrained(
            model_dir,
            torch_dtype=torch.bfloat16,
            low_cpu_mem_usage=True,
            use_flash_attn=False,
            trust_remote_code=True,
            device_map=split_model(),
        ).eval()

        self.tokenizer = AutoTokenizer.from_pretrained(
            model_dir, trust_remote_code=True, use_fast=False
        )

    def __call__(self, data: Dict[str, Any]) -> Dict[str, List[Any]]:
        logger.info(f"Received incoming request with {data=}")
        
        if "instances" in data:
            logger.warning("Using `instances` instead of `inputs` is deprecated.")
            data["inputs"] = data.pop("instances")

        if "inputs" not in data:
            raise ValueError(
                "The request body must contain a key 'inputs' with a list of inputs."
            )

        if not isinstance(data["inputs"], list):
            raise ValueError(
                "The request inputs must be a list of dictionaries with either the key"
                " 'prompt' or 'prompt' + 'image_url', and optionally including the key"
                " 'generation_config'."
            )

        if not all(isinstance(input, dict) and "prompt" in input.keys() for input in data["inputs"]):
            raise ValueError(
                "The request inputs must be a list of dictionaries with either the key"
                " 'prompt' or 'prompt' + 'image_url', and optionally including the key"
                " 'generation_config'."
            )

        predictions = []
        for input in data["inputs"]:
            if "prompt" not in input:
                raise ValueError(
                    "The request input body must contain at least the key 'prompt' with the prompt to use."
                )

            generation_config = input.get("generation_config", dict(max_new_tokens=1024, do_sample=False))

            if "image_url" not in input:
                # pure-text conversation
                response, history = self.model.chat(
                    self.tokenizer,
                    None,
                    input["prompt"],
                    generation_config,
                    history=None,
                    return_history=True,
                )
            else:
                # single-image single-round conversation
                pixel_values = load_image(input["image_url"], max_num=6).to(
                    torch.bfloat16
                )
                response = self.model.chat(
                    self.tokenizer,
                    pixel_values,
                    f"<image>\n{input['prompt']}",
                    generation_config,
                )

            predictions.append(response)
        return {"predictions": predictions}

Which when deployed on Inference Endpoints looks like this:

image/png

Find the custom handler at alvarobartt/NVLM-D-72B-IE-compatible.

Defining Custom Specifications for I/O Payloads

Defining Custom Specifications for I/O Payloads

Note that when using custom specifications for the I/O payloads, the "Task" that runs within the "Default" container within the Inference Endpoint needs to be set to "Custom", otherwise the playground in the UI will be created for that given task, which will fail due to the pre-defined output parsing; whilst the custom task will print out the raw response in JSON format.

Say that you have a UI or SDK that is expecting an API to receive or produce a given payload, but the default Inference Endpoints payload formatting for either or both input and output is not compliant with that, but you still want to leverage Hugging Face Inference Endpoints to use those within your application seamlessly.

Then you would need to implement a custom handler that given a task such as e.g. zero-shot-classification, expects an input different from the default one:

{"inputs": "I have a problem with my iphone that needs to be resolved asap!", "parameters": {"candidate_labels": ["urgent", "not urgent", "phone", "tablet", "computer"]}}

But you want it to expect the following:

{"sequence": "I have a problem with my iphone that needs to be resolved asap!", "labels": ["urgent", "not urgent", "phone", "tablet", "computer"]}

And by default producing the output:

{"sequence": "I have a problem with my iphone that needs to be resolved asap!!", "labels": ["urgent", "phone", "computer", "not urgent", "tablet"], "scores": [0.504, 0.479, 0.013, 0.003, 0.002]}

But you want it to produce:

{"sequence": "I have a problem with my iphone that needs to be resolved asap!!", "label": "urgent", "timestamp": 1732028280}

Then the custom handler would look similar to the following:

import os
from typing import Any, Dict
import time

from transformers import pipeline
import torch

from huggingface_inference_toolkit.logging import logger


class EndpointHandler:
    def __init__(self, model_dir: str, **kwargs: Any) -> None:
        """Initialize the EndpointHandler for zero-shot classification."""
        self.classifier = pipeline(
            "zero-shot-classification",
            model=model_dir,
            device=0 if torch.cuda.is_available() else -1
        )

    def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
        logger.info(f"Received incoming request with {data=}")

        if "sequence" not in data or not isinstance(data["sequence"], str):
            raise ValueError(
                "Provided input body must contain the key `sequence` with the text to classify, "
                "and it needs to be a non-empty string."
            )

        if "labels" not in data or not isinstance(data["labels"], list):
            raise ValueError(
                "Provided input body must contain the key `labels` with a list of classification labels."
            )

        sequence = data["sequence"]
        labels = data["labels"]

        output = self.classifier(sequence, candidate_labels=labels)

        return {
            "sequence": sequence,
            "label": output["labels"][0],
            "timestamp": int(time.time())
        }

These are just some of the multiple use cases of the custom handlers, which can also include downloading model weights from outside the Hub from private storage, e.g., Google Cloud Storage (GCS), adding custom metric reporting or logging, and many others.

Conclusion

While the default implementation of Inference Endpoints should cover most use cases for Text Generation Inference (TGI), Text Embeddings Inference (TEI), or PyTorch-compatible models hosted on the Hugging Face Hub, there may be instances where these implementations have limitations or don't suit your specific needs or specifications. As explained above, these challenges can be easily addressed by using custom code via the Endpoint Handlers.

Custom Handlers provide Inference Endpoints with substantial flexibility, enabling them to serve virtually any model while managed securely and reliably by Hugging Face. This solution is hosted on major cloud provider infrastructures such as Amazon Web Services (AWS), Google Cloud Platform (GCP), or Microsoft Azure, ensuring robust and scalable deployment options.