Optimized and Quantized sentence-transformers/all-MiniLM-L6-v2 with a custom pipeline.py

This repository implements a custom task for sentence-embeddings for 🤗 Inference Endpoints for accelerated inference using 🤗 Optimum. The code for the customized pipeline is in the pipeline.py.

In the how to create your own optimized and quantized model you will learn how the model was converted & optimized, it is based on the Accelerate Sentence Transformers with Hugging Face Optimum blog post. It also includes how to create your custom pipeline and test it. There is also a notebook included.

To use deploy this model a an Inference Endpoint you have to select Custom as task to use the pipeline.py file. -> double check if it is selected

expected Request payload

{
  "inputs": "The sky is a blue today and not gray", 
}

below is an example on how to run a request using Python and requests.

Run Request

import json
from typing import List
import requests as r
import base64

ENDPOINT_URL = ""
HF_TOKEN = ""


def predict(document_string:str=None):

    payload = {"inputs": document_string}
    response = r.post(
        ENDPOINT_URL, headers={"Authorization": f"Bearer {HF_TOKEN}"}, json=payload
    )
    return response.json()


prediction = predict(
    path_to_image="The sky is a blue today and not gray"
)

expected output

{'embeddings': [[-0.021580450236797333,
   0.021715054288506508,
   0.00979710929095745,
   -0.0005379787762649357,
   0.04682469740509987,
   -0.013600599952042103,
   ...
}

How to create your own optimized and quantized model

Steps: 1. Convert model to ONNX
2. Optimize & quantize model with Optimum
3. Create Custom Handler for Inference Endpoints

Helpful links:

Setup & Installation

%%writefile requirements.txt
optimum[onnxruntime]==1.3.0
mkl-include
mkl

install requirements

!pip install -r requirements.txt

1. Convert model to ONNX

from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer
from pathlib import Path


model_id="sentence-transformers/all-MiniLM-L6-v2"
onnx_path = Path(".")

# load vanilla transformers and convert to onnx
model = ORTModelForFeatureExtraction.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# save onnx checkpoint and tokenizer
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)

2. Optimize & quantize model with Optimum

from optimum.onnxruntime import ORTOptimizer, ORTQuantizer
from optimum.onnxruntime.configuration import OptimizationConfig, AutoQuantizationConfig

# create ORTOptimizer and define optimization configuration
optimizer = ORTOptimizer.from_pretrained(model_id, feature=model.pipeline_task)
optimization_config = OptimizationConfig(optimization_level=99) # enable all optimizations

# apply the optimization configuration to the model
optimizer.export(
    onnx_model_path=onnx_path / "model.onnx",
    onnx_optimized_model_output_path=onnx_path / "model-optimized.onnx",
    optimization_config=optimization_config,
)


# create ORTQuantizer and define quantization configuration
dynamic_quantizer = ORTQuantizer.from_pretrained(model_id, feature=model.pipeline_task)
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)

# apply the quantization configuration to the model
model_quantized_path = dynamic_quantizer.export(
    onnx_model_path=onnx_path / "model-optimized.onnx",
    onnx_quantized_model_output_path=onnx_path / "model-quantized.onnx",
    quantization_config=dqconfig,
)

3. Create Custom Handler for Inference Endpoints

%%writefile pipeline.py
from typing import  Dict, List, Any
from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer
import torch.nn.functional as F
import torch

# copied from the model card
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


class PreTrainedPipeline():
    def __init__(self, path=""):
        # load the optimized model
        self.model = ORTModelForFeatureExtraction.from_pretrained(path, file_name="model-quantized.onnx")
        self.tokenizer = AutoTokenizer.from_pretrained(path)

    def __call__(self, data: Any) -> List[List[Dict[str, float]]]:
        """
        Args:
            data (:obj:):
                includes the input data and the parameters for the inference.
        Return:
            A :obj:`list`:. The list contains the embeddings of the inference inputs
        """
        inputs = data.get("inputs", data)

        # tokenize the input
        encoded_inputs = self.tokenizer(inputs, padding=True, truncation=True, return_tensors='pt')
        # run the model
        outputs = self.model(**encoded_inputs)
        # Perform pooling
        sentence_embeddings = mean_pooling(outputs, encoded_inputs['attention_mask'])
        # Normalize embeddings
        sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
        # postprocess the prediction
        return {"embeddings": sentence_embeddings.tolist()}

test custom pipeline

from pipeline import PreTrainedPipeline

# init handler
my_handler = PreTrainedPipeline(path=".")

# prepare sample payload
request = {"inputs": "I am quite excited how this will turn out"}

# test the handler
%timeit my_handler(request)

results

1.55 ms ± 2.04 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)