metadata
license: mit
tags:
- endpoints-template
- optimum
library_name: generic
Optimized and Quantized deepset/roberta-base-squad2 with a custom handler.py
This repository implements a custom
handler for question-answering
for 🤗 Inference Endpoints for accelerated inference using 🤗 Optiumum. The code for the customized handler is in the handler.py.
Below is also describe how we converted & optimized the model, based on the Accelerate Transformers with Hugging Face Optimum blog post. You can also check out the notebook.
expected Request payload
{
"inputs": {
"question": "As what is Philipp working?",
"context": "Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value."
}
}
below is an example on how to run a request using Python and requests
.
Run Request
import json
from typing import List
import requests as r
import base64
ENDPOINT_URL = ""
HF_TOKEN = ""
def predict(question:str=None,context:str=None):
payload = {"inputs": {"question": question, "context": context}}
response = r.post(
ENDPOINT_URL, headers={"Authorization": f"Bearer {HF_TOKEN}"}, json=payload
)
return response.json()
prediction = predict(
question="As what is Philipp working?",
context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science."
)
expected output
{
'score': 0.4749588668346405,
'start': 88,
'end': 102,
'answer': 'Technical Lead'
}
Convert & Optimize model with Optimum
Steps:
- Convert model to ONNX
- Optimize & quantize model with Optimum
- Create Custom Handler for Inference Endpoints
- Test Custom Handler Locally
- Push to repository and create Inference Endpoint
Helpful links:
- Accelerate Transformers with Hugging Face Optimum
- Optimizing Transformers for GPUs with Optimum
- Optimum Documentation
- Create Custom Handler Endpoints
Setup & Installation
%%writefile requirements.txt
optimum[onnxruntime]==1.4.0
mkl-include
mkl
!pip install -r requirements.txt
0. Base line Performance
from transformers import pipeline
qa = pipeline("question-answering",model="deepset/roberta-base-squad2")
Okay, let's test the performance (latency) with sequence length of 128.
context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value."
question="As what is Philipp working?"
payload = {"inputs": {"question": question, "context": context}}
from time import perf_counter
import numpy as np
def measure_latency(pipe,payload):
latencies = []
# warm up
for _ in range(10):
_ = pipe(question=payload["inputs"]["question"], context=payload["inputs"]["context"])
# Timed run
for _ in range(50):
start_time = perf_counter()
_ = pipe(question=payload["inputs"]["question"], context=payload["inputs"]["context"])
latency = perf_counter() - start_time
latencies.append(latency)
# Compute run statistics
time_avg_ms = 1000 * np.mean(latencies)
time_std_ms = 1000 * np.std(latencies)
return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}"
print(f"Vanilla model {measure_latency(qa,payload)}")
# Vanilla model Average latency (ms) - 64.15 +\- 2.44
1. Convert model to ONNX
from optimum.onnxruntime import ORTModelForQuestionAnswering
from transformers import AutoTokenizer
from pathlib import Path
model_id="deepset/roberta-base-squad2"
onnx_path = Path(".")
# load vanilla transformers and convert to onnx
model = ORTModelForQuestionAnswering.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# save onnx checkpoint and tokenizer
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)
2. Optimize & quantize model with Optimum
from optimum.onnxruntime import ORTOptimizer, ORTQuantizer
from optimum.onnxruntime.configuration import OptimizationConfig, AutoQuantizationConfig
# Create the optimizer
optimizer = ORTOptimizer.from_pretrained(model)
# Define the optimization strategy by creating the appropriate configuration
optimization_config = OptimizationConfig(optimization_level=99) # enable all optimizations
# Optimize the model
optimizer.optimize(save_dir=onnx_path, optimization_config=optimization_config)
# create ORTQuantizer and define quantization configuration
dynamic_quantizer = ORTQuantizer.from_pretrained(onnx_path, file_name="model_optimized.onnx")
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
# apply the quantization configuration to the model
model_quantized_path = dynamic_quantizer.quantize(
save_dir=onnx_path,
quantization_config=dqconfig,
)
3. Create Custom Handler for Inference Endpoints
%%writefile handler.py
from typing import Dict, List, Any
from optimum.onnxruntime import ORTModelForQuestionAnswering
from transformers import AutoTokenizer, pipeline
class EndpointHandler():
def __init__(self, path=""):
# load the optimized model
self.model = ORTModelForQuestionAnswering.from_pretrained(path, file_name="model_optimized_quantized.onnx")
self.tokenizer = AutoTokenizer.from_pretrained(path)
# create pipeline
self.pipeline = pipeline("question-answering", model=self.model, tokenizer=self.tokenizer)
def __call__(self, data: Any) -> List[List[Dict[str, float]]]:
"""
Args:
data (:obj:):
includes the input data and the parameters for the inference.
Return:
A :obj:`list`:. The list contains the answer and scores of the inference inputs
"""
inputs = data.get("inputs", data)
# run the model
prediction = self.pipeline(**inputs)
# return prediction
return prediction
4. Test Custom Handler Locally
from handler import EndpointHandler
# init handler
my_handler = EndpointHandler(path=".")
# prepare sample payload
context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value."
question="As what is Philipp working?"
payload = {"inputs": {"question": question, "context": context}}
# test the handler
my_handler(payload)
from time import perf_counter
import numpy as np
def measure_latency(handler,payload):
latencies = []
# warm up
for _ in range(10):
_ = handler(payload)
# Timed run
for _ in range(50):
start_time = perf_counter()
_ = handler(payload)
latency = perf_counter() - start_time
latencies.append(latency)
# Compute run statistics
time_avg_ms = 1000 * np.mean(latencies)
time_std_ms = 1000 * np.std(latencies)
return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}"
print(f"Optimized & Quantized model {measure_latency(my_handler,payload)}")
#
Optimized & Quantized model Average latency (ms) - 29.90 +\- 0.53
Vanilla model Average latency (ms) - 64.15 +\- 2.44
5. Push to repository and create Inference Endpoint
# add all our new files
!git add *
# commit our files
!git commit -m "add custom handler"
# push the files to the hub
!git push