|
--- |
|
license: mit |
|
tags: |
|
- endpoints-template |
|
- optimum |
|
library_name: generic |
|
--- |
|
|
|
# Optimized and Quantized [deepset/roberta-base-squad2](https://huggingface.co/deepset/roberta-base-squad2) with a custom handler.py |
|
|
|
|
|
This repository implements a `custom` handler for `question-answering` for 🤗 Inference Endpoints for accelerated inference using [🤗 Optiumum](https://huggingface.co/docs/optimum/index). The code for the customized handler is in the [handler.py](https://huggingface.co/philschmid/roberta-base-squad2-optimized/blob/main/handler.py). |
|
|
|
Below is also describe how we converted & optimized the model, based on the [Accelerate Transformers with Hugging Face Optimum](https://huggingface.co/blog/optimum-inference) blog post. You can also check out the [notebook](https://huggingface.co/philschmid/roberta-base-squad2-optimized/blob/main/optimize_model.ipynb). |
|
|
|
### expected Request payload |
|
|
|
```json |
|
{ |
|
"inputs": { |
|
"question": "As what is Philipp working?", |
|
"context": "Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value." |
|
} |
|
} |
|
``` |
|
|
|
below is an example on how to run a request using Python and `requests`. |
|
|
|
## Run Request |
|
|
|
```python |
|
import json |
|
from typing import List |
|
import requests as r |
|
import base64 |
|
|
|
ENDPOINT_URL = "" |
|
HF_TOKEN = "" |
|
|
|
|
|
def predict(question:str=None,context:str=None): |
|
payload = {"inputs": {"question": question, "context": context}} |
|
response = r.post( |
|
ENDPOINT_URL, headers={"Authorization": f"Bearer {HF_TOKEN}"}, json=payload |
|
) |
|
return response.json() |
|
|
|
|
|
prediction = predict( |
|
question="As what is Philipp working?", |
|
context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science." |
|
) |
|
``` |
|
|
|
expected output |
|
|
|
```python |
|
{ |
|
'score': 0.4749588668346405, |
|
'start': 88, |
|
'end': 102, |
|
'answer': 'Technical Lead' |
|
} |
|
``` |
|
|
|
|
|
|
|
# Convert & Optimize model with Optimum |
|
|
|
Steps: |
|
1. [Convert model to ONNX](#1-convert-model-to-onnx) |
|
2. [Optimize & quantize model with Optimum](#2-optimize--quantize-model-with-optimum) |
|
3. [Create Custom Handler for Inference Endpoints](#3-create-custom-handler-for-inference-endpoints) |
|
4. [Test Custom Handler Locally](#4-test-custom-handler-locally) |
|
5. [Push to repository and create Inference Endpoint](#5-push-to-repository-and-create-inference-endpoint) |
|
|
|
Helpful links: |
|
* [Accelerate Transformers with Hugging Face Optimum](https://huggingface.co/blog/optimum-inference) |
|
* [Optimizing Transformers for GPUs with Optimum](https://www.philschmid.de/optimizing-transformers-with-optimum-gpu) |
|
* [Optimum Documentation](https://huggingface.co/docs/optimum/onnxruntime/modeling_ort) |
|
* [Create Custom Handler Endpoints](https://link-to-docs) |
|
|
|
## Setup & Installation |
|
|
|
|
|
```python |
|
%%writefile requirements.txt |
|
optimum[onnxruntime]==1.4.0 |
|
mkl-include |
|
mkl |
|
``` |
|
|
|
|
|
```python |
|
!pip install -r requirements.txt |
|
``` |
|
|
|
## 0. Base line Performance |
|
|
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
qa = pipeline("question-answering",model="deepset/roberta-base-squad2") |
|
``` |
|
|
|
Okay, let's test the performance (latency) with sequence length of 128. |
|
|
|
|
|
```python |
|
context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value." |
|
question="As what is Philipp working?" |
|
|
|
payload = {"inputs": {"question": question, "context": context}} |
|
``` |
|
|
|
|
|
```python |
|
from time import perf_counter |
|
import numpy as np |
|
|
|
def measure_latency(pipe,payload): |
|
latencies = [] |
|
# warm up |
|
for _ in range(10): |
|
_ = pipe(question=payload["inputs"]["question"], context=payload["inputs"]["context"]) |
|
# Timed run |
|
for _ in range(50): |
|
start_time = perf_counter() |
|
_ = pipe(question=payload["inputs"]["question"], context=payload["inputs"]["context"]) |
|
latency = perf_counter() - start_time |
|
latencies.append(latency) |
|
# Compute run statistics |
|
time_avg_ms = 1000 * np.mean(latencies) |
|
time_std_ms = 1000 * np.std(latencies) |
|
return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}" |
|
|
|
print(f"Vanilla model {measure_latency(qa,payload)}") |
|
# Vanilla model Average latency (ms) - 64.15 +\- 2.44 |
|
``` |
|
|
|
|
|
|
|
## 1. Convert model to ONNX |
|
|
|
|
|
```python |
|
from optimum.onnxruntime import ORTModelForQuestionAnswering |
|
from transformers import AutoTokenizer |
|
from pathlib import Path |
|
|
|
|
|
model_id="deepset/roberta-base-squad2" |
|
onnx_path = Path(".") |
|
|
|
# load vanilla transformers and convert to onnx |
|
model = ORTModelForQuestionAnswering.from_pretrained(model_id, from_transformers=True) |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
|
# save onnx checkpoint and tokenizer |
|
model.save_pretrained(onnx_path) |
|
tokenizer.save_pretrained(onnx_path) |
|
``` |
|
|
|
|
|
## 2. Optimize & quantize model with Optimum |
|
|
|
|
|
```python |
|
from optimum.onnxruntime import ORTOptimizer, ORTQuantizer |
|
from optimum.onnxruntime.configuration import OptimizationConfig, AutoQuantizationConfig |
|
|
|
# Create the optimizer |
|
optimizer = ORTOptimizer.from_pretrained(model) |
|
|
|
# Define the optimization strategy by creating the appropriate configuration |
|
optimization_config = OptimizationConfig(optimization_level=99) # enable all optimizations |
|
|
|
# Optimize the model |
|
optimizer.optimize(save_dir=onnx_path, optimization_config=optimization_config) |
|
``` |
|
|
|
|
|
```python |
|
# create ORTQuantizer and define quantization configuration |
|
dynamic_quantizer = ORTQuantizer.from_pretrained(onnx_path, file_name="model_optimized.onnx") |
|
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False) |
|
|
|
# apply the quantization configuration to the model |
|
model_quantized_path = dynamic_quantizer.quantize( |
|
save_dir=onnx_path, |
|
quantization_config=dqconfig, |
|
) |
|
|
|
``` |
|
|
|
## 3. Create Custom Handler for Inference Endpoints |
|
|
|
|
|
|
|
```python |
|
%%writefile handler.py |
|
from typing import Dict, List, Any |
|
from optimum.onnxruntime import ORTModelForQuestionAnswering |
|
from transformers import AutoTokenizer, pipeline |
|
|
|
|
|
class EndpointHandler(): |
|
def __init__(self, path=""): |
|
# load the optimized model |
|
self.model = ORTModelForQuestionAnswering.from_pretrained(path, file_name="model_optimized_quantized.onnx") |
|
self.tokenizer = AutoTokenizer.from_pretrained(path) |
|
# create pipeline |
|
self.pipeline = pipeline("question-answering", model=self.model, tokenizer=self.tokenizer) |
|
|
|
def __call__(self, data: Any) -> List[List[Dict[str, float]]]: |
|
""" |
|
Args: |
|
data (:obj:): |
|
includes the input data and the parameters for the inference. |
|
Return: |
|
A :obj:`list`:. The list contains the answer and scores of the inference inputs |
|
""" |
|
inputs = data.get("inputs", data) |
|
# run the model |
|
prediction = self.pipeline(**inputs) |
|
# return prediction |
|
return prediction |
|
``` |
|
|
|
## 4. Test Custom Handler Locally |
|
|
|
|
|
|
|
```python |
|
from handler import EndpointHandler |
|
|
|
# init handler |
|
my_handler = EndpointHandler(path=".") |
|
|
|
# prepare sample payload |
|
context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value." |
|
question="As what is Philipp working?" |
|
|
|
payload = {"inputs": {"question": question, "context": context}} |
|
|
|
# test the handler |
|
my_handler(payload) |
|
``` |
|
|
|
|
|
```python |
|
from time import perf_counter |
|
import numpy as np |
|
|
|
def measure_latency(handler,payload): |
|
latencies = [] |
|
# warm up |
|
for _ in range(10): |
|
_ = handler(payload) |
|
# Timed run |
|
for _ in range(50): |
|
start_time = perf_counter() |
|
_ = handler(payload) |
|
latency = perf_counter() - start_time |
|
latencies.append(latency) |
|
# Compute run statistics |
|
time_avg_ms = 1000 * np.mean(latencies) |
|
time_std_ms = 1000 * np.std(latencies) |
|
return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}" |
|
|
|
print(f"Optimized & Quantized model {measure_latency(my_handler,payload)}") |
|
# |
|
|
|
``` |
|
|
|
`Optimized & Quantized model Average latency (ms) - 29.90 +\- 0.53` |
|
`Vanilla model Average latency (ms) - 64.15 +\- 2.44` |
|
|
|
## 5. Push to repository and create Inference Endpoint |
|
|
|
|
|
|
|
```python |
|
# add all our new files |
|
!git add * |
|
# commit our files |
|
!git commit -m "add custom handler" |
|
# push the files to the hub |
|
!git push |
|
``` |
|
|
|
|