--- license: mit tags: - sentence-embeddings - endpoints-template - optimum library_name: generic --- # Optimized and Quantized [deepset/roberta-base-squad2](https://huggingface.co/deepset/roberta-base-squad2) with a custom handler.py This repository implements a `custom` handler for `question-answering` for 🤗 Inference Endpoints for accelerated inference using [🤗 Optiumum](https://huggingface.co/docs/optimum/index). The code for the customized handler is in the [handler.py](https://huggingface.co/philschmid/roberta-base-squad2-optimized/blob/main/handler.py). Below is also describe how we converted & optimized the model, based on the [Accelerate Transformers with Hugging Face Optimum](https://huggingface.co/blog/optimum-inference) blog post. You can also check out the [notebook](https://huggingface.co/philschmid/roberta-base-squad2-optimized/blob/main/optimize_model.ipynb). ### expected Request payload ```json { "inputs": { "question": "As what is Philipp working?", "context": "Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value." } } ``` below is an example on how to run a request using Python and `requests`. ## Run Request ```python import json from typing import List import requests as r import base64 ENDPOINT_URL = "" HF_TOKEN = "" def predict(question:str=None,context:str=None): payload = {"inputs": {"question": question, "context": context}} response = r.post( ENDPOINT_URL, headers={"Authorization": f"Bearer {HF_TOKEN}"}, json=payload ) return response.json() prediction = predict( question="As what is Philipp working?", context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science." ) ``` expected output ```python { 'score': 0.4749588668346405, 'start': 88, 'end': 102, 'answer': 'Technical Lead' } ``` # Convert & Optimize model with Optimum Steps: 1. [Convert model to ONNX](#1-convert-model-to-onnx) 2. [Optimize & quantize model with Optimum](#2-optimize--quantize-model-with-optimum) 3. [Create Custom Handler for Inference Endpoints](#3-create-custom-handler-for-inference-endpoints) 4. [Test Custom Handler Locally](#4-test-custom-handler-locally) 5. [Push to repository and create Inference Endpoint](#5-push-to-repository-and-create-inference-endpoint) Helpful links: * [Accelerate Transformers with Hugging Face Optimum](https://huggingface.co/blog/optimum-inference) * [Optimizing Transformers for GPUs with Optimum](https://www.philschmid.de/optimizing-transformers-with-optimum-gpu) * [Optimum Documentation](https://huggingface.co/docs/optimum/onnxruntime/modeling_ort) * [Create Custom Handler Endpoints](https://link-to-docs) ## Setup & Installation ```python %%writefile requirements.txt optimum[onnxruntime]==1.4.0 mkl-include mkl ``` ```python !pip install -r requirements.txt ``` ## 0. Base line Performance ```python from transformers import pipeline qa = pipeline("question-answering",model="deepset/roberta-base-squad2") ``` Okay, let's test the performance (latency) with sequence length of 128. ```python context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value." question="As what is Philipp working?" payload = {"inputs": {"question": question, "context": context}} ``` ```python from time import perf_counter import numpy as np def measure_latency(pipe,payload): latencies = [] # warm up for _ in range(10): _ = pipe(question=payload["inputs"]["question"], context=payload["inputs"]["context"]) # Timed run for _ in range(50): start_time = perf_counter() _ = pipe(question=payload["inputs"]["question"], context=payload["inputs"]["context"]) latency = perf_counter() - start_time latencies.append(latency) # Compute run statistics time_avg_ms = 1000 * np.mean(latencies) time_std_ms = 1000 * np.std(latencies) return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}" print(f"Vanilla model {measure_latency(qa,payload)}") # Vanilla model Average latency (ms) - 64.15 +\- 2.44 ``` ## 1. Convert model to ONNX ```python from optimum.onnxruntime import ORTModelForQuestionAnswering from transformers import AutoTokenizer from pathlib import Path model_id="deepset/roberta-base-squad2" onnx_path = Path(".") # load vanilla transformers and convert to onnx model = ORTModelForQuestionAnswering.from_pretrained(model_id, from_transformers=True) tokenizer = AutoTokenizer.from_pretrained(model_id) # save onnx checkpoint and tokenizer model.save_pretrained(onnx_path) tokenizer.save_pretrained(onnx_path) ``` ## 2. Optimize & quantize model with Optimum ```python from optimum.onnxruntime import ORTOptimizer, ORTQuantizer from optimum.onnxruntime.configuration import OptimizationConfig, AutoQuantizationConfig # Create the optimizer optimizer = ORTOptimizer.from_pretrained(model) # Define the optimization strategy by creating the appropriate configuration optimization_config = OptimizationConfig(optimization_level=99) # enable all optimizations # Optimize the model optimizer.optimize(save_dir=onnx_path, optimization_config=optimization_config) ``` ```python # create ORTQuantizer and define quantization configuration dynamic_quantizer = ORTQuantizer.from_pretrained(onnx_path, file_name="model_optimized.onnx") dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False) # apply the quantization configuration to the model model_quantized_path = dynamic_quantizer.quantize( save_dir=onnx_path, quantization_config=dqconfig, ) ``` ## 3. Create Custom Handler for Inference Endpoints ```python %%writefile handler.py from typing import Dict, List, Any from optimum.onnxruntime import ORTModelForQuestionAnswering from transformers import AutoTokenizer, pipeline class EndpointHandler(): def __init__(self, path=""): # load the optimized model self.model = ORTModelForQuestionAnswering.from_pretrained(path, file_name="model_optimized_quantized.onnx") self.tokenizer = AutoTokenizer.from_pretrained(path) # create pipeline self.pipeline = pipeline("question-answering", model=self.model, tokenizer=self.tokenizer) def __call__(self, data: Any) -> List[List[Dict[str, float]]]: """ Args: data (:obj:): includes the input data and the parameters for the inference. Return: A :obj:`list`:. The list contains the answer and scores of the inference inputs """ inputs = data.get("inputs", data) # run the model prediction = self.pipeline(**inputs) # return prediction return prediction ``` ## 4. Test Custom Handler Locally ```python from handler import EndpointHandler # init handler my_handler = EndpointHandler(path=".") # prepare sample payload context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value." question="As what is Philipp working?" payload = {"inputs": {"question": question, "context": context}} # test the handler my_handler(payload) ``` ```python from time import perf_counter import numpy as np def measure_latency(handler,payload): latencies = [] # warm up for _ in range(10): _ = handler(payload) # Timed run for _ in range(50): start_time = perf_counter() _ = handler(payload) latency = perf_counter() - start_time latencies.append(latency) # Compute run statistics time_avg_ms = 1000 * np.mean(latencies) time_std_ms = 1000 * np.std(latencies) return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}" print(f"Optimized & Quantized model {measure_latency(my_handler,payload)}") # Optimized & Quantized model Average latency (ms) - 29.90 +\- 0.53 ``` `Vanilla model Average latency (ms) - 64.15 +\- 2.44` ## 5. Push to repository and create Inference Endpoint ```python # add all our new files !git add * # commit our files !git commit -m "add custom handler" # push the files to the hub !git push ```