|
|
|
Benchmarks |
|
|
|
Hugging Face's Benchmarking tools are deprecated and it is advised to use external Benchmarking libraries to measure the speed |
|
and memory complexity of Transformer models. |
|
|
|
[[open-in-colab]] |
|
Let's take a look at how 🤗 Transformers models can be benchmarked, best practices, and already available benchmarks. |
|
A notebook explaining in more detail how to benchmark 🤗 Transformers models can be found here. |
|
How to benchmark 🤗 Transformers models |
|
The classes [PyTorchBenchmark] and [TensorFlowBenchmark] allow to flexibly benchmark 🤗 Transformers models. The benchmark classes allow us to measure the peak memory usage and required time for both inference and training. |
|
|
|
Hereby, inference is defined by a single forward pass, and training is defined by a single forward pass and |
|
backward pass. |
|
|
|
The benchmark classes [PyTorchBenchmark] and [TensorFlowBenchmark] expect an object of type [PyTorchBenchmarkArguments] and |
|
[TensorFlowBenchmarkArguments], respectively, for instantiation. [PyTorchBenchmarkArguments] and [TensorFlowBenchmarkArguments] are data classes and contain all relevant configurations for their corresponding benchmark class. In the following example, it is shown how a BERT model of type bert-base-cased can be benchmarked. |
|
|
|
from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments |
|
args = PyTorchBenchmarkArguments(models=["google-bert/bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512]) |
|
benchmark = PyTorchBenchmark(args) |
|
</pt> |
|
<tf>py |
|
from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments |
|
args = TensorFlowBenchmarkArguments( |
|
models=["google-bert/bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512] |
|
) |
|
benchmark = TensorFlowBenchmark(args) |
|
|
|
Here, three arguments are given to the benchmark argument data classes, namely models, batch_sizes, and |
|
sequence_lengths. The argument models is required and expects a list of model identifiers from the |
|
model hub The list arguments batch_sizes and sequence_lengths define |
|
the size of the input_ids on which the model is benchmarked. There are many more parameters that can be configured |
|
via the benchmark argument data classes. For more detail on these one can either directly consult the files |
|
src/transformers/benchmark/benchmark_args_utils.py, src/transformers/benchmark/benchmark_args.py (for PyTorch) |
|
and src/transformers/benchmark/benchmark_args_tf.py (for Tensorflow). Alternatively, running the following shell |
|
commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow |
|
respectively. |
|
|
|
python examples/pytorch/benchmarking/run_benchmark.py --help |
|
An instantiated benchmark object can then simply be run by calling benchmark.run(). |
|
|
|
results = benchmark.run() |
|
print(results) |
|
==================== INFERENCE - SPEED - RESULT ==================== |
|
|
|
Model Name Batch Size Seq Length Time in s |
|
google-bert/bert-base-uncased 8 8 0.006 |
|
google-bert/bert-base-uncased 8 32 0.006 |
|
google-bert/bert-base-uncased 8 128 0.018 |
|
google-bert/bert-base-uncased 8 512 0.088 |
|
|
|
==================== INFERENCE - MEMORY - RESULT ==================== |
|
Model Name Batch Size Seq Length Memory in MB |
|
google-bert/bert-base-uncased 8 8 1227 |
|
google-bert/bert-base-uncased 8 32 1281 |
|
google-bert/bert-base-uncased 8 128 1307 |
|
google-bert/bert-base-uncased 8 512 1539 |
|
|
|
==================== ENVIRONMENT INFORMATION ==================== |
|
|
|
transformers_version: 2.11.0 |
|
framework: PyTorch |
|
use_torchscript: False |
|
framework_version: 1.4.0 |
|
python_version: 3.6.10 |
|
system: Linux |
|
cpu: x86_64 |
|
architecture: 64bit |
|
date: 2020-06-29 |
|
time: 08:58:43.371351 |
|
fp16: False |
|
use_multiprocessing: True |
|
only_pretrain_model: False |
|
cpu_ram_mb: 32088 |
|
use_gpu: True |
|
num_gpus: 1 |
|
gpu: TITAN RTX |
|
gpu_ram_mb: 24217 |
|
gpu_power_watts: 280.0 |
|
gpu_performance_state: 2 |
|
use_tpu: False |
|
</pt> |
|
<tf>bash |
|
python examples/tensorflow/benchmarking/run_benchmark_tf.py --help |
|
|
|
An instantiated benchmark object can then simply be run by calling benchmark.run(). |
|
|
|
results = benchmark.run() |
|
print(results) |
|
results = benchmark.run() |
|
print(results) |
|
==================== INFERENCE - SPEED - RESULT ==================== |
|
|
|
Model Name Batch Size Seq Length Time in s |
|
google-bert/bert-base-uncased 8 8 0.005 |
|
google-bert/bert-base-uncased 8 32 0.008 |
|
google-bert/bert-base-uncased 8 128 0.022 |
|
google-bert/bert-base-uncased 8 512 0.105 |
|
|
|
==================== INFERENCE - MEMORY - RESULT ==================== |
|
Model Name Batch Size Seq Length Memory in MB |
|
google-bert/bert-base-uncased 8 8 1330 |
|
google-bert/bert-base-uncased 8 32 1330 |
|
google-bert/bert-base-uncased 8 128 1330 |
|
google-bert/bert-base-uncased 8 512 1770 |
|
|
|
==================== ENVIRONMENT INFORMATION ==================== |
|
|
|
transformers_version: 2.11.0 |
|
framework: Tensorflow |
|
use_xla: False |
|
framework_version: 2.2.0 |
|
python_version: 3.6.10 |
|
system: Linux |
|
cpu: x86_64 |
|
architecture: 64bit |
|
date: 2020-06-29 |
|
time: 09:26:35.617317 |
|
fp16: False |
|
use_multiprocessing: True |
|
only_pretrain_model: False |
|
cpu_ram_mb: 32088 |
|
use_gpu: True |
|
num_gpus: 1 |
|
gpu: TITAN RTX |
|
gpu_ram_mb: 24217 |
|
gpu_power_watts: 280.0 |
|
gpu_performance_state: 2 |
|
use_tpu: False |
|
|
|
By default, the time and the required memory for inference are benchmarked. In the example output above the first |
|
two sections show the result corresponding to inference time and inference memory. In addition, all relevant |
|
information about the computing environment, e.g. the GPU type, the system, the library versions, etc are printed |
|
out in the third section under ENVIRONMENT INFORMATION. This information can optionally be saved in a .csv file |
|
when adding the argument save_to_csv=True to [PyTorchBenchmarkArguments] and |
|
[TensorFlowBenchmarkArguments] respectively. In this case, every section is saved in a separate |
|
.csv file. The path to each .csv file can optionally be defined via the argument data classes. |
|
Instead of benchmarking pre-trained models via their model identifier, e.g. google-bert/bert-base-uncased, the user can |
|
alternatively benchmark an arbitrary configuration of any available model class. In this case, a list of |
|
configurations must be inserted with the benchmark args as follows. |
|
|
|
from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments, BertConfig |
|
args = PyTorchBenchmarkArguments( |
|
models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512] |
|
) |
|
config_base = BertConfig() |
|
config_384_hid = BertConfig(hidden_size=384) |
|
config_6_lay = BertConfig(num_hidden_layers=6) |
|
benchmark = PyTorchBenchmark(args, configs=[config_base, config_384_hid, config_6_lay]) |
|
benchmark.run() |
|
==================== INFERENCE - SPEED - RESULT ==================== |
|
|
|
Model Name Batch Size Seq Length Time in s |
|
bert-base 8 128 0.006 |
|
bert-base 8 512 0.006 |
|
bert-base 8 128 0.018 |
|
bert-base 8 512 0.088 |
|
bert-384-hid 8 8 0.006 |
|
bert-384-hid 8 32 0.006 |
|
bert-384-hid 8 128 0.011 |
|
bert-384-hid 8 512 0.054 |
|
bert-6-lay 8 8 0.003 |
|
bert-6-lay 8 32 0.004 |
|
bert-6-lay 8 128 0.009 |
|
bert-6-lay 8 512 0.044 |
|
|
|
==================== INFERENCE - MEMORY - RESULT ==================== |
|
Model Name Batch Size Seq Length Memory in MB |
|
bert-base 8 8 1277 |
|
bert-base 8 32 1281 |
|
bert-base 8 128 1307 |
|
bert-base 8 512 1539 |
|
bert-384-hid 8 8 1005 |
|
bert-384-hid 8 32 1027 |
|
bert-384-hid 8 128 1035 |
|
bert-384-hid 8 512 1255 |
|
bert-6-lay 8 8 1097 |
|
bert-6-lay 8 32 1101 |
|
bert-6-lay 8 128 1127 |
|
bert-6-lay 8 512 1359 |
|
|
|
==================== ENVIRONMENT INFORMATION ==================== |
|
|
|
transformers_version: 2.11.0 |
|
framework: PyTorch |
|
use_torchscript: False |
|
framework_version: 1.4.0 |
|
python_version: 3.6.10 |
|
system: Linux |
|
cpu: x86_64 |
|
architecture: 64bit |
|
date: 2020-06-29 |
|
time: 09:35:25.143267 |
|
fp16: False |
|
use_multiprocessing: True |
|
only_pretrain_model: False |
|
cpu_ram_mb: 32088 |
|
use_gpu: True |
|
num_gpus: 1 |
|
gpu: TITAN RTX |
|
gpu_ram_mb: 24217 |
|
gpu_power_watts: 280.0 |
|
gpu_performance_state: 2 |
|
use_tpu: False |
|
</pt> |
|
<tf>py |
|
|
|
from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments, BertConfig |
|
|
|
args = TensorFlowBenchmarkArguments( |
|
models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512] |
|
) |
|
config_base = BertConfig() |
|
config_384_hid = BertConfig(hidden_size=384) |
|
config_6_lay = BertConfig(num_hidden_layers=6) |
|
benchmark = TensorFlowBenchmark(args, configs=[config_base, config_384_hid, config_6_lay]) |
|
benchmark.run() |
|
==================== INFERENCE - SPEED - RESULT ==================== |
|
|
|
Model Name Batch Size Seq Length Time in s |
|
bert-base 8 8 0.005 |
|
bert-base 8 32 0.008 |
|
bert-base 8 128 0.022 |
|
bert-base 8 512 0.106 |
|
bert-384-hid 8 8 0.005 |
|
bert-384-hid 8 32 0.007 |
|
bert-384-hid 8 128 0.018 |
|
bert-384-hid 8 512 0.064 |
|
bert-6-lay 8 8 0.002 |
|
bert-6-lay 8 32 0.003 |
|
bert-6-lay 8 128 0.0011 |
|
bert-6-lay 8 512 0.074 |
|
|
|
==================== INFERENCE - MEMORY - RESULT ==================== |
|
Model Name Batch Size Seq Length Memory in MB |
|
bert-base 8 8 1330 |
|
bert-base 8 32 1330 |
|
bert-base 8 128 1330 |
|
bert-base 8 512 1770 |
|
bert-384-hid 8 8 1330 |
|
bert-384-hid 8 32 1330 |
|
bert-384-hid 8 128 1330 |
|
bert-384-hid 8 512 1540 |
|
bert-6-lay 8 8 1330 |
|
bert-6-lay 8 32 1330 |
|
bert-6-lay 8 128 1330 |
|
bert-6-lay 8 512 1540 |
|
|
|
==================== ENVIRONMENT INFORMATION ==================== |
|
|
|
transformers_version: 2.11.0 |
|
framework: Tensorflow |
|
use_xla: False |
|
framework_version: 2.2.0 |
|
python_version: 3.6.10 |
|
system: Linux |
|
cpu: x86_64 |
|
architecture: 64bit |
|
date: 2020-06-29 |
|
time: 09:38:15.487125 |
|
fp16: False |
|
use_multiprocessing: True |
|
only_pretrain_model: False |
|
cpu_ram_mb: 32088 |
|
use_gpu: True |
|
num_gpus: 1 |
|
gpu: TITAN RTX |
|
gpu_ram_mb: 24217 |
|
gpu_power_watts: 280.0 |
|
gpu_performance_state: 2 |
|
use_tpu: False |
|
|
|
Again, inference time and required memory for inference are measured, but this time for customized configurations |
|
of the BertModel class. This feature can especially be helpful when deciding for which configuration the model |
|
should be trained. |
|
Benchmark best practices |
|
This section lists a couple of best practices one should be aware of when benchmarking a model. |
|
|
|
Currently, only single device benchmarking is supported. When benchmarking on GPU, it is recommended that the user |
|
specifies on which device the code should be run by setting the CUDA_VISIBLE_DEVICES environment variable in the |
|
shell, e.g. export CUDA_VISIBLE_DEVICES=0 before running the code. |
|
The option no_multi_processing should only be set to True for testing and debugging. To ensure accurate |
|
memory measurement it is recommended to run each memory benchmark in a separate process by making sure |
|
no_multi_processing is set to True. |
|
One should always state the environment information when sharing the results of a model benchmark. Results can vary |
|
heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very |
|
useful for the community. |
|
|
|
Sharing your benchmark |
|
Previously all available core models (10 at the time) have been benchmarked for inference time, across many different |
|
settings: using PyTorch, with and without TorchScript, using TensorFlow, with and without XLA. All of those tests were |
|
done across CPUs (except for TensorFlow XLA) and GPUs. |
|
The approach is detailed in the following blogpost and the results are |
|
available here. |
|
With the new benchmark tools, it is easier than ever to share your benchmark results with the community |
|
|
|
PyTorch Benchmarking Results. |
|
TensorFlow Benchmarking Results. |
|
|