mirror of
https://github.com/saymrwulf/onnxruntime.git
synced 2026-05-27 22:45:57 +00:00
Update Transformer Optimizer documents (#4591)
(1) Add bert-base-cased and gpt2 benchmark results on V100 (2) Update list of supported models. (3) Add comments to gpt2_helper. (4) Use IO Binding in test parity by default.
This commit is contained in:
parent
03ebe33850
commit
ea87c0d028
3 changed files with 197 additions and 43 deletions
|
|
@ -11,21 +11,34 @@ This tool can help in the following senarios:
|
|||
## Installation
|
||||
First you need install onnxruntime or onnxruntime-gpu package for CPU or GPU inference. To use onnxruntime-gpu, it is required to install CUDA and cuDNN and add their bin directories to PATH environment variable.
|
||||
|
||||
This tool can be installed using pip as follows:
|
||||
This tool can be installed using pip:
|
||||
```console
|
||||
pip install onnxruntime-tools
|
||||
pip install --upgrade onnxruntime-tools
|
||||
```
|
||||
|
||||
## Export a transformer model to ONNX
|
||||
|
||||
PyTorch could export model to ONNX. The tf2onnx and keras2onnx tools can be used to convert model that trained by Tensorflow.
|
||||
Huggingface transformers has a [notebook](https://github.com/huggingface/transformers/blob/master/notebooks/04-onnx-export.ipynb) shows an example of exporting a pretrained model to ONNX.
|
||||
For Keras2onnx, please refer to its [example script](https://github.com/onnx/keras-onnx/blob/master/applications/nightly_build/test_transformers.py).
|
||||
For tf2onnx, please refer to its [BERT tutorial](https://github.com/onnx/tensorflow-onnx/blob/master/tutorials/BertTutorial.ipynb).
|
||||
|
||||
### GPT-2 Model conversion
|
||||
|
||||
Converting GPT-2 model from PyTorch to ONNX is not straightforward when past state is used. We add a tool [convert_to_onnx](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/transformers/convert_to_onnx.py) to help you.
|
||||
|
||||
You can use commands like the following to convert a pre-trained PyTorch GPT-2 model to ONNX for given precision (float32, float16 or int8):
|
||||
```
|
||||
python -m onnxruntime_tools.transformers.convert_to_onnx -m gpt2 --model_class GPT2LMHeadModel --output gpt2.onnx -p fp32
|
||||
python -m onnxruntime_tools.transformers.convert_to_onnx -m distilgpt2 --model_class GPT2LMHeadModel --output distilgpt2.onnx -p fp16 --use_gpu --optimize_onnx
|
||||
python -m onnxruntime_tools.transformers.convert_to_onnx -m [path_to_gpt2_pytorch_model_directory] --output quantized.onnx -p int32 --optimize_onnx
|
||||
```
|
||||
|
||||
The tool will also verify whether the ONNX model and corresponding PyTorch model generate same outputs given same random inputs.
|
||||
|
||||
## Model Optimizer
|
||||
|
||||
In your python code, you can use it like the following:
|
||||
In your python code, you can use the optimizer like the following:
|
||||
|
||||
```python
|
||||
from onnxruntime_tools import optimizer
|
||||
|
|
@ -44,7 +57,7 @@ You can also download the latest script files from [here](https://github.com/mic
|
|||
python optimizer.py --input gpt2.onnx --output gpt2_opt.onnx --model_type gpt2
|
||||
```
|
||||
|
||||
### Options
|
||||
### Optimizer Options
|
||||
|
||||
See below for description of some options of optimizer.py:
|
||||
|
||||
|
|
@ -69,28 +82,76 @@ See below for description of some options of optimizer.py:
|
|||
|
||||
### Supported Models
|
||||
|
||||
Right now, this tool assumes input model has 3 inputs for input IDs, segment IDs, and attention mask. A model with less or addtional inputs might not be fully optimized.
|
||||
Here is a list of PyTorch models from [Huggingface Transformers](https://github.com/huggingface/transformers/) that have been tested using the optimizer:
|
||||
- BERT
|
||||
- DistilBERT
|
||||
- DistilGPT2
|
||||
- RoBERTa
|
||||
- ALBERT
|
||||
- GPT-2 (**GPT2Model**, **GPT2LMHeadModel**)
|
||||
|
||||
Most optimizations require exact match of a subgraph. Any layout change in subgraph might cause some optimization not working. Note that different versions of training or export tool might lead to different graph layouts.
|
||||
For Tensorflow model, we only tested BERT model so far.
|
||||
|
||||
Here is list of models from [Huggingface Transformers](https://github.com/huggingface/transformers/) that have been tested using this tool:
|
||||
- **BertForSequenceClassification** as in [transformers example](https://github.com/huggingface/transformers/blob/master/examples/run_glue.py) exported by PyTorch 1.2-1.4 using opset version 10 or 11.
|
||||
- **BertForQuestionAnswering** as in [transformers example](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py) exported by PyTorch 1.2-1.4 using opset version 10 or 11.
|
||||
- **TFBertForSequenceClassification** as in [transformers example](https://github.com/huggingface/transformers/blob/master/examples/run_tf_glue.py) exported by keras2onnx installed from its master source.
|
||||
- **TFBertForQuestionAnswering** exported by keras2onnx installed from its master source.
|
||||
- **GPT2Model** exported by PyTorch 1.4 using opset version 10 or 11.
|
||||
- **GPT2LMHeadModel** exported by PyTorch 1.4 using opset version 10 or 11.
|
||||
If your model is not in the list, the optimized model might not work. You are welcome to update the scripts to support new models.
|
||||
Most optimizations require exact match of a subgraph. Any layout change in subgraph might cause some optimization not working. Note that different versions of training or export tool might lead to different graph layouts. It is recommended to use latest released version of PyTorch and Transformers.
|
||||
|
||||
If your model is not in the list, it might only be partial optimized or not optimized at all.
|
||||
|
||||
For GPT2 models, current optimization does not support past state (both inputs and outputs). You need disable it in transformers by setting enable_cache=False during exporting.
|
||||
|
||||
## Benchmark
|
||||
There is a bash script [run_benchmark.sh](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/transformers/run_benchmark.sh) for running benchmark. You can modify the bash script to choose your options (like models to test, batch sizes, sequence lengths, target device etc) before running.
|
||||
|
||||
There is a benchmark script that measure inference performance of OnnxRuntime, PyTorch or PyTorch+TorchScript on pretrained models of Huggingface Transformers.
|
||||
The bash script will call benchmark.py script to measure inference performance of OnnxRuntime, PyTorch or PyTorch+TorchScript on pretrained models of Huggingface Transformers.
|
||||
|
||||
The benchmark script requires PyTorch be installed.
|
||||
### Benchmark Results on V100
|
||||
|
||||
Here is an example to run benchmark on pretrained model bert-base-cased on GPU.
|
||||
In the following benchmark results, ONNX Runtime uses optimizer for model optimization, and IO binding is enabled.
|
||||
|
||||
We tested on Tesla V100-PCIE-16GB GPU (CPU is Intel Xeon(R) E5-2690 v4) for different batch size (**b**) and sequence length (**s**). Below result is average latency of per inference in miliseconds.
|
||||
|
||||
#### bert-base-uncased (BertModel)
|
||||
|
||||
The model has 12 layers and 768 hidden, with input_ids as input.
|
||||
|
||||
| engine | version | precision | b | s=8 | s=16 | s=32 | s=64 | s=128 | s=256 | s=512 |
|
||||
|-------------|---------|-----------|---|------|------|------|------|-------|-------|-------|
|
||||
| torchscript | 1.5.1 | fp32 | 1 | 7.92 | 8.78 | 8.91 | 9.18 | 9.56 | 9.39 | 12.83 |
|
||||
| onnxruntime | 1.4.0 | fp32 | 1 | 1.38 | 1.42 | 1.67 | 2.15 | 3.11 | 5.37 | 10.74 |
|
||||
| onnxruntime | 1.4.0 | fp16 | 1 | 1.30 | 1.29 | 1.31 | 1.33 | 1.45 | 1.95 | 3.36 |
|
||||
| onnxruntime | 1.4.0 | fp32 | 4 | 1.51 | 1.93 | 2.98 | 5.01 | 9.13 | 17.95 | 38.15 |
|
||||
| onnxruntime | 1.4.0 | fp16 | 4 | 1.27 | 1.35 | 1.43 | 1.83 | 2.66 | 4.40 | 9.76 |
|
||||
|
||||
[run_benchmark.sh](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/transformers/run_benchmark.sh) is used to get the results.
|
||||
|
||||
#### gpt2 (GPT2LMHeadModel)
|
||||
|
||||
The model has 12 layers and 768 hidden, with input_ids, position_ids, attention_mask and past state as inputs.
|
||||
|
||||
| engine | version | precision | b | s=4 | s=8 | s=32 | s=128 |
|
||||
|-------------|---------|-----------|---|------|------|------|------|
|
||||
| torchscript | 1.5.1 | fp32 | 1 | 5.80 | 5.77 | 5.82 | 5.78 |
|
||||
| onnxruntime | 1.4.0 | fp32 | 1 | 1.42 | 1.42 | 1.43 | 1.47 |
|
||||
| onnxruntime | 1.4.0 | fp16 | 1 | 1.54 | 1.54 | 1.58 | 1.64 |
|
||||
| onnxruntime | 1.4.0 | fp32 | 8 | 1.83 | 1.84 | 1.90 | 2.13 |
|
||||
| onnxruntime | 1.4.0 | fp16 | 8 | 1.74 | 1.75 | 1.81 | 2.09 |
|
||||
| onnxruntime | 1.4.0 | fp32 | 32 | 2.19 | 2.21 | 2.45 | 3.34 |
|
||||
| onnxruntime | 1.4.0 | fp16 | 32 | 1.66 | 1.71 | 1.85 | 2.73 |
|
||||
| onnxruntime | 1.4.0 | fp32 | 128 | 4.15 | 4.37 | 5.15 | 8.61 |
|
||||
| onnxruntime | 1.4.0 | fp16 | 128 | 2.47 | 2.58 | 3.26 | 6.16 |
|
||||
|
||||
Since past state is used, sequence length in input_ids is 1. For example, s=4 means the past sequence length is 4 and the total sequence length is 5.
|
||||
|
||||
[benchmark_gpt2.py](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/transformers/benchmark_gpt2.py) is used to get the results like the following commands:
|
||||
|
||||
```console
|
||||
python -m onnxruntime_tools.transformers.benchmark_gpt2 --use_gpu -m gpt2 -o -v -b 1 8 32 128 -s 4 8 32 128 -p fp32
|
||||
python -m onnxruntime_tools.transformers.benchmark_gpt2 --use_gpu -m gpt2 -o -v -b 1 8 32 128 -s 4 8 32 128 -p fp16
|
||||
```
|
||||
|
||||
### Benchmark.py
|
||||
|
||||
If you use run_benchmark.sh, you need not use benchmark.py directly. You can skip this section if you do not want to know the details.
|
||||
|
||||
Below is example to runing benchmark.py on pretrained model bert-base-cased on GPU.
|
||||
|
||||
```console
|
||||
python -m onnxruntime_tools.transformers.benchmark -g -m bert-base-cased -o -v -b 0
|
||||
|
|
@ -102,7 +163,7 @@ The first command will generate ONNX models (both before and after optimizations
|
|||
|
||||
If you remove -o parameter, optimizer script is not used in benchmark.
|
||||
|
||||
If your GPU (like V100 or T4) has TensorCore, you can append --fp16 to the above commands to enable mixed precision using float16.
|
||||
If your GPU (like V100 or T4) has TensorCore, you can append `-p fp16` to the above commands to enable mixed precision.
|
||||
|
||||
If you want to benchmark on CPU, you can remove -g option in the commands.
|
||||
|
||||
|
|
@ -110,9 +171,9 @@ Note that our current benchmark on GPT2 and DistilGPT2 models has disabled past
|
|||
|
||||
By default, ONNX model has only one input (input_ids). You can use -i parameter to test models with multiple inputs. For example, we can add "-i 3" to command line to test a bert model with 3 inputs (input_ids, token_type_ids and attention_mask). This option only supports OnnxRuntime right now.
|
||||
|
||||
## Model Verification
|
||||
## BERT Model Verification
|
||||
|
||||
If your model has three inputs (like input_ids, token_type_ids and attention_mask), a script compare_bert_results.py can be used to do a quick verification. The tool will generate some fake input data, and compare results from both the original and optimized models. If outputs are all close, it is safe to use the optimized model.
|
||||
If your BERT model has three inputs (like input_ids, token_type_ids and attention_mask), a script compare_bert_results.py can be used to do a quick verification. The tool will generate some fake input data, and compare results from both the original and optimized models. If outputs are all close, it is safe to use the optimized model.
|
||||
|
||||
Example of verifying models optimized for CPU:
|
||||
|
||||
|
|
@ -124,7 +185,7 @@ For GPU, please append --use_gpu to the command.
|
|||
|
||||
## Performance Test
|
||||
|
||||
bert_perf_test.py can be used to check the model inference performance. Below are examples:
|
||||
bert_perf_test.py can be used to check the BERT model inference performance. Below are examples:
|
||||
|
||||
```console
|
||||
python -m onnxruntime_tools.transformers.bert_perf_test --model optimized_model_cpu.onnx --batch_size 1 --sequence_length 128 --samples 100 --test_times 10 --inclusive
|
||||
|
|
|
|||
|
|
@ -141,7 +141,8 @@ def main():
|
|||
device,
|
||||
args.precision == Precision.FLOAT16,
|
||||
rtol=args.tolerance,
|
||||
atol=args.tolerance)
|
||||
atol=args.tolerance,
|
||||
model_class=args.model_class)
|
||||
|
||||
logger.info(f"Done. Output model: {output_path}")
|
||||
|
||||
|
|
|
|||
|
|
@ -19,8 +19,9 @@ logger = logging.getLogger(__name__)
|
|||
DEFAULT_TOLERANCE = {Precision.FLOAT32: 0.0005, Precision.FLOAT16: 0.2, Precision.INT8: 3.0}
|
||||
|
||||
|
||||
# Here we wrap a class to disable past state output.
|
||||
class GPT2ModelNoPastState(GPT2Model):
|
||||
""" Here we wrap a class to disable past state output.
|
||||
"""
|
||||
def __init__(self, config):
|
||||
super().__init__(config)
|
||||
|
||||
|
|
@ -28,8 +29,9 @@ class GPT2ModelNoPastState(GPT2Model):
|
|||
return super().forward(input_ids, use_cache=False)
|
||||
|
||||
|
||||
# Wrap a class for Onnx model conversion for GPT2 model with past state
|
||||
class MyGPT2Model(GPT2Model):
|
||||
""" Here we wrap a class for Onnx model conversion for GPT2Model with past state.
|
||||
"""
|
||||
def __init__(self, config):
|
||||
super().__init__(config)
|
||||
|
||||
|
|
@ -38,6 +40,8 @@ class MyGPT2Model(GPT2Model):
|
|||
|
||||
|
||||
class MyGPT2LMHeadModel(GPT2LMHeadModel):
|
||||
""" Here we wrap a class for Onnx model conversion for GPT2LMHeadModel with past state.
|
||||
"""
|
||||
def __init__(self, config):
|
||||
super().__init__(config)
|
||||
|
||||
|
|
@ -45,13 +49,26 @@ class MyGPT2LMHeadModel(GPT2LMHeadModel):
|
|||
return super().forward(input_ids, position_ids=position_ids, attention_mask=attention_mask, past=past)
|
||||
|
||||
|
||||
# Maps model class name to a tuple of model class and name of first output
|
||||
MODEL_CLASSES = {'GPT2LMHeadModel': (MyGPT2LMHeadModel, 'logits'), 'GPT2Model': (MyGPT2Model, 'last_state')}
|
||||
|
||||
|
||||
class Gpt2Helper:
|
||||
""" A helper class for Gpt2 model conversion, inference and verification.
|
||||
"""
|
||||
@staticmethod
|
||||
def get_dummy_inputs(batch_size, past_sequence_length, sequence_length, num_attention_heads, hidden_size, num_layer,
|
||||
vocab_size, device, float16):
|
||||
def get_dummy_inputs(batch_size,
|
||||
past_sequence_length,
|
||||
sequence_length,
|
||||
num_attention_heads,
|
||||
hidden_size,
|
||||
num_layer,
|
||||
vocab_size,
|
||||
device,
|
||||
float16=False):
|
||||
""" Create random inputs for GPT2 model.
|
||||
Returns torch tensors of input_ids, position_ids, attention_mask and a list of past state tensors.
|
||||
"""
|
||||
float_type = torch.float16 if float16 else torch.float32
|
||||
past_shape = [2, batch_size, num_attention_heads, past_sequence_length, int(hidden_size / num_attention_heads)]
|
||||
|
||||
|
|
@ -76,7 +93,9 @@ class Gpt2Helper:
|
|||
return input_ids, position_ids, attention_mask, past
|
||||
|
||||
@staticmethod
|
||||
def get_output_shapes(batch_size, past_sequence_length, sequence_length, config, model_class):
|
||||
def get_output_shapes(batch_size, past_sequence_length, sequence_length, config, model_class="GPT2LMHeadModel"):
|
||||
""" Returns a dictionary with output name as key, and shape as value.
|
||||
"""
|
||||
num_attention_heads = config.num_attention_heads
|
||||
hidden_size = config.hidden_size
|
||||
num_layer = config.num_hidden_layers
|
||||
|
|
@ -99,7 +118,9 @@ class Gpt2Helper:
|
|||
return output_shapes
|
||||
|
||||
@staticmethod
|
||||
def get_output_buffers(output_shapes, device, is_float16):
|
||||
def get_output_buffers(output_shapes, device, is_float16=False):
|
||||
""" Returns a dictionary of output name as key, and 1D tensor as value. The tensor has enough space for given shape.
|
||||
"""
|
||||
data_type = torch.float16 if is_float16 else torch.float32
|
||||
|
||||
output_buffers = {}
|
||||
|
|
@ -109,6 +130,8 @@ class Gpt2Helper:
|
|||
|
||||
@staticmethod
|
||||
def diff_outputs(torch_outputs, ort_outputs, relative=False):
|
||||
""" Returns the maximum difference between PyTorch and OnnxRuntime outputs.
|
||||
"""
|
||||
expected_outputs = torch_outputs[0].cpu().numpy()
|
||||
diff = numpy.abs(expected_outputs - ort_outputs[0])
|
||||
if relative:
|
||||
|
|
@ -118,6 +141,8 @@ class Gpt2Helper:
|
|||
|
||||
@staticmethod
|
||||
def compare_outputs(torch_outputs, ort_outputs, rtol=1e-03, atol=1e-03):
|
||||
""" Returns True if torch and ORT outputs are close for given thresholds, and False otherwise.
|
||||
"""
|
||||
is_close = numpy.allclose(ort_outputs[0], torch_outputs[0].cpu(), rtol=rtol, atol=atol)
|
||||
logger.debug(f'PyTorch and OnnxRuntime output 0 (last_state) are close: {is_close}')
|
||||
|
||||
|
|
@ -196,6 +221,8 @@ class Gpt2Helper:
|
|||
verbose=verbose)
|
||||
|
||||
def optimize_onnx(onnx_model_path, optimized_model_path, is_float16, num_attention_heads, hidden_size):
|
||||
""" Optimize ONNX model with an option to convert it to use mixed precision.
|
||||
"""
|
||||
from optimizer import optimize_model
|
||||
m = optimize_model(onnx_model_path,
|
||||
model_type='gpt2',
|
||||
|
|
@ -211,6 +238,8 @@ class Gpt2Helper:
|
|||
|
||||
@staticmethod
|
||||
def pytorch_inference(model, inputs, total_runs=0):
|
||||
""" Run inference of PyTorch model, and returns average latency in ms when total_runs > 0 besides outputs.
|
||||
"""
|
||||
logger.debug(f"start pytorch_inference")
|
||||
input_ids, position_ids, attention_mask, past = inputs
|
||||
|
||||
|
|
@ -241,6 +270,8 @@ class Gpt2Helper:
|
|||
|
||||
@staticmethod
|
||||
def onnxruntime_inference(ort_session, inputs, total_runs=0):
|
||||
""" Run inference of ONNX model, and returns average latency in ms when total_runs > 0 besides outputs.
|
||||
"""
|
||||
logger.debug(f"start onnxruntime_inference")
|
||||
input_ids, position_ids, attention_mask, past = inputs
|
||||
|
||||
|
|
@ -248,7 +279,7 @@ class Gpt2Helper:
|
|||
|
||||
if past is not None:
|
||||
for i, past_i in enumerate(past):
|
||||
ort_inputs[f'past_{i}'] = numpy.ascontiguousarray(past[i].cpu().numpy())
|
||||
ort_inputs[f'past_{i}'] = numpy.ascontiguousarray(past_i.cpu().numpy())
|
||||
|
||||
if attention_mask is not None:
|
||||
ort_inputs['attention_mask'] = numpy.ascontiguousarray(attention_mask.cpu().numpy())
|
||||
|
|
@ -272,14 +303,15 @@ class Gpt2Helper:
|
|||
return ort_outputs, average_latency
|
||||
|
||||
@staticmethod
|
||||
def onnxruntime_inference_with_binded_io(ort_session, inputs, output_buffers, output_shapes, total_runs=0):
|
||||
logger.debug(f"start onnxruntime_inference_with_binded_io")
|
||||
input_ids, position_ids, attention_mask, past = inputs
|
||||
def prepare_io_binding(ort_session, input_ids, position_ids, attention_mask, past, output_buffers, output_shapes):
|
||||
""" Returnas IO binding object for a session.
|
||||
"""
|
||||
|
||||
# Bind inputs and outputs to onnxruntime session
|
||||
io_binding = ort_session.io_binding()
|
||||
|
||||
# Bind inputs
|
||||
assert input_ids.is_contiguous()
|
||||
io_binding.bind_input('input_ids', input_ids.device.type, 0, numpy.longlong, list(input_ids.size()),
|
||||
input_ids.data_ptr())
|
||||
|
||||
|
|
@ -288,14 +320,17 @@ class Gpt2Helper:
|
|||
|
||||
if past is not None:
|
||||
for i, past_i in enumerate(past):
|
||||
io_binding.bind_input(f'past_{i}', past[i].device.type, 0, float_type, list(past[i].size()),
|
||||
past[i].data_ptr())
|
||||
assert past_i.is_contiguous()
|
||||
io_binding.bind_input(f'past_{i}', past_i.device.type, 0, float_type, list(past_i.size()),
|
||||
past_i.data_ptr())
|
||||
|
||||
if attention_mask is not None:
|
||||
assert attention_mask.is_contiguous()
|
||||
io_binding.bind_input('attention_mask', attention_mask.device.type, 0, float_type,
|
||||
list(attention_mask.size()), attention_mask.data_ptr())
|
||||
|
||||
if position_ids is not None:
|
||||
assert position_ids.is_contiguous()
|
||||
io_binding.bind_input('position_ids', position_ids.device.type, 0, numpy.longlong,
|
||||
list(position_ids.size()), position_ids.data_ptr())
|
||||
|
||||
|
|
@ -307,13 +342,36 @@ class Gpt2Helper:
|
|||
io_binding.bind_output(output_name, output_buffer.device.type, 0, float_type, output_shapes[output_name],
|
||||
output_buffer.data_ptr())
|
||||
|
||||
# Copy results to cpu for verification
|
||||
return io_binding
|
||||
|
||||
@staticmethod
|
||||
def get_outputs_from_io_binding_buffer(ort_session, output_buffers, output_shapes):
|
||||
""" Copy results to cpu. Returns a list of numpy array.
|
||||
"""
|
||||
ort_outputs = []
|
||||
for output in ort_session.get_outputs():
|
||||
output_name = output.name
|
||||
buffer = output_buffers[output_name]
|
||||
shape = output_shapes[output_name]
|
||||
ort_outputs.append(buffer[0:numpy.prod(shape)].reshape(shape).cpu())
|
||||
ort_outputs.append(buffer[0:numpy.prod(shape)].reshape(shape).cpu().numpy())
|
||||
return ort_outputs
|
||||
|
||||
@staticmethod
|
||||
def onnxruntime_inference_with_binded_io(ort_session, inputs, output_buffers, output_shapes, total_runs=0):
|
||||
""" Inference with IO binding. Returns outputs, and optional latency when total_runs > 0.
|
||||
"""
|
||||
logger.debug(f"start onnxruntime_inference_with_binded_io")
|
||||
input_ids, position_ids, attention_mask, past = inputs
|
||||
|
||||
# Bind inputs and outputs to onnxruntime session
|
||||
io_binding = Gpt2Helper.prepare_io_binding(ort_session, input_ids, position_ids, attention_mask, past,
|
||||
output_buffers, output_shapes)
|
||||
|
||||
# Run onnxruntime with io binding
|
||||
ort_session.run_with_iobinding(io_binding)
|
||||
|
||||
# Copy results to cpu for verification
|
||||
ort_outputs = Gpt2Helper.get_outputs_from_io_binding_buffer(ort_session, output_buffers, output_shapes)
|
||||
|
||||
if total_runs == 0:
|
||||
return ort_outputs
|
||||
|
|
@ -331,16 +389,39 @@ class Gpt2Helper:
|
|||
return ort_outputs, average_latency
|
||||
|
||||
@staticmethod
|
||||
def test_parity(ort_session, model, device, is_float16=False, rtol=5e-4, atol=5e-4, total_test_cases=100):
|
||||
def test_parity(ort_session,
|
||||
model,
|
||||
device,
|
||||
is_float16=False,
|
||||
rtol=5e-4,
|
||||
atol=5e-4,
|
||||
total_test_cases=100,
|
||||
use_io_binding=True,
|
||||
model_class="GPT2LMHeadModel"):
|
||||
""" Generate random inputs and compare the results of PyTorch and Onnx Runtime.
|
||||
"""
|
||||
|
||||
config: GPT2Config = model.config
|
||||
|
||||
logger.info(f"Running parity test (rtol={rtol}, atol={atol}, test_cases={total_test_cases}) ...")
|
||||
logger.info(
|
||||
f"Running parity test (rtol={rtol}, atol={atol}, test_cases={total_test_cases}, use_io_binding={use_io_binding} model_class={model_class} is_float16={is_float16}) ..."
|
||||
)
|
||||
|
||||
max_batch_size = 8
|
||||
max_past_seq_len = 4 # Do not use large number here for higher chance of hitting empty past (past_seq_len=0)
|
||||
max_seq_len = 2
|
||||
|
||||
output_buffers = None
|
||||
if use_io_binding:
|
||||
max_output_shapes = Gpt2Helper.get_output_shapes(max_batch_size, max_past_seq_len, max_seq_len, config,
|
||||
model_class)
|
||||
output_buffers = Gpt2Helper.get_output_buffers(max_output_shapes, device, is_float16)
|
||||
|
||||
passed_test_cases = 0
|
||||
for _ in range(total_test_cases):
|
||||
sequence_length = random.randint(1, 32)
|
||||
past_sequence_length = random.randint(0, 128)
|
||||
batch_size = random.randint(1, 16)
|
||||
sequence_length = random.randint(1, max_seq_len)
|
||||
past_sequence_length = random.randint(0, max_past_seq_len)
|
||||
batch_size = random.randint(1, max_batch_size)
|
||||
|
||||
logger.debug(
|
||||
f"Running parity test for batch_size={batch_size} past_sequence_length={past_sequence_length}...")
|
||||
|
|
@ -348,7 +429,14 @@ class Gpt2Helper:
|
|||
config.num_attention_heads, config.hidden_size, config.n_layer,
|
||||
config.vocab_size, device, is_float16)
|
||||
outputs = Gpt2Helper.pytorch_inference(model, dummy_inputs)
|
||||
ort_outputs = Gpt2Helper.onnxruntime_inference(ort_session, dummy_inputs)
|
||||
if use_io_binding:
|
||||
ort_outputs = Gpt2Helper.onnxruntime_inference(ort_session, dummy_inputs)
|
||||
else:
|
||||
output_shapes = Gpt2Helper.get_output_shapes(batch_size, past_sequence_length, sequence_length, config,
|
||||
model_class)
|
||||
ort_outputs = Gpt2Helper.onnxruntime_inference_with_binded_io(ort_session, dummy_inputs, output_buffers,
|
||||
output_shapes)
|
||||
|
||||
is_all_close = Gpt2Helper.compare_outputs(outputs, ort_outputs, rtol=rtol, atol=atol)
|
||||
if is_all_close:
|
||||
passed_test_cases += 1
|
||||
|
|
@ -359,6 +447,8 @@ class Gpt2Helper:
|
|||
|
||||
@staticmethod
|
||||
def torchscript(model, config, device):
|
||||
""" JIT trace for TorchScript.
|
||||
"""
|
||||
dummy_inputs = Gpt2Helper.get_dummy_inputs(batch_size=1,
|
||||
past_sequence_length=1,
|
||||
sequence_length=1,
|
||||
|
|
@ -373,6 +463,8 @@ class Gpt2Helper:
|
|||
|
||||
@staticmethod
|
||||
def get_onnx_paths(output_dir, model_name_or_path, model_class: str = 'GPT2LMHeadModel', has_past=True):
|
||||
""" Build a path name for given model based on given attributes.
|
||||
"""
|
||||
model_name = model_name_or_path if model_name_or_path.isalnum() else os.path.dirname(model_name_or_path)
|
||||
|
||||
if model_class != 'GPT2LMHeadModel':
|
||||
|
|
|
|||
Loading…
Reference in a new issue