Transformer Model Optimization Tool Dev Guide

Transformer model optimization tool applies to BERT, GPT-2 and some variations (like Roberta, DistilBert etc). However, it cannot cover all the cases especially for the new ones that are coming out of academics. This guide will give you an overall introduction of how the graph transformation works and how to optimize your custom transformer-based model with limited code changes on graph fusion logic and kernels implementations.

The objective of the Dev Guide is to enable more transformer-based models to take advantage of ONNXRuntime optimized kernels.

Meanwhile, welcome to contribute!

Prerequisite

Expect the developer has basic knowledge of C++, CUDA and python programming.
Transformer Model Optimization Tool Overview
This guide assumes that a valid onnx model exported from the original framework is ready. If there are any issues with model exporting, fp16 conversion, profiling and benchmark. Please refer to the above link.
Netron is an excellent graph visualization tool. Web version
Optional: In case kernel changes are needed, here is the instruction on building the ONNXRuntime with packages on different APIs and Language bindings

Rule Of Thumb

The graph fusion transforms a certain graph structure to a single fused node. The kernel wrapped by the fused node is the strict computation equivalent of that certain graph structure and executed by the runtime engine. This means that the candidate graph should have the exact same logic as fused node kernel implementation. It's suggested to get familiar with the targeted optimized kernel implementation and then work on the fusion logic.

Kernel Implementation

ONNXRuntime supports optimized kernels as contrib operators in both CPU and CUDA Execution Provider.

The definition of the optimized kernels can be found in onnxruntime/core/graph/contrib_ops/contrib_defs.cc.
The CPU implementation of the optimized kernels can be found under onnxruntime/contrib_ops/cpu/bert.
The CUDA implementation of the optimized kernels can be found under onnxruntime/contrib_ops/cuda/bert.
Contrib ops tests can be found here

For instance, the entry point of Attention CPU kernel is the Compute() function. Similarly, for the EmbedLayerNorm CUDA kernel, the entry point is the ComputeInternal() function.

Graph Fusion

The main part of the transformer optimizer is graph fusion. In the current implementation for bert optimization, it supports a couple of fusions executed in order. Each particular graph fusion is an inheritance class of Fusion with fuse() method to implement. For instance, the fuse() method in attention fusion.

The onnx_model class provides many useful functions to modify onnx graph including not limited to:

Retrieve all graph nodes with self.nodes()
A mapping of edge names to nodes.
Basic operations of input/output, node, initializer.
Match graph patterns up-streaming and down-streaming.

Fusion process

Match the candidate graph with expected connection pattern. Example: Gelu fusion, Attention fusion
Construct the fused node with inputs, outputs and the weights obtained from the original graph. Example: Gelu fusion, Attention fusion
Remove the candidate graph. Example: Gelu fusion, Attention fusion

After fusing the graph, check the parity between optimized onnx model and original one by feeding the same inputs to both models and comparing outputs.

A Concrete Case

The Attention Op and EmbedLayerNorm Op are not fused(EmbedLayerNorm graph and Attention graph with Netron) after running optimization script on a custom transformer-based onnx model.
Checked and confirmed that these two candidate graphs have identical logic to the current CPU/CUDA kernel implementation.
Applied some code changes to the Attention fusion and EmbedLayerNorm fusion
Re-run the script and these two Ops are fused(EmbedLayerNorm Op and Attention Op with Netron).
The parity is OK

Contribution

Coding Conventions and Standards

7.5 KiB Raw Blame History