onnxruntime/include
Maximilian Müller 7c17e33c07
Make CUDA a NHWC EP (#17200)
### Description

CUDA inference speed heavily relies on Tensor Cores. To have tensor
cores achieve the optimal throughput they require the data layout to be
NHWC rather than NCHW.

### Motivation and Context


Especially for convolutional networks this is very important. I will
illustrate this using a very simple network:
```
import torch
import torch.nn as nn

class Net1(nn.Module):

    def __init__(self):
        super(Net1, self).__init__()
        # 1 input image channel, 6 output channels, 5x5 square convolution
        # kernel
        self.m = nn.ModuleList([
            nn.Conv2d(in_channels=8, out_channels=32, kernel_size=5, stride=1),
            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1),
            nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=1),
            nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False),
            nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False),
        ])
    def forward(self, x):
        for module in self.m:
            x = module(x)
        return x


if __name__ == "__main__":
    dtype = torch.half
    device = "cuda"

    dummy_input = torch.randn(8, 8, 512, 512, dtype=dtype, device=device)
    model = Net1().to(dtype=dtype, device=device)
    input_names = ["input1"]
    output_names = ["output1"]
    torch.onnx.export(model, dummy_input, "test.onnx",
                      input_names=input_names, output_names=output_names)
```

I profiled the launch of `./build/RelWithDebInfo/onnxruntime_perf_test
-e cuda -I -q -t 5 test.onnx` using sys and nvtx ranges.
Current master launches below kernels: 

![image](https://github.com/microsoft/onnxruntime/assets/44298237/81655fce-0f8e-4f78-9335-b858a8c8977b)

If I add the introduced `-l` flag we see below kernels:

![image](https://github.com/microsoft/onnxruntime/assets/44298237/fceb5d6f-c12d-442b-b15a-948797630008)

Notice the missing NCHW<>NHWC kernels per operation. The layout
optimizer introduced a transpose op as first and last op of the whole
network. The `op_generic_tensor_kernel` shows the bias used which should
also be optimized out next.

Measured across some very basic models:
| CUDA EP | **NCHW** [ms] | **NHWC** [ms] | Speedup |

|:------------------------|--------------------------------------:|-----------------------------------------:|------------------:|
|                         |  -e cuda -t 5 -q |   -e cuda -t 5 -q -l | |
| resnet101-v2-7_bs8_fp16 | 18.33 | 13.07 | 1.4 |
| resnet101-v2-7_bs8 | 21.8 | 12.06 | 1.81 |
| test | 102.07 | 73.62 | 1.39 |
Average speedup: 1.53

## Outlook

Next the mission will be to first write a templated unit test to check
for correctness of NHWC vs NCHW ops. After that we have to transition
more ops to measure perf improvements on a broader range of models.
Currently this is not easily possible as we can do not support all ops
in the NHWC domain.

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
2023-10-16 10:16:37 -07:00
..
onnxruntime/core Make CUDA a NHWC EP (#17200) 2023-10-16 10:16:37 -07:00