onnxruntime/orttraining
liqunfu af3988198c
Liqun/e2e transformer test (#3540)
* initial change to transformer.py

* prepare e2e transformer tests

* refactor transformer tests

* put test python files in a flat folder

* fix typo pip install transform(s)

* python 3.6

* python version to 3.6 in install_ubuntu.sh

* remove argparser

* to use opset ver 12

* workaround loss_scale naming patch in case of loss_fn_

* assign self.loss_fn_ so it can be checked

* skip a few un-needed post-process steps

* fix loss_scale_input_name, clean up post process steps

* skip non-frontend tests

* move cpu/cuda related files to coresponding cpu/cuda folder (#3668)

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>

* type cast for ratio is not necessary for dropout (#3682)

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>

* thrustallocator is not needed since cub is used directly for gather now. (#3683)

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>

* GatherND-12 Implementation (#3645)

* Renamed, UT passing

* Move GatherND CUDA Kerenl into onnxruntime

* Merge GatherNDOpTest

* Refactor Test code

* Merge CPU Kernel Impl

* Handle Negative Indice, Fix UT

* Improve CUDA kernel to handle negative index

* Minor Fixes

* Preserve GatherND-1 Cuda kernel

* Fix Mac build

* fix UT

* Fix Build

* fix GatherNDOpTest.double > CUDA error cudaErrorInvalidDeviceFunction:invalid device function

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Peng Wang (pengwa) <pengwa@microsoft.com>

* update with reviewers' comments

* testBertTrainingGradientAccumulation was not using rtol and may fail occasionally with small (e-06) difference

* fix merge mistakes

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Weixing Zhang <weixingzhang@users.noreply.github.com>
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Peng Wang (pengwa) <pengwa@microsoft.com>
2020-04-30 12:26:38 -07:00
..
orttraining Liqun/e2e transformer test (#3540) 2020-04-30 12:26:38 -07:00
pytorch_frontend_examples Liqun/e2e transformer test (#3540) 2020-04-30 12:26:38 -07:00
tools Add pipeline transformer for wait/record node (#3513) 2020-04-22 23:28:01 -07:00
README.md Clean up docs. (#3579) 2020-04-17 22:13:11 -07:00

Introduction

ONNX Runtime Trainer is a test feature introduced in the ONNX Runtime engine. This trainer can be used to accelerate the computation of the ops used to train transformer class models.

The ONNX Runtime trainer can be used with your existing PyTorch training code to accelerate execution on NVIDIA GPU clusters.

Build on Linux

Build the ONNX Runtime Training engine from source to use with NVIDIA GPUs for accelerating the computations.

Dependencies

This default NVIDIA GPU build requires CUDA runtime libraries installed on the system:

  • The GPU-accelerated CUDA libraries CUDA 10.1

  • The GPU-accelerated library of primitives for deep neural networks cuDNN 7.6.2

  • The NVIDIA Collective Communications Library (NCCL) multi-GPU and multi-node communication primitives library NCCL v2.4.8 (download v2.4.8 from the Legacy downloads page)

  • OpenMPI 4.0.0.0

wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.0.tar.gz
tar zxf openmpi-4.0.0.tar.gz
cd openmpi-4.0.0
./configure --enable-orterun-prefix-by-default
make -j $(nproc) all
sudo make install
sudo ldconfig

Get the code and setup the environment

  • Checkout this code repo with git clone https://github.com/microsoft/onnxruntime

  • Set the environment variables: adjust the path for location your build machine

export CUDA_HOME=<location for CUDA libs> # e.g. /usr/local/cuda
export CUDNN_HOME=<location for cuDNN libs> # e.g. /usr/local/cuda
export CUDACXX=<location for NVCC> #e.g. /usr/local/cuda/bin/nvcc
export PATH=<location for openmpi/bin/>:$PATH
export LD_LIBRARY_PATH=<location for openmpi/lib/>:$LD_LIBRARY_PATH
export MPI_CXX_INCLUDE_PATH=<location for openmpi/include/>
source <location of the mpivars script> # e.g. /data/intel/impi/2018.3.222/intel64/bin/mpivars.sh

Create the ONNX Runtime wheel

Change to the ONNX Runtime repo base folder: cd onnxruntime

Run ./build.sh --enable_training --use_cuda --config=RelWithDebInfo --build_wheel

This will produce the .whl file in ./build/Linux/RelWithDebInfo/dist for ONNX Runtime Trainer.

Use with PyTorch training

You can use the ONNX Runtime Training wheel as the trainer in your PyTorch pre-training script. Here is a high-level code fragment to include in your pre-training code:

import torch
import onnxruntime.training.pytorch as ort

# Model definition
class Net(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        ...
    def forward(self, x): 
        ...

model = Net(D_in, H, H_out)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
trainer = ort.trainer(model, criterion, optimizer, ...)

# Training Loop
for t in range(1000):
    # forward + backward + weight update 
    loss, y_pred = trainer.step(x, y)
    ...

A sample for end-to-end training with ONNX Runtime trainer is coming soon.