Commit graph

745 commits

Author SHA1 Message Date
Suffian Khan
225439193e
Optimize Concat and Split on CUDA to eliminate host-to-device copies when sizes are all the same (#8833)
* special case concat and split when sizes are equal

* add tests for 16 and 32 inputs with same dim

* add tests for 16/64 inputs on concat or 16/64 outputs on split

* try eliminate windows warning

* outter => outer
2021-09-01 15:25:45 -07:00
Suffian Khan
00b0a9c127
Add hugging-face models loss curve and performance guards to ROCm CI pipeline. (#8915)
* test running hf bert-large

* try again

* try again

* include other models

* correct names

* disable deberta-v2-xxlarge

* avoid torch.distributed

* add compare json loss and perf for bert-large to test

* fix sed expression

* remove pytest

* add more models

* move unit tests u

* display samples/sec
2021-09-01 09:03:10 -07:00
Tang, Cheng
4dc0ddf606
support register external ep lib information (#8897)
* support register external ep lib inforation; make eager mode share the same ep pools with training workloads

* fix inference code

* fix build break

* fix the message
2021-08-31 20:51:22 -07:00
pengwa
3eb08d4dc7
custom autograd func memory (#8901)
* remove PythonOpGrad control dependency && avoid segement fault

* comment alignment

* fix bugs
2021-09-01 09:29:26 +08:00
baijumeswani
70ca03d491
Correctly set the skip check flags for ORTModule (#8891) 2021-08-31 15:28:04 -07:00
George Nash
dc75a135c8
Add elementwise operators to DNNL execution provider (#8899)
The following ops have been added to the DNNL execution provider
Abs, Elu, Exp, Log, *Relu, Round, Sigmoid, Softplus, Sqrt, and Tanh

*Relu op was moved from its individual file to the elementwise operators

The error tolerance for the LogGrad unit test had to be decreased slightly
when using OneDNN.  Still investigating why a differet tolerance value is
needed.

DnnlSubgraph::AddKernels() member function was moved to the top of the file
since this is eddited every time a new operator is added to the the execution
provider this places the code at the top which mean less scrooling when
adding new kernels.

Signed-off-by: George Nash <george.nash@intel.com>
2021-08-31 12:20:49 -07:00
satyajandhyala
84f9271a8d
Enable registering external custom op schemas on Linux (#8889)
* Use manylinux instead of Ubuntu to run external custom ops build pipeline.
2021-08-30 10:13:47 -07:00
pengwa
36fa0de8b7
fix regression and enable custom autograd func tests in CIs (#8868)
* fix regression and enable tests in CIs

* Update orttraining/orttraining/python/training/ortmodule/_custom_autograd_function.py

Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>

* fix

Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
2021-08-30 09:34:18 +08:00
Sherlock
6e20eb7eb3
Stop gradient for Multinomial, RandomNormalLike, RandomUniformLike and EyeLike (#8836) 2021-08-28 16:21:34 -07:00
baijumeswani
df9438192a
Re-introduce saving of optimized onnx model (#8860)
* Re-introduce saving of optimized onnx model
2021-08-28 14:27:25 -07:00
satyajandhyala
31926176ac
Support external custom operator schemas on Ubuntu (#8807)
* Expose symbols in onnx and protobuf namespaces in python when building with --enable_external_custom_op_schemas

* Add external onnx and protobuf files to wheel

* Added an example to demonstrate external custom ops use-case

* Added a Linux build pipeline to test external custom ops
2021-08-28 11:05:21 -07:00
Tang, Cheng
ae7f2d824d
Share the execution provider instance for training (#8719)
* seperate the training python module; share the execution proivder instance

* fix build break

* fix cuda test crash; reorg the python module code base

* se correct env

* use provider customized hash func

* fixbuild break

* fix rocm break

* use const ref in argument

* rename the file

* move hash func to trainiing module
2021-08-27 16:23:35 -07:00
Sherlock
c325207f7a
Optimize MatmulGrad (#8846)
Optimize two special cases of MatmulGrad using FusedMatMul.
2021-08-25 23:36:40 -07:00
Hariharan Seshadri
cee79526fd
Add opset 15 kernels for Pow, BatchNorm, and Shape (#8442) 2021-08-25 12:04:20 -07:00
Sherlock
73fe7bfa0f
Add ATenOp at::diagonal (#8838)
* Register at::diagonal for ATenOp
2021-08-25 09:45:53 -07:00
Chandru Ramakrishnan
98ed235fc7
Removed MSNPU code from eager. (#8832) 2021-08-25 09:40:25 -04:00
ashari4
4251e04eae
Removed assert (#8779) 2021-08-24 20:26:08 -07:00
ashari4
7f1e880649
Reorder ORT eager headers (#8813) 2021-08-24 14:48:43 -07:00
Changming Sun
4bfff45859
Downgrade Eigen (#8817) 2021-08-23 18:06:23 -07:00
Chandru Ramakrishnan
2693af9799
Ported changes / bug fixes from torch/ort. (#8784)
* Ported changes / bug fixes from torch/ort.

* Fixed formatting

* Renamed function

* Renamed module_ to module.

* Revert "Renamed module_ to module."

This reverts commit b17fc114b3db20d174283811d90592b5b8154c19.

* Include pybind common header to fix linker errors on windows debug.

* Fix to generation of > 1 custom op.

Co-authored-by: Ashwin Hari <ashari@microsoft.com>
2021-08-23 17:45:40 -04:00
George Nash
d4a88cfe3f
Add Gemm op to DNNL Exectution provider (#8799)
* Implement Gemm op for DNNL execution provider

Signed-off-by: George Nash <george.nash@intel.com>

* Remove KernelRegistry and Gemm op for dnnl ep

The KernelRegistry for the dnnl execution provider only
registered a Gemm op that as best we can tell was never
actually used and also was not using the dnnl library.

We have implemented a Gemm op in the DNNL execution provider
subgraph code and thus are removing the unused Gemm op that
was in the dnnl KernelRegistry.

Signed-off-by: George Nash <george.nash@intel.com>

* Fix duplicated output and kernelshape inference

fix getcapability to make sure subgraph outputs do not have duplicates

fix kernelshape inference in pool

Signed-off-by: Wang <zhaoyang.wang@intel.com>

* Removed most dnnl specialized ifdefs from gradient_ops_test code

Re-enable GlobalAveragePoolGrad test for dnnl ep

The bugs that were exposed by the GlobalAveragePoolGrad test have
been fixed and this test no longer needs to be disabled for DNNL.

Removed the ReluGradDnnl test. We are getting the testing from the
already existing ReluGrad test.

MaxPoolGrad test no longer has specialized execution provider
enabling for DNNL execution provider. It will now run without
the extra enabling.

ConvGrad is the only test that still has dnnl specialized ifdefs
However, the ConvGrad code was not being executed by the code
unless it was listed first in the list of execution providers.

Signed-off-by: George Nash <george.nash@intel.com>

* Fix transpose issue on Gemm

On transposing square matrices, getmemoryandreshape will fail to reshape

fix by adding a bool

Signed-off-by: Wang <zhaoyang.wang@intel.com>

* Save memory space by reusing internal tensor for output

The intermediat matmul output tensor can be used as the output
tensor for the binary calculation.

Remove the unused IsAttributeSupported from the
DnnlGemmNodeCapability class since we now support all of the
Gemm attributes in our implementation.

Signed-off-by: George Nash <george.nash@intel.com>

Co-authored-by: Wang <zhaoyang.wang@intel.com>
2021-08-23 08:45:34 -07:00
Suffian Khan
9fa0d8392a
Extend node debugging utilities to push tensors and node placement to SQL database (#8672)
* adding support for tracing to sqldb instead of files

* use compiled statements

* script to pull tensors from db

* link sqlite3

* remove node info redundant with onnx graph

* addressing PR comments

* address PR comments and include program counter

* third party notice

* use find_pacakge

* add to cgmanifests.json

* address thread safety and add pid suffix

* build fi

* python script to select on devicetype

* remove unpopulated and redundant Shape and Type fields

* comment

* comment

* PR comments

* add graph execution counter to session state

* move increment to inference session

* std::endl to \n

* ifdef on graph execution counter

* add ifdef to inference session

* move DEBUG_NODE_INPUTS_OUTPUTS to CMakeLists.txt
2021-08-21 00:40:12 -07:00
Sherlock
81889a1cf6
Invertible ReluGrad (#8773)
* Invertible Relu Grad
2021-08-19 11:29:05 -07:00
Aaron Bockover
b2813656f5
eager: fix build against latest PyTorch master (#8745)
Improve README as well.
2021-08-18 14:27:21 -04:00
pengwa
0983d61969
refine glue code and tests (#8510) 2021-08-18 11:38:00 +08:00
ashbhandare
cc275e7529
Gradient Accumulation optimization verified for correctness (#8273)
* Fetching frontier tensors to frontend

* Move before session initialize call

* Fetch tensor and add to cache

* Rest of the changes for using cache

* Review comments

* Review changes

* Review comments

* switch to shared_ptr

* Fix bug after rebase

* FE docstring change
2021-08-17 16:24:44 -07:00
baijumeswani
871eeb4dbd
Support dicts as inputs to ORTModule (#8718) 2021-08-17 13:40:55 -07:00
Thiago Crepaldi
ed254c283f
Add support for experimental json config for fallback (#8759) 2021-08-17 13:35:42 -07:00
Thiago Crepaldi
419834d285
Add PyTorch fallback for ORTModule forward exceptions (#8346) 2021-08-17 10:41:15 -07:00
M. Zeeshan Siddiqui
0fb82f0f8a
Memory aware gradient builder. (#8582) 2021-08-16 19:01:22 -07:00
Nat Kershaw (MSFT)
aa12d68c37
Update ORTModule API docstrings (#8309) 2021-08-16 16:53:01 -07:00
George Nash
e695cd304a
Dnnl refactor (#8627)
* dnnl ep rework

    rework DnnlTensor,DnnlNode,DnnlSubgraph to support arbitrary graph topology and tensor data types

    rework GetCapability to claim nodes in graph greedily from node topological ordering and delay creation of DnnlSubgraph until Compile

    rework compile to have DnnlSubgraphPrimitive as the object to handle primitive creation and execution
        instead of thread local primitive pool which duplicates intermediate memory allocated by the EP across threads

    DnnlSubgraphPrimitive provides helpers to handle many common functions for each dnnl primitive builder and become the centralized place to store input, output, intermediate memories, initializer memories and etc
        it provides functions to obtain input memories with automatic reordering/reshaping and moving between engines
        it provides interfaces to add primitive, set output memory for single node and etc

    add CONCURRENT_EXEC compile flag for dnnl library as without it, convolution primitive cannot be created and executed on different threads

    enable unit tests to run on dnnl ep as well if built with dnnl ep

    add dnnl ep support for Matmulinteger

* Add Relu to the DNNL refactor

Signed-off-by: George Nash <george.nash@intel.com>

* Add Convolution op to the DNNL rework

Signed-off-by: George Nash <george.nash@intel.com>

* Add Pooling ops to the DNNL rework

This adds the following ops:
    - AveragePool
    - GlobalAveragePool
    - GlobalMaxPool
    - MaxPool

Note: Pooling with dilation is not yet supported.
Note: GlobalLpPool, LpPool, MaxRoiPool, and MaxUnpool are not supported yet.

Signed-off-by: George Nash <george.nash@intel.com>

* Add Sum op to the DNNL rework

Signed-off-by: George Nash <george.nash@intel.com>

* Add ConvGrad op to the DNNL rework

Signed-off-by: George Nash <george.nash@intel.com>

* Add MaxPoolGrad and AveragePoolGrad ops to DNNL rework

Signed-off-by: George Nash <george.nash@intel.com>

* Added lrn operator to the refactored code

Signed-off by chethan.palangoutu.keshava@intel.com

* Added ReduceMean DNNL op to the refactor code

Signed-off-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>

* Added Softmax DNNL op for the refactored code

Signed-off-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>

* Added BatchNorm DNNL op inference-only for refactored code

Signed-off-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>

* Added Binary Ops to DNNL rework

Signed-off-by: Wang <zhaoyang.wang@intel.com>

* Added ReluGrad to DNNL Rework

Signed-off-by: Wang <zhaoyang.wang@intel.com>

* Update OneDNN tag to v2.3

Signed-off-by: Wang <zhaoyang.wang@intel.com>

* Added support for memory upto dim size 12

this is to fix the CI test cases that contain binary ops of input dim
size > 5

Signed-off-by: Wang <zhaoyang.wang@intel.com>

* Prevent claiming support for float16 and bfloat16 when only float is suppoted

By using The string.find used was causing the code to claiming support
for float16 and bfloat16 when we only supported float. We now explicitly
check the code for the data type or the data type with a 7 letter prefix
basically prefixed with "tensor("

Signed-off-by: George Nash <george.nash@intel.com>

* Disable uint8 mul and div, improve type conversion

Disable mul_uint8 and div_uint8 test cases as they use modulo for
overflow handling while onednn uses saturation

improve ype conversion using enum instead of string comparsion as well
as adding more types

Signed-off-by: Wang <zhaoyang.wang@intel.com>

Co-authored-by: Wang <zhaoyang.wang@intel.com>
Co-authored-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>
2021-08-13 14:15:43 -07:00
Changming Sun
436ac6dd5f
Rename ml_value.h to ort_value.h (#8726) 2021-08-13 07:04:56 -07:00
baijumeswani
217b2c9f93
Removing filelock import from ORTModule (#8722) 2021-08-12 21:19:49 -07:00
Tang, Cheng
de2a53e46d
[eager mode] fix build and support customize shared provider entry point (#8680)
* fix build break

* support customize the name of shared provide lib's entry point

* fix non training build

* check error code

* check return code
2021-08-11 15:10:35 -07:00
harshithapv
c24335246b
Support bool type for Pad Op and fix Unsqueeze in Tile grad for Opset 13 (#8602)
* changes

* tile grad unsqueeze fix for opset 13

* clean up

* remove bool support for opset 2 to 12 for Pad as it is not supported.

* Copy OperatorKernels.md from artifacts of Windows CI build.
2021-08-11 11:21:02 -07:00
mindest
a56e325eb8
constrain inputs for min/max grad UT (#8632)
* fix inputs for min/max grad UT

* use random inputs (truncated)
2021-08-07 18:29:06 +08:00
Tang, Cheng
6d3c2c85ef
Integrate eager mode source code into onnxruntime repo (#8584)
* integrate eager mode source codde; build with cmake and integrate the python test

* Adding the python path for importing libraries in the Eager mode

* fix clang break;check if training and python enabled

* handling the linking of torch libraries across multiple platforms

* merge and fix the naming

* add build instruction

Co-authored-by: Abhishek Jindal <abjindal@OrtTrainingDev0.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: ajindal1 <abjindal@microsoft.com>
2021-08-06 08:30:27 -07:00
Ashwini Khade
96eb9810ba
Update onnx (#8458)
* updates for picking pnnx commit

* add tests filter to c# tests

* plus test fixes

* fix versioning for contrib ops

* fix tests

* test filter for optional ops

* more versioning related updates

* fix test

* fix layernorm spec

* more updates

* update docs

* add more test filters

* more filters

* update binary size threshold

* update docs

* plus more fixes

* updates per review

* update to release commit

* add filters for optional type tests

* plus updates
2021-08-05 09:21:44 -07:00
Changming Sun
0510688411
Update compliance tasks in python packaging pipeline and fix some compile warnings (#8471)
1. Update SDLNativeRules from v2 to v3. The new one allows us setting excluded paths.
2. Update TSAUpload from v1 to v2. And add a config file ".gdn/.gdntsa" for it.
3. Fix some parentheses warnings
4. Update cmake to the latest.
5. Remove "--x86" build option from pipeline yaml files. Now we can auto-detect cpu architecture from python. So we don't need to ask user to specify it.
2021-07-30 17:16:37 -07:00
baijumeswani
816ad86d14
Configuring ORTModule - Internal Options (#8537) 2021-07-30 13:05:32 -07:00
satyajandhyala
5e2f4263db
Enable cast propagation in the frontend. (#8517) 2021-07-28 17:06:49 -07:00
baijumeswani
2e28cbaa64
Configuring ORTModule - End User Facing Options (#8470) 2021-07-28 10:51:43 -07:00
Sherlock
1370cbe256
[ORTModule] Extract output schema in module's true train/eval mode (#8516)
* Extract output schema in module's true train/eval mode
2021-07-28 09:55:07 -07:00
mindest
a71dab691d
Implement BatchNormInternal for cuda (#8172)
* correct batchnorm replacement output order;

remove bn replacement in grad graph builder

* update op defs and kernel class

* implement batch norm internal and grad.

* change saved_var into saved_inv_std

* cuda test case: bn internal

* remove redundant include

* fix comment; add support and UT for 1d input.

* exclude batch_norm_internal in amd_hipify

* run BNInternal UT for CUDA only

* fix CI error

* fix comment errors

* fix error

* add comment for inconsistency with cudnnBN doc

* additional comments for cudnnBN inconsistency
2021-07-28 16:04:49 +08:00
Vincent Wang
1798698545
avgpool2d atenop (#8507) 2021-07-28 14:04:55 +08:00
Sherlock
686f9b530b
ORTModule set_seed in int (#8511) 2021-07-27 15:43:13 -07:00
Oliver Rausch
1685ab8138
Implement Concat with Strided copy (#8336)
Adds a StridedCopy function that implements a copy from strided tensor to another.

This parallelizes the Concat operator, and can also be used in the future to parallelize many other data movement operators (e.g. Transpose, Split, etc.).
This operation is also required for the proposed data layout extensions to ORT.
2021-07-27 18:27:56 +02:00
ytaous
1ae32655b3
fix t5 assert error (#8501)
Co-authored-by: Ethan Tao <ettao@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-07-27 09:04:01 -07:00
ytaous
ab5289f109
Performance: enable faster training with skip checks config (#8411)
* freeze/fastpath support

* more comments on _fast_path

* per comments

* minor fix

* IntFlag improve

* address comments

Co-authored-by: Ethan Tao <ettao@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-07-23 10:23:13 -07:00