Commit graph

5046 commits

Author SHA1 Message Date
Vincent Wang
2f2aaf2cf6
Fix Memory Leak from DlpackToOrtValue (#8029) 2021-06-11 15:48:13 +08:00
Jeff Daily
d02de9c1bc
[ROCm] dockerfile updates (#7955)
* do not remove onnxruntime build directory in Dockerfile.rocm4.1.pytorch

* restore ONNX Runtime Training Examples to rocm 4.2 dockerfile
2021-06-10 23:50:19 -07:00
Scott McKay
00d48d9c30
Add enhanced partitioning utils for use by compiling EPs (#7991)
* Add enhanced partitioning utils and convert internal testing EP to use it. Will convert NNAPI EP once checked in.

Background:
Currently most EPs do their partitioning by iterating the model in the topologically sorted order. Whilst this works, it doesn’t ensure that all nodes which could possibly be added to the current group are, as the group is closed as when the first unsupported node is seen.

Changes:
- Ask EP for all nodes it supports first
- Do partitioning aware topological sort
  - Groups nodes and flips between processing supported and unsupported nodes to maximise inputs that will be available for each partition
- Create groups of nodes for the partition using the new order of nodes
- Create ComputeCapability for each group

There’s also an additional ability to specify operators to stop at. The processing will find all downstream nodes from ‘stop at’ operators and exclude them. If NonMaxSuppression is specified we can prevent the post-processing from SSD Mobilenet and MobileDet attempting to use NNAPI (so easy way to have parity with the TF Lite behavior). I don’t think there’s an automated way to determine what if any ‘stop at’ operators are required for a model, so this will need to be a configuration parameter for the EP and we’ll need to document recommended values for popular models.
2021-06-11 15:23:21 +10:00
Suffian Khan
35ca3c99d1
Fix ROCm wheels pipeline after changes to manylinux scripts (#8026)
* update

* try fix rocm pipeline

* avoid already isntalled error

* ignore python3.10 since build fails

* fix

* try setting user

* try again

* try again

* try again

* fix script

* disable inference docs generation

* try print device id

* fix name qual

* try again

* try again

* try again

* provider_options

* add device verify

* rty again

* try again

* try aggain

* print video/render gid

* try again

* run as root

* try again with uid, gid

* cleanup

* run as root

* temp fix

* add /bin/bash

Co-authored-by: Changming Sun <chasun@microsoft.com>
2021-06-10 21:01:28 -07:00
Scott McKay
20579595c8
Make logic in InsertCastTransformer around forcing a node to fp32 more precise. (#8018)
* Address #7981

Reworked the logic around forcing a node to run on fp32 even if it was supported on fp16.

The github issue had multiple factors. In ORT 1.8 we remove Identity nodes that produce graph outputs as they're not needed. That resulted in a Loop node no longer having output nodes (it produces graph outputs instead), which meant the check in IsSingleInputNodeFloat16Node returned true as there was no longer a downstream Identity node processing fp16 data.

We shouldn't only force a node to fp32 in very specific circumstances, and the changes hopefully check for those more precisely.
2021-06-11 13:54:40 +10:00
Nat Kershaw (MSFT)
0237225117
Add @file annotation to support doxygen generation of C API docs (#7458) 2021-06-10 16:10:32 -07:00
baijumeswani
b2ed4fb0a4
Merge orttraining and ortmodule single gpu ci pipelines (#8022)
* Merge orttraining and ortmodule single gpu ci pipelines

* Remove Debug from orttrainer build config
2021-06-10 15:58:23 -07:00
Rachel Guo
4d1b48632c
[CoreML EP] Add ArgMax op support and modify OpBuilder interface (#7924)
* add argmax skip cast op support [initial]

* modify some op support related logic

* fix typo

* add cast node into the partition

* update cast op builder and add int32 graph output

* modify op_builder interface

* exclude unused header file

* clean minor update

* minor change

* address comments

* address more comments

* add UT test for the case

* address more comments

* minor change

* refine

* update

* refine

* make UT test run on CPU

* minor formatting

* update

* switch UT test implementation

* minor refine

Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
2021-06-10 14:39:15 -07:00
Changming Sun
b313c4581c
Remove CC/CXX env settings from C API packaging pipeline (#8014) 2021-06-10 11:36:52 -07:00
Sherlock
2a74f5e85b
Save module output for backward if needed (#8010)
* Save module output for backward if needed
2021-06-10 09:56:35 -07:00
Changming Sun
c74265667e
Remove CUDA architectures 35 and 86 from GPU packages (#8004)
Because our python packages are oversize.
2021-06-09 17:47:34 -07:00
Ryan Hill
b03383f6d5
Add cuda provides files (#8002) 2021-06-09 15:31:24 -07:00
Guoyu Wang
f013b0c0eb
[NNAPI EP] Add support of Elu, merge in NNAPI updates for API level 30 (#8001)
* Add elu, integrate new Android NNAPI API changes

* add slice check

* update previous typo

* Move sdk level check to nnapi feature level check

* update readme
2021-06-09 12:39:02 -07:00
Changming Sun
aa45545af7
Update orttraining-linux-gpu-perf-test-ci-pipeline.yml (#8005)
I changed the OS version. It's is Ubuntu 20.04 + python 3.8 now. So I need to update the python command.
2021-06-09 10:22:14 -07:00
pengwa
cb5f411da3
Fix Python Packaging Pipeline && Build Clean Up (#7993)
* remove link to python

* revert orttraining-linux-ci build env change introduced by pr
https://github.com/microsoft/onnxruntime/pull/7993.

* fix builds

* fix builds

* clean up

* fix builds

* Fix unused params

* fix some comments.
2021-06-09 17:35:17 +08:00
Ye Wang
d433aa2459
Add transformers tool test to pipeline (#7959)
* checkin transformers pipeline

* add docker requirements

* only trigger linux cpu

* temp remove tf instalation due to numpy version conflicts

* test numpy>=1.7

* revert numpy and disable transformers

* add coloredlogs

* enable shape_infer_helper and install transformers when needed

* pip3?

* testtest

* enable more tets

* line too long

* remove pytorch1.4 test and added back some onnx  files

* add tests

* copy dir

* disable 2 teests

* trim lines

* add missing onnx

* fix type

* fix  version conflicts

* install psutil

* change file path

* mfix path

* remove cached files

* add back attention fusion test

* labeled the shape infer test as slow

* fix

* enable tf2onnx test and enable pytest

* refactor path

* fix typo

* add cwd
2021-06-08 19:43:59 -07:00
Vincent Wang
f0f3012666
Add SoftmaxCrossEntropyLossInternal to Support Dynamic ignore_index Input (#7899)
* add SoftmaxCrossEntropyLossInternal

* bugfix and ut

* fix ut

* fix ut

* support torch1.8.1

* function body for nll_loss_internal
2021-06-09 10:29:46 +08:00
baijumeswani
f38200e209
Override ORTModule named_modules to support extra arg (#7954) 2021-06-08 17:32:09 -07:00
Yulong Wang
1cc896c8ae
optimize js package folder structure (#7989)
* put generated .js files into dist/ folder

* format code
2021-06-08 16:49:06 -07:00
George Wu
47d8977741
add missing provider_options.h in packages (#7995)
* consolidate copy binary script for gpu/trt tarball package

* add provider_options.h

* add provider_options.h
2021-06-08 16:37:05 -07:00
Changming Sun
275796a165
Update googletest to latest commit to fix build issues with GCC11 (#7984) 2021-06-08 16:06:53 -07:00
Olivia Jain
861cd0fb24
Increase Python Mac Job Timeout (#7998)
* Increase mac wheel timeout from default 60 to 90
2021-06-08 15:57:21 -07:00
Yufeng Li
500f18badb
fix bug that bias can not be shared across Convs (#7982) 2021-06-08 14:01:06 -07:00
iperov
66170bfcef
Python with DmlExecutionProvider : choose device_id in SessionOptions (#7964) 2021-06-08 13:02:12 -07:00
Changming Sun
4ecbae43b2
Use GCC 10 in Linux CPU CI pipeline (#7985) 2021-06-08 11:53:29 -07:00
Bowen Bao
a776b57160
Add shape inference to custom symbolic functions (#7937)
**Description**: As title.

**Motivation and Context**
- PyTorch ONNX exporter heavily depends on ONNX shape inference to export accurate and efficient model. Custom symbolic function exports the op as contrib ops, thus exporter is unable to perform standard onnx shape inference. Models with dynamic shape inputs are affected.
2021-06-08 10:43:06 -07:00
Weixing Zhang
dce76c15e7
add dockfile for ROCm 4.2 (#7749)
* add dockfile for ROCm 4.2
* using rocm/pytorch:rocm4.2_ubuntu18.04_py3.6_pytorch_1.8.1
2021-06-08 08:02:27 -07:00
Du Li
b50e9d9d74
Adding webgl shape kernel (#7971) 2021-06-08 06:22:45 -07:00
Gao, Chun
0f01de3b0b
[js/web] Add wasm SIMD backend to onnxruntime-web (#7896)
* [js/web] Add wasm SIMD backend to onnxruntime-web

* Import SIMD wasm artifacts enabled by PR #7839

* Detect SIMD capability of web engine

* Use SIMD wasm backend in both single-thread and multi-thread cases

* update optimized SIMD loading from ort web

* code lint and format

* fix WasmFileName in CI

* replace deprecated wasm SIMD functions

* fix unittest for simd

* optimize CI pipeline to merge build matrix

* make clean build for each config

* fix simd wasm to enable it.

* update script/pull-prebuilt-wasm-artifacts.ts

Co-authored-by: Yulong Wang <yulongw@microsoft.com>
Co-authored-by: Lei Zhang <zhang.huanning@hotmail.com>
2021-06-07 23:24:27 -07:00
Vincent Wang
71c4f5ddb2
ATenOp Enhancement (#7725)
* config parser, default argument values

* ut

* win build

* maxpool2d

* fix win build

* fix build

* unfold atenop
2021-06-08 11:01:17 +08:00
Edward Chen
0696e2f0d4
[Objective-C API] Add script to assemble pod package files. (#7958)
Add a helper script for creating the Objective-C API pod package. It puts the necessary files and generates a podspec in a staging directory.
2021-06-07 19:16:39 -07:00
Tiago Koji Castro Shibata
fc331cbd5d
Accept success codes other than S_OK in RoInitialize (#7979)
* Accept success codes other than S_OK in RoInitialize

* Use multithreaded apartment in raw WinML tests
2021-06-07 18:06:31 -07:00
Changming Sun
eb354853d3
Update CMakeLists.txt for openvino EP (#7980) 2021-06-07 15:52:25 -07:00
RandySheriffH
1a5ee11dbd
Implement Sequence Ops GPU (#7863) 2021-06-07 15:30:26 -07:00
pengwa
9e4dc08483
training with custom autograd Functions (#7513)
* Register Torch Custom autograd.Function

* Add flag to supress pybind11 warning

* Avoid unnecessary include in cmake

* Add missing reference

* Add getter for registerred functions

* Format for making subsquent changes cleaner

* Fix interop feature build failure

* Forward pass, run PyOP on CPU EP

* clean up the code

* Fix build

* Define new ops

* refactor pyop - extract PyOpLibProxy class

* Hacks to run example

* implement the kernel compute func

* add back PyOP for comparision experiments

* debug info - thread id

* refine the kernels

* Polish code

(cherry picked from commit 4ed606f9a0)

* Fix a the Tensor address mismatch in C++ side

* PythonOpGrad compute

* add distributed test case

* refine test cases

* get dist.get_rank() in Autograd forward pass

* Add CUDA kernels

* Store float, int, and tuple of them as PythonOp's attributes

* Populate local changes

* Fix bugs

* PythonOp/PythonOpGrad CUDA kernels

* Support non-tensor inputs

* Single GPU FP16 Run Pass

(cherry picked from commit e539989e91e18ee997900292d3493b97d3eafa8a)

* Fix segement

* add basic test cases

* Save progress

* fix gradient builder for a Add op who have same inputs

* add test cases for auto grad fallback feature

* fix ref cnt issue. add thread id for debugging

* POC: remove interface class

* Remove interface classes

* Clean a bit

* Coarse-grained clean up after rebase master

* reset pyop and language_interop_ops to latest master

* Fix missing part during merge

* re-structure torch related language interop files

* Fix build

* Fix tests and build

* Fix build and basic unit tests

* Fix most of uts

* remove unnecessary import

* clean up and fix build when enabling language_interop_ops

* Fix single-GPU UTs

* Move runner register into ORT package

* Update dist UTs to new style

* Also fix distributed UTs and leaf gradient problem

* Static generation for constant args

* Move arg_positions_ to static field

* Rename some functions

* Move arg ceration into a function

* Clean output logic in PythonOp

* Move PythonOp's ctor

* Revise PythonOpGrad

* Fix "ORT only supports contiguous tensor for now" for inputs

* Fix evaulation mode error, add test & clean up

* clean up codes

* Fix issues introduced by recent master change (enabled symbolic shape infer)

* automatically register forward/backward function pointers && clean up

* Fix multi-output case

* Add a test back

* fix build and clean up

* RAII for function params PyObject

* Use new exporter

* Clean full name in new exporter

* Fix UTs

* Format a file

* Add "inplace" back

Remove a legacy comment

* Refine TorchProxy
1. Make TorchProxy a formal singleton class.
2. Remove unused Scope class.
3. Simplify the call to Forward and Backward. The two functions now
   automatically acquire and release GIL state, so user doesn't need
   any GIL-related calls.

* Format

* Add lock to avoid racing condition when registering Python objs

* Fix Python call param ref issues && Add RefcountTracker for debug build && Clean up

* clean up print

* Resolve part of comments && clean up

* Fix a potential bug

* track pyobject consistently

* move kernels to cpu provider as base class

* Refactor - 1. Extract PythonOpBase/PythonOpGradBase 2. Implement CPU kernels 3. Test coverage for CPU kernels

* Refine register code

* Add a missing macro

* Release python call result objects with PythonObjectPtr && Add UnRegisterContext && Track PyObject for Debugging && Clena up

* Fix random segfault issue - relasing a wrong ctx pointer for inplace cases

* put ref count in debug macro

* Move GIL out

* Refine tests

* Fix memory leak issue && forward output lifecycle issue:
1. Unregister the OrtValue PythonObject. Currently, the OrtValue shared same buffer with PythonOp/PythonOpGrad's output. So after those kernels outputs are released, the "leaked" OrtValue caused the shared buffer cannot be released.
2. According PyTorch forward+backward execution. The forward outputs (e.g. torch tensors) maintains the context/saved variables/dirty inputs, etc, which are used for backward execution, so its life should be after the backward runs. This change added such a depencencies between PythonOpGrad on PythonOp.

* Move dlpack->ortvalue into C++ to avoid temp object registration

* Fix the over released Py_False/Py_True && refine tests

* Clean up unused functions

* Always assume the first forward output is context so we don't need to test unused cases.

* Fix a memory leak

* move-copy unique_ptr & avoid C-style casting

* Use inplace attribute to determine if input tensors are copied

* Move DlpackCapsuleDestructor's to a common place

* Thread-safe TorchProxy

* Use OrtValue instead of OrtValue*

* Only keep checks for Debug build

* Wrap some long line per comment

* onnx_export_type --> kwargs

* Use requires_grads to create PythonOpGrad's inputs

* add missing files during master merge

* Fix build issue after merge

* Address two comments.
1. Internalize DlpackCapsuleDestructor
2. Change "(" to "]" for describing closed interval.

* Address some comments.
1. "override" -> "overwrite" to avoid using reserved keyword.
2. Call DLPack's helper to create OrtValue for avoiding repeated code.

* Address comments.
1. Pass std::mutex to registeration helpers so their callers don't
   have to lock the mutex expclicitly.
2. Rename "func_context_pool_mutex_" to "mutex_". This mutex is the global mutex for OrtTorchFunctionPool.

* Add bridging code to make cuda kernels work with merged master

* put debue macro check within RefCountTracker && use default logger for debug info && remove useless ortvalue_ptr interface && typos && revert unncessary blank line changes

* fix some comments

* Resolve more comments

* Capitalize a word

* use unique_ptr instead of ObjectPointer for PyObject management && add converntion

* Support symbolic shape

* Remove unused variable

* fix build

* Enable function registration for training only && rectify ToDlpack/FromDlpack merge with master.

* Don't add context for non-PythonOp opeartors (for example AtenOp)

* Fix build error

* Polish frontend part.
1. Avoid adding kwargs to ORTModule's ctor
2. Use onnx_export_type rather than kwargs for type safty
3. Fix some build bugs.

* Resolve simpler comments

* Resolve export related comments

* sync master && fix tests && fix non-training build error

* Fix build errors

* add target link lib

* windows build error

* Fix orttraining-linux-ci build

* disable autograd test && clean up

* fix linux orttraining ci build

* try fixing win build error

* Revise append calls in runner

* Enable custom function using a function

* Rename to avoid using reservied keyword

* Use list comprehension

* Set ORT random seed in tests

* Remove print code and fix ctx shape

* [] -> list()

* Move autograd.Function and nn.Module into corresponding functions

* Move test helpers

* Polish dist test a bit. Tried move helpers to helper file but it causes a deadlock.

* trying fix undefined reference

* Context is not managed by global pool

* Polish dist test

* Polish dist test

* Add enable_custom_autograd_function

* Remove enable_custom_autograd_function from ctors

* Add doc strings

* Shorter code

* Address comments

* Add one empty line

* revert a minor and not needed change

* Address comments

* Back to reference

* Fix windows builds

* Fix windows debug build fail to find "'python39_d.lib'"

* fix mac build error

* revert _to_contiguous change

* add debugging tag for orttraining-cpu-ci

* Fix the wrong PYTHON_LIBRARIES which is affected by PYTHON_LIBRARY given in build command

* add debugging info

* Fix the build in this case: PYTHON_LIBDIR: /opt/_internal/cpython-3.7.10/lib, PYTHON_EXECUTABLE: /opt/python/cp37-cp37m/bin/python3, PYTHON_MULTIARCH: x86_64-linux-gnu
PYTHON_LIBRARY_PATH python3.7m

* fix build error due to python lib not found

* Fixes
1. Release PyObject's
2. Not useing deepcopy because we assume autograd.Function's
   non-tensor inputs are static (constants) so there should
   be no side effect after calling any autograd.Function
   multiple times.

* Revert dtoc for decreasing refcnt

* add debugging log

* add debugging tag

* Fix a small leak

* Remove ONNX_FALLTHROUGH flag

* debug tag

* debug tag

* fix builds

* remove debug tag

* fix build

* fix builds

* fix build

* install python3 in centos, in case there is no libpython3.xm.so

* build python so for redhat

* add training cpu specific docker, build python so inside

* revert build-cpython change

* try fixing numpy include issue

* install_deps after re-installing cpython

* fix build && remove debug tag

* install openssl before cpython

* let's say: builds pass!

* add build flag for torch iterop, only enable it when training+Python is enabled

* skip ComputeBroadcastBackwardAxesDynamic for the shared inputs

* fix build

* add debug info for padgrad test

* Fix builds

* Split dlpack_converter into C++ and Python interfaces respecitively. Then different build use them as needed.

* clean up the changes

* fix addsubgradient builder

* Fix builds

* clean up

* clean up

* Address some comments.
1. Use pointer wraper to avoid calling Py_DECREF
2. Remove unregister_* functions
3. Allow repeated registration by skipping those with existing keys
4. Unregister context in PythonOpGrad

* Fix over-released Py_Boolean

Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
2021-06-07 13:01:21 -07:00
Changming Sun
4cb3c5e3e2
Update auto-generated csharp files (#7950) 2021-06-07 12:48:44 -07:00
Olivia Jain
e23529f313
Update Python Wheel File Path to fix python packaging pipeline (#7978) 2021-06-07 12:10:03 -07:00
Yulong Wang
bfa996b5fa
add emsdk to component detection ignore dir (#7932)
* add emsdk to component detection ignore dir

* only ignore ws 0.8.0
2021-06-07 10:20:07 -07:00
Ankur Verma
429df40f1d
Suppress warnings in GTest (Fixes Gcc11 build errors) (#7957)
Co-authored-by: Ankur Verma <ankurv@microsoft.com>
2021-06-07 10:07:36 -07:00
Yulong Wang
9b5f749176
[wasm] emsdk: allow to install emscripten only (#7961) 2021-06-07 09:45:02 -07:00
Scott McKay
f352d54743
Refactor Resize/Upsample implementation to reduce binary size. (#7650)
* Refactor Resize/Upsample to split type agnostic code out from templatized code.

Saving is ~17KB for a Windows Release build.
2021-06-07 15:48:06 +10:00
Guoyu Wang
fd23b8caad
Update mobilenetv2 quantization notebook (#7941)
* Update MobileNetV2 notebook for mobile

* Remove outputs of the notebook

* minor update

* Address CR comments

* update comments of the notebook
2021-06-04 18:15:10 -07:00
Yulong Wang
291453dac9
[wasm] fix test report generation (#7953) 2021-06-04 18:05:17 -07:00
Wenbing Li
09ab895563
Fix a resolving issue on the quantization node transforming. (#7952) 2021-06-04 16:07:52 -07:00
Changming Sun
b856e7ae3c
Update build.py: change default cmake generator for Windows to VS2019 (#7945)
VS 2019 is well tested. VS 2017 is not. 

We should make "Visual Studio 16 2019" as the default to not confuse people.
2021-06-04 15:56:53 -07:00
Zhang Lei
0975e7c9a7
Add and correct multichannel test for qlinear pool test. (#7864) 2021-06-04 12:03:41 -07:00
Changming Sun
5a7f65b831
Fix training e2e pipeline (#7942)
1. Fix training e2e pipeline. The failure was caused by my recent change #7632. The fix is adding "--cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=70" to the build parameters because the machines are with V100 GPUs.
2. Simplify Nuphar pipeline. It doesn't need to install a separated ONNX version(1.5.0)
3. Fix a problem that run_dockerbuild.sh ignored OS version parameter. Now because it starts to take effect, I also set python version to the system default one(3.8 for ubuntu 20.04)
2021-06-04 09:37:09 -07:00
Vincent Wang
2b3f953701
Run PadGrad UT for CUDA Only (#7947) 2021-06-04 22:32:08 +08:00
Yulong Wang
650314c926
[wasm] bugfix for emscripten '--preload-file' under node.js (#7944) 2021-06-04 00:18:01 -07:00
Yulong Wang
0723d16436
[wasm] allows to specify MALLOC setting for wasm build (#7934) 2021-06-03 23:08:56 -07:00