Commit graph

291 commits

Author SHA1 Message Date
Suffian Khan
02c78a8aa8
test migration to rocm4.2 (#7800) 2021-05-24 11:48:44 -07:00
Changming Sun
ee29330cab
Delete unused file: Dockerfile.ubuntu_gpu (#7797) 2021-05-21 17:05:35 -07:00
liqunfu
f6eb0f76ae
to used cudnn7 to build onnxruntime-training wheel with Cuda 10.2 support (#7760) 2021-05-20 09:18:41 -07:00
Ryan Hill
c99aa3a3f3
Ryanunderhill/cuda shared (#7626)
* First iteration of making cuda a shared provider.
Separated out shared OpKernel change, so doing this to merge with that change.

* More cuda shared library refactoring

* More cuda shared library refactoring

* More build options tested, converted the training ops over.

* Fix merge breaks

* Fix submodules

* Fix submodules

* Fix submodules

* Fix python

* Fix compile errors

* Duplicate symbol fix

* Test fix for ROCM provider

* Another ROCM test workaround

* ROCM Build Test

* ROCM build fix

* ROCM

* ROCM

* ROCM

* ROCM

* ROCM

* ROCM test

* Reduce header dependencies

* Remove redundant namespace

* Test fix for linux

* Fix linux build

* Fix Eigen build error

* Fix unused parameter warning

* Test link error

* Another linker test

* Linker test

* Linker test

* Another test

* Another build test

* Fix linux link error

* Build test

* Fix control flow ops to use common base class with core code

* Remove extra qualifiers

* Fix template syntax for linux

* Fix cuda memory leak

* Fix pybind

* Test disabling cast

* Cleanup

* Restore cuda in test

* Remove more header dependencies

* Test not adding cuda provider to session

* Make GetProviderInfo_CUDA throw

* No-op cuda provider creation

* Fix some setup issues

* Fix memory cleanup on unload

* Diagnostics

* Don't unload library

* Add diagnostics

* Fix deleting registry at right time.

* Test disabling profiler

* Fix merge break

* Revert profiler change

* Move unloading of shared providers into Environment

* Free more global allocations before library unloads

* Add more diagnostics

* Move unloading back to the OrtEnv as there are multiple Environments created during a session.

Remove some library dependencies for tests.

* Fix more cmake files

* ERROR -> WARNING

* Fix python shutdown

* Test not using dml in pipeline

* Change python version and disable dml

* Update python version

* Test adding unload method for shared providers

* Disable DLL test

* Python test

* Revert "Python test"

This reverts commit c7ec2cfe98.

* Revert "Disable DLL test"

This reverts commit e901cb93aa.

* Revert "Test adding unload method for shared providers"

This reverts commit c427b78799.

* Point to RyanWinGPU

* Revert python version

* Fix id_to_allocator_map

* Another python exit test

* Remove extra debug messages
Try a more clean python shutdown through DllMain

* Revert DllMain idea, it didn't work

* Merge conflicts

* Fix merge with master issues.

* Comments

* Undo edit to file

* Cleanup + new training ops

* Revert yml changes

* Fix another merge error

* ROCM fix

* ROCM fix v2

* Put back Linux hack, it is necessary

* Stupid fixes

* Fix submodule out of sync

* ROCM fix 3

* ROCM 4

* Test java fix

* Fix typos

* Java test on my VM

* Fix build error

* Spotless fix

* Leave temp file around to load properly

* Fix cleanup on exit

* Fix break

* Java comments

* Remove LongformerAttentionBase workaround

* Spotless fix

* Switch yml back to regular build pool

* Revert "Switch yml back to regular build pool"

This reverts commit be35fc2a5a.

* Code review feedback

* Fix errors due to merge

* Spotless fix

* Fix minimal build

* Java fix for non cuda case

* Java fix for CPU build

* Fix Nuphar?

* Fix nuphar 2

* Fix formatting

* Revert "Remove LongformerAttentionBase workaround"

This reverts commit 648679b370.

* Training fix

* Another java fix

* Formatting

* Formatting

* For orttraining

* Last orttraining build fix...

* training fixes

* Fix test provider error

* Missing pass command

* Removed in wrong spot

* Python typo

* Python typos

* Python crash on exit, possibly due to unloading of libraries.

* Remove test_execution_provider from training build
Only enable python atexit on windows
Remove assert on provider library exit

* Still can't unload providers in python, alas.

* Disable Nvtx temporarily

* MPI Kernels for Training

* MPI Kernels part 2

* Patch through INcclService

* Oops, wrong CMakeLists

* Missing namespace

* Fix missing ()

* Move INcclService::GetInstance around to link nicer

* Missing }

* Missing MPI libraries for Cuda

* Add extra GetType functions used by MPI

* Missing Nccl library

* Remove LOGS statements as a test

* Add in a couple more missing GetType methods

* Update comments

* Missed a logging reference in mpi_context.h

* Convert aten_op to shared (due to marge with master)

* Test moving DistributedRunContext instance into shared provider layer
(with purpose error to verify it's being built properly)

* Test passed, now with fix

* Missing static

* Oops, scope DistributedRunContext to just NCCL

* Merge related issues and code review feedback.

* Merge error

* Bump to rel-1.9.1 (#7684)

* Formatting

* Code review feedback for Java build on non Windows

* Remove cupti library dependency from core library

* Test Java pipeline fix

* Linux build fix

* Revert "Linux build fix"

This reverts commit a73a811516.

* Revert "Remove cupti library dependency from core library"

This reverts commit 6a889ee8bf.

* Packaging pipeline fixes to copy cuda shared provider for tensorrt & standard packages

* Add cuda to Tensorrt nuget package

* onnxruntime_common still has a cuda header dependency

Co-authored-by: ashbhandare <ash.bhandare@gmail.com>
2021-05-20 07:53:47 -07:00
Changming Sun
3a68c389d9
Add version lock to manylinux build scripts (#7755) 2021-05-19 09:28:40 -07:00
Changming Sun
38d90b0f15
Cleanup install_deps.sh (#7734) 2021-05-17 19:27:47 -07:00
liqunfu
d604281a86
Liqun/training pkg to run tests (#7662) 2021-05-16 09:10:57 -07:00
liqunfu
3ead2f2f39
update pt lightning version (#7711)
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-05-15 21:46:16 -07:00
liqunfu
359fe1d197
Liqun/ort training version (#7620) 2021-05-14 09:54:19 -07:00
ashbhandare
56e993a434
Bump to rel-1.9.1 (#7684) 2021-05-13 18:41:28 -07:00
Hariharan Seshadri
4b691a5c0d
Add ability for memory arenas to "shrink" periodically (#7284) 2021-05-08 07:53:21 -07:00
Changming Sun
41e370c2b3
Update protobuf to 3.16 (#7616) 2021-05-07 14:09:23 -07:00
baijumeswani
f3a70f1aec
Ignore invalid input argument to install_os_deps.sh (#7566) 2021-05-05 14:33:31 -07:00
Changming Sun
a284eede64
Fix Linux CPU pipeline (#7584) 2021-05-05 13:26:10 -07:00
Scott McKay
594dde2647
Validate that the conversion script from the python package can be used to convert models. (#7517) 2021-05-04 16:25:04 +10:00
George Wu
faea7a222d
linux trt package pipeline (#7537) 2021-05-03 19:14:20 -07:00
baijumeswani
cab84d902e
Install and use conda on ortmodule CI pipelines (#7530)
* Install and use conda on ortmodule CI pipelines

* Update build script to install onnxruntime wheel before running unit tests

* Remove python 3.5 from install_python_deps

* Pinning deepspeed version to 0.3.15
2021-05-03 15:52:22 -07:00
liqunfu
196e6702ad
to support multiple cuda versions in published onnxruntime-training package (#7468)
to support multiple CUDA versions in published onnxruntime-training package
2021-04-27 17:15:33 -07:00
Suffian Khan
7a3c1787af
Add CI pipeline to publish Python training package targeting Rocm (#7417)
* first attempt rocm training wheel

* modifications needed to python packaging pipeline for Rocm 4.1

* changges to not conflict with cuda

missed stage1 changes

remove package push

add option r to getopt

try again without python install

try again without python install

try again without python install

split pipelines and add back push to remote storage

try on cuda gpu pool

try again

try again

try running without az subscription set

try again on original pipeline

change pool

passing AMD Rocm whl on AMD-GPU pool

split rocm pipeline from cuda pipeline

remove comments

* try adding Rocm tests as well

* try with tests in place

* fix trailing ws

* add training data

* try again as root for tests

* use python3

* typo

* try to map video, render group into container

* try again

* try again

* try to avoid yum error code

* make UID 1001

* try without yum downgrade

* define rocm_version=None

* remove CUDA related comments for Rocm Dockerfile

* Dont pin nightly torch torchvision torchtext versions as they expire (for now nightly is required for Rocm 4.1)

* missed requirements-rocm.txt from last commit

* fix whitespace
2021-04-23 17:22:31 -07:00
Ashwini Khade
75e054cd33
pick onnx release candidate (#7177)
* pick onnx release candidate

* fix typo

* filter batchnorm tests

* add implementation for reshape 14

* add identity op kernel for opset 14

* fix typo

* update onnx commit

* update commit to latest master

* add hashes for new kernel registrations and update 1

* TEST commit

* update onnx back to right commit

* Update onnx to latest in rel-1.9.0

* temp fix

* remove nonzeroshapesetter transformer

* pick rel branch latest commit

* fix build failures

* fix build failures

* fix build failures

* update the commit to latest in release branch

* add test filters for not impemented op14 ops in c# tests

* plus review comments
2021-04-22 23:57:09 -07:00
Changming Sun
6822ae95ec
Reduce the number of TensorRT tests needed to run (#7419) 2021-04-22 19:14:39 -07:00
Changming Sun
65b2b87f83
Update CI build docker images (#7386)
Update CI build docker images: delete ubuntu 16.04 support.
2021-04-21 13:18:34 -07:00
Changming Sun
b4cfa88bf7
Update protobuf to the latest version (#7396) 2021-04-21 10:30:06 -07:00
Guoyu Wang
fce67e2b9b
Create Android Package pipeline (#7295)
* Create Android Package pipeline

* adress CR comments

* Switch to jdk11
2021-04-12 17:56:25 -07:00
Edward Chen
0ebeaf529d
Check kernel def hashes (#7120)
Add unit test for verifying kernel def hashes.
Add way to add new types to kernel definition without changing hash.
2021-04-01 17:42:58 -07:00
sfatimar
52bcef4d4f
Openvino ep 2021.3 (#7180)
* Integrate openvino-ep-2021.3

* operators type

* changed the myriad as it is case sensitive

* logging information for openvino-ep-2021.3

* Unit test fix

* Resize operator added for myriad

* Fixed python tests for CPU and GPU

* data commit for loop tile and gatherelements failure

* adding checks for Where

* fixing gatherelements and loop tests

* disabling instance normalization test for now as there seems to be a
myriad bug, putting loop in ops supported only because all the tests
fail

* gather elements op test taking care of warning message

* condition needs to be an intializers

* Disabled python test for Myriad

* Disable compilation warning for MSVC windows compiler

* softmax_test, threedimaxis0 and 1 test give accuracy mismatch
tensoroptest disables test gives accuracy mismatch
gather test gives accuracy mismatch

* Updated with ov version 2021.3

* Updated with ov version 2021.3

* Updated README

* Disabling python tests for cpu

* Disabling python tests with accuracy mismatch on cpu

* Added fix for Linux CI Pipeline failure

-> Disabled tests that were throwing segfault

Co-authored-by: sfatimar <sahar.fatima@intel/com>
Co-authored-by: MaajidKhan <n.maajidkhan@gmail.com>
Co-authored-by: Aravind <aravindx.gunda@intel.com>
2021-04-01 11:28:54 -07:00
baijumeswani
249a2c14ef
Pin version of pytorch to 1.8.1 for ORTModule CI pipeline (#7167)
* Pin version of pytorch to 1.8.1 for ORTModule CI pipeline
* Use pytorch-lightning stable version 1.2.5
* Revert to cuda 10.1
2021-04-01 09:37:47 -07:00
liqunfu
e545604499
. (#7165) 2021-03-30 13:58:30 -07:00
Ashwini Khade
b22e60bd44
pull onnx latest commit (#7102)
* update onnx commit

* fix test scripts to remove deprecated call

* update filters

* add registration for relu and cumsum ver 14

* add promote trilu to onnx domain

* update onnx-tensorrt submodule

* update flag

* update flag

* update dependencies

* fix android ci failure
2021-03-29 11:00:38 -07:00
harshithapv
540eac253e
Deepspeed pipeline parallel and fairscale sharded optimizer test samples with ORTModule (#7078)
* adding samples for Deepspeed pipeline parallel and fairscale sharded optimizer with ortmodule

* fixed typo in args

* addressed Thiago's comments

* Update orttraining/orttraining/test/python/orttraining_test_ortmodule_deepspeed_pipeline_parallel.py

Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
2021-03-24 09:43:05 -07:00
baijumeswani
a7a2a16edd
Pass arguments to azure_scale_set_vm_mount_test_data from perf test ci pipeline (#7094) 2021-03-22 21:48:32 -07:00
Thiago Crepaldi
867804bea1
Add auto doc gen for ORTModule API during CI build (#7046)
In addition to ORTModule auto documentation during packaging, this PR also update golden numbers to fix CI
2021-03-22 10:20:33 -07:00
Changming Sun
701e73b5b8
Move Linux minimal build CI pipeline to the new Linux machine pool (#7050) 2021-03-18 12:09:12 -07:00
Thiago Crepaldi
335edaa2c4
Merge pull request #6973 from microsoft/thiagofc/merge-ortmodule-into-master
Introduce ORTModule training API to ONNX Runtime
2021-03-17 10:30:06 -07:00
Changming Sun
ed2d441a2e
Update ORT server build pipeline (#7030)
1. Migrated it to Ed's new docker build script
2. Use python 3.6 instead, because it is the default one in ubuntu 18.04
3. Move the "pip install" command to the docker image build stage(instead of when running the image)
2021-03-16 18:02:09 -07:00
Changming Sun
4161758058
Remove openmp related packaging pipeline (#6991)
1. Remove openmp related packaging pipelines and build jobs.
2. Set continueOnError to true for the TSAUpload tasks. Their service is unstable recently.
3. Update Ubuntu 16 docker images to Ubuntu 18, in prepare for getting C++17 support
4. Cherry-pick the changes in 1.7.1 to the master: updating CFLAGS/CXXFLAGS to strip out debug symbols
2021-03-12 10:02:59 -08:00
baijumeswani
79f832c682
Separate requirements.txt file for ORTModule pipelines (#6879)
* Move all ORTModule dependency installations to ortmodule subfolder
2021-03-05 14:12:11 -08:00
Baiju Meswani
d5667554e6 Merge branch 'master' of github.com:microsoft/onnxruntime into bmeswani/merge_master_onto_ortmodule 2021-03-03 20:37:29 -08:00
Guoyu Wang
36a44d55ed
Only report Android Baseline binary size for master branch (#6844)
* Only report binary size from master

* update script

* Correct the typo
2021-03-01 15:57:18 -08:00
Sherlock
12edf22f11
Merge pull request #6838 from microsoft/mzs/ortmodule-api-sync-from-master-210226
Sync from master
2021-02-27 12:32:36 -08:00
Thiago Crepaldi
f71d93ea2b
Enable PyTorch Lightning basic test on CI (#6809) 2021-02-27 09:35:42 -08:00
M. Zeeshan Siddiqui
ca48310d6d Merge branch 'master' of https://github.com/microsoft/onnxruntime into mzs/ortmodule-api-sync-from-master-210226 2021-02-27 04:25:23 +00:00
Maajid khan
7465673e33
[OpenVINO-EP] Find package changes (#6801)
* Find package changes to cmake

* Removing unwanted code from cmake

Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>

Co-authored-by: suryasidd <surya.siddharth.pemmaraju@intel.com>
2021-02-25 05:12:57 -08:00
Edward Chen
5db0c9c648
Enable CI to cover globally allowed types (#6778)
Add test to CI build to cover type reduction with globally allowed types.
2021-02-23 10:24:12 -08:00
stevenlix
53eb948f4c
Upgrade TensorRT to v7.2.2 (#6452)
* upgrade to TensorRT 7.2.2

* extend GPU tensorrt CI timeout to 150 minutes

* update docker image name

* disable user interaction to avoid tensorrt container stuck when install tzdata

* upgrade to libssl1.1 for ubuntu20.04

* remove libicu60 from ubuntu20.04

* add libicu66 for ubuntu20.04

* debug

* llvm

* llvm

* disable ReverseSequenceTest.InvalidInput

* disable ReverseSequenceTest.InvalidInput

* fix issues

* fix issues

* Update linux-gpu-tensorrt-ci-pipeline.yml

* disable warning 4458 for TensorRT parser

* update onnx-tensorrt submodule

* disable warnings for TensorRT parser

* update onnx-tensorrt submodule to include latest bug fixes

* update setup_env_trt

* update pool for win trt ci pipeline'

Co-authored-by: George Wu <jywu@microsoft.com>
2021-02-18 04:30:47 -08:00
M. Zeeshan Siddiqui
40dda452cf Merge branch 'master' of https://github.com/microsoft/onnxruntime into mzs/sync-from-master 2021-02-18 03:03:01 +00:00
liqunfu
dd8ef4409a
Liqun/migrate perf test (#6733)
move ort training perf tests to azure devops
2021-02-17 17:48:47 -08:00
Thiago Crepaldi
3184c47ad1 Merge branch 'master' into thiagofc/merge-from-master 2021-02-17 11:49:52 -08:00
baijumeswani
01dfa8e125
Support non tuple return values from torch.nn.module (#6660)
* Support dictionary, namedtuples and huffingface ModelOutput type for model return values
2021-02-16 20:48:32 -08:00
Scott McKay
02c7873b0e
Update ORT model conversion script to support custom ops (#6701)
* Add support for custom ops library to the ORT model conversion script
Simplify model conversion now that we read ops from the ORT format model.
Enable custom ops in the python bindings if custom ops are turned on in a minimal build.
* Add test of model conversion involving custom ops.
2021-02-17 12:52:39 +10:00