Commit graph

59 commits

Author SHA1 Message Date
msftlincoln
9cf6912bba
Fix ORT Eager Mode to work with Pytorch 1.12 (#12323) 2022-07-27 16:24:46 -04:00
pengwa
2b2367efbf
Fix orttraining-linux-gpu-ci-pipeline (fairscale dependency) (#12320)
authored by: @pengwa
2022-07-26 15:11:04 -07:00
LironKesem
9647a3be40
Add tests for all unary aten ops supported in eager mode (#12087)
* Add tests for all uniary aten ops supported in eager mode

* fixing the PR draft

* fixing the merge

* changing eval to be at compile time

* adding requirements for eager

* 1.adding function to {ops}_out
2.cleaning the code
  and adding comments

* editing the code according to code review

Co-authored-by: root <root@AHA-LIRONKESE-1>
2022-07-12 08:53:19 -04:00
PeixuanZuo
1c39d22f4e
[ADD] Rocm5.2 for Rocm python packaging pipeline (#12129)
[ADD] rocm5.2
2022-07-11 11:10:45 +08:00
Wil Brady
fdf12a5c35
Fix windows eager build break by pinning to torch version 1.11.0 (#12033)
Fix windows and linux eager build to torch 1.11.0.
2022-06-30 07:01:13 -04:00
PeixuanZuo
c556f5f22f
Add AMD python package ROCm5.1.1+torch1.11 (#11516)
* [FIX] fix name error

* [ADD] add rocm5.1.1 python package

* [ADD] torch1.10.0 rocm requirements

* [UPDATE] update docker Repository name
2022-05-16 08:14:11 +08:00
Justin Chu
a1f9847b23
[Fix] Add the extra param to match gelu in PyTorch in the contrib symbolic function (#11318)
Description:

Add the extra param to match gelu in PyTorch in the contrib symbolic function

Motivation and Context

Why is this change required? What problem does it solve?
The symbolic function in /onnxruntime/python/tools/pytorch_export_contrib_ops.py is missing a recently added parameter approximate. We add this parameter and use the exporter defined gelu if approximate is "tanh".
2022-05-04 10:36:38 -07:00
ytaous
eec5187801
Remove Rocm 4.2 from CI Build (#11130)
* remove rocm42 CI

* update torch to v1.11.0

Co-authored-by: Ethan Tao <ettao@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2022-04-07 11:42:09 -07:00
Baiju Meswani
f9940f17b1
Remove extra-index-url to avoid nuget security analysis vulnerability (#11082) 2022-04-01 18:30:55 -07:00
Baiju Meswani
249c4dec7f
Update orttraining release pipelines to use torch 1.11.0 (#11018)
* Update orttraining release pipelines to use torch 1.11.0

* Change requirements_torch...txt to requirements.txt

* Update cuda cmake architectures and clean up old files
2022-03-31 21:51:06 -07:00
dependabot[bot]
79e4ed8064 Bump pytorch-lightning
Bumps [pytorch-lightning](https://github.com/PyTorchLightning/pytorch-lightning) from 1.5.10 to 1.6.0.
- [Release notes](https://github.com/PyTorchLightning/pytorch-lightning/releases)
- [Changelog](https://github.com/PyTorchLightning/pytorch-lightning/blob/master/CHANGELOG.md)
- [Commits](https://github.com/PyTorchLightning/pytorch-lightning/compare/1.5.10...1.6.0)

---
updated-dependencies:
- dependency-name: pytorch-lightning
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-03-31 16:51:24 -07:00
raviskolli
480c793125
Update training packages to Pytorch 1.11.0 (#10851)
* Update ortmodule training packages to Pytorch 1.11.0

Co-authored-by: Harshitha Venkata <havenka@microsoft.com>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
2022-03-22 16:45:51 -07:00
Changming Sun
6260733533
Fix eager mode pipeline (#10802)
It was still using python 3.6
2022-03-08 09:26:20 -08:00
dependabot[bot]
e3c85d4262 Bump numpy
Bumps [numpy](https://github.com/numpy/numpy) from 1.19.5 to 1.21.0.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst.txt)
- [Commits](https://github.com/numpy/numpy/compare/v1.19.5...v1.21.0)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-03-04 09:51:32 -08:00
dependabot[bot]
b780a3784e Bump numpy in /tools/ci_build/github/linux/docker/scripts/training
Bumps [numpy](https://github.com/numpy/numpy) from 1.19.5 to 1.21.0.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst.txt)
- [Commits](https://github.com/numpy/numpy/compare/v1.19.5...v1.21.0)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-03-04 09:38:38 -08:00
dependabot[bot]
0b0e8ccf92 Bump numpy
Bumps [numpy](https://github.com/numpy/numpy) from 1.19.5 to 1.21.0.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst.txt)
- [Commits](https://github.com/numpy/numpy/compare/v1.19.5...v1.21.0)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-03-04 09:34:58 -08:00
Changming Sun
feae842a7c
Update pytorch-lightning (#10421) 2022-01-27 21:15:00 -08:00
Thiago Crepaldi
6a7d3deb22
Update pytorch-lightning (#10276) 2022-01-14 16:49:10 -05:00
Abhishek Jindal
d5742f3a43
moving from torch nightly build to stable build (#10150)
* moving from torch nightly build to stable build

* using torch cpu version

* using torch cpu version from link
2021-12-29 19:35:10 -08:00
Suffian Khan
7e55a942cd
Add torch 1.10 requirements for rocm (#10028) 2021-12-13 20:39:58 -08:00
Xavier Dupré
42c176b60c
Update default opset to 14 in ORTModule (#9743)
* update to torch 1.10
* update torchvision version
* update torchtext version
* remove deprecated option enable_onnx_checker
* add unit test to test gradient of GatherElements
* add ORTMODULE_ONNX_OPSET_VERSION in a docker file
2021-12-09 12:45:35 +01:00
Tang, Cheng
8db49e3d0f
add ortmodule and eager mode test (#9888)
* add ortmodule and eager mode test

* add ortmodule dependency

* fix eager pipeline

* skip tthe ortmodule test for windows due to win ci issue

* remove useless win ci change

* add torch

Co-authored-by: Abhishek Jindal <abjindal@microsoft.com>
2021-12-02 19:49:18 -08:00
raviskolli
9f4e8cf6a0
Update training pipelines to pytorch 1.10 (#9709)
* Update training pipelines to pytorch 1.10

* Fixed a typo in cuda version.

* Downgraded gcc to 8 for cuda 10.2
2021-11-15 11:21:55 -08:00
baijumeswani
1422a9ba6b
Remove previous temporary fixes and address TODOs (#9020) 2021-09-13 10:10:07 -07:00
liqun Fu
a7f5bd226b
retarget torch181 to torch182 (#8947)
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-09-03 09:44:42 -07:00
Abhishek Jindal
868c8af9ac
Abjindal/eager mode pipeline (#8870)
* Adding pipeline file for eager mode

* adding the build eager mode flag

* adding torch wheel files for installation

* Changing pytorch version for change in wheel files

* updating requirements file path

* Removing Java and NodeJS from the build

* removing import torch for testing build of eager mode

* changing the build command

* import torch

* building eager mode separately

* removing Java tests

* python path issues

* changing python path location

* changing the build path file loc

* installing torch before build

* setting environment for building eager mode

* Copying the build file and getting rid of flags

* changing python path

* adding missing packages

* moving build eager mode code

* changing python path to python3

* adding amd_hipify

* adding logger file

* install torch before build

* change requirements file location

* install torch before build eager

* modifying eager mode build

* modifying build location

* adding new docker image

* handling gradle move issue

* Typo fix

* changing deps file

* adding java and nodejs

* changing repo name for docker image

* removing pybind

* building only eager mode

* changing the image name

* removing install wheel package

* build complete onnxruntime with eager mode

* building wheel

* enabling pybind

* adding build eager mode flag in unit tests

* removing build java nodejs

* adding build command

* removing java tests

* moving Debug tests before Release

* building Debug only case

* changing debug test code

* running the build eager mode with tests

* adding build dir

* adding build dir path

* changing build dir path

* changing build command for eager mode

* building eager mode and running tests simultaneously

* adding more flags to the pipeline

* chaning flag

* adding Debug and Release

* changing torch to nightly build

* changing torch version for nightly build

* chaning torch version

* move to Ubuntu image

* adding pool

* adding dockerfile for eager mode

* adding python deps file for eager

* modifying python deps file for eager

* changing deps file

* changing deps file statements

* changing python path

* REMOVING ECHO line

* going to original docker file

* changing docker file

* changing to eager requirements file

* changing python deps file

* changing paths

* changing cmake path

* changing build script

* changing python installation

* running debug mode only

* changing pipeline file

* test name

* test name

* test name2

* changing requirements file

* final flags for eager mode

* previous pipeline

* moving to ubuntu image and including some deps

* adding cmake path

* returning to manylinux image

* removing unncecessary files for pipeline
2021-08-30 18:24:39 -07:00
liqun Fu
2beb873c6b
move training CI agent pools to 1ES hosted (#8775) 2021-08-18 18:36:19 -07:00
liqun Fu
bec24ca4c1
create packaging pipeline to support cuda11.4 (#8663) 2021-08-11 17:44:57 -07:00
Thiago Crepaldi
9073c094d4 Update torch litghning and re-enable test 2021-07-22 14:18:07 -07:00
baijumeswani
090bae21ab
Pinning pillow version to 8.2.0 to circumvent regression introduced by 8.3.0 (#8303) 2021-07-06 13:02:39 -07:00
Suffian Khan
008c5f7640
Use single builder image across Python versions for ROCm wheels (#8302)
* first attempt share docker image across python and torch versons

* set dependency between jobs

* fix yaml grammer

* remove python version from first stage

* clean deepspeed directroy

* split into two images according torch version

* fix yaml syntax

* invalidate cache

* remove DS to prevent torch 1.9.0 upgrade
2021-07-06 11:56:00 -07:00
baijumeswani
2bda2a62fd
Pin version of Pillow to 8.2.0 to circumvent noncompatibility with numpy (#8278) 2021-07-02 09:05:49 -07:00
Thiago Crepaldi
83be3759bc
Add post-install command to build PyTorch CPP extensions from within onnxruntime package (#8027)
ORTModule requires two PyTorch CPP extensions that are currently JIT compiled. The runtime compilation can cause issues in some environments without all build requirements or in environments with multiple instances of ORTModule running in parallel

This PR creates a custom command to compile such extensions that must be manually executed before ORTModule is executed for the first time. When users try to use ORTModule before the extensions are compiled, an error with instructions are raised

PyTorch CPP Extensions for ORTModule can be compiled by running:
python -m onnxruntime.training.ortmodule.torch_cpp_extensions.install

Full build environment is needed for this
2021-06-28 18:11:58 -07:00
liqunfu
9366114028
make pipelines to support torch1.8.1 and torch1.9.0 (#8084) 2021-06-25 14:55:49 -07:00
baijumeswani
7701c8703e
Add module attribute to ORTModule to support HuggingFace Trainer save_model (#8088) 2021-06-18 13:13:45 -07:00
Thiago Crepaldi
c45ac166d3
Add graphviz into Dockerfile images for Python API documentation (#7819) 2021-06-02 16:12:54 -07:00
liqunfu
3ead2f2f39
update pt lightning version (#7711)
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-05-15 21:46:16 -07:00
liqunfu
359fe1d197
Liqun/ort training version (#7620) 2021-05-14 09:54:19 -07:00
baijumeswani
cab84d902e
Install and use conda on ortmodule CI pipelines (#7530)
* Install and use conda on ortmodule CI pipelines

* Update build script to install onnxruntime wheel before running unit tests

* Remove python 3.5 from install_python_deps

* Pinning deepspeed version to 0.3.15
2021-05-03 15:52:22 -07:00
liqunfu
196e6702ad
to support multiple cuda versions in published onnxruntime-training package (#7468)
to support multiple CUDA versions in published onnxruntime-training package
2021-04-27 17:15:33 -07:00
Suffian Khan
7a3c1787af
Add CI pipeline to publish Python training package targeting Rocm (#7417)
* first attempt rocm training wheel

* modifications needed to python packaging pipeline for Rocm 4.1

* changges to not conflict with cuda

missed stage1 changes

remove package push

add option r to getopt

try again without python install

try again without python install

try again without python install

split pipelines and add back push to remote storage

try on cuda gpu pool

try again

try again

try running without az subscription set

try again on original pipeline

change pool

passing AMD Rocm whl on AMD-GPU pool

split rocm pipeline from cuda pipeline

remove comments

* try adding Rocm tests as well

* try with tests in place

* fix trailing ws

* add training data

* try again as root for tests

* use python3

* typo

* try to map video, render group into container

* try again

* try again

* try to avoid yum error code

* make UID 1001

* try without yum downgrade

* define rocm_version=None

* remove CUDA related comments for Rocm Dockerfile

* Dont pin nightly torch torchvision torchtext versions as they expire (for now nightly is required for Rocm 4.1)

* missed requirements-rocm.txt from last commit

* fix whitespace
2021-04-23 17:22:31 -07:00
baijumeswani
249a2c14ef
Pin version of pytorch to 1.8.1 for ORTModule CI pipeline (#7167)
* Pin version of pytorch to 1.8.1 for ORTModule CI pipeline
* Use pytorch-lightning stable version 1.2.5
* Revert to cuda 10.1
2021-04-01 09:37:47 -07:00
harshithapv
540eac253e
Deepspeed pipeline parallel and fairscale sharded optimizer test samples with ORTModule (#7078)
* adding samples for Deepspeed pipeline parallel and fairscale sharded optimizer with ortmodule

* fixed typo in args

* addressed Thiago's comments

* Update orttraining/orttraining/test/python/orttraining_test_ortmodule_deepspeed_pipeline_parallel.py

Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
2021-03-24 09:43:05 -07:00
baijumeswani
a7a2a16edd
Pass arguments to azure_scale_set_vm_mount_test_data from perf test ci pipeline (#7094) 2021-03-22 21:48:32 -07:00
baijumeswani
79f832c682
Separate requirements.txt file for ORTModule pipelines (#6879)
* Move all ORTModule dependency installations to ortmodule subfolder
2021-03-05 14:12:11 -08:00
Thiago Crepaldi
f71d93ea2b
Enable PyTorch Lightning basic test on CI (#6809) 2021-02-27 09:35:42 -08:00
M. Zeeshan Siddiqui
40dda452cf Merge branch 'master' of https://github.com/microsoft/onnxruntime into mzs/sync-from-master 2021-02-18 03:03:01 +00:00
liqunfu
dd8ef4409a
Liqun/migrate perf test (#6733)
move ort training perf tests to azure devops
2021-02-17 17:48:47 -08:00
baijumeswani
01dfa8e125
Support non tuple return values from torch.nn.module (#6660)
* Support dictionary, namedtuples and huffingface ModelOutput type for model return values
2021-02-16 20:48:32 -08:00
baijumeswani
62ac164279
Cache datasets on CI machines (#6525) 2021-02-02 21:11:35 -08:00