Commit graph

392 commits

Author SHA1 Message Date
Edward Chen
d761571afc
Deprecate Python global configuration functions [Part 2] (#6171)
Update Python API to allow more flexibility for setting providers and provider options.

The providers argument (InferenceSession/TrainingSession constructors, InferenceSession.set_providers()) now also accepts a tuple of (name, options dict).
Fix get_available_providers() API (and the corresponding function in the C API) to return the providers in default priority order. Now it can be used as a starting point for the providers argument and maintain the default priority order.
Convert some usages of the deprecated global configuration functions to use EP-specific options instead.

Update some EP-specific option parsing to fail on unknown options.

Other clean up.
2021-01-07 10:10:55 -08:00
Tang, Cheng
431604ef89
add bfloat16 to gathergrad type constrains (#6267)
Co-authored-by: Cheng Tang <chenta@microsoft.com>
2021-01-06 15:04:14 -08:00
pengwa
eea3806db1
model parallel refinement (#6244)
* Megatron Transformation as a seperate step

* remove useless header

* clang formating

* Re-Structure megatron transformer for subsquent changes

* fix  comments
2021-01-06 10:30:22 +08:00
ashbhandare
493bf931c5
Add the Concat Slice Elimination transform, fix constant_folding transform (#5457)
* Add concat slice transform + test

* Cosmetic improvements in concat slice transform

* Remove unrelated file, fix comment, fix constant folding bug

* Add test onnx graph

* fix windows build

* Review comments

* review comment

Co-authored-by: Aishwarya <aibhanda@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-01-04 16:18:33 -08:00
baijumeswani
93bf7c4d52
Documentation for distributed CI tests pipeline (#6140) 2021-01-04 10:09:39 -08:00
Suffian Khan
46e0e4e69f
Tune BiasGeluGradDx kernel in approximation mode to avoid tanh(...) on Rocm (#6239)
* bias gelu grad use exp(...) instead

* update cuda to rocm

* missing semicolon

* comment

* remove dockerfile

* missing factor of two
2021-01-02 08:54:16 -08:00
Jesse Benson
7ccdfed1a6 Remove most ROCm-specific element-wise code and reuse CUDA element-wise code. 2020-12-27 10:30:29 -08:00
Jesse Benson
52228a703c Use TArray in AMD element-wise kernels, rather than manually copying memory to device. 2020-12-27 10:30:29 -08:00
Jesse Benson
c562952750 Dockerfile to build onnxruntime with ROCm 4.0 2020-12-22 10:21:12 -08:00
baijumeswani
a8b482681a
Clean up checkpoint tests to use the new checkpoint functions (#6188)
* add deprecation warning for old checkpoint functions

* update all the distributed checkpoint tests to use new checkpoint functions
2020-12-22 09:15:57 -08:00
Weixing Zhang
53307a5f2e
improve perf for softmax (#6128)
* improve perf for both gathergrad and softmax

* revert the change in gathergrad and will be done in another PR.

* address comments from code review.
2020-12-21 14:15:54 -08:00
jingyanwangms
f874260b9e
Backend APIs for checkpointing (#5803)
* Add backend API GetOptimizerState and GetModelState

* add GetPartitionInfoMap
2020-12-21 08:21:29 -08:00
Derek Murray
11b0a5401e
Fix typo in BERT pretraining script (#6175)
A misplaced `}` meant that the `'enable_adasum'` option was interpreted incorrectly, causing the test to fail.
2020-12-18 16:38:14 -08:00
baijumeswani
39aedbc97f
aggregate model states only for the case when mixed precision was true (#6176) 2020-12-18 14:09:32 -08:00
Sergii Dymchenko
824ef9a1de
Don't try to bind unused inputs in the Training frontend (#6166) 2020-12-17 21:41:28 -08:00
baijumeswani
adc2071043
save_checkpoint, load_checkpoint and aggregate_checkpoints (#6136)
* save_checkpoint and load_checkpoint implementations

* checkpoint aggregation logic

* unit tests for save_checkpoint, load_checkpoint and aggregate_checkpoints
2020-12-17 21:01:36 -08:00
Tixxx
32c67c2944
Deprecating Horovod and refactored Adasum computations (#5468)
deprecated horovod submodule
refactored adasum logic to be ort-native
added tests for native kernel and e2e tests
2020-12-17 16:21:33 -08:00
Juliana Franco
36c03b32e9
Using a map of of ops to stages as input of partition function. (#5940)
* New partition algorithm running before AD

* Convert cut_group_info into device map. Work in progress -- works for  bert-tiny with pp=2

* Removing code for partition of bwd graphs

* Remove old code

* Adding some verification code

* Handle Shared Initializer

* Renaming rank with stage

* Added first unit test

* new test

* redundant check

* undo change in bert

* Moved cut-based partition to testing utils file

Co-authored-by: xzhu1900
Co-authored-by: wschin

* New conversion function and tests

* minor

* remove test that is not needed2

* improve GetDeviceAssignment and PR comments

* minor changes

* PR comments

* improving documentation and variable naming

* add documentation

* Variable naming and docs

* more doc improvements

* more doc improvements

* missing static cast

* Fix test file for windows

* Fix test file for windows

* Fix test file for windows

* stage id is not the same as rank id

* PR comments

* PR comments

* More comments

* More comments
2020-12-17 09:03:33 -08:00
ashbhandare
82690486c1
Partition initial optimizer state for Zero-1 (#6093)
* Initial changes

* Working changes

* Working changes

* Cleanup

* fix windows CI

* Review comments

* review comments
2020-12-16 15:27:42 -05:00
Derek Murray
8fd085801a
Add gradient registration for Abs. (#6139) 2020-12-16 08:32:10 -08:00
George Nash
939cc9b410
Enable running the mnist_training sample without cuda (#6085)
Signed-off-by: George Nash <george.nash@intel.com>
2020-12-15 17:06:54 -08:00
Edward Chen
64709b1335
Deprecate Python global configuration functions [Part 1] (#5923)
Enable options to be set via execution provider (EP)-specific options and log deprecation warning from current global configuration functions.
2020-12-15 11:32:43 -08:00
Edward Chen
9810b9e02b
Reduce amount of compiled CUDA device code (#6118)
Move CudaKernel from cuda_common.h to a new separate header, cuda_kernel.h. Update include sites to use cuda_kernel.h instead if they need CudaKernel. Inclusions of cuda_common.h are now more lightweight.

Make corresponding changes for ROCM execution provider code.

Other minor cleanup.
2020-12-14 15:27:40 -08:00
liqunfu
cde723a136
Liqun/move nightly pl to linux multi gpu v100 (#6024)
* move e2e nightly pipeline to azure devop
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-12-14 12:43:41 -08:00
baijumeswani
dd2e5a1a05
state_dict and load_state_dict for ORTTrainer (#6095)
* add functions state_dict and load_state_dict to ORTTrainer

* unit tests for state_dict and load_state_dict for ORTTrainer
2020-12-14 11:55:52 -08:00
Suffian Khan
6cb5d3ac09
Fix multi-tensor LAMB reduction to be deterministic (#6028)
* define ordering of reduction across blocks

* save state

* remove debug code

* remove debug code

* review comments

* significant correction for reduction only over blocks on same tensor

* addressing ocmments

* update rocm/lamb.cc to build as well

* remove times 2048*size in multitensor test until threshold error in rocm resolved

* convert tuple => struct as per recomendation

* update comment

* apply perfect forwarding for launch_multitensor to permit passing ref rather than pointer

* remove excess template arguments from rocm lamb.cc launch_multitensor as well

* fixes for AMD build

* pr comments

* run formatter from vscode

* formatter on cuda files
2020-12-11 13:13:05 -08:00
Sherlock
a53f4dd379
Introduce VariadicAlias, remove hardcoded alias limits (#6106)
* Introduce VariadicAlias, remove hardcoded alias limits

* Include optional-lite in winml build

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2020-12-11 10:47:08 -08:00
Jesse Benson
38c49c2483 Make ROCM and CUDA reduction_all code more similar. 2020-12-11 09:35:07 -08:00
Vincent Wang
7ddeafdfcc
Add ReduceL2Grad and ClipGrad (#5970)
* ReduceL2Grad and ClipGrad.

* fix win build and amd ci pipeline

* resolve comments.

Co-authored-by: Vincent Wang <weicwang@AiFramework2080ti2.corp.microsoft.com>
2020-12-10 11:03:26 +08:00
Sergii Dymchenko
9e26e59a37
Deprecate opsets <12 for training. (#6027) 2020-12-09 00:15:27 -08:00
Weixing Zhang
d95fc5e849
clean un-used code. (#6059)
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2020-12-08 23:15:30 -08:00
Weixing Zhang
2705115732
add dockerfile for ROCm3.10 and update BUILD.md for ROCm EP (#5821)
* add HSA_NO_SCRATCH_RECLAIM=1 to dockerfile

It is to work around an issue in AMD compiler which generates poor GPU ISA when the type of kernel parameter is a structure and “pass-by-value” is used

* update BUILD.md

* add dockerfile for rocm3.10
2020-12-08 23:14:56 -08:00
ashbhandare
b1a75d0e98
Enable passing initial optimizer state while creating training session (#5869)
* Support to pass initial optimizer states to optimizer graph builder

* Changes for passing init optim state to training session config

* Pass optimizer state through cpp and python frontend

* Cleanup

* Review comments

* Fix windows and mac CI

* Review comments

* review comments

* Review comments

* Frontend review changes

* Fix CI
2020-12-08 21:20:51 -05:00
Sherlock
7a43fa0028
Fix AllReduce kernel for contiguous buffer (#6064) 2020-12-08 15:55:13 -08:00
baijumeswani
523d187193
save data to and load data from an hdf5 file for checkpointing (#5975)
* save python dictionary to hdf5 representation and load an hdf5 file into a python dictionary

* unit tests for saving data to and loading data from hdf5 file
2020-12-08 11:40:57 -08:00
ashbhandare
7cebf76a46
Improve checkpointing for Zero stage 1 (#5478)
* Initial running changes

* Checkpointing aggregation changes

* compare with older version

* initial cleanup

* Add zero test, minor fix

* Fix zero test, transform, formatting

* Review comments

* add more unit tests

* review comments

* Try fix CI

* Add additional check on just aggregation code

* Try fix ckpt gen

* Add pregenerated ckpt for CI, enable zero test in e2e

* Moving test to nightly, removing ckpt files

* Add tests to dist GPU CI

* Fix dist test

* Review comments

* Fix test
2020-12-07 09:16:01 -08:00
Jesse Benson
14f6eb14b1 Use __launch_bounds__ workaround, rather than limiting threads to 256 on AMD. 2020-12-03 13:06:34 -08:00
Jesse Benson
245d43615d Fix AMD multi-tensor implementation. 2020-12-03 13:06:34 -08:00
Sherlock
c86a1e5c13
Fix Flaky orttraining tests (#5977)
* Fix Flacky orttraining  tests
2020-12-03 10:24:25 -08:00
Alberto Magni
fb310fba0c
Avoid adding non-existent inputs to new Event nodes (#5915)
During graph resolve non-existent nodes cause shape-inference failures.
2020-12-01 08:21:05 -08:00
Jesse Benson
45966d878a Code review feedback 2020-11-30 09:24:22 -08:00
Jesse Benson
86e30a2db6 Update CUDA IsAllFinite kernel 2020-11-30 09:24:22 -08:00
Jesse Benson
bd96f60888 Use CUDA's IsAllFinite kernel for ROCm 2020-11-30 09:24:22 -08:00
baijumeswani
69b9368c93
Add unit tests to identify configuration migration scenarios for checkpointing (#5678) 2020-11-25 09:40:26 -08:00
baijumeswani
208f4c1d3c
Azure ci pipeline for distributed environment tests (#5881) 2020-11-23 14:01:00 -08:00
Vincent Wang
47185b9513
reducealll2 cpu kernel (#5833)
Co-authored-by: Vincent Wang <weicwang@AiFramework2080ti2.corp.microsoft.com>
2020-11-19 10:20:05 +08:00
Tracy Sharpe
f964bb94ba
Add QLinearConv NHWC transformer (#5824)
The implementation of QLinearConv internally does a transpose(NHWC)->im2col+GEMM->transpose(NCHW). This adds a graph transformer to change a model to use a com.microsoft.QLinearConv that supports NHWC natively to avoid unnecessary transposes.
2020-11-17 20:51:02 -08:00
Edward Chen
71e7c2b423
Cache build docker images in container registry. (#5811)
This PR adds infrastructure to automatically cache docker images used in CI builds in a container registry.

Currently, build images are pulled from a container registry for some builds and built every time for others. The container registry requires maintenance to keep the images up to date and building images every time wastes build agent resources.

With this change, a given build image can be looked up in a cache container registry and if present, pulled, and otherwise, built and pushed. The uniqueness of a build image is determined by a hash digest of the dockerfile, docker build context directory, and certain "docker build" options. This digest is part of the image tag in the cache container repository.

The cache container registry will need to be cleaned up periodically. This is not automated yet.
2020-11-17 17:02:24 -08:00
zhijxu
89e5b3a24f resolve review comments 2020-11-16 11:23:01 +08:00
zhijxu
89902c2519 fix frontend bug.
old ort session may already exists when creating new ort session, this may cause OOM error
2020-11-16 11:23:01 +08:00