Update Python API to allow more flexibility for setting providers and provider options.
The providers argument (InferenceSession/TrainingSession constructors, InferenceSession.set_providers()) now also accepts a tuple of (name, options dict).
Fix get_available_providers() API (and the corresponding function in the C API) to return the providers in default priority order. Now it can be used as a starting point for the providers argument and maintain the default priority order.
Convert some usages of the deprecated global configuration functions to use EP-specific options instead.
Update some EP-specific option parsing to fail on unknown options.
Other clean up.
* save_checkpoint and load_checkpoint implementations
* checkpoint aggregation logic
* unit tests for save_checkpoint, load_checkpoint and aggregate_checkpoints
* New partition algorithm running before AD
* Convert cut_group_info into device map. Work in progress -- works for bert-tiny with pp=2
* Removing code for partition of bwd graphs
* Remove old code
* Adding some verification code
* Handle Shared Initializer
* Renaming rank with stage
* Added first unit test
* new test
* redundant check
* undo change in bert
* Moved cut-based partition to testing utils file
Co-authored-by: xzhu1900
Co-authored-by: wschin
* New conversion function and tests
* minor
* remove test that is not needed2
* improve GetDeviceAssignment and PR comments
* minor changes
* PR comments
* improving documentation and variable naming
* add documentation
* Variable naming and docs
* more doc improvements
* more doc improvements
* missing static cast
* Fix test file for windows
* Fix test file for windows
* Fix test file for windows
* stage id is not the same as rank id
* PR comments
* PR comments
* More comments
* More comments
Move CudaKernel from cuda_common.h to a new separate header, cuda_kernel.h. Update include sites to use cuda_kernel.h instead if they need CudaKernel. Inclusions of cuda_common.h are now more lightweight.
Make corresponding changes for ROCM execution provider code.
Other minor cleanup.
* define ordering of reduction across blocks
* save state
* remove debug code
* remove debug code
* review comments
* significant correction for reduction only over blocks on same tensor
* addressing ocmments
* update rocm/lamb.cc to build as well
* remove times 2048*size in multitensor test until threshold error in rocm resolved
* convert tuple => struct as per recomendation
* update comment
* apply perfect forwarding for launch_multitensor to permit passing ref rather than pointer
* remove excess template arguments from rocm lamb.cc launch_multitensor as well
* fixes for AMD build
* pr comments
* run formatter from vscode
* formatter on cuda files
* Introduce VariadicAlias, remove hardcoded alias limits
* Include optional-lite in winml build
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* ReduceL2Grad and ClipGrad.
* fix win build and amd ci pipeline
* resolve comments.
Co-authored-by: Vincent Wang <weicwang@AiFramework2080ti2.corp.microsoft.com>
* add HSA_NO_SCRATCH_RECLAIM=1 to dockerfile
It is to work around an issue in AMD compiler which generates poor GPU ISA when the type of kernel parameter is a structure and “pass-by-value” is used
* update BUILD.md
* add dockerfile for rocm3.10
* Support to pass initial optimizer states to optimizer graph builder
* Changes for passing init optim state to training session config
* Pass optimizer state through cpp and python frontend
* Cleanup
* Review comments
* Fix windows and mac CI
* Review comments
* review comments
* Review comments
* Frontend review changes
* Fix CI
* save python dictionary to hdf5 representation and load an hdf5 file into a python dictionary
* unit tests for saving data to and loading data from hdf5 file
* Initial running changes
* Checkpointing aggregation changes
* compare with older version
* initial cleanup
* Add zero test, minor fix
* Fix zero test, transform, formatting
* Review comments
* add more unit tests
* review comments
* Try fix CI
* Add additional check on just aggregation code
* Try fix ckpt gen
* Add pregenerated ckpt for CI, enable zero test in e2e
* Moving test to nightly, removing ckpt files
* Add tests to dist GPU CI
* Fix dist test
* Review comments
* Fix test
The implementation of QLinearConv internally does a transpose(NHWC)->im2col+GEMM->transpose(NCHW). This adds a graph transformer to change a model to use a com.microsoft.QLinearConv that supports NHWC natively to avoid unnecessary transposes.
This PR adds infrastructure to automatically cache docker images used in CI builds in a container registry.
Currently, build images are pulled from a container registry for some builds and built every time for others. The container registry requires maintenance to keep the images up to date and building images every time wastes build agent resources.
With this change, a given build image can be looked up in a cache container registry and if present, pulled, and otherwise, built and pushed. The uniqueness of a build image is determined by a hash digest of the dockerfile, docker build context directory, and certain "docker build" options. This digest is part of the image tag in the cache container repository.
The cache container registry will need to be cleaned up periodically. This is not automated yet.