* Enable qlinearconv per-channel quantization
* Fix the android CI test failure
* Add Android Version Check for Per-Channel Quant
* Address PR comments
* Fix some minor issues
* Add verification of per-channel zero points
* Make the error tolerance configurable
* save_checkpoint and load_checkpoint implementations
* checkpoint aggregation logic
* unit tests for save_checkpoint, load_checkpoint and aggregate_checkpoints
* fix the issue that std::numeric_limits cannot handle half type
* adding a test
Co-authored-by: Du Li <duli@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* New partition algorithm running before AD
* Convert cut_group_info into device map. Work in progress -- works for bert-tiny with pp=2
* Removing code for partition of bwd graphs
* Remove old code
* Adding some verification code
* Handle Shared Initializer
* Renaming rank with stage
* Added first unit test
* new test
* redundant check
* undo change in bert
* Moved cut-based partition to testing utils file
Co-authored-by: xzhu1900
Co-authored-by: wschin
* New conversion function and tests
* minor
* remove test that is not needed2
* improve GetDeviceAssignment and PR comments
* minor changes
* PR comments
* improving documentation and variable naming
* add documentation
* Variable naming and docs
* more doc improvements
* more doc improvements
* missing static cast
* Fix test file for windows
* Fix test file for windows
* Fix test file for windows
* stage id is not the same as rank id
* PR comments
* PR comments
* More comments
* More comments
Fix clean_docker_image_cache.py detection of image pushes. They were being ignored because the expected HTTP status code was wrong. For pushes, it's 201 instead of 200.
* Remove Provider_IExecutionProvider and make the internal IExecutionProvider usable by shared providers
* Change Provider_IExecutionProviderFactory to be the core version.
Move CudaKernel from cuda_common.h to a new separate header, cuda_kernel.h. Update include sites to use cuda_kernel.h instead if they need CudaKernel. Inclusions of cuda_common.h are now more lightweight.
Make corresponding changes for ROCM execution provider code.
Other minor cleanup.
* build for .net5
* only reference cswinrt for .net5
* remove netstandard2.0 references
* upgrade language version
* net5
* remove extra comment closure
* add targetframework
* set target framework
* remove net*
* pep8 errors
* make test project build with .net windows SDK projection
* disable c# builds for non-x64 builds
* fix pep8 errors
* disable for store build
* fix tests
* remove cswinrt and sdk references from package
* bump cswinrt down to 1.0.1
* fix bin path
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
* define ordering of reduction across blocks
* save state
* remove debug code
* remove debug code
* review comments
* significant correction for reduction only over blocks on same tensor
* addressing ocmments
* update rocm/lamb.cc to build as well
* remove times 2048*size in multitensor test until threshold error in rocm resolved
* convert tuple => struct as per recomendation
* update comment
* apply perfect forwarding for launch_multitensor to permit passing ref rather than pointer
* remove excess template arguments from rocm lamb.cc launch_multitensor as well
* fixes for AMD build
* pr comments
* run formatter from vscode
* formatter on cuda files
Move the DEBUG_NODE_INPUTS_OUTPUTS test into its own process. The implementation uses static variables which do not interact well with other tests.
Clean up old test_main.cc files which are no longer used.
* Introduce VariadicAlias, remove hardcoded alias limits
* Include optional-lite in winml build
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* adding fp16 support for topk.
* disable fp16 tests for cpu ep
Co-authored-by: Du Li <duli@OrtTrainingDev0.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* ReduceL2Grad and ClipGrad.
* fix win build and amd ci pipeline
* resolve comments.
Co-authored-by: Vincent Wang <weicwang@AiFramework2080ti2.corp.microsoft.com>
* allow custom op taking varied types
* refactor test case
* add test model
* refactor test case
* enable copy elision
* update test case
* fix issue in ToString function
Optimize reduction kernel code by moving loads from global memory before computation.
Add CMake option to build CUDA code with --generate-line-info option.