Although forward pass works, this has the limitation of not working for
backward pass due to the lack of intermediate tensors needed for
gradient.
Next step is to export a training graph and split it manually
Move CudaKernel from cuda_common.h to a new separate header, cuda_kernel.h. Update include sites to use cuda_kernel.h instead if they need CudaKernel. Inclusions of cuda_common.h are now more lightweight.
Make corresponding changes for ROCM execution provider code.
Other minor cleanup.
* build for .net5
* only reference cswinrt for .net5
* remove netstandard2.0 references
* upgrade language version
* net5
* remove extra comment closure
* add targetframework
* set target framework
* remove net*
* pep8 errors
* make test project build with .net windows SDK projection
* disable c# builds for non-x64 builds
* fix pep8 errors
* disable for store build
* fix tests
* remove cswinrt and sdk references from package
* bump cswinrt down to 1.0.1
* fix bin path
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
* define ordering of reduction across blocks
* save state
* remove debug code
* remove debug code
* review comments
* significant correction for reduction only over blocks on same tensor
* addressing ocmments
* update rocm/lamb.cc to build as well
* remove times 2048*size in multitensor test until threshold error in rocm resolved
* convert tuple => struct as per recomendation
* update comment
* apply perfect forwarding for launch_multitensor to permit passing ref rather than pointer
* remove excess template arguments from rocm lamb.cc launch_multitensor as well
* fixes for AMD build
* pr comments
* run formatter from vscode
* formatter on cuda files
Move the DEBUG_NODE_INPUTS_OUTPUTS test into its own process. The implementation uses static variables which do not interact well with other tests.
Clean up old test_main.cc files which are no longer used.
* Introduce VariadicAlias, remove hardcoded alias limits
* Include optional-lite in winml build
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* adding fp16 support for topk.
* disable fp16 tests for cpu ep
Co-authored-by: Du Li <duli@OrtTrainingDev0.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* ReduceL2Grad and ClipGrad.
* fix win build and amd ci pipeline
* resolve comments.
Co-authored-by: Vincent Wang <weicwang@AiFramework2080ti2.corp.microsoft.com>
* allow custom op taking varied types
* refactor test case
* add test model
* refactor test case
* enable copy elision
* update test case
* fix issue in ToString function
Optimize reduction kernel code by moving loads from global memory before computation.
Add CMake option to build CUDA code with --generate-line-info option.
* add HSA_NO_SCRATCH_RECLAIM=1 to dockerfile
It is to work around an issue in AMD compiler which generates poor GPU ISA when the type of kernel parameter is a structure and “pass-by-value” is used
* update BUILD.md
* add dockerfile for rocm3.10
* Support to pass initial optimizer states to optimizer graph builder
* Changes for passing init optim state to training session config
* Pass optimizer state through cpp and python frontend
* Cleanup
* Review comments
* Fix windows and mac CI
* Review comments
* review comments
* Review comments
* Frontend review changes
* Fix CI
Fix a typo in tools/ci_build/github/azure-pipelines/templates/get-docker-image-steps.yml.
Add logging to tools/ci_build/get_docker_image.py for easier debugging.
* save python dictionary to hdf5 representation and load an hdf5 file into a python dictionary
* unit tests for saving data to and loading data from hdf5 file