Going forward, a single unifed docker image will be published in
MCR. The hardware accelerator target choice will have to be made
in the application using OpenVINO EP's runtime config options.
* Create a helper for generating unique ids that can be used by an EP that creates compiled nodes and needs ids to be deterministic for a model when used in multiple sessions.
Added to IExecutionProvider as this can potentially be used by all compiling EPs and is more robust than a simplistic counter (although EP implementer is free to choose either approach).
* Restructure the helper so it can be called across the EP bridge.
Add ability to call id generation helper from EP bridge
- convert DNNL EP to use helper to validate
Address issue where a new Model may be loaded into the same address as a previous one.
- hash the bytes in the Graph instance (1728 bytes currently) to use as the key to the full hash for the model
Add lock around id generation to ensure no issues if multiple sessions partitions graphs at exactly the same time.
- Extremely unlikely but would be hard to debug and the locking cost is not an issue as it's only incurred during graph partitioning and not execution.
* Enable qlinearconv per-channel quantization
* Fix the android CI test failure
* Add Android Version Check for Per-Channel Quant
* Address PR comments
* Fix some minor issues
* Add verification of per-channel zero points
* Make the error tolerance configurable
* save_checkpoint and load_checkpoint implementations
* checkpoint aggregation logic
* unit tests for save_checkpoint, load_checkpoint and aggregate_checkpoints
* fix the issue that std::numeric_limits cannot handle half type
* adding a test
Co-authored-by: Du Li <duli@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* New partition algorithm running before AD
* Convert cut_group_info into device map. Work in progress -- works for bert-tiny with pp=2
* Removing code for partition of bwd graphs
* Remove old code
* Adding some verification code
* Handle Shared Initializer
* Renaming rank with stage
* Added first unit test
* new test
* redundant check
* undo change in bert
* Moved cut-based partition to testing utils file
Co-authored-by: xzhu1900
Co-authored-by: wschin
* New conversion function and tests
* minor
* remove test that is not needed2
* improve GetDeviceAssignment and PR comments
* minor changes
* PR comments
* improving documentation and variable naming
* add documentation
* Variable naming and docs
* more doc improvements
* more doc improvements
* missing static cast
* Fix test file for windows
* Fix test file for windows
* Fix test file for windows
* stage id is not the same as rank id
* PR comments
* PR comments
* More comments
* More comments
Fix clean_docker_image_cache.py detection of image pushes. They were being ignored because the expected HTTP status code was wrong. For pushes, it's 201 instead of 200.
* Remove Provider_IExecutionProvider and make the internal IExecutionProvider usable by shared providers
* Change Provider_IExecutionProviderFactory to be the core version.
Move CudaKernel from cuda_common.h to a new separate header, cuda_kernel.h. Update include sites to use cuda_kernel.h instead if they need CudaKernel. Inclusions of cuda_common.h are now more lightweight.
Make corresponding changes for ROCM execution provider code.
Other minor cleanup.
* build for .net5
* only reference cswinrt for .net5
* remove netstandard2.0 references
* upgrade language version
* net5
* remove extra comment closure
* add targetframework
* set target framework
* remove net*
* pep8 errors
* make test project build with .net windows SDK projection
* disable c# builds for non-x64 builds
* fix pep8 errors
* disable for store build
* fix tests
* remove cswinrt and sdk references from package
* bump cswinrt down to 1.0.1
* fix bin path
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
* define ordering of reduction across blocks
* save state
* remove debug code
* remove debug code
* review comments
* significant correction for reduction only over blocks on same tensor
* addressing ocmments
* update rocm/lamb.cc to build as well
* remove times 2048*size in multitensor test until threshold error in rocm resolved
* convert tuple => struct as per recomendation
* update comment
* apply perfect forwarding for launch_multitensor to permit passing ref rather than pointer
* remove excess template arguments from rocm lamb.cc launch_multitensor as well
* fixes for AMD build
* pr comments
* run formatter from vscode
* formatter on cuda files
Move the DEBUG_NODE_INPUTS_OUTPUTS test into its own process. The implementation uses static variables which do not interact well with other tests.
Clean up old test_main.cc files which are no longer used.
* Introduce VariadicAlias, remove hardcoded alias limits
* Include optional-lite in winml build
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>