* Template datatype for SoftmaxWithRawMaskSmallKernel in ROCm EP
* Remove valid_items usage from SoftmaxWithRawMaskSmallKernel for ROCm EP
The kernel already masks off invalid items and this gives a much
faster implementation in hipCUB.
* Update accumulator type in ROCm EP for SoftmaxWithRawMaskSmallKernel
Hard code accumulator to fp32 for hipCUB in indicated kernel.
* Reset casting to old behavior
* Document steps to optimize SoftMax kernel on ROCm EP
Usage of the hipCUB valid_items interface on reduction operations
has a significant performance impact. Masking all thread data to
avoid need to use the valid_items interface to hipCUB.
* Fix bug in pybind get_all_operator_schema due to premature reference dropping
* Add updated operator kernels markdown table
* Update build.py to include documentation generation for DML operators too
* Update GPU pipeline to include DML in the build to so operators can be generated.
* Use a separate pipeline stage, feedback from Changming and Scott
* Appease annoying Python linter
* Add onnxruntime_BUILD_UNIT_TESTS=OFF and remove stale --use_dml in cuda stage
* Use CUDA callback to release deferred-release buffers
Polishment
* Minor improvements.
1. Reorder a if-else so that frequent cases are checked first.
2. More documents.
* Fix tests.
Previously, in CUDAExecutionProvider::OnRunStart, we call
GetPerThreadContext in
auto& current_deferred_release_event = GetPerThreadContext().GetCurrentDeferredReleaseEvent();
so that a CUDAExecutionProvider always owns an active PerThreadContext
and the ReleasePerThreadContext in CUDAExecutionProvider::OnRunEnd
is always valid. However, this isn't true after we drop event-
based deferred-release code, so we need to check if
CUDAExecutionProvider really owns PerThreadContext than call
ReleasePerThreadContext if yes.
* Follow up for AMD GPU and improve CUDA part's return value.
Currently, CUDA hardware is not available to be leveraged by build
during `docker build`. because of that, CUDA capable hardware would not
have CUDA support
This PR adds an env varf ONNXRUNTIME_FORCE_CUDA in which it allows CUDA
extensions to be compiled even when CUDA support is not detected.
* drop nuphar code and configs
* refactor test case
* format python
* remove nuphar from training test
* remove commented nuphar logics
* restore llvm setting
* drop nuphar ci
* fix compile err
* fix compile err
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
* Add support for initial_growth_chunk_size_bytes setting in OrtArenaCfg pybind
* Add overloaded constructor for KVP, UT still in progress
* Fix class member access in pybind, fix unit test
* Resolve linter warnings
* Improve formatting
* Simplify UT
* Fix linter formatting
Co-authored-by: Peter Mcaughan <petermca@microsoft.com>
Fix a few obvious issues:
(1) bert_perf_test.py create session without provider in line 65.
(2) compare_bert_results.py miss a parameter in create_session in line 37
(3) onnx_exporter.py returns value mismatch in lines 667, 690.
(4) remove some imports not used in the scripts.
(5) fusion_utils need not print "Removed 0 cast nodes" or "Removed 0 Identity nodes"...
(6) update requirements for numpy version since gpt2 parity tool use equal_nan in numpy v1.19+
* Adding Split Fusion
* Make changes to comments
* Format files and change typo
* Format files and change typo
* Format files and change typo
* Format files and change typo
* Format file
* Format files
* Format files
* Format files
* Format files
* Update stale.yml
Change the number of days of inactivity before an issue becomes stale from 60 to 5 and the number of days of inactivity before a stale issue is closed from 7 to 5. Update the exempt labels based on the redefined set of GH labels.
* Implement stale.yml feedback.
**Description**: Create codeql.yml to replace LGTM
**Motivation and Context**
LGTM.com is shutting down and moving to github code scanning. This PR enables github code scanning.
cpp and c# support will be added in a separate pr.