* Add infrastructure so that a kernel definition has the full list of supported types and a list of types enabled in this build. We need to use the full list when calculating the kernel hash so that the hash value in an ORT format model is stable across builds with and without type reduction enabled.
* Generate error when an explicit stream argument is not provided in the <<<...>>> kernel launch syntax
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Remove condition from ORT_RETURN_IF[_NOT] macro output as repeating the condition doesn't add much value compared to the explicit error message, and the error message includes the file and line anyway so it's easy enough to find the condition if needed.
Update the few places where the macros were used without an explicit error message to provide an explicit error message.
Saves 12.5KB in a minimal MinSizeRel build with all DNN ops, 16KB in full release build.
Update gpu packaging pipelines to CUDA11
In the next release we will use CUDA 11. And our CUDA 11 build suddenly became broken because recently CentOS 7 posted an update of glibc. The version of glibc was changed from 2.17-317.el7 to 2.17-322.el7_9. But the newer one isn't compatible with CUDA 11. We have to downgrade it.
* Support to allow user to specify compute stream per session
Create computation cuda stream explicitly rather than use default legacy stream or per-thread default stream.
remove some redudant cudaStreamSynchronize
fix gpt2 model test failures
don't use default stream in nccl either.
add stream schronization in OnRunEnd()
using cub::DeviceScan::InclusiveSum which can be called with stream specified.
fix topK failure due to latest rebase
fix tensorrt
support user specified stream
add user_stream support in tensorrt EP
use same stream for both tensort and CUDA EP.
fix ScatterND
specify stream for adasum and p2p kernels.
fix loop
fix CApiTest.custom_op_handler
fix CApiTest.varied_input_custom_op_handler
change for cudaMemcpyFromSymbol
improve provider options for user specified compute stream
* add changes for ROCM EP
* fix GatherGrad UT for ROCM EP
* clean code and fix NonMaxSuppression
* use default stream for ROCM now
* fix CApiTest.custom_op_handler:OrtFormatCustomOpTests.ConvertOnnxModelToOrt
* fix tensorrt ut: CApiTest.io_binding_cuda
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
1. Merge Nuget CPU pipeline, Java CPU pipeline, C-API pipeline into a single one.
2. Enable compile warnings for cuda files(*.cu) on Windows.
3. Enable static code analyze for the Windows builds in these jobs. For example, this is our first time scanning the JNI code.
4. Fix some warnings in the training code.
5. Enable code sign for Java. Previously we forgot it.
6. Update TPN.txt to remove Jemalloc.