### Description
Change the hipify step to remove the -roc option to hipify-perl. This
will prefer hipblas over rocblas. rocblas can still be called directly
such as in TunableOp.
### Motivation and Context
hip interfaces are preferred over roc for porting from cuda to hip.
Calling roc interfaces is meant for ROCm-specific enhancements or
extensions.
### Description
* Add new ROCm CI pipeline (`Linux ROCm CI Pipeline`) focusing on
inference.
* Resolve test errors; disable flaky tests.
based on test PR #21614.
### Description
- Update existing rocBLAS get_solutions API using
`*_get_solutions_by_type` (supported from ROCm5.6); remove the original
nested TunableOp logic.
- Update kernel_explorer.
### Description
In some scenarios, the triton written kernels are more performant than
CK or other handwritten kernels, so we implement a framework that
onnxruntime can use these triton written kernels.
This PR is to integrate triton into ort, so that ort can use kernels
that written and compiled by triton.
The main change focus on two part:
1. a build part to compile triton written kernel and combine these
kernels into libonnxruntime_providers_rocm.so
2. a loader and launcher in c++, for loading and launch triton written
kernels.
#### Build
To compile triton written kernel, add a script
`tools/ci_build/compile_triton.py`. This script will dynamic load all
kernel files, compile them, and generate `triton_kernel_infos.a` and
`triton_kernel_infos.h`.
`triton_kernel_infos.a` contains all compiled kernel instructions, this
file will be combined into libonnxruntime_providers_rocm.so, using
--whole-archive flag.
`triton_kernel_infos.h` defines a const array that contains all the
metadata for each compiled kernel. These metadata will be used for load
and launch. So this header file is included by 'triton_kernel.cu' which
defines load and launch functions.
Add a build flag in build.py and CMakeList.txt, when building rocm
provider, it will call triton_kernel build command, and generate all
necessary files.
#### C++ Load and Launch
On c++ part, we implement load and launch functions in triton_kernel.cu
and triton_kernel.h.
These two files located in `providers/cuda`, and when compiling rocm,
they will be hipified. so this part supports both cuda and rocm. But
currently we only call triton kernel in rocm.
We also implement a softmax triton op for example. Because there will
generate many kernels for different input shape of softmax, we use
TunableOp to select the best one.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Add GemmFastGelu CK implementation.
TODO
1. The performance of CK GemmFastGelu in ORT is not good as using CK
directly, still need to investigate the reason and improve the CK in
ORT.
`GemmFastGeluUnfused float16 NN m=49152 n=3072 k=768 2298.8064 us 100.89
tflops`
`withbias DeviceGemmMultipleD_Xdl_CShuffle<256, 256, 128, 32, 8, 8,
Default> LoopScheduler: Default, PipelineVersion: v1 float16 NN m=49152
n=3072 k=768 2401.9799 us 96.56 tflops`
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
1. Update CK to its latest develop branch
2. `-mllvm -amdgpu-early-inline-all=true` is critical to CK's
performance, ensure it is properly configured.
- The flags are propagated from target `hip-lang::device`'s
`INTERFACE_COMPILE_OPTIONS`, we must not manually add the flags.
- Instead, we must ensure this target is properly configured by checking
_CMAKE_HIP_DEVICE_RUNTIME_TARGET is set.
TL,DR
`hip-lang::device` sometime will be not be properly configured if our
`CMAKE_PREFIX_PATH` is not configured carefully. In the CI docker, the
configuration is in good state, but on dev machine it is not, which then
silently result poor performance for kernels. We fixed it in this PR and
add a guard to avoid unsuccessful future editing and to prevent
convoluted debugging process.
`_CMAKE_HIP_DEVICE_RUNTIME_TARGET ` is shared in
`/opt/rocm/lib/cmake/hip-lang/hip-lang-config.cmake` and it is internal
to
[CMake](https://gitlab.kitware.com/cmake/cmake/-/merge_requests/6121/diffs),
the variable name will not be changed in the foreseeable future.
Fix warnings and enable dev mode for ROCm CI:
* Fix ROCm headers complaining "This file is deprecated. Use the header file from ..."
* Disable warning signed and unsigned compare for kernel explorer
* Fix unused and nondiscard warnings
* Enable dev mode for ROCm CI
* Walkaround error "unknown warning option '-Wno-nonnull-compare'" in kernel explorer by using '-Wno-unknown-warning-option' to ignore the unknown option
* Fix error "unused parameter 'mask'"
* Fix warning "instantiation of variable 'onnxruntime::rocm::Consts<float>::One' required here, but no definition is available", etc. Fixed by using C++17's inline (implied by constexpr) static initialization.
* Remove unused variable
* Add the missing `override` specifier
Update for ROCm CI before reland tunable GEMM #12853. This PR also update
composable kernel to use CMakes's HIP language support so that we can
mix C/C++ compiler with HIP compiler instead of locking to hip-clang
Some Ops in EP directory instead of contrib_ops directory will
require TunableOp. We will also need to add EP level session tuning
options for it. So move those code all at once.
Also remove duplicated utility functions.
* Split GemmBase RocBlasGemm
* Add composable kernel GEMM baseline
* Make linter happy
* Address review comment
* Update bert cases with batchsize
* Adjust includes to fix IWYU lint
* Only builds and links used ck kernels to improve building time
* Remove warmup run on SelectImpl
* Add comment to utility function
* Mute cpplint
* Make RocBlasGemm<T>::SelectImpl semantically correct
* Add reduced basic test cases for ck gemm
* More robust gemm testing
* Fix warnings
* Fix grammar