onnxruntime/cmake
kailums f62f722c70
integrate triton into ort (#15862)
### Description
In some scenarios, the triton written kernels are more performant than
CK or other handwritten kernels, so we implement a framework that
onnxruntime can use these triton written kernels.

This PR is to integrate triton into ort, so that ort can use kernels
that written and compiled by triton.

The main change focus on two part:
1. a build part to compile triton written kernel and combine these
kernels into libonnxruntime_providers_rocm.so
2. a loader and launcher in c++, for loading and launch triton written
kernels.

#### Build

To compile triton written kernel, add a script
`tools/ci_build/compile_triton.py`. This script will dynamic load all
kernel files, compile them, and generate `triton_kernel_infos.a` and
`triton_kernel_infos.h`.

`triton_kernel_infos.a` contains all compiled kernel instructions, this
file will be combined into libonnxruntime_providers_rocm.so, using
--whole-archive flag.

`triton_kernel_infos.h` defines a const array that contains all the
metadata for each compiled kernel. These metadata will be used for load
and launch. So this header file is included by 'triton_kernel.cu' which
defines load and launch functions.

Add a build flag in build.py and CMakeList.txt, when building rocm
provider, it will call triton_kernel build command, and generate all
necessary files.

#### C++ Load and Launch

On c++ part, we implement load and launch functions in triton_kernel.cu
and triton_kernel.h.

These two files located in `providers/cuda`, and when compiling rocm,
they will be hipified. so this part supports both cuda and rocm. But
currently we only call triton kernel in rocm.

We also implement a softmax triton op for example. Because there will
generate many kernels for different input shape of softmax, we use
TunableOp to select the best one.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-05-17 09:35:28 +08:00
..
external Implement openAI endpoint invoker for nuget (#15797) 2023-05-11 22:04:02 -07:00
patches FlatBuffers fails to compile with gcc13. (#15787) 2023-05-11 11:20:19 -07:00
tensorboard Improve dependency management (#13523) 2022-12-01 09:51:59 -08:00
adjust_global_compile_flags.cmake upgrade emsdk to 3.1.37 (#15817) 2023-05-08 16:49:47 -07:00
CMakeLists.txt integrate triton into ort (#15862) 2023-05-17 09:35:28 +08:00
CMakeSettings.json
codeconv.runsettings
deps.txt Implement openAI endpoint invoker for nuget (#15797) 2023-05-11 22:04:02 -07:00
EnableVisualStudioCodeAnalysis.props
gdk_toolchain.cmake
Info.plist.in
libonnxruntime.pc.cmake.in
nuget_helpers.cmake
onnxruntime.cmake Basic CSharp packaging support for ROCm EP (#15535) 2023-05-16 07:27:38 +08:00
onnxruntime_codegen_tvm.cmake Use target name for flatbuffers (#13991) 2022-12-20 11:44:02 -08:00
onnxruntime_common.cmake Add clog back to onnxruntime_EXTERNAL_LIBRARIES. (#15363) 2023-04-05 09:11:19 -07:00
onnxruntime_compile_triton_kernel.cmake integrate triton into ort (#15862) 2023-05-17 09:35:28 +08:00
onnxruntime_config.h.in Adust GetVersionString() GetBuildInfoString() signatures and move them to OrtApi (#15921) 2023-05-13 13:45:07 -07:00
onnxruntime_csharp.cmake Refactor training build options (#13964) 2023-01-03 13:28:16 -08:00
onnxruntime_flatbuffers.cmake Rework some external targets to ease building with -DFETCHCONTENT_FULLY_DISCONNECTED=ON (#15323) 2023-04-03 17:45:12 -07:00
onnxruntime_framework.cmake Implement openAI endpoint invoker for nuget (#15797) 2023-05-11 22:04:02 -07:00
onnxruntime_fuzz_test.cmake Fix fuzz test (#14385) 2023-01-22 22:17:43 -08:00
onnxruntime_graph.cmake Create dedicated build for training api (#14136) 2023-01-10 20:58:04 -08:00
onnxruntime_ios.toolchain.cmake
onnxruntime_java.cmake Update build option for training in java to enable_training_api (#15638) 2023-04-24 11:53:08 -07:00
onnxruntime_java_unittests.cmake Update build option for training in java to enable_training_api (#15638) 2023-04-24 11:53:08 -07:00
onnxruntime_kernel_explorer.cmake integrate triton into ort (#15862) 2023-05-17 09:35:28 +08:00
onnxruntime_language_interop_ops.cmake Use target name for flatbuffers (#13991) 2022-12-20 11:44:02 -08:00
onnxruntime_mlas.cmake [AMX] Update assembler check (#15501) 2023-04-19 14:16:26 -07:00
onnxruntime_nodejs.cmake [js] upgrade dependencies and enable strict mode (#14930) 2023-03-22 15:05:04 -07:00
onnxruntime_objectivec.cmake Add iOS Swift Package Manager support (#15297) 2023-04-20 16:18:35 +10:00
onnxruntime_opschema_lib.cmake Use target name for flatbuffers (#13991) 2022-12-20 11:44:02 -08:00
onnxruntime_optimizer.cmake Optimize SCE loss compute (#15401) 2023-04-13 13:02:12 +08:00
onnxruntime_providers.cmake integrate triton into ort (#15862) 2023-05-17 09:35:28 +08:00
onnxruntime_pyop.cmake Use target name for flatbuffers (#13991) 2022-12-20 11:44:02 -08:00
onnxruntime_python.cmake Remove onnxruntime_PYBIND_EXPORT_OPSCHEMA definition from onnxruntime (#15776) 2023-05-03 13:08:35 -07:00
onnxruntime_rocm_hipify.cmake [ROCm] add beam search support (#15625) 2023-04-26 17:53:33 +08:00
onnxruntime_session.cmake fix headers for training apis (#14350) 2023-01-19 10:26:53 -08:00
onnxruntime_snpe_provider.cmake Use target name for flatbuffers (#13991) 2022-12-20 11:44:02 -08:00
onnxruntime_training.cmake Create dedicated build for training api (#14136) 2023-01-10 20:58:04 -08:00
onnxruntime_unittests.cmake upgrade emsdk to 3.1.37 (#15817) 2023-05-08 16:49:47 -07:00
onnxruntime_util.cmake Improve dependency management (#13523) 2022-12-01 09:51:59 -08:00
onnxruntime_webassembly.cmake Support WebNN EP (#15698) 2023-05-08 21:25:10 -07:00
precompiled_header.cmake
Sdl.ruleset Add a Github workflow for Prefast (#15763) 2023-05-03 11:42:51 -07:00
set_winapi_family_desktop.h
target_delayload.cmake
uwp_stubs.h Run clang-format in CI (#15524) 2023-04-18 09:26:58 -07:00
wcos_rules_override.cmake
winml.cmake Use target name for flatbuffers (#13991) 2022-12-20 11:44:02 -08:00
winml_cppwinrt.cmake
winml_sdk_helpers.cmake
winml_unittests.cmake Use target name for flatbuffers (#13991) 2022-12-20 11:44:02 -08:00