### Description
Add additonal gfx targets for AMD GPU support
### Motivation and Context
Required to integrate mainline onnxruntime support for AMD GPUs
---------
Co-authored-by: Stefan Sokolovic <stsokolo@amd.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
### Description
<!-- Describe your changes. -->
Pick up onnx/onnx#6010 to support EinSum shape inference
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This change allows EinSum operator's output shape to be inferenced so
that it can run on accelerators.
### Description
This change introduces the WebGPU EP into ONNX Runtime.
To make the PR as simple as possible, this PR excluded the following:
- C API changes for WebGPU EP
- actual implementation of WebGPU EP. Currently in this PR, WebGPU is a
stub implementation that does not register any kernel.
- Python IO Binding update
- Node.js IO Binding update
This PR now contains only 43 file changes (while the working branch
contains 130+) and hopefully this makes it easier to review.
There is going to be separated PRs for each mentioned above.
Current working branch: #21904
### Description
<!-- Describe your changes. -->
NS is not developed anymore and ORT doesn't use it for int4 inference
either. Remove it to clean up the code
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Composable Kernel build fails under ROCm 6.2.
This PR patches Composable Kernel the same way as
https://github.com/ROCm/composable_kernel/pull/1346
* fix buffer resource to match "s" constraint
* add missing memory clobber
### Description
Update XNNPack to latest version (Sep 4)
- Some op outputs are changed, channel or stride paras are moved into
reshape func.
e.g.
96962a602d
- input params of xnnpack's resize related function are changed a lot
- KleidiAI is added as a dependency in ARM64
- The latest XNNPACK includes 2 static libs microkernels-prod and
xnnpack.
Without microkernels-prod, it throws the exception of Undefined symbols.
- Add ORT_TARGET_PROCESSOR to get the real processor target in CMake
### Description
Implement softcap for gqa.
### Motivation and Context
Fixes certain models like Gemma-2 which need softcap to work so they
don't output nan's.
### Description
Enabling python binding and gcc support for AIX.
### Motivation and Context
Code changes in this PR contains:
1. python binding enablement
2. gcc building support
Below are list of files and the description.
1. cmake/CMakeLists.txt
[gcc building support] -no-unused-function compiler flag addition for
IBMClang
2. cmake/external/eigen.cmake
[gcc building support] AIX check for applying the AIX patch
3. cmake/onnxruntime_python.cmake
[python binding ] putting NOT AIX check for -Xlinker
4. cmake/onnxruntime_unittests.cmake
[gcc building support] Fix for gtest behavior. Check the comment .
[python binding ] using -Wl,-brtl for linking
onnxruntime_providers_shared in test_execution_provider
5. cmake/patches/eigen/eigen-aix.patch
[gcc building support] In AIX gcc, we are hitting
__builtin_cpu_supports("mma") which is not supported yet. So patching
code for this method . Patched code will check for P10 Processor at
run-time and based on that routine will be set.
6. onnxruntime/python/onnxruntime_validation.py
[python binding ] Adding AIX check in check_distro_info()
7. onnxruntime/test/providers/cpu/generator/random_test.cc
[gcc building support] updating previous check for AIX , along with
clang. So in case of gcc, else block will hit.
8. onnxruntime/test/python/onnxruntime_test_python.py
[python binding ] powerpc check on platform.processor()
9. setup.py
[python binding ] Adding AIX check for list of libs.
### Description
This change disables Abseil's symbolize functionality in Windows
non-debug builds.
### Motivation and Context
To solve #21826. Avoid having a dependency on dbghelp.dll.
### Description
<!-- Describe your changes. -->
refer to https://github.com/microsoft/onnxruntime/pull/21867
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Your Name <you@example.com>
1. Update google benchmark from 1.8.3 to 1.8.5
2. Update google test from commit in main branch to tag 1.15.0
3. Update pybind11 from 2.12.0 to 2.13.1
4. Update pytorch cpuinfo to include the support for Arm Neoverse V2,
Cortex X4, A720 and A520.
5. Update re2 from 2024-05-01 to 2024-07-02
6. Update cmake to 3.30.1
7. Update Linux docker images
8. Fix a warning in test/perftest/ort_test_session.cc:826:37: error:
implicit conversion loses integer precision: 'streamoff' (aka 'long
long') to 'const std::streamsize' (aka 'const long')
[-Werror,-Wshorten-64-to-32]
### Description
Enablement of onnxruntime for AIX and fixing issues related to
big-endian platform.
### Motivation and Context
changes in this PR contains:
1. Enablement code for building onnxruntime on AIX operating system.
2. while testing the build on AIX, we found issues related to big endian
platform . More details about few of those issues can be found in [Big
endian issue: Graph Transformation Attention Fusion tests are failing
#12921](https://github.com/microsoft/onnxruntime/issues/12921)
Below are list of files and the description about the change.
1. cmake/CMakeLists.txt
[BUILDING on AIX issue] check for "IBMClang" is added for handling
-Wno-unused-parameter
2. cmake/external/onnxruntime_external_deps.cmake
[BUILDING on AIX issue]Enabling gtest_disable_pthreads for AIX
3. cmake/onnxruntime.cmake
[BUILDING on AIX issue]
o Blocking codes for AIX which generates generated_source.c and further
requires some symbol files.
o Putting NO AIX check for non-supported linker flags like --Xlinker
o iconv linking
4. cmake/onnxruntime_framework.cmake
[BUILDING on AIX issue]Putting NO AIX check for -Wl,-rpath='$ORIGIN'
5. cmake/onnxruntime_mlas.cmake
[BUILDING on AIX issue]POWER10 releated macro/function definition .
6. cmake/onnxruntime_providers_cpu.cmake
[BUILDING on AIX issue]Putting NO AIX check for non-supported linker
flags like --Xlinker
7. cmake/onnxruntime_unittests.cmake
[BUILDING on AIX issue]
o Putting NO AIX check for non-supported linker flags like --Xlinker
o Adding required libraries for AIX linker under applicatiion like
onnxruntime_shared_lib_test ,onnxruntime_logging_apis etc
8. cmake/patches/flatbuffers/flatbuffers.patch
[BUILDING on AIX issue] Handling of TypeCode in
include/flatbuffers/flatbuffers.h under AIX + clang
9. onnxruntime/contrib_ops/cpu/murmur_hash3.cc
[Big endian issue] Byte-Conversion handlling in compute() and getblock()
routines
10. onnxruntime/contrib_ops/cpu/quantization/matmul_nbits_impl.cc
[Big endian issue] Handling of test failures . Byte swapping for
quant_value.
11. onnxruntime/core/framework/tensorprotoutils.cc
[Big endian issue]
Implementation of SetRawDataInTensorProto , ConvertRawDataInTensorProto
.
o SetRawDataInTensorProto : Wrapper for set_raw_data(). Calling
ConvertRawDataInTensorProto() in big-endian system
o ConvertRawDataInTensorProto : function used mainly on big-endian
system for byte-swapping of tensor raw_data
12. onnxruntime/core/framework/tensorprotoutils.h
[Big endian issue]
Declaration of SetRawDataInTensorProto, ConvertRawDataInTensorProto
13. onnxruntime/core/graph/graph.cc
[Big endian issue]
o Call ConvertRawDataInTensorProto for SPARSE_TENSOR type
o Call ConvertRawDataInTensorProto for SaveToOrtFormat
14. onnxruntime/core/mlas/lib/platform.cpp
[BUILDING on AIX issue] POWER10 released enablement for AIX
15. onnxruntime/core/mlas/lib/power/qgemm_kernel_power10.cpp
[BUILDING on AIX issue]Handling of __vector under AIX+clang
16. onnxruntime/core/mlas/lib/qgemm.h
[BUILDING on AIX issue] Adding _AIX flag
17. onnxruntime/core/mlas/lib/qlmul.cpp
[BUILDING on AIX issue] Handling of __vector under AIX+clang
18. onnxruntime/core/optimizer/attention_fusion.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
19. onnxruntime/core/optimizer/compute_optimizer/shared_utils.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
20. onnxruntime/core/optimizer/constant_folding.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
21. onnxruntime/core/optimizer/embed_layer_norm_fusion.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
22. onnxruntime/core/optimizer/nchwc_transformer.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
23. onnxruntime/core/optimizer/qdq_transformer/avx2_weight_s8_to_u8.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
24. onnxruntime/core/optimizer/qdq_transformer/qdq_s8_to_u8.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
25. onnxruntime/core/optimizer/qdq_transformer/s8_to_u8.h
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
26.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
27. onnxruntime/core/optimizer/reshape_fusion.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
28. onnxruntime/core/optimizer/stft_decomposition.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
29.
onnxruntime/core/optimizer/transpose_optimization/ort_optimizer_api_impl.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
30. onnxruntime/core/platform/path_lib.h
[BUILDING on AIX issue] Moving to normal function call, instead of
template
31. onnxruntime/core/platform/posix/env.cc
[BUILDING on AIX issue]Blocking syscall.h in AIX
32. onnxruntime/core/session/inference_session.cc
[Big endian issue] Removing ORT_RETURN_IF_NOT, FLATBUFFERS_LITTLEENDIAN
33. onnxruntime/test/flatbuffers/flatbuffer_utils_test.cc
[Big endian issue] Call ConvertRawDataInTensorProto in CreateInitializer
and ExternalWriteReadWithLoadInitializers
34. onnxruntime/test/framework/sparse_kernels_test.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
35. onnxruntime/test/framework/tensorutils_test.cc
[Big endian issue] Helper method ConvertEndianessForVector and call this
from required place.
36. onnxruntime/test/framework/test_tensor_loader.cc
o. [BUILDING on AIX issue] Handling of getcwd for AIX
o. [Big endian issue] Bytes Swapping in run_external_data_test
37. onnxruntime/test/onnx/main.cc
[Big endian issue] including <thread> for AIX
38. onnxruntime/test/onnx/tensorprotoutils.cc
[Big endian issue] Bytes swapping in UnpackTensorWithRawData
39. onnxruntime/test/optimizer/graph_transform_test.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
40. onnxruntime/test/optimizer/graph_transform_test_builder.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
41. onnxruntime/test/optimizer/graph_transform_test_builder.h
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
42. onnxruntime/test/optimizer/initializer_test.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
43. onnxruntime/test/optimizer/nchwc_optimizer_test.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
44. onnxruntime/test/providers/base_tester.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
45. onnxruntime/test/providers/cpu/generator/random_test.cc
[BUILDING on AIX issue] Adding AIX check in MultinomialGoodCase
---------
Co-authored-by: Vamshikrishna Thatikonda <vamshikrishna@in.ibm.com>
### Description
Our macOS pipeline are failing because of a build error in absl.
However, the bug fix we need is not available in the latest ABSL
release.
Here is the issue: https://github.com/abseil/abseil-cpp/pull/1536
And here is the fix:
779a3565ac
GTests uses ABSL. But this ABSL target also depends on GTest. So, it is
a circular dependency. We should be able to avoid that by avoid building
tests for ABSL. However, the version we are using has a problem with
that: it has cmake target that still depends on GTest even when testing
is disabled.
It's strange that we suddenly hit this problem and it only happens on macOS.
### Description
Update protobuf_cmake.patch to allow extra disablements. ORT repo
already patches protobuf to not disable the warning 4996.
### Motivation and Context
To meet SDL requirements, Microsoft repos have to fail build if there is
warning 4996
Binskim also gives errors if warning 4996 is disabled.
We can suppress the Binskim issues, but we need a way to disable the
warnings for the minimal set of code that has them.
Right now, WindowsAI disables 4996 for entirety of ORT, but it should
only be disabled for protobuf.
### Description
Upgrade cutlass to 3.5 to fix build errors using CUDA 12.4 or 12.5 in
Windows
- [x] Upgrade cutlass to 3.5.0.
- [x] Fix flash attention build error with latest cutlass header files
and APIs. This fix is provided by @wangyems.
- [x] Update efficient attention to use new cutlass fmha interface.
- [x] Patch cutlass to fix `hrsqrt` not found error for sm < 53.
- [x] Disable TF32 Staged Accumulation to fix blkq4_fp16_gemm_sm80_test
build error for cuda 11.8 to 12.3.
- [x] Disable TRT 10 deprecate warnings.
The following are not included in this PR:
* TRT provider replaces the deprecated APIs.
* Fix blkq4_fp16_gemm_sm80_test build error for cuda 12.4 or 12.5. This
test is not built by default unless you add `--cmake_extra_defines
onnxruntime_ENABLE_CUDA_EP_INTERNAL_TESTS=ON` in build command.
To integrate to rel-1.18.1: Either bring in other changes (like onnx
1.16.1), or generate manifest and upload a new ONNX Runtime Build Time
Deps artifact based on rel-1.18.1.
### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/19891https://github.com/microsoft/onnxruntime/issues/20924https://github.com/microsoft/onnxruntime/issues/20953
### Description
- Adds a patch that fixes a shape inference bug that caused a segfault:
https://github.com/onnx/onnx/pull/6080
- Fix documentation describing why QLinearMatMul tests are currently
being skipped.
### Motivation and Context
The [PR for integrating with ONNX
1.16.0](https://github.com/microsoft/onnxruntime/pull/19745) disabled
various python quantization tests due to a shape inference bug. This PR
applies the ONNX fix as a patch. We still can't enable the tests because
some of our CIs pip install onnx-1.16.0, which doesn't include the fix.
### Description
update with ONNX 1.16.0 branch according to
https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md
ONNX 1.16.0 release notes:
https://github.com/onnx/onnx/releases/tag/v1.16.0
#### Updated ops for CPU EP:
- DequantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block dequantization support
- QuantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block quantization support
- Cast(21)
- Missing int4 and uint4 support
- CastLike(21)
- Missing int4 and uint4 support
- ConstantOfShape(21)
- Missing int4 and uint4 support
- Identity(21)
- Missing int4 and uint4 support
- If(21)
- Missing int4 and uint4 support
- Loop(21)
- Missing int4 and uint4 support
- Reshape(21)
- Missing int4 and uint4 support
- Scan(21)
- Missing int4 and uint4 support
- Shape(21)
- Missing int4 and uint4 support
- Size(21)
- Missing int4 and uint4 support
- Flatten(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Pad(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Squeeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Transpose(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Unsqueeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
#### Unimplemented opset 21 features/ops
- int4 and uint4 data type
- QLinearMatMul(21)
- GroupNormalization(21)
- ai.onnx.ml.TreeEnsemble(5)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Disabled tests
#### ORT Training
orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py
- test_ort_custom_ops: Potential shape inference bug for custom ops
#### Python quantization unit tests
test/onnx/python/quantization (shape inference bug)
- test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16
- test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16
- test_op_gemm.py: test_quantize_qop_gemm_s8s8
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3
- test_op_matmul.py: test_quantize_matmul_u8u8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution
- test_op_relu.py: test_quantize_qop_relu_s8s8
#### ONNX tests
- test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a
maxpool output size bug and added this test. Enable this test when [ORT
PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged.
Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741).
- test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op
ai.onnx.ml.TreeEnsemble
- test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same
- test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same
- test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same
- test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4
yet
- test_cast_INT4_to_INT8_cpu: same
- test_cast_UINT4_to_FLOAT_cpu: same
- test_cast_UINT4_to_UINT8_cpu: same
- test_cast_INT4_to_FLOAT_cuda
- test_cast_INT4_to_INT8_cuda
- test_cast_UINT4_to_FLOAT_cuda
- test_cast_UINT4_to_UINT8_cuda
- test_constantofshape_float_ones_cuda: ConstantOfShape(21) not
implemented for cuda
- test_constantofshape_int_shape_zero_cuda: same
- test_constantofshape_int_zeros_cuda: same
- test_flatten_axis0_cuda: Flatten(21) not implemented for cuda
- test_flatten_axis1_cuda: same
- test_flatten_axis2_cuda: same
- test_flatten_axis3_cuda: same
- test_flatten_default_axis_cuda: same
- test_flatten_negative_axis1_cuda: same
- test_flatten_negative_axis2_cuda: same
- test_flatten_negative_axis3_cuda: same
- test_flatten_negative_axis4_cuda: same
- test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not
implemented in ORT yet
- test_qlinearmatmul_2D_int8_float32_cpu: same
- test_qlinearmatmul_2D_uint8_float16_cpu: same
- test_qlinearmatmul_2D_uint8_float32_cpu: same
- test_qlinearmatmul_3D_int8_float16_cpu: same
- test_qlinearmatmul_3D_int8_float32_cpu: same
- test_qlinearmatmul_3D_uint8_float16_cpu: same
- test_qlinearmatmul_3D_uint8_float32_cpu: same
- test_qlinearmatmul_2D_int8_float16_cuda: same
- test_qlinearmatmul_2D_int8_float32_cuda: same
- test_qlinearmatmul_2D_uint8_float16_cuda: same
- test_qlinearmatmul_2D_uint8_float32_cuda: same
- test_qlinearmatmul_3D_int8_float16_cuda: same
- test_qlinearmatmul_3D_int8_float32_cuda: same
- test_qlinearmatmul_3D_uint8_float16_cuda: same
- test_qlinearmatmul_3D_uint8_float32_cuda: same
- test_size_cuda: Size(21) not implemented for cuda
- test_size_example_cuda: same
- test_dequantizelinear_blocked: Missing implementation for block
dequant for DequantizeLinear(21)
- test_quantizelinear_blocked_asymmetric: Missing implementation for
block quant for QuantizeLinear(21)
- test_quantizelinear_blocked_symmetric: Missing implementation for
block quant for QuantizeLinear(21)
---------
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
### Description
<!-- Describe your changes. -->
Initialize Symbol engine as needed with no duplicate calls.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Currently absel library may call SymInitialize more than once
when shared libraries are involved. However, this can only be
called only once per process. Our debug_alloc also may call it
when enabled. This change enables intialization to proceed
only when needed with no duplicate effort.
### Description
<!-- Describe your changes. -->
the crash caused by the neural_speed turns out to be a very corn case.
Turn it on by default.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Modifications to support 2GB+ checkpoint & Upgrading Flatbuffers
### Motivation and Context
This PR includes changes that will make ort handle 2GB+ checkpoints.
To do that we need to upgrade flatbuffers to 23.5.9 -
https://github.com/google/flatbuffers/pull/7945
- Modified the commitHash and the hash for the new version
- Removed the patch for rust generator's unused variable warning as it
is no longer producing this - [Check it out
here](d121e09d89/src/idl_gen_rust.cpp)
- Updated the VerifyField calls with alignment values that were
introduced in the new version.
---------
Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com>
### Description
Add a patch for Windows ARM64EC
### Motivation and Context
Will need more changes in onnxruntime/core/common/cpuid_arch_definition.h and onnxruntime/core/common/cpuid_info.cc
### Description
<!-- Describe your changes. -->
Setup usage of coremltools via dependencies instead of copying files.
Pull in some changes from
https://github.com/microsoft/onnxruntime/pull/19347 in preparation for
supporting ML Program and enabling building the ML Model on all
platforms to make development and testing of CoreML EP code easier.
- Update to coremltools 7.1
- Add patch for changes required for cross platform build of ML Program
related code
- Generate coreml proto files on all platforms
- mainly to test these changes work everywhere, as the proto files will
be used on all platforms when #19347 is checked in
- rename onnxruntime_coreml_proto target to coreml_proto as it contains
purely coreml protobuf code with no ORT related chagnes
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve setup.
### Description
Building on g++ 13.2.0 results in -Wstringop-overread errors on Linux.
This commit addresses the flatbuffer build issue with the following
changes:
1. Remove the Werror flag in the flarbuffer patch.
2. Add a compilation option to suppress the 'stringop-overflow' error in
the Flatbuffers within the xnnpack provider.
### Motivation and Context
https://github.com/google/flatbuffers/issues/8119https://github.com/microsoft/onnxruntime/pull/19239
Signed-off-by: Phoebe Chen <phoebe.chen@sifive.com>
### Description
Update absl and gtest to fix an ARM64EC build error
### Motivation and Context
We need to get an important fix into ORT.
The fix is:
8028a87c96
### Description
<!-- Describe your changes. -->
1. Introduce MoE CUDA op to ORT based on FT implementation.
2. Upgrade cutlass to 3.1.0 to avoid some build failures on Windows.
Remove patch file for cutlass 3.0.0.
3. Sharded MoE implementation will come with another PR
limitation: __CUDA_ARCH__ >= 700
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Use different march flag to workaround what appears to be a clang issue.
See https://github.com/tensorflow/tensorflow/issues/59970 for links to
various relevant pieces of info/discussions.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Add 32-bit patch binary and infra to fallback to it. The Azure devops
Windows CIs are missing patch.exe from their git install for some reason
so the default `find_package(Patch)` fails as that is where it expects
to find it.
Remove Eigen patch. Underlying issue was fixed in source 3 years ago by
c6c84ed961
and the patch command is invalid (args are for git apply not patch).
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Make usage of patch consistent across all CIs
Fix https://github.com/microsoft/onnxruntime/issues/15248
### Description
<!-- Describe your changes. -->
Update XNNPACK to latest version
- adds fp16 kernels and various other improvements
- requires pthreadpool update as well
Most code updates in the XNNPACK EP are to adjust to the new XNNPACK API
- 'setup' is split into 'reshape' and 'setup'
- some ops use a workspace buffer
- copied workspace allocation from XNNPACK unit test code
- some suffixes changed
Added wrapper for XNNPACK caches to base XNNPACK EP kernel
- simplifies usage
- XNNPACK split out the code and weights caches, but the code cache
isn't currently usable via the public API
- we could use the internal types if we think it's required for
performance reasons. non-trivial though as we'd need to propagate ifdef
values from the XNNPACK build up to the ORT build.
- using XNNPACK internals would also mean we would not be able to
support using a pre-build XNNPACK package
- not an issue currently
Fixed opset registration for internal NHWC domain
- was not being tied to the ONNX version, so nodes inserted by layout
transformation had the incorrect opset
- a number of other places needed updating once this issue was fixed
Remove support for NCHW Resize from XNNPACK EP so it's NHWC only
- we only supported NCHW for fp32,
- doing so adds complexity in multiple places (XNNPACK EP kernel
implementation, layout transformation and transpose optimization)
- unclear if that complexity provides any benefit. can add back if
required by production scenario
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We're looking at enabling fp16 support for CoreML and NNAPI. If we do
that we need a good fallback story if the CPU EP will be used. The
XNNPACK fp16 kernels will hopefully provide that.
NOTE: This PR doesn't add fp16 support to the XNNPACK EP kernels. That
can be done as required in separate EPs and should be relatively simple
to do.
### Description
this is for ORT 1.17.0 - make ORT to use ONNX release 1.15.0 branch. Eventually will update to the release tag once ONNX 1.15.0 is released
### Motivation and Context
Prepare for ORT 1.17.0 release. People can start work on new and updated ONNX ops in ORT.
---------
Signed-off-by: Liqun Fu <liqfu@microsoft.com>
### Description
The new cpuinfo library doesn't use clog on Android. Newer XNNPack
versions have removed the dependency on clog, but the one we use still
has it. So I cherry-pick the XNNPack to our patch file.
### Description
This PR fixes build break for WebAssembly introduced in
6986981482
(435ad2b1d8).
This change updates onnx.patch in onnxruntime repo. the corresponding PR
in onnx repo is: https://github.com/onnx/onnx/pull/5495.
It may takes a while for the next onnx version bump.
- Fix flatbuffers flatc warning, unused-but-set-variable.
- Address `-Wshorten-64-to-32` warnings (fix in our code, allow in dependencies' code).
- Update CI builds to use Xcode 14.3.
- Update minimum iOS version to 12.0.
- Update Mac hosted agents to MacOS 13 where possible.
When building the FlatBuffers dependencies, gcc13 emits a
stringop-overflow warning. All warnings being turned into errors, that
fails the compilation of FlatBuffers, and as a consequence also fails
the build of onnxruntime.
This commit adds the application of a patch to FlatBuffers's
CMakeList.txt, to add -Wno-error=stringop-overflow to the
CMAKE_CXX_FLAGS.
### Description
<!-- Describe your changes. -->
1. upgrade cutlass to 3.0 that containing attn_bias support.
2. extend Attention/MHA to use memory efficient attention when
rel_pos_bias with [1, num_head, s, s*] and 1d mask with [2 * batch_size
+ 1] are present.
new mask format introduction:
MASK_1D_KEY_SEQ_LEN_START,
[3 * batch_size + 2] with [key_len[0], ..., key_len[batch_size - 1],
query_start[0], ..., query_start[batch_size - 1], query_end[batch_size -
1], key_start[0], ..., key_start[batch_size - 1], key_end[batch_size -
1]]
e.g
2D mask with [[1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]] converts to this
1D mask is [3, 5, 0, 6, 12, 0, 6, 12]
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
It potentially benefits tnlrv6 and t5(encoder)
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com>
Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>