Commit graph

11312 commits

Author SHA1 Message Date
Chen Feiyue
fffd430091
[VSINPU]Code improvement && Slice/Dropout OP support (#21217)
### Description
- Refactor codes to meet line length limit and guard missing warning
- Add slice/dropout op support
- Move vsinpu ep's cmake settings from onnxruntime_providers.cmake to a
separate file
- Modify apis with param onnxruntime::Path because this kind is replaced
by std:filesystem::path by #20920
2024-07-09 20:14:46 -07:00
Maximilian Müller
cc0de0d526
[Build] Propagate build option for CUDA minimal to TRT (#20695)
### Description

Extend cuda minimal option to TRT provider, as with TRT 10 no linking to
cuDNN is required anymore
.
Besides that with the new engine dump feature it is also possible to
embed an engine in to an ONNX and not ship a builder lib.
In addition to that this has roughly the same deserialization
time/session setup time that using TRT standalone has.

### Motivation and Context

```
exe_builder_lib\onnxruntime_perf_test.exe -I -e tensorrt -r 5 -i 'trt_engine_cache_enable|1 trt_timing_cache_enable|1 trt_dump_ep_context_model|1 trt_weightless_engine_enable|1' model.onnx


exe_no_builder_lib\onnxruntime_perf_test.exe -I -e tensorrt -r 5 -i 'trt_engine_cache_enable|1 trt_timing_cache_enable|1 trt_dump_ep_context_model|1 trt_weightless_engine_enable|1' model_ctx.onnx
```
2024-07-09 14:40:04 -07:00
Edward Chen
307b34a820
[NNAPI EP] Track skipped initializer usage (#21286)
Track skipped initializer usage in NNAPI EP to account for usage by other nodes.
2024-07-09 13:43:22 -07:00
Xiang Zhang
1ab162fbca
Fix ETW Sink Initialize unproperly locking (#21226)
### Description
ETW trace logger is fakely registered as initialized_ is marked as true
before the registration is done, causing crashing issue for Lenovo
camera application.

[Bug
42610244](https://microsoft.visualstudio.com/OS/_workitems/edit/42610244):
[Watson Failure] caused by
SVCHOSTGROUP_Camera_INVALID_POINTER_READ_c0000005_onnxruntime.dll!onnxruntime::logging::Logger::Log
2024-07-09 10:55:41 -07:00
Jian Chen
d1c19e79ea
Update OpenVino CI Ubuntu to 22.04 (#21127)
### Description
[Update OpenVino CI Ubuntu to
22.04](312fab5b3f)



### Motivation and Context
Ubuntu 22.04 is needed for linux C++20
2024-07-09 09:56:44 -07:00
Wanming Lin
eeb8fc0931
[WebNN EP] Release WebNN MLGraphBuilder after Compile to free memory (#21200)
This would help release the constants bound by the MLGraphBuilder.
2024-07-09 08:49:58 -07:00
Changming Sun
2c53b4a534
Remove core/common/gsl.h (#20894)
### Description
It might be easier if we just directly include the original gsl headers.
"core/common/gsl.h" is an indirection that doesn't provide extra help.
2024-07-08 18:09:39 -07:00
Enrico Galli
4c3c809bdb
[js/webnn] Enable user-supplied MLContext (#20600)
### Description
This PR enables the API added in #20816 as well as moving context
creation to JS.

### Motivation and Context
In order to enable I/O Binding with the upcoming
[MLBuffer](https://github.com/webmachinelearning/webnn/issues/542) API
in the WebNN specification, we need to share the same `MLContext` across
multiple sessions. This is because `MLBuffer`s are restricted to the
`MLContext` where they were created. This PR enables developers to use
the same `MLContext` across multiple sessions.
2024-07-08 10:19:39 -07:00
Wanming Lin
cd516a1677
[WebNN EP] Remove constraint for conv ops on CPU backend (#21237)
Currently WebNN TFLite backend allows the filter of
conv2d/convTranspose2d be an input. Remove the constraint and operate
necessary transpose/reshape operations for the filter input.
2024-07-08 10:14:43 -07:00
zz002
4a7eaff1d9
[vitisai] Fix build failure introduced by #20920 (#21247)
### Description

Fix build failure introduced by #20920
2024-07-08 05:44:30 -07:00
Jing Fang
83e0c6b96e
Add MatMulNBits shape infer to SymbolicShapeInference (#21246)
### Description
Support MatMulNBits shape infer in SymbolicShapeInference

MatMulNBits's B input is rank-2, so implicit merge does not apply.

### Motivation and Context
[Issue with performing shape inference using symbolic_shape_infer.py
with Phi-3 ONNX Models · Issue #21194 · microsoft/onnxruntime
(github.com)](https://github.com/microsoft/onnxruntime/issues/21194)
2024-07-05 16:24:57 -07:00
KnightYao
9ef28f092f
[Fix Bug] Fp8*Fp8 Run Error (#20911)
Fix fp8*fp8 when input A is e5m2, input B is e4m3 will run error

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-05 17:11:59 +02:00
pengwa
3f6b7430d6
Use cuda memset async (#21216)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-05 17:27:45 +08:00
Baiju Meswani
0bbd061a54
Exclude azure ep from gen_def.cc (#21250)
Addresses python packaging pipeline failure.
2024-07-04 10:50:27 -07:00
Changming Sun
07c429191e
Delete path.h (#21211)
### Description
Delete path.h and replace all occurrences of onnxruntime::Path with
std::filesystem::path.
Previously we couldn't use C++17's std::filesystem because it was not
supported in iOS 12(which was released in 2018). Now we dropped the
support for iOS 12.

### Motivation and Context
To simplify code. For example, if an EP wants to use the Path class, now
it can directly use it without going through a wrapper. And the standard
implementation can handle various path types better. (We didn't take
much consideration on UNC path, "/" as a path separator on Windows,
etc).
2024-07-04 15:54:13 +08:00
kailums
40d4b2ec75
exclude split3inner kernel on rocm ep (#21238)
### Description
There is an issue when using split3inner kernel on rocm-6.0.3, exclude
these code from rocm EP.
2024-07-04 14:32:28 +08:00
Tianlei Wu
7d9b12a2e3
[CPU] SparseAttention op (#21110)
Add SparseAttention cpu implementation.
- [x] Refactoring GQAAttentionBase
- [x] Add SparseAttention implementation
- [x] Add test cases

This is unfused version. Flash attention version will be added later.
2024-07-03 21:51:57 -07:00
Yi Zhang
30b6e82e7d
Make ROCm packaging stages to a single workflow (#21235)
### Description
Make current ROCm packaging stages to a single workflow.
Reduce the possibility of all nightly packages can't be generated by one
failed stage



### Motivation and Context
Our plan is to reduce the complexity of the current zip-nuget pipeline
to improve the stability and performance of nightly packages generation.
ROCm packaging stages has no dependencies with other packaging jobs and
it's the most time-consuming route.
After this change, the most used CPU/CUDA/Mobile packaging workflow
duration can be reduced roughly from 3h20m to 2h30m.
2024-07-04 11:07:04 +08:00
cloudhan
f39ee14b46
Add GQA support for ROCm (#21032) 2024-07-03 14:55:31 +08:00
pengwa
4932e04053
ORTModule GraphTransitionManager (#19007)
### Problem

Currently, the codebase contains some logics pertaining to model
re-export checks and graph_builder reinitialization checks. Ideally,
these operations should function akin to a state machine. However, upon
inspecting the implementation, it becomes apparent that certain states
are checked or set in various scattered locations. This fragmentation
makes it challenging to comprehend when a re-export or re-initialization
will be triggered. For optimal clarity and maintainability, it is
advisable to consolidate these states into a cohesive component, rather
than dispersing them within the current graph execution manager.

Furthermore, the process of model exports and post-export processing for
stage 3 support or memory-efficient gradient management introduces
considerable complexity. To enhance the codebase's structure, it would
be beneficial to extract these intricate functionalities into a
dedicated component, divorcing them from the current graph execution
manager.

As part of the effort to improve the codebase, it's essential to address
inconsistencies in handling input/output flatten/unflatten operations.
Currently, there are several functions performing these operations
recursively, each with slightly different implementations. This
inconsistency leads to varying support for input/output data types and
structures in different parts of the code. To rectify this, the proposed
pull request simplifies these operations into a set of primitive
functions, ensuring uniformity. This not only streamlines the code but
also facilitates the maintenance of consistency when introducing bug
fixes or supporting new data types. One thing to mention here: input
output handling is deeply bound to the graph transition mentioned above,
so it is difficult to make this change separately.

While acknowledging the complexity of these logics, it is reassuring
that the codebase benefits from an extensive suite of unit tests that
cover all possible branches. Despite the intricacies, ensuring the
passage of all tests has been a time-intensive but necessary aspect of
this development effort.

### Design


Introduce `GraphTransitionManager` and put all model export and
post-export processing logics in it.
1. Re-export check
2. Do export
3. Re-post-export process check
4. Do post-export process
5. Return `PostExportProcessedModelInfo`, which contains all the
information we need, to pass to ORT to build gradient graph (currently
we do the same for training or evaluating, but ideally we should not do
it for evaluating, let's keep this behavior as it is now, and make the
change later).
    ```
          # Input names for the pre-gradient-build graph.
# This may be different with the one in ExportedGraph since we may
modify the graph inputs as needed
# for example when memory efficient gradient management is enabled.
self.onnx_graph_input_names: list[str] = onnx_graph_input_names
  
          # A subset of onnx_graph_input_names.
# Input names that require gradients for the pre-gradient-build graph.
self.onnx_graph_input_names_require_grad: list[str] =
onnx_graph_input_names_require_grad
  
# Create symbolic names for each dimension of the graph input (e.g.
onnx_graph_input_names).
# The key is the input name, the value is a dict of {dim_index:
symbolic_dim_name}
# e.g. {"input1": {0: "input1_dim0", 1: "input1_dim1"}, "input2": {0:
"input2_dim0"}}
self.onnx_graph_input_dynamic_axes_map: dict[str, dict[int, str]] =
onnx_graph_input_dynamic_axes_map
  
self.buffer_for_ort_runs: dict[str, torch.Tensor] = OrderedDict()
          self.onnx_graph_input_names_user_defined = (
onnx_graph_input_names_user_defined # The ONNX graph input names
excluding the parameters, buffers.
          )
  
# The ONNX graph input names excluding the parameters, buffers.
self.onnx_graph_input_names_require_grad_user_defined =
onnx_graph_input_names_require_grad_user_defined
  
self._post_export_processed_model: onnx.ModelProto | None =
post_export_processed_model
  
# A function to access the input data from the args and kwargs.
# If it is not None, the length is same as onnx_graph_input_names.
# For i-th input name, we can use the i-th function to get the input
data from args and kwargs.
          self.data_accessor: list[callable] | None = data_accessor
  
          # Used for unflattening the outputs from the ORT forward run.
self.module_forward_output_schema: ORTModelInputOutputSchemaType | None
= module_forward_output_schema```




The `GraphTransitionManager` instance is a property of
`GraphExecutionManager` (e.g. `TrainingManager` or ``InferenceManager),
1. Use
'self._graph_transition_manager.use_cache_or_reconstruct_post_processed_model(inputs,
kwargs)' to check whether the PyTorch module need a re-export or
re-post-export-process.
2. Use
`self._graph_transition_manager._post_export_processed_model_info.construct_inputs`
to construct the list of inputs used for ORT runs.
3. Use
`self._graph_transition_manager._post_export_processed_model_info.restore_outputs(user_outputs)`
to restore the outputs in original PyTorch output structure.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-03 10:53:31 +08:00
Baiju Meswani
116398c1a4
onnxruntime shared lib inside python package (#21223) 2024-07-02 15:37:50 -07:00
Tianlei Wu
7df97f1987
Add debugging helper to dump string, vector and thread id (#21224)
### Description

Add some macro to help print data to console for debugging purpose.

Example usage:
```
int input_id;
vector<int> some_vector;

DUMP_CPU_TENSOR_INIT()
DUMP_CPU_TENSOR("some vector", some_vector);
DUMP_STRING("input_id=", input_id);
```

- To enable dump thread id, set environment variable
`ORT_DUMP_THREAD_ID=0`.
- User can disable dumping by environment variable
`ORT_ENABLE_CPU_DUMP=0`.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-02 11:24:04 -07:00
Yifan Li
7be1d4aad3
[TensorRT EP] Update TRT10.0 deprecated api (#20989)
### Description
<!-- Describe your changes. -->

Note:
* This PR would remove C4996 suppression in
tensorrt_execution_provider.cc only (according to Nvidia, places with
nvinfer.h included need C4996 suppression, when /Zc:__cplusplus is
enabled in ORT win build)
* A follow-up PR will be raised to update deprecated TRT Plugin api
usage.

Here are deprecated apis to be updated in this PR:
| deprecated api | Update |
| ------------------------------------------------------------ |
------------------------------------------------------------ |
|
[kCUBLAS](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#a9e1d81e5a8bfeb38b86e22a66d5f836a)
| / |
|
[kCUBLAS_LT](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#a9e1d81e5a8bfeb38b86e22a66d5f836a)
| / |
|
[kCUDNN](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#a9e1d81e5a8bfeb38b86e22a66d5f836a)
| / |
|
[reallocateOutput](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1v__1__0_1_1_i_output_allocator.html#acae6441d4029584cc1c6550917518691)
| Superseded by
[reallocateOutputAsync](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1v__1__0_1_1_i_output_allocator.html#aa40eeb891c1dfe4c1bbf1eabe8c705ab)
with cudaStream_t argument |
|
[createExecutionContextWithoutDeviceMemory](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_cuda_engine.html#adc86bcc42b098204997396ef2b1093fb)
| Superseded by
[createExecutionContext()](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_cuda_engine.html#a35de29aa6134165a5b14a537e6d99e82)
with parameter.<br />Check
[ExecutionContextAllocationStrategy::kUSER_MANAGED](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#ac6251a050df629edfc0ce037fa366503)
for more detail |




### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
TRT deprecated api list:
https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/deprecated.html
2024-07-01 22:55:20 -07:00
Yi Zhang
beb2496748
Templatize publishing nuget package (#21199)
### Description
It's the prerequisite step of reducing complexity of current zip-nuget
pipeline.
Some packaging tasks could be cut from the most complex nuget pipline
and easily be published

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-02 09:24:19 +08:00
Scott McKay
8c2689877f
CoreML: Disable 1D ML Program matmul due to bug in coreml (#21186)
### Description
Disable using CoreML ML Program for a matmul where one of the inputs is
1D as the CoreML implementation appears to be broken. See
https://github.com/apple/coremltools/issues/2263

Add some debugging notes. 

### Motivation and Context
Fix failing test on macos-14.
2024-06-29 12:19:51 -07:00
Chen Feiyue
56b36a58ba
Initial PR for VSINPU execution provider (#20903)
### Description
<!-- Describe your changes. -->
-It is an initial PR for VSINPU execution provider



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- For support VeriSilicon hardware
- TIM-VX(Tensor Interface Module)
(https://github.com/VeriSilicon/TIM-VX) is an integrated software
solution by Verisilicon for our hardware(A311D/i.MX 8M Plus etc.)
design, it is easy to use Verisilicon’s hardware by simply connecting
onnxruntime with the TIM-VX API by this VSINPU execution provider.
2024-06-28 21:48:34 -07:00
Jian Chen
9007ede102
Update upstream packaging pipeline name to make it more meaningful. (#21154)
### Description
Update upstream packaging pipeline name to make it more meaningful.



### Motivation and Context
The upstream pipeline used to only building Nuget packages, but now it
also builds Zip and Java. So change the name will make it more
meaningful.
2024-06-28 21:40:09 -07:00
Changming Sun
3a83f8b317
Update the functions in tensorprotoutils.h to use std::filesystem::path instead (#20920)
### Description
1. Update the functions in tensorprotoutils.h to use
std::filesystem::path instead of onnxruntime::Path. Eventually we can
remove the whole onnxruntime::Path class, but to this PR small I am not
doing that.
2. Remove the _SILENCE_EXPERIMENTAL_FILESYSTEM_DEPRECATION_WARNING macro
def when TensorRT EP is enabled.
2024-06-28 20:03:57 -07:00
Jian Chen
0cbe7eec5e
Uppdate nuget to Use Nuget 6.10.x (#21209)
### Description
Uppdate nuget to Use Nuget 6.10.x
2024-06-28 19:49:54 -07:00
mingyueliuh
7e93cd7f8b
[VitisAI] Align TensorProto_DataType with onnx1.16 (#21067)
### Description
Vitis AI EP synchronously supports the TensorProto data types supported
by ONNX 1.16.
Add error message show when graph resolve fail for troubleshooting.


### Motivation and Context
ONNX 1.15 & 1.16 add support some new TensorProto DataType , such as 
- FLOAT8E4M3FN
- FLOAT8E4M3FNUZ
- FLOAT8E5M2
- FLOAT8E5M2FNUZ
- UINT4
- INT4

---------

Co-authored-by: liumingyue <mingyue@xilinx.com>
2024-06-28 17:19:20 -07:00
Preetha Veeramalai
6baaaf5165
OVEP options to disable CPU fallback at compile time (#21166)
### Description
Provide user level options to control the fallback on CPU for models not
supported on Intel's NPU hardware.


### Motivation and Context
- Current workflow of OVEP allows safe fallback from OV NPU to OV CPU on
compilation failures. Also supports MLAS CPU fallback in presence of
unsupported custom ops.
- The PR provides a build-time option to disable fallback from OV NPU to
OV CPU.
- The session Option "kOrtSessionOptionsDisableCPUEPFallback" disables
OV CPU and MLAS CPU fallback.
- Also has bug fix for proto creation.

---------

Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
Co-authored-by: ankitm3k <ankit.maheshkar@intel.com>
2024-06-28 08:31:02 -07:00
Hector Li
21ad004237
Add QNN UTs for QNN Pad Op with FP16 data on HTP backend (#21142)
### Description
1. Add QNN UTs for QNN Pad Op with FP16 data on HTP backend
2. Improve Pad op builder to handle invalid optional input
3. Add UT for ReduceSum for FP16 precision with 5D for issue reproduce
2024-06-27 22:09:13 -07:00
Yi Zhang
587e92c279
Add FP32 and INT4 test in Llama2 (#21187)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-06-28 06:18:26 +08:00
Changming Sun
d1ab94c2b0
Add compatibility for NumPy 2.0 (#21085)
### Description

As suggested by SciPy's doc, we will
`Build against NumPy 2.0.0, then it will work for all NumPy versions
with the same major version number (NumPy does maintain backwards ABI
compatibility), and as far back as NumPy 1.19 series at the time of
writing`

I think it works because in
[numpyconfig.h#L64](https://github.com/numpy/numpy/blob/main/numpy/_core/include/numpy/numpyconfig.h#L64)
there is a macro NPY_FEATURE_VERSION. By default it is set to
NPY_1_19_API_VERSION. And the NPY_FEATURE_VERSION macro controls ABI.

This PR only upgrade the build time dependency; When a user installs
ONNX Runtime, they still can use numpy 1.x.

### Motivation and Context
Recently numpy published a new version, 2.0.0, which is incompatible with the latest ONNX Runtime release.
2024-06-27 13:50:53 -07:00
Wanming Lin
78316c8cbe
[WebNN EP] Remove useless variable unpacked_tensors_ (#21189) 2024-06-27 11:56:56 -07:00
Guenther Schmuelling
9eb1c2a7a3
support for layernorm in webgpu pre opset-17 (#21121)
handled the same way cpu does
2024-06-27 10:20:48 -07:00
Yi Zhang
8f738d8e9f
[Fix] Throwes one excepiton while Llama2 parity_check fails (#21160)
### Description



### Motivation and Context
The pipeline is green even Llama2 parity_check fails.

The PR should be merged after the below exception is solved.

'''
2024-06-25 03:49:43.621298481 [E:onnxruntime:,
sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned
while running Expand node. Name:'/model/Expand' Status Message:
/model/Expand: left operand cannot broadcast on dim 3 LeftShape:
{1,1,9,9}, RightShape: {2,1,9,17}
An error occurred while verifying parity: Error in execution: Non-zero
status code returned while running Expand node. Name:'/model/Expand'
Status Message: /model/Expand: left operand cannot broadcast on dim 3
LeftShape: {1,1,9,9}, RightShape: {2,1,9,17}
Traceback (most recent call last):
File
"/workspace/onnxruntime/python/tools/transformers/models/llama/convert_to_onnx.py",
line 1043, in main
    parity_check(parity_cmd)
File
"/workspace/onnxruntime/python/tools/transformers/models/llama/llama_parity.py",
line 298, in main
verify_parity(args, location, use_auth_token, kv_cache_ortvalues,
pytorch_model=llama, config=config)
File
"/workspace/onnxruntime/python/tools/transformers/models/llama/llama_parity.py",
line 137, in verify_parity
    ort_model.run_with_iobinding(io_binding)
File
"/home/onnxruntimedev/.local/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py",
line 331, in run_with_iobinding
    self._sess.run_with_iobinding(iobinding._iobinding, run_options)
RuntimeError: Error in execution: Non-zero status code returned while
running Expand node. Name:'/model/Expand' Status Message: /model/Expand:
left operand cannot broadcast on dim 3 LeftShape: {1,1,9,9}, RightShape:
{2,1,9,17}
'''

The exception looks caused by #19832
2024-06-27 23:49:32 +08:00
Wanming Lin
b49788e68b
[WebNN EP] Fixed bug in Expand implementation (#21163)
ONNX's Expand supports bidirectionally broadcast, while WebNN's expand
op only supports unidirectionally broadcast. Thus we should calculate
the output shape for 'newShape' input of WebNN's expand op.
2024-06-27 08:09:13 -07:00
kailums
a1bbfeb306
add split3inner (#19886)
### Description
<!-- Describe your changes. -->
The split op is using pin_memory when split on different sizes.

But pin_memory is not capable for using cudagraph. 

Add a new implementation for only transformer scenarios, it split the
qkv_proj into q, k, v, not using pin_memory.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-06-27 18:53:12 +08:00
PeixuanZuo
446aa986a1
[ROCm] Extend the Pipeline restriction time (#21158)
ROCm EP builds are taking longer.
2024-06-27 15:36:04 +08:00
mindest
eecc11afc7
[ROCm] Disable ck_tile in Debug build (#21178)
### Description
tmp fix: disable ck_tile for Debug build.



### Motivation and Context
Release build works fine for ck_tile, while Debug build fails.
<details>
<summary> Typical error log to revisit
</summary>

```
[880/1797] Building HIP object CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o
FAILED: CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o 
/opt/rocm/llvm/bin/clang++ -DEIGEN_MPL2_ONLY -DENABLE_ROCM_PROFILING -DENABLE_STRIDED_TENSORS -DENABLE_TRAINING -DENABLE_TRAINING_APIS -DENABLE_TRAINING_CORE -DENABLE_TRAINING_OPS -DENABLE_TRAINING_TORCH_INTEROP -DMIOPEN_VERSION=30100 -DORT_ENABLE_STREAM -DROCM_VERSION=60100 -DUSE_ROCM=1 -D_GNU_SOURCE -D__HIP_ROCclr__=1 -D__bf16__ -D__fp16__ -D__fp32__ -I/build/Debug/_deps/utf8_range-src -I/ws/onnxruntime/include/onnxruntime -I/ws/onnxruntime/include/onnxruntime/core/session -I/ws/onnxruntime/orttraining/orttraining/training_api/include -I/build/Debug/_deps/composable_kernel-src/example/ck_tile/01_fmha -I/build/Debug/_deps/composable_kernel-src/include -I/build/Debug/_deps/composable_kernel-build/include -I/build/Debug/_deps/composable_kernel-src/library/include -isystem /opt/rocm-6.1.0/include -g -O -std=gnu++17 --offload-arch=gfx90a -fPIC -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -MD -MT CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o -MF CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o.d -o CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o -x hip -c /build/Debug/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp
In file included from /build/Debug/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp:5:
In file included from /build/Debug/_deps/composable_kernel-src/example/ck_tile/01_fmha/fmha_fwd.hpp:6:
In file included from /build/Debug/_deps/composable_kernel-src/include/ck_tile/core.hpp:11:
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
   27 |     asm volatile("s_add_u32 m0, %0, m0" : : "n"(v) : "memory");
      |                  ^
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated when compiling for gfx90a.
...
```
</details>
2024-06-27 12:04:17 +08:00
Scott McKay
887a818aa7
Check for unit test log severity override earlier (#21177)
### Description
<!-- Describe your changes. -->
Setting the log level after environment creation is too late in some
cases.

If the DML EP is enabled, it will create a composite sink with the
original logger using the creation time log severity, as well as
additional ETW sink. As it saves the current severity levels for each
sink inside the composite sink that prevents being able to get verbose
log output to stdout even if you set that at the session level.

I don't know enough about the setup that combines ETW with the original
sink to say whether we should also be updating the severity of
individual sinks in the combined sink, so this change is limited to
making the unit tests behave in the expected manner when the default log
severity is set in the background and not directly controlled.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Make it possible to get verbose output to stdout when the DML EP is
enabled.
2024-06-27 12:51:13 +10:00
Vincent Wang
3c0b407709
Rollback 19832, Remove shape_input_merge Fusion (#21179)
The PR caused Big Models pipeline failure for running Llama2. After the
rollback, the pipeline is back to normal.
2024-06-26 10:00:45 -07:00
Scott McKay
337cc56d6f
Convert scalars to 1D to satisfy ML Program requirements. (#21159)
### Description
<!-- Describe your changes. -->
Convert scalars to 1D to satisfy ML Program requirements.


https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1418617&view=logs&j=f7cc61a9-cc70-56e7-b06c-4668ca17e426&t=16d281b5-1bfd-5309-f274-36d0dffd9cb1&l=27167

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fixes test failure in #17361
2024-06-26 09:54:36 -07:00
mindest
e2abba18ea
Skip softmax BF16 test for ROCm (#21162)
### Description

Skip softmax BF16 test for ROCm, because BFloat16 is unsupported by
MIOpen, and `torch.cuda.is_available()` also returns `True` for ROCm.
2024-06-26 11:15:50 +08:00
Wanming Lin
41ad83fb00
[WebNN EP] Support rest Reduction ops for TFLite backend (#21135)
- reduceLogSum, reduceLogSumExp and reduceSumSquare have been landed in
https://chromium-review.googlesource.com/c/chromium/src/+/5575815
- reduceL1 and reduceL2 have been landed in
https://chromium-review.googlesource.com/c/chromium/src/+/5606091
2024-06-25 18:30:55 -07:00
Wanming Lin
4743803944
[WebNN EP] Support more Normalization ops for TFLite backend (#21151)
Following Normalization ops have been supported in Chromium for TFLite
backend:
- batchNormalization:
https://chromium-review.googlesource.com/c/chromium/src/+/5532745
- layerNormalization:
https://chromium-review.googlesource.com/c/chromium/src/+/5573326
- instanceNormalization:
https://chromium-review.googlesource.com/c/chromium/src/+/5532750
2024-06-24 19:04:23 -07:00
Jian Chen
f81c0ec32a
Remove warning suppression from Java Packaging pipeline. (#21010)
### Description
Remove warning suppression from Java Packaging pipeline.


### Motivation and Context
We want the CI step not to produce warning.
2024-06-24 16:46:21 -07:00
mindest
adaf0e8116
[Fix] USE_NCCL -> ORT_USE_NCCL (#21136)
### Description
Correct the macro used when NCCL enabled.
2024-06-24 11:33:17 -07:00
Wanming Lin
3a917e49fb
[WebNN EP] Support 4 more ops for TFLite backend (#21134)
Recently WebNN TFLite backend supports gelu, expand, softsign,
reciprocal.
2024-06-24 09:52:12 -07:00