onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-04 04:07:22 +00:00

Author	SHA1	Message	Date
Chen Feiyue	fffd430091	[VSINPU]Code improvement && Slice/Dropout OP support (#21217 ) ### Description - Refactor codes to meet line length limit and guard missing warning - Add slice/dropout op support - Move vsinpu ep's cmake settings from onnxruntime_providers.cmake to a separate file - Modify apis with param onnxruntime::Path because this kind is replaced by std:filesystem::path by #20920	2024-07-09 20:14:46 -07:00
Maximilian Müller	cc0de0d526	[Build] Propagate build option for CUDA minimal to TRT (#20695 ) ### Description Extend cuda minimal option to TRT provider, as with TRT 10 no linking to cuDNN is required anymore . Besides that with the new engine dump feature it is also possible to embed an engine in to an ONNX and not ship a builder lib. In addition to that this has roughly the same deserialization time/session setup time that using TRT standalone has. ### Motivation and Context ``` exe_builder_lib\onnxruntime_perf_test.exe -I -e tensorrt -r 5 -i 'trt_engine_cache_enable\|1 trt_timing_cache_enable\|1 trt_dump_ep_context_model\|1 trt_weightless_engine_enable\|1' model.onnx exe_no_builder_lib\onnxruntime_perf_test.exe -I -e tensorrt -r 5 -i 'trt_engine_cache_enable\|1 trt_timing_cache_enable\|1 trt_dump_ep_context_model\|1 trt_weightless_engine_enable\|1' model_ctx.onnx ```	2024-07-09 14:40:04 -07:00
Edward Chen	307b34a820	[NNAPI EP] Track skipped initializer usage (#21286 ) Track skipped initializer usage in NNAPI EP to account for usage by other nodes.	2024-07-09 13:43:22 -07:00
Xiang Zhang	1ab162fbca	Fix ETW Sink Initialize unproperly locking (#21226 ) ### Description ETW trace logger is fakely registered as initialized_ is marked as true before the registration is done, causing crashing issue for Lenovo camera application. [Bug 42610244](https://microsoft.visualstudio.com/OS/_workitems/edit/42610244): [Watson Failure] caused by SVCHOSTGROUP_Camera_INVALID_POINTER_READ_c0000005_onnxruntime.dll!onnxruntime::logging::Logger::Log	2024-07-09 10:55:41 -07:00
Jian Chen	d1c19e79ea	Update OpenVino CI Ubuntu to 22.04 (#21127 ) ### Description [Update OpenVino CI Ubuntu to 22.04](`312fab5b3f`) ### Motivation and Context Ubuntu 22.04 is needed for linux C++20	2024-07-09 09:56:44 -07:00
Wanming Lin	eeb8fc0931	[WebNN EP] Release WebNN MLGraphBuilder after Compile to free memory (#21200 ) This would help release the constants bound by the MLGraphBuilder.	2024-07-09 08:49:58 -07:00
Changming Sun	2c53b4a534	Remove core/common/gsl.h (#20894 ) ### Description It might be easier if we just directly include the original gsl headers. "core/common/gsl.h" is an indirection that doesn't provide extra help.	2024-07-08 18:09:39 -07:00
Enrico Galli	4c3c809bdb	[js/webnn] Enable user-supplied MLContext (#20600 ) ### Description This PR enables the API added in #20816 as well as moving context creation to JS. ### Motivation and Context In order to enable I/O Binding with the upcoming [MLBuffer](https://github.com/webmachinelearning/webnn/issues/542) API in the WebNN specification, we need to share the same `MLContext` across multiple sessions. This is because `MLBuffer`s are restricted to the `MLContext` where they were created. This PR enables developers to use the same `MLContext` across multiple sessions.	2024-07-08 10:19:39 -07:00
Wanming Lin	cd516a1677	[WebNN EP] Remove constraint for conv ops on CPU backend (#21237 ) Currently WebNN TFLite backend allows the filter of conv2d/convTranspose2d be an input. Remove the constraint and operate necessary transpose/reshape operations for the filter input.	2024-07-08 10:14:43 -07:00
zz002	4a7eaff1d9	[vitisai] Fix build failure introduced by #20920 (#21247 ) ### Description Fix build failure introduced by #20920	2024-07-08 05:44:30 -07:00
Jing Fang	83e0c6b96e	Add MatMulNBits shape infer to SymbolicShapeInference (#21246 ) ### Description Support MatMulNBits shape infer in SymbolicShapeInference MatMulNBits's B input is rank-2, so implicit merge does not apply. ### Motivation and Context [Issue with performing shape inference using symbolic_shape_infer.py with Phi-3 ONNX Models · Issue #21194 · microsoft/onnxruntime (github.com)](https://github.com/microsoft/onnxruntime/issues/21194)	2024-07-05 16:24:57 -07:00
KnightYao	9ef28f092f	[Fix Bug] Fp8Fp8 Run Error (#20911 ) Fix fp8fp8 when input A is e5m2, input B is e4m3 will run error ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-05 17:11:59 +02:00
pengwa	3f6b7430d6	Use cuda memset async (#21216 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-05 17:27:45 +08:00
Baiju Meswani	0bbd061a54	Exclude azure ep from gen_def.cc (#21250 ) Addresses python packaging pipeline failure.	2024-07-04 10:50:27 -07:00
Changming Sun	07c429191e	Delete path.h (#21211 ) ### Description Delete path.h and replace all occurrences of onnxruntime::Path with std::filesystem::path. Previously we couldn't use C++17's std::filesystem because it was not supported in iOS 12(which was released in 2018). Now we dropped the support for iOS 12. ### Motivation and Context To simplify code. For example, if an EP wants to use the Path class, now it can directly use it without going through a wrapper. And the standard implementation can handle various path types better. (We didn't take much consideration on UNC path, "/" as a path separator on Windows, etc).	2024-07-04 15:54:13 +08:00
kailums	40d4b2ec75	exclude split3inner kernel on rocm ep (#21238 ) ### Description There is an issue when using split3inner kernel on rocm-6.0.3, exclude these code from rocm EP.	2024-07-04 14:32:28 +08:00
Tianlei Wu	7d9b12a2e3	[CPU] SparseAttention op (#21110 ) Add SparseAttention cpu implementation. - [x] Refactoring GQAAttentionBase - [x] Add SparseAttention implementation - [x] Add test cases This is unfused version. Flash attention version will be added later.	2024-07-03 21:51:57 -07:00
Yi Zhang	30b6e82e7d	Make ROCm packaging stages to a single workflow (#21235 ) ### Description Make current ROCm packaging stages to a single workflow. Reduce the possibility of all nightly packages can't be generated by one failed stage ### Motivation and Context Our plan is to reduce the complexity of the current zip-nuget pipeline to improve the stability and performance of nightly packages generation. ROCm packaging stages has no dependencies with other packaging jobs and it's the most time-consuming route. After this change, the most used CPU/CUDA/Mobile packaging workflow duration can be reduced roughly from 3h20m to 2h30m.	2024-07-04 11:07:04 +08:00
cloudhan	f39ee14b46	Add GQA support for ROCm (#21032 )	2024-07-03 14:55:31 +08:00
pengwa	4932e04053	ORTModule GraphTransitionManager (#19007 ) ### Problem Currently, the codebase contains some logics pertaining to model re-export checks and graph_builder reinitialization checks. Ideally, these operations should function akin to a state machine. However, upon inspecting the implementation, it becomes apparent that certain states are checked or set in various scattered locations. This fragmentation makes it challenging to comprehend when a re-export or re-initialization will be triggered. For optimal clarity and maintainability, it is advisable to consolidate these states into a cohesive component, rather than dispersing them within the current graph execution manager. Furthermore, the process of model exports and post-export processing for stage 3 support or memory-efficient gradient management introduces considerable complexity. To enhance the codebase's structure, it would be beneficial to extract these intricate functionalities into a dedicated component, divorcing them from the current graph execution manager. As part of the effort to improve the codebase, it's essential to address inconsistencies in handling input/output flatten/unflatten operations. Currently, there are several functions performing these operations recursively, each with slightly different implementations. This inconsistency leads to varying support for input/output data types and structures in different parts of the code. To rectify this, the proposed pull request simplifies these operations into a set of primitive functions, ensuring uniformity. This not only streamlines the code but also facilitates the maintenance of consistency when introducing bug fixes or supporting new data types. One thing to mention here: input output handling is deeply bound to the graph transition mentioned above, so it is difficult to make this change separately. While acknowledging the complexity of these logics, it is reassuring that the codebase benefits from an extensive suite of unit tests that cover all possible branches. Despite the intricacies, ensuring the passage of all tests has been a time-intensive but necessary aspect of this development effort. ### Design Introduce `GraphTransitionManager` and put all model export and post-export processing logics in it. 1. Re-export check 2. Do export 3. Re-post-export process check 4. Do post-export process 5. Return `PostExportProcessedModelInfo`, which contains all the information we need, to pass to ORT to build gradient graph (currently we do the same for training or evaluating, but ideally we should not do it for evaluating, let's keep this behavior as it is now, and make the change later). ``` # Input names for the pre-gradient-build graph. # This may be different with the one in ExportedGraph since we may modify the graph inputs as needed # for example when memory efficient gradient management is enabled. self.onnx_graph_input_names: list[str] = onnx_graph_input_names # A subset of onnx_graph_input_names. # Input names that require gradients for the pre-gradient-build graph. self.onnx_graph_input_names_require_grad: list[str] = onnx_graph_input_names_require_grad # Create symbolic names for each dimension of the graph input (e.g. onnx_graph_input_names). # The key is the input name, the value is a dict of {dim_index: symbolic_dim_name} # e.g. {"input1": {0: "input1_dim0", 1: "input1_dim1"}, "input2": {0: "input2_dim0"}} self.onnx_graph_input_dynamic_axes_map: dict[str, dict[int, str]] = onnx_graph_input_dynamic_axes_map self.buffer_for_ort_runs: dict[str, torch.Tensor] = OrderedDict() self.onnx_graph_input_names_user_defined = ( onnx_graph_input_names_user_defined # The ONNX graph input names excluding the parameters, buffers. ) # The ONNX graph input names excluding the parameters, buffers. self.onnx_graph_input_names_require_grad_user_defined = onnx_graph_input_names_require_grad_user_defined self._post_export_processed_model: onnx.ModelProto \| None = post_export_processed_model # A function to access the input data from the args and kwargs. # If it is not None, the length is same as onnx_graph_input_names. # For i-th input name, we can use the i-th function to get the input data from args and kwargs. self.data_accessor: list[callable] \| None = data_accessor # Used for unflattening the outputs from the ORT forward run. self.module_forward_output_schema: ORTModelInputOutputSchemaType \| None = module_forward_output_schema``` The `GraphTransitionManager` instance is a property of `GraphExecutionManager` (e.g. `TrainingManager` or ``InferenceManager), 1. Use 'self._graph_transition_manager.use_cache_or_reconstruct_post_processed_model(inputs, kwargs)' to check whether the PyTorch module need a re-export or re-post-export-process. 2. Use `self._graph_transition_manager._post_export_processed_model_info.construct_inputs` to construct the list of inputs used for ORT runs. 3. Use `self._graph_transition_manager._post_export_processed_model_info.restore_outputs(user_outputs)` to restore the outputs in original PyTorch output structure. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-03 10:53:31 +08:00
Baiju Meswani	116398c1a4	onnxruntime shared lib inside python package (#21223 )	2024-07-02 15:37:50 -07:00
Tianlei Wu	7df97f1987	Add debugging helper to dump string, vector and thread id (#21224 ) ### Description Add some macro to help print data to console for debugging purpose. Example usage: ``` int input_id; vector<int> some_vector; DUMP_CPU_TENSOR_INIT() DUMP_CPU_TENSOR("some vector", some_vector); DUMP_STRING("input_id=", input_id); ``` - To enable dump thread id, set environment variable `ORT_DUMP_THREAD_ID=0`. - User can disable dumping by environment variable `ORT_ENABLE_CPU_DUMP=0`. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-02 11:24:04 -07:00
Yifan Li	7be1d4aad3	[TensorRT EP] Update TRT10.0 deprecated api (#20989 ) ### Description <!-- Describe your changes. --> Note: * This PR would remove C4996 suppression in tensorrt_execution_provider.cc only (according to Nvidia, places with nvinfer.h included need C4996 suppression, when /Zc:__cplusplus is enabled in ORT win build) * A follow-up PR will be raised to update deprecated TRT Plugin api usage. Here are deprecated apis to be updated in this PR: \| deprecated api \| Update \| \| ------------------------------------------------------------ \| ------------------------------------------------------------ \| \| [kCUBLAS](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#a9e1d81e5a8bfeb38b86e22a66d5f836a) \| / \| \| [kCUBLAS_LT](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#a9e1d81e5a8bfeb38b86e22a66d5f836a) \| / \| \| [kCUDNN](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#a9e1d81e5a8bfeb38b86e22a66d5f836a) \| / \| \| [reallocateOutput](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1v__1__0_1_1_i_output_allocator.html#acae6441d4029584cc1c6550917518691) \| Superseded by [reallocateOutputAsync](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1v__1__0_1_1_i_output_allocator.html#aa40eeb891c1dfe4c1bbf1eabe8c705ab) with cudaStream_t argument \| \| [createExecutionContextWithoutDeviceMemory](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_cuda_engine.html#adc86bcc42b098204997396ef2b1093fb) \| Superseded by [createExecutionContext()](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_cuda_engine.html#a35de29aa6134165a5b14a537e6d99e82) with parameter.<br />Check [ExecutionContextAllocationStrategy::kUSER_MANAGED](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#ac6251a050df629edfc0ce037fa366503) for more detail \| ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> TRT deprecated api list: https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/deprecated.html	2024-07-01 22:55:20 -07:00
Yi Zhang	beb2496748	Templatize publishing nuget package (#21199 ) ### Description It's the prerequisite step of reducing complexity of current zip-nuget pipeline. Some packaging tasks could be cut from the most complex nuget pipline and easily be published ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-02 09:24:19 +08:00
Scott McKay	8c2689877f	CoreML: Disable 1D ML Program matmul due to bug in coreml (#21186 ) ### Description Disable using CoreML ML Program for a matmul where one of the inputs is 1D as the CoreML implementation appears to be broken. See https://github.com/apple/coremltools/issues/2263 Add some debugging notes. ### Motivation and Context Fix failing test on macos-14.	2024-06-29 12:19:51 -07:00
Chen Feiyue	56b36a58ba	Initial PR for VSINPU execution provider (#20903 ) ### Description <!-- Describe your changes. --> -It is an initial PR for VSINPU execution provider ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> - For support VeriSilicon hardware - TIM-VX(Tensor Interface Module) (https://github.com/VeriSilicon/TIM-VX) is an integrated software solution by Verisilicon for our hardware(A311D/i.MX 8M Plus etc.) design, it is easy to use Verisilicon’s hardware by simply connecting onnxruntime with the TIM-VX API by this VSINPU execution provider.	2024-06-28 21:48:34 -07:00
Jian Chen	9007ede102	Update upstream packaging pipeline name to make it more meaningful. (#21154 ) ### Description Update upstream packaging pipeline name to make it more meaningful. ### Motivation and Context The upstream pipeline used to only building Nuget packages, but now it also builds Zip and Java. So change the name will make it more meaningful.	2024-06-28 21:40:09 -07:00
Changming Sun	3a83f8b317	Update the functions in tensorprotoutils.h to use std::filesystem::path instead (#20920 ) ### Description 1. Update the functions in tensorprotoutils.h to use std::filesystem::path instead of onnxruntime::Path. Eventually we can remove the whole onnxruntime::Path class, but to this PR small I am not doing that. 2. Remove the _SILENCE_EXPERIMENTAL_FILESYSTEM_DEPRECATION_WARNING macro def when TensorRT EP is enabled.	2024-06-28 20:03:57 -07:00
Jian Chen	0cbe7eec5e	Uppdate nuget to Use Nuget 6.10.x (#21209 ) ### Description Uppdate nuget to Use Nuget 6.10.x	2024-06-28 19:49:54 -07:00
mingyueliuh	7e93cd7f8b	[VitisAI] Align TensorProto_DataType with onnx1.16 (#21067 ) ### Description Vitis AI EP synchronously supports the TensorProto data types supported by ONNX 1.16. Add error message show when graph resolve fail for troubleshooting. ### Motivation and Context ONNX 1.15 & 1.16 add support some new TensorProto DataType , such as - FLOAT8E4M3FN - FLOAT8E4M3FNUZ - FLOAT8E5M2 - FLOAT8E5M2FNUZ - UINT4 - INT4 --------- Co-authored-by: liumingyue <mingyue@xilinx.com>	2024-06-28 17:19:20 -07:00
Preetha Veeramalai	6baaaf5165	OVEP options to disable CPU fallback at compile time (#21166 ) ### Description Provide user level options to control the fallback on CPU for models not supported on Intel's NPU hardware. ### Motivation and Context - Current workflow of OVEP allows safe fallback from OV NPU to OV CPU on compilation failures. Also supports MLAS CPU fallback in presence of unsupported custom ops. - The PR provides a build-time option to disable fallback from OV NPU to OV CPU. - The session Option "kOrtSessionOptionsDisableCPUEPFallback" disables OV CPU and MLAS CPU fallback. - Also has bug fix for proto creation. --------- Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com> Co-authored-by: ankitm3k <ankit.maheshkar@intel.com>	2024-06-28 08:31:02 -07:00
Hector Li	21ad004237	Add QNN UTs for QNN Pad Op with FP16 data on HTP backend (#21142 ) ### Description 1. Add QNN UTs for QNN Pad Op with FP16 data on HTP backend 2. Improve Pad op builder to handle invalid optional input 3. Add UT for ReduceSum for FP16 precision with 5D for issue reproduce	2024-06-27 22:09:13 -07:00
Yi Zhang	587e92c279	Add FP32 and INT4 test in Llama2 (#21187 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-06-28 06:18:26 +08:00
Changming Sun	d1ab94c2b0	Add compatibility for NumPy 2.0 (#21085 ) ### Description As suggested by SciPy's doc, we will `Build against NumPy 2.0.0, then it will work for all NumPy versions with the same major version number (NumPy does maintain backwards ABI compatibility), and as far back as NumPy 1.19 series at the time of writing` I think it works because in [numpyconfig.h#L64](https://github.com/numpy/numpy/blob/main/numpy/_core/include/numpy/numpyconfig.h#L64) there is a macro NPY_FEATURE_VERSION. By default it is set to NPY_1_19_API_VERSION. And the NPY_FEATURE_VERSION macro controls ABI. This PR only upgrade the build time dependency; When a user installs ONNX Runtime, they still can use numpy 1.x. ### Motivation and Context Recently numpy published a new version, 2.0.0, which is incompatible with the latest ONNX Runtime release.	2024-06-27 13:50:53 -07:00
Wanming Lin	78316c8cbe	[WebNN EP] Remove useless variable unpacked_tensors_ (#21189 )	2024-06-27 11:56:56 -07:00
Guenther Schmuelling	9eb1c2a7a3	support for layernorm in webgpu pre opset-17 (#21121 ) handled the same way cpu does	2024-06-27 10:20:48 -07:00
Yi Zhang	8f738d8e9f	[Fix] Throwes one excepiton while Llama2 parity_check fails (#21160 ) ### Description ### Motivation and Context The pipeline is green even Llama2 parity_check fails. The PR should be merged after the below exception is solved. ''' 2024-06-25 03:49:43.621298481 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running Expand node. Name:'/model/Expand' Status Message: /model/Expand: left operand cannot broadcast on dim 3 LeftShape: {1,1,9,9}, RightShape: {2,1,9,17} An error occurred while verifying parity: Error in execution: Non-zero status code returned while running Expand node. Name:'/model/Expand' Status Message: /model/Expand: left operand cannot broadcast on dim 3 LeftShape: {1,1,9,9}, RightShape: {2,1,9,17} Traceback (most recent call last): File "/workspace/onnxruntime/python/tools/transformers/models/llama/convert_to_onnx.py", line 1043, in main parity_check(parity_cmd) File "/workspace/onnxruntime/python/tools/transformers/models/llama/llama_parity.py", line 298, in main verify_parity(args, location, use_auth_token, kv_cache_ortvalues, pytorch_model=llama, config=config) File "/workspace/onnxruntime/python/tools/transformers/models/llama/llama_parity.py", line 137, in verify_parity ort_model.run_with_iobinding(io_binding) File "/home/onnxruntimedev/.local/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 331, in run_with_iobinding self._sess.run_with_iobinding(iobinding._iobinding, run_options) RuntimeError: Error in execution: Non-zero status code returned while running Expand node. Name:'/model/Expand' Status Message: /model/Expand: left operand cannot broadcast on dim 3 LeftShape: {1,1,9,9}, RightShape: {2,1,9,17} ''' The exception looks caused by #19832	2024-06-27 23:49:32 +08:00
Wanming Lin	b49788e68b	[WebNN EP] Fixed bug in Expand implementation (#21163 ) ONNX's Expand supports bidirectionally broadcast, while WebNN's expand op only supports unidirectionally broadcast. Thus we should calculate the output shape for 'newShape' input of WebNN's expand op.	2024-06-27 08:09:13 -07:00
kailums	a1bbfeb306	add split3inner (#19886 ) ### Description <!-- Describe your changes. --> The split op is using pin_memory when split on different sizes. But pin_memory is not capable for using cudagraph. Add a new implementation for only transformer scenarios, it split the qkv_proj into q, k, v, not using pin_memory. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-06-27 18:53:12 +08:00
PeixuanZuo	446aa986a1	[ROCm] Extend the Pipeline restriction time (#21158 ) ROCm EP builds are taking longer.	2024-06-27 15:36:04 +08:00
mindest	eecc11afc7	[ROCm] Disable ck_tile in Debug build (#21178 ) ### Description tmp fix: disable ck_tile for Debug build. ### Motivation and Context Release build works fine for ck_tile, while Debug build fails. <details> <summary> Typical error log to revisit </summary> ``` [880/1797] Building HIP object CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o FAILED: CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o /opt/rocm/llvm/bin/clang++ -DEIGEN_MPL2_ONLY -DENABLE_ROCM_PROFILING -DENABLE_STRIDED_TENSORS -DENABLE_TRAINING -DENABLE_TRAINING_APIS -DENABLE_TRAINING_CORE -DENABLE_TRAINING_OPS -DENABLE_TRAINING_TORCH_INTEROP -DMIOPEN_VERSION=30100 -DORT_ENABLE_STREAM -DROCM_VERSION=60100 -DUSE_ROCM=1 -D_GNU_SOURCE -D__HIP_ROCclr__=1 -D__bf16__ -D__fp16__ -D__fp32__ -I/build/Debug/_deps/utf8_range-src -I/ws/onnxruntime/include/onnxruntime -I/ws/onnxruntime/include/onnxruntime/core/session -I/ws/onnxruntime/orttraining/orttraining/training_api/include -I/build/Debug/_deps/composable_kernel-src/example/ck_tile/01_fmha -I/build/Debug/_deps/composable_kernel-src/include -I/build/Debug/_deps/composable_kernel-build/include -I/build/Debug/_deps/composable_kernel-src/library/include -isystem /opt/rocm-6.1.0/include -g -O -std=gnu++17 --offload-arch=gfx90a -fPIC -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -MD -MT CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o -MF CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o.d -o CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o -x hip -c /build/Debug/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp In file included from /build/Debug/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp:5: In file included from /build/Debug/_deps/composable_kernel-src/example/ck_tile/01_fmha/fmha_fwd.hpp:6: In file included from /build/Debug/_deps/composable_kernel-src/include/ck_tile/core.hpp:11: /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression 27 \| asm volatile("s_add_u32 m0, %0, m0" : : "n"(v) : "memory"); \| ^ /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression fatal error: too many errors emitted, stopping now [-ferror-limit=] 20 errors generated when compiling for gfx90a. ... ``` </details>	2024-06-27 12:04:17 +08:00
Scott McKay	887a818aa7	Check for unit test log severity override earlier (#21177 ) ### Description <!-- Describe your changes. --> Setting the log level after environment creation is too late in some cases. If the DML EP is enabled, it will create a composite sink with the original logger using the creation time log severity, as well as additional ETW sink. As it saves the current severity levels for each sink inside the composite sink that prevents being able to get verbose log output to stdout even if you set that at the session level. I don't know enough about the setup that combines ETW with the original sink to say whether we should also be updating the severity of individual sinks in the combined sink, so this change is limited to making the unit tests behave in the expected manner when the default log severity is set in the background and not directly controlled. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Make it possible to get verbose output to stdout when the DML EP is enabled.	2024-06-27 12:51:13 +10:00
Vincent Wang	3c0b407709	Rollback 19832, Remove shape_input_merge Fusion (#21179 ) The PR caused Big Models pipeline failure for running Llama2. After the rollback, the pipeline is back to normal.	2024-06-26 10:00:45 -07:00
Scott McKay	337cc56d6f	Convert scalars to 1D to satisfy ML Program requirements. (#21159 ) ### Description <!-- Describe your changes. --> Convert scalars to 1D to satisfy ML Program requirements. https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1418617&view=logs&j=f7cc61a9-cc70-56e7-b06c-4668ca17e426&t=16d281b5-1bfd-5309-f274-36d0dffd9cb1&l=27167 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fixes test failure in #17361	2024-06-26 09:54:36 -07:00
mindest	e2abba18ea	Skip softmax BF16 test for ROCm (#21162 ) ### Description Skip softmax BF16 test for ROCm, because BFloat16 is unsupported by MIOpen, and `torch.cuda.is_available()` also returns `True` for ROCm.	2024-06-26 11:15:50 +08:00
Wanming Lin	41ad83fb00	[WebNN EP] Support rest Reduction ops for TFLite backend (#21135 ) - reduceLogSum, reduceLogSumExp and reduceSumSquare have been landed in https://chromium-review.googlesource.com/c/chromium/src/+/5575815 - reduceL1 and reduceL2 have been landed in https://chromium-review.googlesource.com/c/chromium/src/+/5606091	2024-06-25 18:30:55 -07:00
Wanming Lin	4743803944	[WebNN EP] Support more Normalization ops for TFLite backend (#21151 ) Following Normalization ops have been supported in Chromium for TFLite backend: - batchNormalization: https://chromium-review.googlesource.com/c/chromium/src/+/5532745 - layerNormalization: https://chromium-review.googlesource.com/c/chromium/src/+/5573326 - instanceNormalization: https://chromium-review.googlesource.com/c/chromium/src/+/5532750	2024-06-24 19:04:23 -07:00
Jian Chen	f81c0ec32a	Remove warning suppression from Java Packaging pipeline. (#21010 ) ### Description Remove warning suppression from Java Packaging pipeline. ### Motivation and Context We want the CI step not to produce warning.	2024-06-24 16:46:21 -07:00
mindest	adaf0e8116	[Fix] USE_NCCL -> ORT_USE_NCCL (#21136 ) ### Description Correct the macro used when NCCL enabled.	2024-06-24 11:33:17 -07:00
Wanming Lin	3a917e49fb	[WebNN EP] Support 4 more ops for TFLite backend (#21134 ) Recently WebNN TFLite backend supports gelu, expand, softsign, reciprocal.	2024-06-24 09:52:12 -07:00

1 2 3 4 5 ...

11312 commits