onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-19 19:00:47 +00:00

Author	SHA1	Message	Date
Scott McKay	68b6064be6	Fix reporting of unused initializers in subgraphs (#19341 ) ### Description <!-- Describe your changes. --> Increment num_resolves_ inside the graph resolve finalization function so the subgraphs have the same value. This prevents incorrect output regarding removing unused initializers. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #19141	2024-02-01 08:02:12 +10:00
Yi-Hong Lyu	55b60d8fe0	Turn off Neural Speed to avoid slowdowns (#19265 ) Disable Neural Speed to prevent the operation following MatMulNBits from significantly slowing down.	2024-01-31 13:40:25 -08:00
Adrian Lizarraga	ca8d4459d4	Add contrib Q/DQ ops to symbolic shape inference tool (#19340 ) ### Description Adds type/shape inferencing support for MSFT domain QuantizeLinear and DequantizeLinear operators to symbolic_shape_infer.py ### Motivation and Context Need a way to infer the types and shapes of Q/DQ ops in models that use the MSFT domain versions (e.g., int16 quantization).	2024-01-31 10:38:01 -08:00
Phoebe Chen	2b361c04d6	Fix Flatbuffer build issue. (#19296 ) ### Description Building on g++ 13.2.0 results in -Wstringop-overread errors on Linux. This commit addresses the flatbuffer build issue with the following changes: 1. Remove the Werror flag in the flarbuffer patch. 2. Add a compilation option to suppress the 'stringop-overflow' error in the Flatbuffers within the xnnpack provider. ### Motivation and Context https://github.com/google/flatbuffers/issues/8119 https://github.com/microsoft/onnxruntime/pull/19239 Signed-off-by: Phoebe Chen <phoebe.chen@sifive.com>	2024-01-31 10:12:43 -08:00
zesongw	d87f73ab44	[WebNN EP] Use GetVecUint32FromVecInt64 to simplify the code (#19324 ) - Use the function `GetVecUint32FromVecInt64` in helper.h to replace `transform`. - Change some `int32_t` to `uint32_t`. - Remove a useless `temp`.	2024-01-31 00:20:07 -08:00
Baiju Meswani	3262e8df2f	Introduce a Nominal Checkpoint for On-Device Training (#19232 )	2024-01-30 22:11:25 -08:00
petermcaughan	4562c910fe	Whisper Crash Fix (#19345 ) ### Description There is a current bug in the BeamSearch implementation of T5, GPT, and Whisper due to an interaction between two PRs merged in the past 7 months. First PR/code change is the addition of BeamSearchScorer GPU implementation. This PR accelerates some operations by executing them in the GPU and not the CPU. The approach for this code change didn't utilize a cudaStream when copying one particular variable from GPU to CPU (see nullptr value here: [[link](`b65d3d0a53/onnxruntime/contrib_ops/cpu/transformers/beam_search_impl_t5.h (L213)`)]). The second PR/code change was the alteration to utilize a cudaStream to initialize various memory buffers in BeamSearch (see `stream` included as the last argument in these allocations [[link](`d1431e1b78/onnxruntime/contrib_ops/cpu/transformers/beam_search_impl_base.h (L25)`)]). During the in-between period of these two PRs, I believe neither allocation utilized a stream and were thus synchronized. Once the latter PR was merged, the copy became desynchronized with the initialization due to different streams. The fix for this is to reintroduce the same stream into the copy operation added in the first PR. ### Motivation and Context This does not happen reliably on every hardware with every script due to the race condition nature, but the bug completely breaks ORT execution with a BeamSearch model. --------- Co-authored-by: Peter McAughan <petermca@microsoft.com>	2024-01-30 21:53:18 -08:00
Yulong Wang	dd1f6ccc45	[js/webgpu] resolve codescan alert (#19343 ) ### Description resolve codescan alert: https://github.com/microsoft/onnxruntime/security/code-scanning/17687	2024-01-30 21:06:21 -08:00
Xu Xing	d73131cf0f	[js/webgpu] Use DataType as uniform cpu type (#19281 ) This saves turning data type to string by tensorDataTypeEnumToString.	2024-01-30 21:05:08 -08:00
Jiajia Qin	85cef0af8c	[js/webgpu] Support capture and replay for jsep (#18989 ) ### Description This PR expands the graph capture capability to JS EP, which is similar to #16081. But for JS EP, we don't use the CUDA Graph, instead, we records all gpu commands and replay them, which removes most of the cpu overhead to avoid the the situation that gpu waiting for cpu. mobilenetv2-12 becomes 3.7ms from 6ms on NV 3090 and becomes 3.38ms from 4.58ms on Intel A770. All limitations are similar with CUDA EP: 1. Models with control-flow ops (i.e. If, Loop and Scan ops) are not supported. 2. Usage of graph capture is limited to models where-in all ops in the model can be partitioned to the JS EP or CPU EP and no memory copy between them. 3. Shapes of inputs/outputs cannot change across inference calls. 4. IObinding is required. The usage is like below: Method 1: specify outputs buffers explicitly. ``` const sessionOptions = { executionProviders: [ { name: "webgpu", }, ], enableGraphCapture: true, }; const session = await ort.InferenceSession.create('./models/mobilenetv2-12.onnx', sessionOptions); // prepare the inputBuffer/outputBuffer ... ... const feeds = { 'input': ort.Tensor.fromGpuBuffer(inputBuffer, { dataType: 'float32', dims }) }; const fetches = { 'output': ort.Tensor.fromGpuBuffer(outputBuffer, { dataType: 'float32', dims: [1, 1000] }) }; let results = await session.run(feeds, fetches); // The first run will begin to capture the graph. // update inputBuffer content ... ... results = = await session.run(feeds, fetches); // The 2ed run and after will directly call replay to execute the graph. ... ... session.release(); ``` Method 2: Don't specify outputs buffers explicitly. Internally, when graph capture is enabled, it will set all outputs location to 'gpu-buffer'. ``` const sessionOptions = { executionProviders: [ { name: "webgpu", }, ], enableGraphCapture: true, }; const session = await ort.InferenceSession.create('./models/mobilenetv2-12.onnx', sessionOptions); // prepare the inputBuffer ... ... const feeds = { 'input': ort.Tensor.fromGpuBuffer(inputBuffer, { dataType: 'float32', dims }) }; let results = await session.run(feeds); // The first run will begin to capture the graph. // update inputBuffer content ... ... results = = await session.run(feeds); // The 2ed run and after will directly call replay to execute the graph. ... ... session.release();	2024-01-30 18:28:03 -08:00
Scott McKay	6dd0079d13	Exclude more code from custom_ops.cc when not required in minimal build (#19142 ) ### Description <!-- Describe your changes. --> - Split out the code that implements the OrtKernelContext API (used by compiled nodes and custom ops) and the code that implements the custom ops API. - Exclude based on minimal build settings using helpers - the main change is to simply wrap the implementation into a lambda so it can be easily enabled/disabled - actual implementation of all functions are unchanged - Re-organize so the related implementations are together - most diffs are from this, but without the reorg it would be much harder to know which helper to use - General cleanup of lines that were too long. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Saves ~10KB in a minimal build. Build command used for comparison ``` ./build --android --android_api=29 --android_sdk="d:\Android" --android_abi=arm64-v8a --parallel --android_ndk_path="D:\Android\ndk\26.0.10792818\" --build_shared_lib --cmake_generator Ninja --skip_tests --minimal_build --disable_rtti --disable_ml_ops --disable_exceptions --cmake_extra_defines=onnxruntime_BUILD_UNIT_TESTS=OFF --include_ops_by_config .\no_ops.config --config MinSizeRel ``` Main: 1,218,480 bytes With changes: 1,208,320 bytes	2024-01-31 12:25:34 +10:00
Wanming Lin	1e936bfd63	[WebNN] Ignore empty optional input tensor (#19235 ) Empty optional input tensors are indicated by an empty name, which are allowed and we should just ignore them.	2024-01-30 18:09:16 -08:00
Yi Zhang	e74f141338	Save stablediffusion and open-clip in pipeline cache (#19314 ) ### Description 1. save the model to pipeline cache 2. lower the similarly bar to 97 3. publish the generated image that we can check it once the test fails ### Motivation and Context Reduce model downloads	2024-01-31 09:39:27 +08:00
Adrian Lizarraga	0c38e96bb5	[Quant tool] Ensure MSFT opset for Q/DQ models (#19335 ) ### Description Updates qdq quantization to ensure the final model has the `com.microsoft` opset import if the model uses Q/DQ ops with the `com.microsoft` domain (e.g., for int16 quantization) ### Motivation and Context Need to ensure the MSFT domain is correctly set for all relevant cases. Otherwise, shape inferencing tools will raise an exception.	2024-01-30 17:19:08 -08:00
Jiajia Qin	90883a366a	[js/webgpu] Add hardSigmoid activation for fusedConv (#19233 ) ### Description Add hardSigmoid activation for fusedConv. It will be used by mobilenetv3-small-100 model.	2024-01-30 16:28:53 -08:00
Changming Sun	8dad9d92f4	Move einsum's test data to constexpr variables (#19320 ) ### Description emscripten's C++ compiler has difficulty on compiling einsum_test.cc because the file has too many local variables. So I moved them to constexpr.	2024-01-30 15:59:37 -08:00
Edward Chen	c379a89bcb	[MLAS AArch64] SQNBitGemm optimization (#19272 ) 1. Add support for packing 4-bit values 32 at a time for CompInt8. 32 4-bit values can fit into a single 128-bit NEON register. For CompInt8, this enables a more efficient path for block sizes greater than or equal to 32. CompFp32 seems to do better with handling 16 elements at a time, so this 32-value packing is not used there. Pack differently based on compute type. Adjust APIs to handle this. 2. Introduce template argument for whether to handle zero-point. This results in less code for the no zero-point (symmetric) case. However, there is a binary size increase due to the additional template instantiations.	2024-01-30 14:29:12 -08:00
Changming Sun	04afe77305	Update ThirdPartyNotices.txt: Add Intel neural-speed (#19332 ) Add Intel neural-speed to ThirdPartyNotices.txt because it will be shipped in the default build in most of our packages.	2024-01-30 12:40:30 -08:00
kunal-vaishnavi	febec1c586	Update Whisper export with beam search (#19322 ) ### Description This PR updates the Whisper export with beam search by adding the following. - Fixes a bug when running `DecoderMaskedMultiHeadAttention` in the Whisper with beam search model - Sets the default PyTorch attention implementation to `eager` to allow existing attention fusions to continue working - Re-uses the cache directory when loading the PyTorch model to reduce memory used on disk - Adds `--disable_auto_mixed_precision` to the example FP16 export command ### Motivation and Context - [This PR](https://github.com/microsoft/onnxruntime/pull/19112) added the `is_unidirectional` parameter to `CheckInputs`, but it was not provided when checking the inputs in `DecoderMaskedMultiHeadAttention`. - [This PR](https://github.com/microsoft/onnxruntime/pull/19200) explains the reasoning behind why `eager` is used to load the `WhisperAttention` class. - By re-using the cache directory for loading the PyTorch model, only one copy of the PyTorch model is saved on disk instead of two copies. - By providing this flag, there will be less Cast nodes in the Whisper with beam search model to switch between FP16 and FP32 precision.	2024-01-30 11:59:15 -08:00
ivberg	3454f86e70	Windows - Only set thread affinity on Server with auto affinity (#19318 ) ### Description Only set thread affinity on Server with auto affinity. Auto affinity = when API user does specify thread settings or affinity themselves. ### Motivation and Context On client best to let OS scheduler handle. On big (P-Core) / little (E-Core) CPU designs affinity overrides win32 Quality of Service (QoS) and has high power usage. Specifically on background workloads whose process is tagged QoS Utility (Background), this affinity setting overrides the OS scheduler that only wants to schedule on the E-Cores. Thus P-Cores waking up uses more energy than intended on client and users gets less battery life. Foreground AI workloads would be tagged QoS High and would run the ORT threads on all cores.	2024-01-30 10:53:10 -08:00
liqun Fu	b84cb247e3	io_binding to handle optional input of sequence type_proto (#19273 )	2024-01-30 10:25:14 -08:00
Wei-Sheng Chin	ffc3431a66	Update ScatterElements to Support Opset 13, 15, 18 (#19198 ) `ScatterElements` in opset 18 has been around for a while. However, the highest opset supporting `ScatterElements` in ORT is 13. This PR implement this op in CUDA EP by replacing `assignment` in the current CDUA kernel with `atomic reduction` (e.g., atomic add, atomic max). A series of fundamental atomic functions (e.g., atomic max for int8_t and half) are implemented in `common.cuh`; the implementation is general enough to cover old CUDA and new CUDA versions. - The core changes are in `cuda/atomic/common.cuh` with very detailed documentation including `bit-wise operation's visualization`. They are also copied to `rocm/atomic/common.cuh` to support AMD GPU. - `/cuda/tensor/gather_elements_impl.cu` contains small changes to call the new atomic functions to support new `reduction` behavior in new `ScatterElements`. - New `ScatterElements` are defined in `rocm_execution_provider.cc` and `cuda_execution_provider.cc`.	2024-01-30 09:18:50 -08:00
Rachel Guo	3e17ca3dab	Fix iOS artifacts issue in Microsoft.ML.OnnxRuntime Nuget Package (#19311 ) ### Description <!-- Describe your changes. --> Updates to only include ios archs framework in artifacts included in Nuget Package. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Related issue: https://github.com/microsoft/onnxruntime/issues/19295#issuecomment-1914143256 --------- Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-01-30 08:44:20 -08:00
Changming Sun	a92802f940	Disable a few tests for wasm build (#19316 )	2024-01-30 08:16:57 -08:00
Vincent Wang	9f68a27c7a	[ORTModule] Handle Cast on Constant Number on Triton Code-gen (#19321 ) When using scaled_dot_product_attention on float16 type, the exported graph has Sqrt(float16(constant)), which cannot be ConstantFold in ORT because Sqrt CPU kernel doesn't support float16. This causes Triton code-gen generates code like: result = 128.0.to(tl.float32) This code cannot be compiled because .to() cannot be applied to constant. This PR is to handle such case that constant number will not do the Cast.	2024-01-30 17:04:01 +08:00
Xu Xing	624b4e2063	[js/webgpu] Remove enableShapesUniforms (#19279 )	2024-01-29 17:49:06 -08:00
Chi Lo	00d048121b	[TensorRT EP] Fix InferenceSession::Run() not thread-safe issue (#19301 ) Given that InferenceSession::Run() is guaranteed to be thread-safe meaning multiple threads can call this function concurrently, TRT EP needs to carefully take care of concurrency here, if not, following concurrent issue might happen: - It's suggested that to perform inference concurrently in multiple streams, use one trt execution context per stream. In the design of TRT EP (Not apply per-thread context implementation) and if multiple threads are calling InferenceSession::Run() concurrently, the trt execution context instance is shared by all the threads and each thread aquires different stream from ORT. So TRT EP will end up having one trt execution context using multiple streams which is not suggested. But, since the whole compute_func() is protected by the lock and if cudaStreamSynchronize() is enforced here, one trt execution context per stream is guaranteed. Therefore, TRT EP needs to call cudaStreamSynchronize() at compute_func() which means to wait until stream has completed all operations to prevent the concurrent github isse: https://github.com/microsoft/onnxruntime/issues/19275	2024-01-29 17:36:27 -08:00
Baiju Meswani	465540d29b	Update training api python documentation (#19287 )	2024-01-29 14:14:15 -08:00
Changming Sun	e91d91ae4f	Fix a build issue: /MP was not enabled correctly (#19190 ) ### Description In PR #19073 I mistunderstood the value of "--parallel". Instead of testing if args.parallel is None or not , I should test the returned value of number_of_parallel_jobs function. If build.py was invoked without --parallel, then args.parallel equals to 1. Because it is the default value. Then we should not add "/MP". However, the current code adds it. Because if `args.paralllel` is evaluated to `if 1` , which is True. If build.py was invoked with --parallel with additional numbers, then args.parallel equals to 0. Because it is unspecified. Then we should add "/MP". However, the current code does not add it. Because `if args.paralllel` is evaluated to `if 0` , which is False. This also adds a new build flag: use_binskim_compliant_compile_flags, which is intended to be only used in ONNX Runtime team's build pipelines for compliance reasons. ### Motivation and Context	2024-01-29 12:45:38 -08:00
Changming Sun	4ee222413f	Update OneBranch.Nuget-WindowsAI-Pipeline.Official.yml for Azure Pipelines (#19293 ) To fix a pipeline issue.	2024-01-29 12:00:42 -08:00
Guenther Schmuelling	9e69606360	fix f16 for attention, enable slice and flatten for more types (#19262 )	2024-01-29 10:13:46 -08:00
Yi Zhang	e96a038f01	Add VP test in Stable diffusion pipeline (#19300 ) ### Description 1. Add visual parity test based on openai clip model 2. Add trigger rules ### Motivation and Context 1. check generated image is expected 2. reduce unnecessary triggers	2024-01-29 09:33:58 -08:00
PeixuanZuo	82c1cb416b	[CUDA] Refactor GroupNorm and add common vectorize implementation (#19158 ) Co-authored-by: Peixuan Zuo <peixuanzuo@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2024-01-29 09:15:10 +08:00
Adrian Lizarraga	6d7ac9c93a	Support general session config entries in perf test tool (#19289 ) ### Description Adds the ability to specify general session configuration entries via the `-C` command-line option. Example: `-C "session.disable_cpu_ep_fallback\|1 ep.context_enable\|1"` Some session config entries can already be set via dedicated command-line options. If the user uses multiple command-line options to set the same session config entry, we'll print a warning. Note that the dedicated command-line options will take precedence. ### Motivation and Context Allows setting session configurations when testing EPs. QNN EP, for example, uses the `session.disable_cpu_ep_fallback` and `ep.context_*` options.	2024-01-26 19:51:48 -08:00
Tianlei Wu	d7ff81dfb7	[CUDA] support user_compute_stream in python API (#19229 ) ### Description It is an important feature to pass user cuda stream to avoid synchronization in python API. Here we allow user to pass cuda stream for CUDA provider. Note that TRT or ROCm provider need similar change, which are not included in this pull request. Note that we will set `has_user_compute_stream` automatically based on whether there is cuda stream passed, so setting `has_user_compute_stream` through python API has no effect. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> https://github.com/microsoft/onnxruntime/issues/19094	2024-01-26 10:34:43 -08:00
cao lei	7d4dc66846	ExecutionProvider API refactor - make GenerateMetaDefId a standalone function, decouple it from EP (#18977 ) ### Description <!-- Describe your changes. --> Make EP's member function, GenerateMetaDefId, a standalone function which decouples from EP ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This change is for ExecutionProvider API refactoring, we will make a clean ExecutionProvider API first for later EPv2 work	2024-01-26 07:39:08 -08:00
Baiju Meswani	fc44f96ad5	Add support for a collection of OrtValue as inputs and outputs to C# TrainingSession (#19048 )	2024-01-25 21:55:36 -08:00
Tianlei Wu	358650d441	Fix BigModel stable diffusion pipeline (#19277 ) ### Description Fix two issues: (1) We can only use single quote inside `bash -c "..."`. Current pipeline job stopped at `python3 demo_txt2img.py astronaut` and skip the following commands. In this change, we remove the remaining commands to get same effect (otherwise, the pipeline runtime might be 2 hours instead of 15 minutes). (2) Fix a typo of Stable.	2024-01-25 17:19:04 -08:00
Xu Xing	a3f0e2422b	[js/webgpu] Support f16 uniform (#19098 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-01-25 16:58:22 -08:00
Tianlei Wu	8b4517218b	Remove USE_CUTLASS flag (#19271 ) ### Description Since Cutlass can be built with CUDA 11.4 (The minimum CUDA version for onnxruntime CUDA build), there is no need to have a flag to disable cutlass. Changes: (1) Reverted https://github.com/microsoft/onnxruntime/pull/18761 (2) remove the condition to build cutlass. (3) Fix a few build errors or warnings during testing CUDA 11.4 build. Note that SM 89 and 90 (including fp8) requires CUDA 11.8 or later. Flash attention and cutlass fused multihead attention will not be built for CUDA < 11.6. It is recommended to use CUDA 11.8 or above to build if you want to support latest GPUs. It is better to include it in 1.17.0 (otherwise, the release branch might encounter build failure with CUDA 11.4). Tests: (1) Build with flash attention and efficient attention off: passed (2) Build with CUDA 11.4: passed Example build command used in Ubuntu 20.04: ``` export CUDA_HOME=/usr/local/cuda-11.4 export CUDNN_HOME=/usr/lib/x86_64-linux-gnu/ export CUDACXX=/usr/local/cuda-11.4/bin/nvcc sh build.sh --config Release --build_shared_lib --parallel --use_cuda --cuda_version 11.4 \ --cuda_home $CUDA_HOME --cudnn_home $CUDNN_HOME --build_wheel --skip_tests \ --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=80 \ --disable_types float8 ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-01-25 16:57:58 -08:00
Xu Xing	656ca66186	[js/webgpu] Support uniforms for conv, conv transpose, conv grouped (#18753 )	2024-01-25 15:37:05 -08:00
Chi Lo	a2867b911e	[TensorRT EP] Fix mem leak for TRT plugins custom ops (#19248 ) TRT EP's GetTensorRTCustomOpDomainList() will create vector of OrtCustomOpDomain objects and release the ownership of those objects. But, thoses objects are not released forever. In session level, we need to make TRT EP remember what OrtCustomOpDomain objects it created and release them at EP destruction time.	2024-01-25 11:51:39 -08:00
Tianlei Wu	2b285cd78a	[CUDA] Add functions to dump bfloat16 tensors (#19266 ) ### Description GroupQueryAttention add BFloat16 in https://github.com/microsoft/onnxruntime/pull/19095, and there is build error when enable dumping. This supports print bfloat16 tensor to console.	2024-01-25 09:30:15 -08:00
Jiajie Hu	5b06505073	[js/webgpu] Fix Tanh explosion (#19201 ) ### Description ```math \tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}= \left\{ \begin{array}{cc} -\frac{1-e^{-2\cdot(-x)}}{1+e^{-2\cdot(-x)}}, & x<0 \\ 0, & x=0 \\ \frac{1-e^{-2x}}{1+e^{-2x}}, & x>0 \end{array} \right. ``` ### Motivation and Context On some platforms, $$\tanh(1000)=\frac{e^{1000}-e^{-1000}}{e^{1000}+e^{-1000}}$$ would produce NaN instead of 0.999... or 1 (imagine $e^{1000}=\infty$ and $\frac{\infty}{\infty}$ explodes).	2024-01-25 08:25:35 -08:00
PeixuanZuo	1c92e56dc0	[Cuda] Refactor GroupNorm (#19146 ) Split GroupNorm implementation into multiple files, to make ROCm EP can reuse cuda code. Related PR: https://github.com/microsoft/onnxruntime/pull/19158 --------- Co-authored-by: Peixuan Zuo <peixuanzuo@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2024-01-25 22:28:47 +08:00
Vincent Wang	2b87dd373a	[ORTModule] Remove Mod from Hash to Avoid Conflict for Triton Code-gen (#19256 ) Remove mod (10**8) from hash to avoid conflict for Triton code-gen.	2024-01-25 10:16:41 +08:00
Dmitri Smirnov	7dd1f4b8e2	Pad-18 Cuda implementation (#19211 ) ### Description Implement Pad-18 for Cuda. ### Motivation and Context Latest models converted by Dynamo fall back on CPU for Pad with performance degradation. This contributes to https://github.com/microsoft/onnx-rewriter/issues/126	2024-01-24 18:12:04 -08:00
Phoebe Chen	4477f57ee3	Enable RISC-V 64-bit Cross-Compiling Support for ONNX Runtime on Linux (#19238 ) ### Description This pull request introduces the necessary changes to enable RISC-V 64-bit cross-compiling support for the ONNX Runtime on Linux. The RISC-V architecture has gained popularity as an open standard instruction set architecture, and this contribution aims to extend ONNX Runtime's compatibility to include RISC-V, thereby broadening the reach of ONNX models to a wider range of devices. ### Motivation and Context RISC-V is a free and open-source instruction set architecture (ISA) based on established RISC principles. It is provided under open licenses without fees. Due to its extensibility and freedom in both software and hardware, RISC-V is poised for widespread adoption in the future, especially in applications related to AI, parallel computing, and data centers. ### Example Build Command ``` ./build.sh --parallel --config Debug --rv64 --riscv_toolchain_root=/path/to/toolchain/root --skip_tests ``` ### Documentation Updates Relevant sections of the documentation will be updated to reflect the newly supported RISC-V 64-bit cross-compilation feature. https://github.com/microsoft/onnxruntime/pull/19239 --------- Signed-off-by: Phoebe Chen <phoebe.chen@sifive.com>	2024-01-24 16:27:05 -08:00
Wanming Lin	0c2f0ba90d	[WebNN EP] Support conv1d by reshaping with prepended 1's (#18857 ) WebNN only supports 4-D inputs for conv2d and convTranspose2d, this PR supports 3-D inputs (i.e. conv1d) by prepending a 1 size dimension and several reshape operations.	2024-01-24 15:53:10 -08:00
Wanming Lin	7252c6e747	[WebNN EP] Support WebNN async API with Asyncify (#19145 )	2024-01-24 15:37:35 -08:00

1 2 3 4 5 ...

10478 commits