onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-05 04:17:53 +00:00

Author	SHA1	Message	Date
Jian Chen	29e40987e3	Update batch file to set PATH for Cuda with TRT (#18182 ) ### Description Update batch file to set PATH for Cuda with TRT ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-31 10:22:40 -07:00
Vincent Wang	1c25fe5580	Fix PoliCheck (#18180 ) Fix PoliCheck by changing some words, which was from Triton flash attention's original code.	2023-10-31 13:53:11 +08:00
cloudhan	08dce54266	Improve tunable verbose log (#17328 )	2023-10-31 13:10:21 +08:00
Jian Chen	8a574b874c	Update setup_env_cuda.bat (#18176 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-30 21:28:02 -07:00
Patrice Vignola	8ed9bd6eca	Add one more MHA mask pattern (#18164 ) Add an MHA mask pattern for the scenario where the mask has already been broadcasted via an Expand node.	2023-10-30 21:21:51 -07:00
PeixuanZuo	efef6407bc	[ROCm] update rocm package exclude libs (#18130 ) update rocm package exclude libs. - change librocblas.so.0 to librocblas.so.3 which is used on ROCm5.6 and ROCm5.7 - add librocfft.so.0, libhipfft.so.0, libhiprtc.so.5 and sort the list.	2023-10-31 08:41:01 +08:00
Jiajia Qin	785e2b1eae	[js/webgpu] Optimize softmax by vector (#18153 ) ### Description This PR enables `softmax` outputs max supported components instead of scalar for each thread. Softmax with input[0]: [12,4096,4096] becomes 47.86 ms from 55.11 ms	2023-10-30 16:05:35 -07:00
Yufeng Li	90d1f537cb	optimize SLN with large dimension (#18138 ) ### Description <!-- Describe your changes. --> Optimize SkipLayerNorm for large dimension (>=2048) by handling 8 elements in one thread. It avoid the re-writing and re-loading sum of input, skip and bias to main memory. It reduces the latency of dimension 4096 with small batch size from ~18us to ~3.8us on A100. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-30 14:12:17 -07:00
Patrice Vignola	348a963238	[DML EP] Handle non-raw data in dynamic graph compilation (#18160 )	2023-10-30 13:48:34 -07:00
Chen Fu	4819fbf31c	Augment blockwise quantization (#18101 ) ### Description Augment block wise 4b quantization -- plain CPU impl ### Motivation and Context Allow column wise or row wise blocks. Experiments show row wise quantization in LLM weight matrices achieves better precision. Added tests for quantization and dequantization code.	2023-10-30 09:14:37 -07:00
Hector Li	be2f72a315	[QNN EP] Disable early termination in GetCapability (#18140 ) [QNN EP] Disable early termination in GetCapability if there are multiple partition and context binary enabled ### Description QNN EP context binary cache feature only support single partition for now. We have early termination in GetCapability. After the PR https://github.com/microsoft/onnxruntime/pull/17764. There's no Level 1 optimization any more for the 1st GetCapability. Graph transformer EnsureUniqueDQForNodeUnit is not applied. So if there's initializer -> DQ -> shared by multiple node unit. The node is not identified as node unit group. QNN EP report many not supported nodes because of this in the 1st GetCapability call. The 2nd GetCapability still works normally. Disable the early termination in GetCapability, delay the decision to Compile.	2023-10-30 08:34:49 -07:00
Yulong Wang	9bba990871	[js/web] fix a few package consuming problems (#18109 ) ### Description This PR tries to fix a part of the NPM package consuming problems for onnxruntime-web (ES module) as described in #10913: - reduce the package size to fit the 150MB restriction in jsdelivr, by removing dev build targets for uncommon exports - add default export to support `import ort from 'onnxruntime-web';` (currently only support `import * as ort from 'onnxruntime-web';`	2023-10-30 08:11:43 -07:00
Yi Zhang	436056dcd7	Revert "Disable dml stage in windows GPU pipeline temporarily. (#18034 )" (#18150 ) This reverts commit `99b8dcaae2`. ### Description <!-- Describe your changes. --> ### Motivation and Context Restore the dml stage in windows GPU pipeline. Agent issue is solved by adding Feature.DisableGpuDriver in pool properties.	2023-10-30 15:41:07 +08:00
Hariharan Seshadri	8ebdd3bbca	Fix regression in perf test runner (#18139 )	2023-10-29 19:26:12 -07:00
snadampal	0e34100484	create memory descriptors based on the tensor dimensions (#15848 ) Arm Compute Library(ACL)backend requires explicit memory format tag iniatilization to decide wether the tensor can be computed with the ACL kernels. Hence, the src, weights and dst memroy descriptor format is set based on the tensor dimensions instead of using the format::any tag. ### Description <!-- Describe your changes. --> Arm Compute Library(ACL)backend requires explicit memory format tag iniatilization to decide wether the tensor can be computed with the ACL kernels. Hence, the src, weights and dst memroy descriptor format is set based on the tensor dimensions instead of using the format::any tag. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> The change enables ACL kernels for DNNL matmul ops on aarch64 platform.	2023-10-29 09:43:12 -07:00
Wei-Sheng Chin	24f9c1afe3	Distributed Expand (#18126 ) This PR implements DistributedExpand for llama 2. Representative Examples of DistributedExpand: - [shard on non-expanded axis] `input tensor (shape=[8, 1], spec=S[0]R, device_mesh=[0,1]) -> Expand(target_shape=[8, 2] -> output tensor (shape=[8, 2], spec=S[0]R, device_mesh=[0,1])` - [sharding expanded axis is invalid since it must have dim=1 and axis with dim=1 cannot be sharded] `input tensor (shape=[1, 8], spec=S[0]R, device_mesh=[0,1]) -> Expand(target_shape=[2, 8] -> output tensor (shape=[2, 8], spec=S[0]R, device_mesh=[0,1])` From those examples, we observe a few important behaviors. - The output sharding spec is always the same to the input sharding spec. - Expanding always happen on axis with dimension=1. Otherwise, it will violate the broadcasting rule. - No communication is needed since all computation can happen locally. Let's consider the first example again. If you put the first half tensor (shape: [4, 1]) on device 0 and the second half (shape: [4, 1]) on device 1, then `Expand` it with target shape [4, 2] , these two local tensors (shape: [4, 2]) are exactly the same as the one described by output sharding spec. Algorithm: - Compute logical (i.e., unsharded) shapes of input and output. - Compute sharded output shape from logical output. - Call Expand to broadcast local input to sharded output shape. How to review? - Start with [changes in onnxruntime_test_distributed.py](`ea33392f37`). Those tests are good examples for using this op. - [Read expand.h/expand.cc](`e4c49987f5`). Theose changes are for exposing functionalities in Expand to DistributedExpand. - Read distributed_expand.h/distributed_expand.cc. It follows the algorithm described above. The commit `68ac301bba` first sketches the definition of DistributedExpand. The next commit `0eb9330c3b` adds real implementation.	2023-10-28 00:44:02 -07:00
RandySheriffH	8daabf3f15	Tune min version supporint custom op ComputeV2 (#18134 ) Set min version supporting custom_op::ComputeV2 to 16, since the feature has been released since ort 1.16. Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-10-27 16:09:07 -07:00
zesongw	d9695dea6d	[WebNN EP] Remove Conv initializer constraint for GPU (#18129 ) ### Description WebNN can now handle Conv with filter as input . ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Support more models with WebNN.	2023-10-27 13:57:01 -07:00
sophies927	28ad3ff799	Fix stale bot issue (#18064 ) ### Description Previously used GitHub stale app is now deprecated, so I deleted that file and added a new GitHub Actions workflow to automatically apply the stale label to inactive issues. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-27 10:57:28 -07:00
Maximilian Müller	2eeafc37bc	Enable global TRT timing cache (#17865 ) I am adding a new `trt_timing_cache_path` option. Internally it is handled as `global_cache_path_` and will be set via a fall through approach: 1. no path provided => workdir 2. `trt_engine_cache_path` provided but no `trt_timing_cache_path` => `trt_engine_cache_path` 3. `trt_timing_cache_path` provided => `trt_timing_cache_path` (if not provided `trt_engine_cache_path` will still be workdir) ### Motivation and Context A TRT timing cache can be reused across multiple models as it only holds kernel timings and it is common that network "patterns" are reused. This can accelerate build times a lot. --------- Co-authored-by: Carson M <carson@pyke.io>	2023-10-27 09:23:19 -07:00
guyang3532	58f1d15d19	Replace Transpose with Replace if they are equivalent (#18096 ) ### Description Transpose is equivalent to a Reshape if: empty dimensions can change place, not empty dimensions must be in the same order in the permuted tenosr. Example: Shape=(1,1,1024,4096) -> perm=(2,0,3,1). This pr adds a graph transformer which replaces Transpose with Reshape if they are equivalent. Because Transpose need memory copy while Reshape needn't, this replacement can save overhead for memory copy.	2023-10-27 23:50:18 +08:00
Xavier Dupré	b5f242e978	GemmFloat8 as a contrib ops (#16051 ) ### Description Add support for Gemm with float 8 as a contrib op. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Scott McKay <Scott.McKay@microsoft.com> Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-10-27 14:33:55 +02:00
Xavier Dupré	c10b83eb68	Update python cryptography version to 41.0.4 (#18056 ) ### Description Version 41.0.0 currently used has vulnerabilities. ### Motivation and Context See [Vulnerable OpenSSL included in cryptography wheels](https://github.com/advisories/GHSA-v8gr-m533-ghj9)	2023-10-27 12:06:38 +02:00
Wei-Sheng Chin	9c32310673	Distributed Reshape Implementation (#18068 ) This DistributedReshape aims at supporting all sharding patterns encountered in llama 2. All patterns found are tested in `TestDistributedReshape` in `onnxruntime_test_distributed.py`. This PR implements algorithms to compute the categories below. - All inputs and outputs are replica, so it's computed like a normal Reshape. - Two-axis fusion (if any of the inputs and outputs are sharded). This category convers, e.g., `[batch, seq, hidden] -> [batch x seq, hidden]`. - Two-axis decomposition (if any of the inputs and outputs are sharded). This category convers, e.g., `[batch x seq, hidden] -> [batch, seq, hidden]`. Review guideline: - Ignore the changes in sharding_spec.h and sharding_spec.cc since they come from another PR #18025. - First, read onnxruntime_test_distributed.py to get familiar with the input/output of DistributedReshape. - Second, check the new APIs in reshape.h/reshape.cc to expose CUDA Reshape kernel to DistributedReshape. - For DistributedReshape, check its `ComputeInternal` for the 3 categories mentioned above.	2023-10-26 22:33:42 -07:00
kunal-vaishnavi	b79ea74819	Add updates to LLaMA scripts (#18076 ) ### Description This PR adds a few updates to scripts in the LLaMA folder: - Fixes the precision re-naming in the LLaMA export - Adds a "prerequisites" section in the README - Adds IO binding synchronizations during benchmarking for other EPs ### Motivation and Context - With precision re-naming, the LLaMA parity check does not produce errors when creating the FP32 CPU model - The "prerequisites" section shows that there are specific package versions needed - This allows for benchmarking with other EPs besides CPU and CUDA	2023-10-26 21:54:23 -07:00
mindest	0f3a067d3a	[FIX] reorder initializer (#18097 ) ### Description Fix building error when with collective ops: error is thrown because `device_mesh_axis` will be initialized after `cond`.	2023-10-27 11:29:55 +08:00
Vincent Wang	b7408f7389	[ORTModule] ATen Efficient Attention and Triton Flash Attention (#17959 ) This PR is to support efficient attention and flash attention in ORTModule, including: - Use ATen to call efficient attention, which requires PyTorch 2.2.0 dev or newer. ORTMODULE_USE_EFFICIENT_ATTENTION=1 to enable. - Integrate Triton Flash attention, which requires triton==2.0.0.dev20221202. Need A100 or H100. ORTMODULE_USE_FLASH_ATTENTION=1 to enable. - A python transformer tool to match sub-graph by config and write transformer quickly. Current transformers supports attention mask for both efficient attn and flash attn, and dropout for efficient attn only. To support more training scenarios (such as causal mask in GPT2), more transformers need to be added. The feature is guarded by system environment variables, it won't effect any current behavior if not enabled. Since it requires specific PyTorch/Triton versions, related tests is not added for now.	2023-10-27 10:29:27 +08:00
Tang, Cheng	37873be86d	enable reduce ops on opset18 (#18053 ) ### Description Opset 18 apply the "axes as input" change from ReduceSum to all the other reduce ops. Our cuda kernel actually support it, but we didn't enable it for opset18. This PR update the reduce ops' kernel registration to enable the "axes as input" behavior for opset18. As part of the fix, I also simplify the reduce op kernel registration part. ORT doesn't require the kernel definition need to be exactly the same as onnx op definition. For our case, which we share the same kernel for all the reduce ops (from version 1 to version 18), we don't need to maintain different version of kernel definitions. we can simplify it by just using a single kernel definition for multiple versions. Although for some cases, we might register more types for legacy versions, but it is harmless. Framework is using schema to validate the graph, not kernel definition. --------- Co-authored-by: Cheng Tang <chenta@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net> Co-authored-by: Cheng Tang <chenta@microsoft.com>	2023-10-26 16:57:21 -07:00
Yang Gu	52f4968359	[js/webgpu] Change timestamp-query-in-passes to timestamp-query (#18108 ) Timestamp-query has a broader support than timestamp-query-in-passes on all the platforms, including macOS. Note that to enable timestamp-query, you still need to add switch "--enable-dawn-features=allow_unsafe_apis" to Chrome. By default, the lowest 16 bits are masked with 0 (at a granularity about 0.1ms) for privacy. To get the highest precision, you need to add another switch "--enable-webgpu-developer-features".	2023-10-26 16:33:03 -07:00
Maximilian Müller	b7bee621cd	[CUDA] Remove shape warnings in NHWC <> NCHW unit tests (#17992 ) There were some warning in https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1170770 e.g. ``` [ RUN ] CudaNhwcTypedTest/1.AveragePoolNhwcPad @ /home/administrator/onnxruntime/onnxruntime/test/providers/cuda/nhwc/pool_test.cc:84 [W:onnxruntime:Default, graph.cc:108 MergeShapeInfo] Error merging shape info for output. 'Y' source:{1,16,66,66} target:{1,16,67,67}. Falling back to lenient merge. ``` These warnings where not specific to NHWC or NCHW but were just a miscalculation of output shape in some tests.	2023-10-26 16:32:01 -07:00
Jian Chen	7c18c60bc2	Change cuda image for tensorRT to the one with cudnn8 (#18102 ) ### Description copilot:summary ### Motivation and Context copliot::walkthrough	2023-10-26 16:28:57 -07:00
Ashwini Khade	f2e19a8ccf	Updates to training pipelines to reduce CI time (#18116 ) ### Description Motivation for this PR is reducing CI test time by removing unnecessary tests from the pipelines. Following changes are for reducing test time in pipelines: - Skip CPU model tests in GPU builds. Training CIs run these tests as a sanity check. There is no direct training code being tested in these pipelines, furthermore, CPU tests are being run in CPU pipelines so no need to run them again in GPU builds and block the GPU VM. This change reduces testing time by 20-25 mins in all training GPU pipelines. - Delete debug package building pipeline for linux training packages. This was required by compiler team at some point but there have been 0 downloads of these packages. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-26 14:58:57 -07:00
Wei-Sheng Chin	a514a68770	Support per-tensor device mesh at op level (#18025 ) Since Reshape may change device mesh from, e.g., [0, 1] to [0, 1, 0, 1], we can't assume same device mesh per op. At each operator, we replace a single operator-level device mesh - `device_mesh_shapes` - `device_mesh_elements` with per-tensor device meshes - `input_device_mesh_shapes` (input_device_mesh_shapes[i] is the device mesh's shape for the i-th input, e.g., "[3]" for 1-D mesh with 3 devices) - `input_device_mesh_elements` (input_device_mesh_elements[i] is the flattened device mesh elements for the i-th input; e.g., "[0, 1, 2]" if you have 3 devices in that mesh) - `output_device_mesh_shapes` - `output_device_mesh_elements` Check out the change in onnxruntime_test_distributed.py for examples. It's also heavily used in #18068's `onnxruntime_test_distributed.py` change.	2023-10-26 14:47:16 -07:00
Chi Lo	455a9ce614	[TensorRT EP] Use latest onnx-tensorrt parser (#18067 ) Use latest onnx-tensorrt to fix compile error. Please see the issue https://github.com/microsoft/onnxruntime/issues/18029	2023-10-26 13:55:12 -07:00
Jian Chen	b023de0bfc	Redo #18044 Install CUDA 12.2 on Windows (#18093 )	2023-10-26 10:12:46 -07:00
Caroline Zhu	64de71c5e2	[js/web/training] Add CreateTrainingSession (#17891 ) ### Description * Adds TrainingSession.create() functionality following the web bindings for training design doc * Added 2 new training APIs to wasm/api.h: * OrtTrainingGetInputOutputName * OrtTrainingGetInputOutputCount * Moved isOrtEnvInitialized boolean to the wasm-core-impl and added a method that references it ### Motivation and Context * Adding web bindings for training #### Related work * #16521 allowed for training artifacts to be built * #17333 added interfaces for training * #17474 allows for training package to be built + adds training backend to web package [MUST BE MERGED IN BEFORE THIS ONE] --------- Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Ashwini Khade <askhade@microsoft.com>	2023-10-26 09:22:10 -07:00
Changming Sun	0f72739b6d	Disable ccache for WinML build (#18104 ) ### Description It seems would resolve the timeout issue. ### Motivation and Context	2023-10-26 19:03:01 +08:00
Patrice Vignola	538e97cbda	[DML EP] Add dynamic graph compilation (#17876 ) Historically, DML was only able to fuse partitions when all sizes are known in advance or when we were overriding them at session creation time. But in practice, it should be possible to compile partitions at compute time if the caller knows that the dimensions won't be changed for every inference (e.g. resizing a webcam window, or padding the input to powers of 2). This graph will be cached and reused until the sizes change. This is an opt-in option gated under the `enable_dynamic_graph_fusion` option, which means that it will only be enabled when the caller requests it since they have more context on how their model will be called between inferences. This PR also adds the option to disable metacommands from the python API, which is an option for the C API but was lacking for python.	2023-10-25 19:56:16 -07:00
Jambay Kinley	d30d4d372a	Add MatMul FP4 and NF4 Support (#18066 ) ### Description Add a contrib op MatMulBnb4 (FP4 and NF4) and related toolchain to support quantization on weight. This PR adds: - schema for contrib op MatMulBnb4 which can support FP4 (4-bit floating point) and NF4 (4-bit NormalFloat) quantization on weight. - a naive implementation for MatMulBnb4 on CPU and GPU, i.e., implemented like MatMul(A, Dequantize(B)). - a special implementation for GemV for MatMulBnb4 and related benchmark tool. - tool to quantize model to FP4 or NF4.	2023-10-25 15:34:58 -07:00
snadampal	d88d52eead	[aarch64] Remove mmla kernel support from apple (#18082 ) ### Description <!-- Describe your changes. --> The mmla kernels require additional ISA flags and are currently supported only on Linux ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> more context is in https://github.com/microsoft/onnxruntime/pull/15270 cc: @skottmckay , @chenfucn , @snnn	2023-10-25 11:34:57 -07:00
liqun Fu	706e13e0c9	implement affinegrid cpu kernel (#17777 )	2023-10-25 10:46:04 -07:00
pengwa	2c6b31c5aa	FP16 optimizer automatically detect DeepSpeed compatibility (#18084 ) ### FP16 optimizer automatically detect DeepSpeed compatibility Optimum/Transformers are using accelerate lib to prepare models, so our FP16 optimizer wrapper does not work for long time. Because the namespace is `accelerate.utils.deepspeed.DeepSpeedOptimizerWrapper`, which underlying is still calling into DeepSpeed stage1and2 optimizer. This PR includes following changes: 1. Add `accelerate.utils.deepspeed.DeepSpeedOptimizerWrapper` in the modifier registry, plus a check on its contained `optimizer` property MUST be DeepSpeed stage 1 and 2 optimizer. (let's cover Stage 3 optimizer later) 2. For DeepSpeed version > 0.9.1, we will store the source code in a version list. As long as the related function in DeepSpeed remains unchanged during its new release, we won't need manually upgrade the version check any more. If some day, the source code did not match, a warning will be raised to users, to add a new version of source code in the list. With the above change, we will have our FP16 Optimizer working again in Optimum. ![image](https://github.com/microsoft/onnxruntime/assets/10530022/d35b4aa9-b371-46f1-98ae-73114f91179b)	2023-10-25 15:11:02 +08:00
Sumit Agarwal	ae8561979f	Introduce new optimizer MatMul + BatchNormalization (#17915 ) ### Description Introduce new ORT L1 optimizer under RewriteRule category to fuse MatMul + BatchNormalization node. This optimizer look for a specific pattern observed in one of the impacting customer models and fuse the Matmul and Batchnormalization node into a Gemm node. For details on the pattern matching and fusion please refer to the comment section of `matmul_bn_fusion.cc`. To visualize, this optimizer will replace following subgraph to a Gemm node. <pre> MatMul GEMM \| \| Reshape ^ ---> Reshape ^ \| \| Transpose ^ Transpose ^ \| BatchNormalization Note: ^ means there can be >=0 occurrence(s) of that node. Few example fusable pattern: * - MatMul -> Reshape -> Transpose -> BatchNormalization ---> GEMM -> Reshape -> Transpose * - MatMul -> Reshape -> BatchNormalization ---> GEMM -> Reshape * - MatMul -> Transpose -> BatchNormalization ---> GEMM -> Transpose * - MatMul -> Reshape -> Reshape -> BatchNormalization ---> GEMM -> Reshape -> Reshape * - MatMul -> Reshape -> Transpose -> Reshape -> BatchNormalization ---> GEMM -> Reshape -> Transpose -> Reshape * - MatMul -> BatchNormalization ---> GEMM </pre> Note: This optimizer may evolve in the future to be more generic in terms of the pattern matching. ### Motivation and Context - Why is this change required? What problem does it solve? One of the user of ORT+DML ep needs this to better target the model to DML. But this transformation applies more broadly, so added L1 optimizer. <!-- - If it fixes an open issue, please link to the issue here. -->	2023-10-24 19:41:10 -07:00
Jian Chen	76e275baf4	Merge Cuda docker files into a single one (#18020 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-24 15:17:36 -07:00
Changming Sun	6ec45f2ba5	Merge aiinfra-linux-ARM64-CPU-2019 and onnxruntime-linux-ARM64-CPU-2019 (#18069 ) ### Description Merge aiinfra-linux-ARM64-CPU-2019 and onnxruntime-linux-ARM64-CPU-2019 machines to a single one to ease management.	2023-10-24 13:04:08 -07:00
liqun Fu	efa0cc2562	implement isinf20 and isnan20 (#17874 )	2023-10-24 10:58:54 -07:00
Changming Sun	abb329179a	Update win-wasm-ci.yml: increase the timeout value (#18023 )	2023-10-24 10:50:12 -07:00
Jian Chen	e63ccd3cbb	Install CUDA 12.2 on Windows (#18044 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-24 10:47:23 -07:00
Jiajia Qin	eb47008049	[js/webgpu] FP16 Cast, Resize (#18035 ) ### Description <!-- Describe your changes. --> Cast/Resize with f16 are missing in vae-decoder-f16. With this change, vae-decoder-f16 becomes 315 ms from over than 1 seconds.	2023-10-23 22:56:56 -07:00
Tianlei Wu	688524a9ab	[CUDA EP] Add warning logs when adding memcpy nodes (#18032 ) Memcpy nodes could have negative impact on performance, they also cause ORT unable to run CUDA graph. Here we add a warning log for CUDA EP when this happens. It could help trouble shooting. For example, when CUDA graph cannot run, we can see the logs to find out where the Memcpy nodes are inserted (Although it is also possible through saving optimized model, but that need more time and disk space). Note that the warning is per graph. When there are subgraphs, we might see multiple warnings if the issue happens in multiple graphs. Example logs: ``` 2023-10-19 20:58:10.678176531 [I:onnxruntime:, transformer_memcpy.cc:329 AddCopyNode] Add MemcpyFromHost after input_ids for CUDAExecutionProvider 2023-10-19 20:58:10.678198702 [I:onnxruntime:, transformer_memcpy.cc:329 AddCopyNode] Add MemcpyFromHost after /text_model/ArgMax_output_0 for CUDAExecutionProvider 2023-10-19 20:58:10.678211727 [I:onnxruntime:, transformer_memcpy.cc:329 AddCopyNode] Add MemcpyFromHost after /text_model/Gather_3_output_0 for CUDAExecutionProvider 2023-10-19 20:58:10.678257903 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 3 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message. ```	2023-10-23 22:00:02 -07:00

1 2 3 4 5 ...

9868 commits