onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-11 17:48:34 +00:00

Author	SHA1	Message	Date
Jian Chen	b023de0bfc	Redo #18044 Install CUDA 12.2 on Windows (#18093 )	2023-10-26 10:12:46 -07:00
Caroline Zhu	64de71c5e2	[js/web/training] Add CreateTrainingSession (#17891 ) ### Description * Adds TrainingSession.create() functionality following the web bindings for training design doc * Added 2 new training APIs to wasm/api.h: * OrtTrainingGetInputOutputName * OrtTrainingGetInputOutputCount * Moved isOrtEnvInitialized boolean to the wasm-core-impl and added a method that references it ### Motivation and Context * Adding web bindings for training #### Related work * #16521 allowed for training artifacts to be built * #17333 added interfaces for training * #17474 allows for training package to be built + adds training backend to web package [MUST BE MERGED IN BEFORE THIS ONE] --------- Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Ashwini Khade <askhade@microsoft.com>	2023-10-26 09:22:10 -07:00
Changming Sun	0f72739b6d	Disable ccache for WinML build (#18104 ) ### Description It seems would resolve the timeout issue. ### Motivation and Context	2023-10-26 19:03:01 +08:00
Patrice Vignola	538e97cbda	[DML EP] Add dynamic graph compilation (#17876 ) Historically, DML was only able to fuse partitions when all sizes are known in advance or when we were overriding them at session creation time. But in practice, it should be possible to compile partitions at compute time if the caller knows that the dimensions won't be changed for every inference (e.g. resizing a webcam window, or padding the input to powers of 2). This graph will be cached and reused until the sizes change. This is an opt-in option gated under the `enable_dynamic_graph_fusion` option, which means that it will only be enabled when the caller requests it since they have more context on how their model will be called between inferences. This PR also adds the option to disable metacommands from the python API, which is an option for the C API but was lacking for python.	2023-10-25 19:56:16 -07:00
Jambay Kinley	d30d4d372a	Add MatMul FP4 and NF4 Support (#18066 ) ### Description Add a contrib op MatMulBnb4 (FP4 and NF4) and related toolchain to support quantization on weight. This PR adds: - schema for contrib op MatMulBnb4 which can support FP4 (4-bit floating point) and NF4 (4-bit NormalFloat) quantization on weight. - a naive implementation for MatMulBnb4 on CPU and GPU, i.e., implemented like MatMul(A, Dequantize(B)). - a special implementation for GemV for MatMulBnb4 and related benchmark tool. - tool to quantize model to FP4 or NF4.	2023-10-25 15:34:58 -07:00
snadampal	d88d52eead	[aarch64] Remove mmla kernel support from apple (#18082 ) ### Description <!-- Describe your changes. --> The mmla kernels require additional ISA flags and are currently supported only on Linux ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> more context is in https://github.com/microsoft/onnxruntime/pull/15270 cc: @skottmckay , @chenfucn , @snnn	2023-10-25 11:34:57 -07:00
liqun Fu	706e13e0c9	implement affinegrid cpu kernel (#17777 )	2023-10-25 10:46:04 -07:00
pengwa	2c6b31c5aa	FP16 optimizer automatically detect DeepSpeed compatibility (#18084 ) ### FP16 optimizer automatically detect DeepSpeed compatibility Optimum/Transformers are using accelerate lib to prepare models, so our FP16 optimizer wrapper does not work for long time. Because the namespace is `accelerate.utils.deepspeed.DeepSpeedOptimizerWrapper`, which underlying is still calling into DeepSpeed stage1and2 optimizer. This PR includes following changes: 1. Add `accelerate.utils.deepspeed.DeepSpeedOptimizerWrapper` in the modifier registry, plus a check on its contained `optimizer` property MUST be DeepSpeed stage 1 and 2 optimizer. (let's cover Stage 3 optimizer later) 2. For DeepSpeed version > 0.9.1, we will store the source code in a version list. As long as the related function in DeepSpeed remains unchanged during its new release, we won't need manually upgrade the version check any more. If some day, the source code did not match, a warning will be raised to users, to add a new version of source code in the list. With the above change, we will have our FP16 Optimizer working again in Optimum. ![image](https://github.com/microsoft/onnxruntime/assets/10530022/d35b4aa9-b371-46f1-98ae-73114f91179b)	2023-10-25 15:11:02 +08:00
Sumit Agarwal	ae8561979f	Introduce new optimizer MatMul + BatchNormalization (#17915 ) ### Description Introduce new ORT L1 optimizer under RewriteRule category to fuse MatMul + BatchNormalization node. This optimizer look for a specific pattern observed in one of the impacting customer models and fuse the Matmul and Batchnormalization node into a Gemm node. For details on the pattern matching and fusion please refer to the comment section of `matmul_bn_fusion.cc`. To visualize, this optimizer will replace following subgraph to a Gemm node. <pre> MatMul GEMM \| \| Reshape ^ ---> Reshape ^ \| \| Transpose ^ Transpose ^ \| BatchNormalization Note: ^ means there can be >=0 occurrence(s) of that node. Few example fusable pattern: * - MatMul -> Reshape -> Transpose -> BatchNormalization ---> GEMM -> Reshape -> Transpose * - MatMul -> Reshape -> BatchNormalization ---> GEMM -> Reshape * - MatMul -> Transpose -> BatchNormalization ---> GEMM -> Transpose * - MatMul -> Reshape -> Reshape -> BatchNormalization ---> GEMM -> Reshape -> Reshape * - MatMul -> Reshape -> Transpose -> Reshape -> BatchNormalization ---> GEMM -> Reshape -> Transpose -> Reshape * - MatMul -> BatchNormalization ---> GEMM </pre> Note: This optimizer may evolve in the future to be more generic in terms of the pattern matching. ### Motivation and Context - Why is this change required? What problem does it solve? One of the user of ORT+DML ep needs this to better target the model to DML. But this transformation applies more broadly, so added L1 optimizer. <!-- - If it fixes an open issue, please link to the issue here. -->	2023-10-24 19:41:10 -07:00
Jian Chen	76e275baf4	Merge Cuda docker files into a single one (#18020 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-24 15:17:36 -07:00
Changming Sun	6ec45f2ba5	Merge aiinfra-linux-ARM64-CPU-2019 and onnxruntime-linux-ARM64-CPU-2019 (#18069 ) ### Description Merge aiinfra-linux-ARM64-CPU-2019 and onnxruntime-linux-ARM64-CPU-2019 machines to a single one to ease management.	2023-10-24 13:04:08 -07:00
liqun Fu	efa0cc2562	implement isinf20 and isnan20 (#17874 )	2023-10-24 10:58:54 -07:00
Changming Sun	abb329179a	Update win-wasm-ci.yml: increase the timeout value (#18023 )	2023-10-24 10:50:12 -07:00
Jian Chen	e63ccd3cbb	Install CUDA 12.2 on Windows (#18044 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-24 10:47:23 -07:00
Jiajia Qin	eb47008049	[js/webgpu] FP16 Cast, Resize (#18035 ) ### Description <!-- Describe your changes. --> Cast/Resize with f16 are missing in vae-decoder-f16. With this change, vae-decoder-f16 becomes 315 ms from over than 1 seconds.	2023-10-23 22:56:56 -07:00
Tianlei Wu	688524a9ab	[CUDA EP] Add warning logs when adding memcpy nodes (#18032 ) Memcpy nodes could have negative impact on performance, they also cause ORT unable to run CUDA graph. Here we add a warning log for CUDA EP when this happens. It could help trouble shooting. For example, when CUDA graph cannot run, we can see the logs to find out where the Memcpy nodes are inserted (Although it is also possible through saving optimized model, but that need more time and disk space). Note that the warning is per graph. When there are subgraphs, we might see multiple warnings if the issue happens in multiple graphs. Example logs: ``` 2023-10-19 20:58:10.678176531 [I:onnxruntime:, transformer_memcpy.cc:329 AddCopyNode] Add MemcpyFromHost after input_ids for CUDAExecutionProvider 2023-10-19 20:58:10.678198702 [I:onnxruntime:, transformer_memcpy.cc:329 AddCopyNode] Add MemcpyFromHost after /text_model/ArgMax_output_0 for CUDAExecutionProvider 2023-10-19 20:58:10.678211727 [I:onnxruntime:, transformer_memcpy.cc:329 AddCopyNode] Add MemcpyFromHost after /text_model/Gather_3_output_0 for CUDAExecutionProvider 2023-10-19 20:58:10.678257903 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 3 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message. ```	2023-10-23 22:00:02 -07:00
Chi Lo	555b2af7d6	[TensorRT EP] Add unit test for user provided cuda stream (#17974 ) Add a unit test for testing user provided CUDA stream	2023-10-23 19:41:15 -07:00
Chi Lo	4ffd022b0b	[TensorRT EP] Refactor of TRT plugins support (#17946 ) Make sure "trt.plugins" custom op domain only being registered once. The bottom line is "trt.plugins" custom op domain needs to be registered before model load. `CreateTensorRTCustomOpDomainList()` is TRT EP's function to create "trt.plugins" custom op domain. Following are places where this function will be called. (This function only fetches all the TRT plugins from TRT plugin registry but not yet registered them to ORT custom op registry. The real registration happens in AddCustomOpDomains()) C/C++ APIs: - `OrtApis::SessionOptionsAppendExecutionProvider_TensorRT_XX`: This function will make session option object contain the "trt.plugins" custom op domain for ORT to register. So that later the session creation api can register the custom op domain accordingly and won't complain about invalid onnx node. - `InferenceSession::RegisterExecutionProvider`: In some cases, users might create the session object first and later call session_object.RegisterExecutionProvider(). This function will call p_exec_provider->GetCustomOpDomainList() which returns "trt.plugins" custom op domain. Otherwise, session_object.Load(model) will complain. Python APIs: - `RegisterTensorRTPluginsAsCustomOps`: Need to call this function so that session option object contains the "trt.plugins" custom op domain for ORT to register. Different language bindings have slightly different workflow of initializing the session. This might cause duplicate custom op domain in `session_option.custom_op_domains_` or `CreateTensorRTCustomOpDomainList()` being called more than once, but we put checks to make sure ep's custom op domain won't be registered twice.	2023-10-23 17:46:38 -07:00
Dmitri Smirnov	2c50b75a26	Functions Ahead Of Time inlininng (#17764 ) ### Description Inline functions in an EP aware fashion. The result of this PR is that models that are having been inlined by ONNX inliner and optimized and models that have been AOT inlined appear to be visually identical. For tests I used two models. The only difference is the resulting size because ONNX inliner removes local function definitions and AOT does not. Difference in sizes for `HF Mobile` model was 2.5 MB, and for `HF Bart` it was ~500K. It seems that the resuling model size affects the load time more than the actual optimizations. In general, the inlined models grow in size very fast and can easily exceed 2Gb limit. Q. Should we make AOT optional? `If` costant folding and the removal of local inlined models will be coming in other PRs. Some stats: ![image](https://github.com/microsoft/onnxruntime/assets/11303988/fcb4c815-2e06-4574-8d96-5a0a727d1ecf)	2023-10-23 17:42:20 -07:00
satyajandhyala	f3cfe08c42	[JS/Web] Enabled 1d spacial input to GlobalAveragePool (#17973 ) ### Description Enable one-dim special input to GlobalAveragePoll input ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Currently only 2D input is supported.	2023-10-23 16:02:50 -07:00
snadampal	780ee186d7	[aarch64] Implement QGEMM kernels with UMMLA/SMMLA instructions (#17160 ) ### Description <!-- Describe your changes. --> This PR adds UMMLA and SMMLA based QGEMM kernels for aarch64. This covers (i) symmetric quantization (zero point is Zero) (ii) asymmetric quantization (zero point is non zero) (iii) per channel as well as per tensor quantization (iv) Signed weights (U8S8 Gemm) (v) Unsigned weights (U8U8 Gemm) and (vi) Signed activations and weights (S8S8 Gemm) scenarios I've enabled the ummla/smmla kernels based on cpuinfo check for `I8MM` support MMLA QGEMM kernels are enabled for all the devices that support I8MM instructions. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This is to improve INT8 quantized MatMul performance on aarch64 platform. I have run the below benchmarking script (bert , roberta and gpt2 model inference) on AWS Graviton3 based c7g.4xl instance and observed up to 1.33x performance improvement compared to the optimized UDOT qgemm kernel performance. ``` cd onnxruntime/python/tools/transformers python3 benchmark.py ``` I have also run the unit tests, and made sure all are passing ``` ./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync ```	2023-10-24 07:49:04 +10:00
kunal-vaishnavi	2a17d5cf32	LLaMA Model Optimization (#18021 ) ### Description This PR contains fusion-level and kernel-level optimizations for [Meta's LLaMA-2](https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-meta-expand-their-ai-partnership-with-llama-2-on-azure-and-windows/). Some of the added optimizations include: - SimplifiedLayerNorm changes - Fusions for multiple variants - SkipSimplifiedLayerNorm changes - Kernel support for CPU - Rotary embeddings (previously did not exist) - Fusions for multiple variants - CPU and CUDA kernels - Supports interleaving and non-interleaving in the same kernels - Optimized cache that requires half of its originally exported sizes - Reduced from `(max_sequence_length, head_size)` to `(max_sequence_length, head_size / 2)` - Multi-head attention - Support for 2D and 3D attention masks - Group query attention (for FP16 CUDA and INT4 CUDA) - Integration with flash attention v2 and past-present buffer sharing - Removes need for `attention_mask` input as it is supported in the kernel - 4 bit quantization - `block_size` parameter is available for customizing - Support the new changes for [Microsoft version](https://github.com/microsoft/Llama-2-Onnx) - Support combinations of the below variants (ex: export ORT version and run with Optimum) Supported variants of LLaMA-2 include: - [ORT version](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/llama) - Produces one ONNX file that is already optimized (and quantized if requested) - Integrates with Optimum - [Another Microsoft version](https://github.com/microsoft/Llama-2-Onnx) - Already exported and available off-the-shelf - Faster versions of those models will be uploaded there soon - [Hugging Face version](https://huggingface.co/meta-llama) - Models that end with `-hf` - Some older and current versions of [`transformers`](https://github.com/huggingface/transformers) and [`optimum`](https://github.com/huggingface/optimum) that export the model to ONNX differently - Note that while some older versions are supported, it is recommended to use the latest package versions. ### Usage To use the optimizations, please see `README.md` for details. Please note the various `requirements.txt` files for the package versions recommended in order to use these changes. To run the ORT transformer optimizer separately, run the script as follows: ``` $ cd onnxruntime/onnxruntime/python/tools/transformers/ $ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type gpt2 --num_heads <number of attention heads> --hidden_size <attention hidden size> --use_external_data_format --opt_level 0 ``` ### Motivation and Context This PR helps the following issues: - https://github.com/microsoft/onnxruntime/issues/14997 - https://github.com/microsoft/onnxruntime/issues/16254 - https://github.com/microsoft/onnxruntime/issues/17681 - https://github.com/microsoft/onnxruntime/issues/17925 - https://github.com/microsoft/onnxruntime-inference-examples/issues/320 This PR uses changes from the following PRs: - https://github.com/pytorch/pytorch/pull/104468 - https://github.com/pytorch/pytorch/pull/109759 - https://github.com/microsoft/onnxruntime/pull/17020 - https://github.com/microsoft/onnxruntime/pull/17674 - https://github.com/microsoft/onnxruntime/pull/17890 - https://github.com/microsoft/onnxruntime/pull/17920 - https://github.com/huggingface/transformers/pull/26162 - https://github.com/huggingface/optimum/pull/1257 - https://github.com/huggingface/optimum/pull/1289 - https://github.com/huggingface/optimum/pull/1462 ### New TorchDynamo Exporter (experimental stage) This PR uses changes from the following issues and PRs to begin supporting the [new TorchDynamo exporter](https://pytorch.org/docs/stable/onnx.html#torchdynamo-based-onnx-exporter): - https://github.com/huggingface/transformers/pull/26307 - https://github.com/pytorch/pytorch/issues/104903 - https://github.com/pytorch/pytorch/pull/105040 - https://github.com/microsoft/onnxscript/pull/847 - https://github.com/microsoft/onnxscript/pull/862 - https://github.com/microsoft/onnxscript/issues/493	2023-10-23 13:00:56 -07:00
Jiajia Qin	8a12b2cea6	[js/webgpu] Fix the transpose error when dims > 4D (#18027 ) ### Description <!-- Describe your changes. --> Currently, the uniform support has bugs when dims rank is larger than 4. See https://github.com/microsoft/onnxruntime/issues/17860 item 1. So this PR only enables shapes uniforms when shape rank is <= 4 for transpose. Otherwise, below compilation errors are thrown: ``` 1 error(s) generated while compiling the shader: :3:50 error: uniform storage requires that array elements are aligned to 16 bytes, but array element of type 'u32' has a stride of 4 bytes. Consider using a vector or struct as the element type instead. struct Uniforms { output_size:u32, a_shape:array<u32, 5>, a_strides:array<u32, 5>, output_shape:array<u32, 5>, output_strides:array<u32, 5> }; ^^^^^^^^^^^^^ :3:7 note: see layout of struct: /* align(4) size(84) / struct Uniforms { / offset( 0) align(4) size( 4) / output_size : u32; / offset( 4) align(4) size(20) / a_shape : array<u32, 5>; / offset(24) align(4) size(20) / a_strides : array<u32, 5>; / offset(44) align(4) size(20) / output_shape : array<u32, 5>; / offset(64) align(4) size(20) / output_strides : array<u32, 5>; / */ }; struct Uniforms { output_size:u32, a_shape:array<u32, 5>, a_strides:array<u32, 5>, output_shape:array<u32, 5>, output_strides:array<u32, 5> }; ^^^^^^ :4:42 note: 'Uniforms' used in address space 'uniform' here @group(0) @binding(2) var<uniform> uniforms: Uniforms; ^^^^^^^^ ```	2023-10-23 11:02:19 -07:00
Hector Li	f0d5ea5930	[QNN EP] Disable flaky test QnnCPUBackendTests.MatMulOp_Broadcast (#18033 ) Disable flaky test QnnCPUBackendTests.MatMulOp_Broadcast. The test failed on Linux randomly.	2023-10-23 09:01:29 -07:00
JiCheng	b7ae293be0	Support large model export using multi-gpu (#17990 ) ### Description This PR is to implemente a exporter which works for large language models(LLM). It works for models like Llama2-70b or gpt-175. The main idea is to utilize multiple-GPU and dispatch differnet layers to different GPU, in short, it symply implemented auto pipeline parallelism. For example : to export Llama2-70b, you need 8x V100-32GB or 4x A100-80G or More GPU memories. It would expect to export decoder-only models. For encoder-decoder arch-like models, we didn't test it yet. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2023-10-22 23:33:29 +08:00
pengwa	444a0eda30	Avoid one time clone to save memory peak (#17934 ) ### Avoid one more time clone to save memory peak	2023-10-21 19:45:45 +08:00
RandySheriffH	009cd4ea2e	Allow cuda custom ops allocate deferred cpu mem (#17893 ) Expose a new allocator from cuda stream. The allocator manages deferred cpu memory which only get recycled before stream destruction. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-10-20 16:12:21 -07:00
Chi Lo	2f57625cb0	[TensorRT EP] Add stream sync after enqueue (#18026 ) If the model is partitioned into TRT subgraphs and CUDA EP node, we observed cuda stream synchronization issue when multithreading. Calling stream sync API after enqueue can solve this issue without adding much performance overhead.	2023-10-20 15:09:46 -07:00
liqun Fu	020824ed50	Update ONNX to 1.15.0rc1 (#17914 )	2023-10-20 15:08:25 -07:00
Baiju Meswani	a43c57f59d	ResizeGrad CUDA/ROCM kernel implementation (#17772 )	2023-10-20 11:39:57 -07:00
Changming Sun	cc7e8cc21f	Update dockerfiles/Dockerfile.source to avoid installing onnx (#17975 ) ### Description Update dockerfiles/Dockerfile.source to avoid installing onnx python package. ONNX is not listed in https://github.com/microsoft/onnxruntime/blob/main/requirements.txt.in. We do not have to install it. Especially when we do not run tests, the package provides no help when building onnxruntime from source. ### Motivation and Context Resolve #17781	2023-10-20 09:24:21 -07:00
Yi Zhang	99b8dcaae2	Disable dml stage in windows GPU pipeline temporarily. (#18034 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-20 08:41:40 -07:00
Hector Li	35ecce4549	[QNN EP] Reduce overhead of QNN context binary loading (#17965 ) ### Description Reduce overhead of QNN context binary loading by avoiding memory copy ### Motivation and Context Reduce the session initialization time and memory usage while load from QNN context binary	2023-10-18 15:30:35 -07:00
Jian Chen	cbb0e0f83c	Create a new Dockerfile for cuda 12 and trt 8.6.1.6-1.cuda12.0 (#18000 )	2023-10-18 14:46:02 -07:00
aciddelgado	a2c6283274	Fix Packed MultiHead Attention (#17996 ) ### Description Initialize previously unitialized parameters that were causing Op to crash. ### Motivation and Context Solves Cuda Memory Misalignment / Illegal Memory Access error when FlashAttention was used in Packed Multi-Head Attention.	2023-10-18 10:52:14 -07:00
Arthur Islamov	22947109f2	[js/web] FP16 LayerNorm, InstanceNorm, SkipLayerNorm (#17630 ) ### Description This PR includes fixes for Norm operations to support FP16 and also some optimizations to use vec2/vec4 if possible	2023-10-18 10:47:41 -07:00
dependabot[bot]	f9694c5b97	Bump @babel/traverse from 7.18.5 to 7.23.2 in /js/react_native/e2e (#17963 )	2023-10-18 05:25:27 +00:00
Tianlei Wu	59ae3fdfdc	[CUDA] StableDiffusion XL demo with CUDA EP (#17997 ) Add CUDA EP to the StableDiffusion XL Demo including: (1) Add fp16 VAE support for CUDA EP. (2) Configuration for each model separately (For example, some models can run with CUDA graph but some models cannot). Some remaining works will boost performance further later: (1) Enable CUDA Graph for Clip2 and UNet. Currently, some part of graph is partitioned to CPU, which blocks CUDA graph. (2) Update GroupNorm CUDA kernel for refiner. Currently, the cuda kernel only supports limited number of channels in refiner so we shall see some gain there if we remove the limitation. Some extra works that are nice to have (thus lower priority): (3) Support denoising_end to ensemble base and refiner. (4) Support classifier free guidance (The idea is from https://www.baseten.co/blog/sdxl-inference-in-under-2-seconds-the-ultimate-guide-to-stable-diffusion-optimiza/). #### Performance on A100-SXM4-80GB Example commands to test an engine built with static shape or dynamic shape: ``` engine_name=ORT_CUDA python demo_txt2img_xl.py --engine $engine_name "some prompt" python demo_txt2img_xl.py --engine $engine_name --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape "some prompt" ``` Engine built with dynamic shape could support different batch size (1 to 4 for TRT; 1 to 16 for CUDA) and image size (256x256 to 1024x1024). Engine built with static shape could only support fixed batch size (1) and image size (1024x1024). The latency (ms) of generating an image of size 1024x1024 (sorted by total latency): Engine \| Base (30 Steps)* \| Refiner (9 Steps) \| Total Latency (ms) -- \| -- \| -- \| -- ORT_TRT (static shape) \| 2467 \| 1033 \| 3501 TRT (static shape) \| 2507 \| 1048 \| 3555 ORT_CUDA (static shape) \| 2630 \| 1015 \| 3645 ORT_CUDA (dynamic shape) \| 2639 \| 1016 \| 3654 TRT (dynamic shape) \| 2777 \| 1099 \| 3876 ORT_TRT (dynamic shape) \| 2890 \| 1166 \| 4057 \* VAE decoder is not used in Base since the output from base is latent, which is consumed by refiner to output image. We can see that ORT_CUDA is faster on dynamic shape, while slower in static shape (The cause is Clip2 and UNet cannot run with CUDA Graph right now, and we will address the issue later). ### Motivation and Context Follow up of https://github.com/microsoft/onnxruntime/pull/17536	2023-10-17 21:30:04 -07:00
Patrice Vignola	61f1a16265	Fix MHA shape inference (#18009 ) The previous shape inference never had the chance to infer the past_key and past_value outputs because we were returning early.	2023-10-17 21:19:57 -07:00
Patrice Vignola	65575389f4	[DML EP] Enable more MHA masks (#17882 ) Those masks are used for MHA in LLaMA.	2023-10-17 17:31:51 -07:00
Guenther Schmuelling	2ef7abf1ff	support for fp16 in GetClipMinMax() (#17967 ) add support for fp16 in GetClipMinMax()	2023-10-17 16:43:30 -07:00
Adam Pocock	3456831413	[java] Make the backing byte buffer in an OrtValue accessible (#16578 ) ### Description Adds a method to access the backing direct byte buffer from a Java `OnnxTensor` object, assuming it is backed by a direct byte buffer (tensors created by ORT's run call or ones created in Java from multidimensional arrays are not). Also adds a method to check if the backing byte buffer was copied from the user's buffer supplied on creation (this could be tested via a pointer comparison from the output of `getBufferRef` and the user's input buffer, so I'm not sure if it's necessary). ### Motivation and Context This is the first part of changes necessary to support output pinning in Java OrtSession.run/OrtTrainingSession.run calls. I split it out from the rest of the work as it's useful by itself (e.g. to allow users to keep a single input tensor and rewrite it each time with new inputs rather than allocate a fresh one) and the other change will be much more involved so splitting it makes it easier to review. cc @yuslepukhin	2023-10-17 10:03:49 -07:00
Changming Sun	57c8736596	Move a nodejs test to a different machine pool (#17970 ) ### Description This is a temp fix for the failing "Zip-Nuget-Java-Nodejs Packaging Pipeline". The pipeline is failing because I removed NodeJS from the build machine pool's image, to reduce the number of dependencies we need to maintain in VMs. So this PR will temporarily move the test to a different machine pool to get the test passed. Then I will move the test to docker. Docker images are relatively easier to update and maintain. Now we almost run all Linux test in docker, except for this one. Moving it to docker is needed for enabling GPU support in nodejs, because all our Linux VMs do not have CUDA. ### Motivation and Context	2023-10-17 09:30:14 -07:00
Hariharan Seshadri	9356986730	Fix AMD builds and enable testing NHWC CUDA ops in one GPU CI (#17972 ) ### Description This PR: (1) Fixes AMD builds after #17200 broke them (Need to remember to run AMD builds while trying to merge external CUDA PRs next time) (2) Turn on the NHWC CUDA feature in the Linux GPU CI. The extra time spent in building a few more files and running a few more tests will not be much. Test Linux GPU CI run : https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1170770 ### Motivation and Context Keep the NHWC CUDA ops tested (https://github.com/microsoft/onnxruntime/pull/17200) and guard against regressions	2023-10-17 09:23:52 -07:00
Scott McKay	6832b688a2	Fix missing attribute on C# DOrtGetResizedStringTensorElementBuffer delegate (#17901 ) ### Description <!-- Describe your changes. --> Fix missing attribute. Causes build error on release xamarin iOS build. Fix some long lines as well. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #16463 - once the dummy extensions nuget package is used this problem shows up.	2023-10-17 17:48:36 +10:00
Tianlei Wu	83547d3067	[CUDA] Fix SkipLayerNorm vectorized kernel out-of-bounds read (#17943 ) Fix a bug in https://github.com/microsoft/onnxruntime/pull/11803: When hidden size is not exactly same as next size (for example ld=320 in stable diffusion) current vectorized kernel might read out-of-bounds, and might cause CUDA failure. Also resolved another issue: for the first and last size, current macro will cause some dead code (some branch will never run). Here we change it to avoid those branches in boundary sizes. Performance tests with stable diffusion shows that the performance is on-par before/after this fix.	2023-10-16 13:58:37 -07:00
dependabot[bot]	cf974f0905	Bump @babel/traverse from 7.18.2 to 7.23.2 in /js/react_native (#17962 )	2023-10-16 18:24:01 +00:00
Maximilian Müller	7c17e33c07	Make CUDA a NHWC EP (#17200 ) ### Description CUDA inference speed heavily relies on Tensor Cores. To have tensor cores achieve the optimal throughput they require the data layout to be NHWC rather than NCHW. ### Motivation and Context Especially for convolutional networks this is very important. I will illustrate this using a very simple network: ``` import torch import torch.nn as nn class Net1(nn.Module): def __init__(self): super(Net1, self).__init__() # 1 input image channel, 6 output channels, 5x5 square convolution # kernel self.m = nn.ModuleList([ nn.Conv2d(in_channels=8, out_channels=32, kernel_size=5, stride=1), nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1), nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=1), nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False), nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False), ]) def forward(self, x): for module in self.m: x = module(x) return x if __name__ == "__main__": dtype = torch.half device = "cuda" dummy_input = torch.randn(8, 8, 512, 512, dtype=dtype, device=device) model = Net1().to(dtype=dtype, device=device) input_names = ["input1"] output_names = ["output1"] torch.onnx.export(model, dummy_input, "test.onnx", input_names=input_names, output_names=output_names) ``` I profiled the launch of `./build/RelWithDebInfo/onnxruntime_perf_test -e cuda -I -q -t 5 test.onnx` using sys and nvtx ranges. Current master launches below kernels: ![image](https://github.com/microsoft/onnxruntime/assets/44298237/81655fce-0f8e-4f78-9335-b858a8c8977b) If I add the introduced `-l` flag we see below kernels: ![image](https://github.com/microsoft/onnxruntime/assets/44298237/fceb5d6f-c12d-442b-b15a-948797630008) Notice the missing NCHW<>NHWC kernels per operation. The layout optimizer introduced a transpose op as first and last op of the whole network. The `op_generic_tensor_kernel` shows the bias used which should also be optimized out next. Measured across some very basic models: \| CUDA EP \| NCHW [ms] \| NHWC [ms] \| Speedup \| \|:------------------------\|--------------------------------------:\|-----------------------------------------:\|------------------:\| \| \| -e cuda -t 5 -q \| -e cuda -t 5 -q -l \| \| \| resnet101-v2-7_bs8_fp16 \| 18.33 \| 13.07 \| 1.4 \| \| resnet101-v2-7_bs8 \| 21.8 \| 12.06 \| 1.81 \| \| test \| 102.07 \| 73.62 \| 1.39 \| Average speedup: 1.53 ## Outlook Next the mission will be to first write a templated unit test to check for correctness of NHWC vs NCHW ops. After that we have to transition more ops to measure perf improvements on a broader range of models. Currently this is not easily possible as we can do not support all ops in the NHWC domain. --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com>	2023-10-16 10:16:37 -07:00
Chi Lo	8abaa7b753	[TensorRT EP] Fix cmake install (#17923 ) We removed tensorrt_provider_factory.h in the [PR](https://github.com/microsoft/onnxruntime/pull/17617). Need to remove the copy of this file when cmake install.	2023-10-16 09:16:24 -07:00
Yulong Wang	ad817d0efa	[js/web] optimize tsc for web: split out "npm prepare" (#17955 ) ### Description optimize tsc for web: split out "npm prepare"	2023-10-16 09:04:54 -07:00

1 2 3 4 5 ...

9834 commits