onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-06 00:03:22 +00:00

Author	SHA1	Message	Date
Jian Chen	a49f31b670	Remove drop-nuget artifact from all pipelines (#18592 ) ### Description Currently, the `drop-nuget` artifact only contains protoc.exe which is also part of the `drop-extra` artifact. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-28 13:23:01 -08:00
Mike Guo	e24733cfe9	fix the Olive CI pipeline failure on Windows (#18464 ) Fix the https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1046 failure for Windows	2023-11-28 11:42:39 -08:00
Rachel Guo	288b80d363	Add MacOS build to ORT C Pod (#18550 ) ### Description <!-- Describe your changes. --> As title. 1. Add macos build as an optionally enabled arch for pod and changes to exsiting build_ios_framework/assemble_c_pod scripts. 2. Enable macos build arch in ios packaging pipeline (currently for variants other than Mobile) and check the output artifacts are correct. 3. Write MacOS Test Target scheme in the test app and integrate into ios packaging CI testing pipeline. Currently the changes only apply to onnxruntime-c pod. as the original request was from ORT SPM which consumes the onnxruntime-c pod only as the binary target. TODO: could look into adding macos platform to objc pod as well. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enable macos platform support in cocoapods. and also potentially produce binary target for enabling macos platform in SPM as well. Replace https://github.com/microsoft/onnxruntime/pull/18334 --------- Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local> Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-11-28 10:11:53 -08:00
Chen Fu	05046e5452	Adding unit test for sm80 prepack (#18514 ) ### Description Prepacking code for block q4 x fp16 GEMM cuda kernel, for SM80 hardware ### Motivation and Context Preparing for addition of Q4 x FP16 GEMM kernel on Nvidia Ampere GPUs. This kernel requires sophisticated quantized weight rearrangement to speedup loading data to tensor-core. To facilitate the addition, this change includes the following: 1. matrix_layout.h A new layout lib that facilitate iterating matrix elements and tiles that balance memory safety and performance. 2. prepack_sm80.h Code for rearranging quantized weight, scales and offsets (aka. prepacking) 3. blkq4_fp16_sm80_prepack_test.cc Unit tests that explicitly test the memory safety and correctness of the prepacking code. Currently the prepacking code runs on CPU with single threaded code. We run this on CPU in order to minimize GPU memory fragmentation. On the other hand, hopefully we get around to parallelize this part of the code. Should be straight forward with the unit tests in place.	2023-11-28 10:01:09 -08:00
Adrian Lizarraga	8d5ecc4dae	[Quantization] Fix scale/zero-point for 16-bit QDQ Softmax (#18589 ) ### Description Sets the appropriate scale and zero-point values for 16-bit QDQ Softmax. Previously, the scale/zp were set to fixed values that were specific to 8-bit quantization. ### Motivation and Context Generate more accurate 16-bit QDQ models that contain Softmax.	2023-11-28 09:46:47 -08:00
Sheil Kumar	0b7048e7d6	Update winml to use #cores - #soc cores by Default as the number of intraopthreads (#18384 ) Update winml to use #cores - #soc cores by Default as the number of intraopthreads --------- Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-11-28 09:26:48 -08:00
Yi Zhang	a6d8726407	Update ADO windows image to custom image (#18598 ) ### Description Update Azure-Pipelines-EO-Windows2022-aiinfra to onnxruntime-win-CPU-2022 in Nuget_Package_CPU. To make the debugging easier, use flex-downloadPipelineArtifact ### Motivation and Context Azure-Pipelines-EO-Windows2022-aiinfra is using 1ES window-latest image. The pipeline might be failed by unexpected upgrade. Verified: https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=384425&view=results ### P.S. I think we should replace all Azure-Pipelines-EO-Windows2022-aiinfra.	2023-11-28 09:04:25 -08:00
Jian Chen	3ea27c2925	Create a new Nuget Package pipeline for CUDA 12 (#18135 )	2023-11-28 09:03:46 -08:00
Xavier Dupré	94a6020a7f	Improve parallelization of TfIdfVectorizer, Reduce memory consumption (#18539 ) ### Description TfIdfVectorizer has two steps: first search for n-grams in the input, second, weight the results. The second step was not parallelized. The PR adresses that issue. Before two vectors were of the size of the output were allocated to compute the results. The first one, frequencies, was used as an intermediate vector between the two steps. This vector is now broken into multiple small vectors, one per thread. The memory consumption is then reduced for batches with a number of rows > the number of threads. ### Motivation and Context Performance and memory consumption. For one model, the improvment is +15% faster (4 cores, model size is ~6Mb, batch size is 100). Here is another benchmark on a machine with 32 cores with different size of vocabularies and batch sizes. The tested TfIdfVectorizer only deals with unigram and processes sequences of 10 tokens (integers). ![image](https://github.com/microsoft/onnxruntime/assets/22452781/0bb9abe9-ed81-44da-b5c4-ad2a12f129bd)	2023-11-28 12:56:00 +01:00
Ran Gal	3f42fbad2e	deleted the unused random_device variables because they caused a warning that was treated like an error. (#18543 ) deleted the unused random_device variables because they caused a warning that was treated like an error. _Please check if the declaration is required for the random number generation. if so, there need to be a dummy reference to the variable or turning off the warning as error behavior._ ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-28 08:54:38 +01:00
Jiajia Qin	fc8631e2f1	[js/web] Fix conv2dMatmul errors due to #18452 (#18562 ) ### Description Currently, all conv2dMatmul with inChannels = 3 and outChannels % 4 = 0 will report compilation errors. Models, which include this kind of shape will be impacted, like mobilenetv2-12, resnet50 . The errors is introduced by #18452 https://github.com/microsoft/onnxruntime/pull/18452/files#diff-8b24ea43aa11b1346c0c9e327f9bce6b37a93bd8f2bf8a6392b2b263972b7ea2R200, which accidentally pass `components` to `x`. But `x`'s components is `innerElementSize` not `components `. And when `innerElementSize` is 3, we should use `1` in current design.	2023-11-27 21:21:47 -08:00
cao lei	b9fd9c5665	remove dead code in openvino EP (#18457 ) ### Description <!-- Describe your changes. --> Remove dead code in openvino EP ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Remove dead code in openvino EP	2023-11-27 13:41:12 -08:00
Caroline Zhu	dd355e39a0	[js/web/training] Added parameters methods (#18250 ) ### Description * Implemented: `getParametersSize`, `getContiguousParameters` (equivalent to copyParametersToBuffer), and `loadParametersBuffer` (equivalent to copyParametersFromBuffer) * as part of these changes, getParametersSize was added to the TrainingSession interface so that users know what size buffer to create for loadParametersBuffer * The parameters methods in the interface were modified to take in a Float32Array instead ### Motivation and Context * part of the work for implementing web bindings for training * enables federated learning in the web * previous PR: #18006 --------- Co-authored-by: Ashwini Khade <askhade@microsoft.com>	2023-11-27 10:30:13 -08:00
Hector Li	a2fd8a6fc0	[QNN EP] Return INVALID_GRAPH if failed to load from context binary (#18485 ) ### Description [QNN EP] Return INVALID_GRAPH if failed to load from context binary ### Motivation and Context Make sure QNN EP return INVALID_GRAPH if error encountered with the context binary file	2023-11-24 20:41:27 -08:00
cloudhan	2f608338cb	Setup default python formatter for new python plugin (#18563 )	2023-11-24 18:04:48 +08:00
Ted Themistokleous	7b2aefa856	undo hipify of __half to rocblas_half (#18573 ) Fixes build issue seen with newer ROCm releases Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2023-11-24 18:04:23 +08:00
mindest	b9c935f605	[ROCm] Some fixes in tunable (#18575 ) ### Description * Fix workspace size for hipBLASLt algos at 32M * Update according to API changes	2023-11-24 17:22:00 +08:00
Rachel Guo	62f00ad8e7	[CoreML] Add Softmax and Split op support (#18358 ) ### Description <!-- Describe your changes. --> As title. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Added for yolov8 model missing operator support. https://github.com/microsoft/onnxruntime/issues/17654 Now the model support info looks like: _CoreMLExecutionProvider::GetCapability, number of partitions supported by CoreML: 3 number of nodes in the graph: 233 number of nodes supported by CoreML: 230_ (only missing 3 concat op support due to input 3d shape is not currently support in CoreML EP Concat). --------- Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net> Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-11-23 14:26:57 -08:00
cloudhan	6f3c1f9dc9	[ROCm] Update ck for GemmFloat8 (#18487 )	2023-11-23 12:06:19 +08:00
Adrian Lizarraga	1c79897c90	[QNN EP] Support LpNormalization (#18561 ) ### Description Add support for the ONNX LpNormalization operator (p == 2). This is translated to QNN's L2Norm operator. ### Motivation and Context Support more models with QNN EP	2023-11-22 19:40:33 -08:00
pengwa	43a5147e01	Memory optimization refactor and refinement (#17481 ) ### Memory optimization refactor and refinement Currently memory optimizer runs graph transformations and print recompute opportunities in INFO level, while ORT backend has many many INFO level logs making users hard to find those information. So we are looking for a Python binding API to retrieve the memory optimization opportunities instead of depending on the MemoryOptimizer's default logging. Then we can print ORTModule feature statistics using this information. Also, with such an API, we can create an ORT session created, where allocation plan is done, the analysis will consider buffer reuse as well. This can void giving some recomputation subgraphs that are reusing other subgraphs' output buffers. Check https://github.com/microsoft/onnxruntime/blob/pengwa/add_devinfo_level/docs/Memory_Optimizer.md for the new flow using `MemoryOptimizer`. This pull requests made following refactoring: 1. Print the log in ORTModule Python script, along with ORTModule feature enabling stats. This is implemented by exposing an API `get_serialized_ortmodule_memory_stat` to retrieve the memory optimization opportunities. 2. We are analyzing memory optimization opportunities considering ORT memory planning. This is done by firstly creating the execution graph without enabling MemoryOptimizer, then we call `execution_agent.get_serialized_ortmodule_memory_stat` which internally will consider the session memory allocation planner when analyzing memory optimization opportunity. As a direct result, the memory optimization opportunities can show those stashed activations that are reusing other buffers. 3. Move recompute analysis logic from memory_optimizer.h/cc to recompute_analysis.h/cc. 4. Abstract optimization strategies for their own implementation. This will make introducing new strategies (for example compression and decompression ) easier. New logging matrix (INFO Level), in WARNING level, the details will NOT show. ``` 2023-09-13 13:25:09,249 orttraining.rank-0 [WARNING] - *** ONNX Runtime Training (ORTModule) is accelerating your model *** ORTModule is enabled with following features ON/OFF for [training] mode: ATen Executor : ON : Dispatch ATen operators to ORT's ATen executor Cast Propagation : ON : Level 1 enabled Custom Function : ON : Support custom torch.autograd.Function export and execution Memory Optimizer : ON : RecomputeConfig: Reshape+Where+BiasSoftmax+:1:-1,Cast+:1:-1, ProbeLevel: 1, available configs: Config Freq Saving(B) Saving Symbolic(Bytes) - Plan 1 : ON : Reshape+Where+BiasSoftmax+:1:-1 5 671,088,640 640.0inputs_input_ids_dim0inputs_input_ids_dim1*2 - Plan 2 : ON : Cast+:1:-1 6 402,587,648 inputs_input_ids_dim0inputs_input_ids_dim1(384.0inputs_input_ids_dim1 - 64.0) - Plan 3 : OFF : Reshape+Where+:1:-1 1 134,217,728 128.0inputs_input_ids_dim0inputs_input_ids_dim1*2 - Plan 4 : OFF : BiasSoftmax+:1:-1 1 134,086,656 128.0inputs_input_ids_dim0inputs_input_ids_dim1(inputs_input_ids_dim1 - 1) - Plan 5 : OFF : BiasGelu+:1:-1 6 125,808,640 inputs_input_ids_dim0(122880.0inputs_input_ids_dim1 - 20480.0) - Plan 6 : OFF : FusedMatMul+:1:-1 6 125,808,640 inputs_input_ids_dim0(122880.0inputs_input_ids_dim1 - 20480.0) - Plan 7 : OFF : FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1 5 26,214,400 25600.0inputs_input_ids_dim0inputs_input_ids_dim1 - Plan 8 : OFF : Add+:1:-1 1 5,237,760 5120.0inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) - Plan 9 : OFF : Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1 1 4,096 4.0inputs_input_ids_dim0inputs_input_ids_dim1 - Plan 10 : OFF : Cast+:2:-1 1 2,048 2.0inputs_input_ids_dim0inputs_input_ids_dim1 Compute Optimizer : ON : Enable/Disable with env ORTMODULE_ENABLE_COMPUTE_OPTIMIZER=1/0 - FLOPReduction : ON : Reduce FLOPs by upstreaming shrinking-sized ops Auto Fallback : ON : Fallback to PyTorch when encountering unsupported ops TritonOp Enabled : OFF : ORT will switch to Triton for executing some ops to further accelerate training. ZeRO Stage3 Support : OFF : Enable/Disable with env ORTMODULE_ENABLE_ZERO_STAGE3=1/0 Total ORT initialization overhead is 10.73s where export takes 8.39s. Other overhead details: graph builder init takes 0.06s, runtime detection takes 0.01s, graph building takes 0.31s, session creation takes 1.96s Versions: ONNX Runtime - 1.16.0+cu118, ONNX - 1.11.0 Note 1: use comma to enable multiple plans at the same time. export ORTMODULE_MEMORY_OPT_CONFIG=<plan1 config>,<plan2 config>,... Note 2: saving is calculated based on the 1st batch symbolic dim values: inputs_input_ids_dim0=1, inputs_input_ids_dim1=1024, inputs_attention_mask_dim0=1, inputs_attention_mask_dim1=1024, inputs_labels_dim0=1, inputs_labels_dim1=1024, ************************************************************************ ``` If DEVINFO level is enabled, then more details about the memory optimizations are printed. ``` MemoryInsight Summary - User config: BiasGelu+:1:-1,Cast+:2:-1 ========================================================================================================================================== \|Freq \| Memory Optimization Opportunities (Clustered by node-level activation patterns) \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|3 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+Add+Reshape+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+Reshape+:1:-1 \| \| \| Stashed Activations: \| \| \| - ReuseFreq : Output 0(3), \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 32 x 240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+:1:-1 \| \| \| Stashed Activations: \| \| \| - ReuseFreq : Output 0(2), \| \| \| - Output 0 : [ x 2560 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Cast+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+Where+BiasSoftmax+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+BiasSoftmax+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph BiasGelu+ \| \| \| Status : Enabled, requested count=-1, actual applied count=2 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+Add+FusedMatMul+Add+Add+Add+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 2560 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+Where+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Cast+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \| \| \| \| \|>>Option 2 : RecomputeWithCompromise subgraph Cast+ \| \| \| Status : Enabled, requested count=-1, actual applied count=1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 50% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph BiasSoftmax+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=BiasSoftmax+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph BiasGelu+ \| \| \| Status : Enabled, requested count=-1, actual applied count=1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Add+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Add+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) x 2560 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| ========================================================================================================================================== Note: use comma as a separator for enabling more than one subgraphs. *********************************************************************** ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-23 11:39:00 +08:00
Jiajia Qin	64dacc2892	[js/webgpu] Add BatchNormalization Op (#18468 ) ### Description This PR adds `BatchNormalization` with `float` support. Some Todos: 1. all inputs don't have same data type. For example, x/y is float16, but bias/scale is float32 or double. 2. training mode support. We see many models are using `BatchNormalization` ops. However, due to the missing in jsep, all of them run on cpu, which result very poor performance. With this PR's support, densenet-9 model becomes 20.29 ms from 250.69 ms.	2023-11-22 15:58:06 -08:00
Xu Xing	fa106942a7	[js/webgpu] Refactor matmul conv to support uniforms for matmul (#18452 ) This change refactored matmul/conv related programs to support shape uniforms. Currently only matmul shape uniforms are fully enabled. TODOs: add input dependencies for conv related programs, turn clipMax and clipMin to uniforms.	2023-11-22 14:42:55 -08:00
Scott McKay	42c6799c59	Update transpose optimization to be more QDQ aware (#18444 ) ### Description <!-- Describe your changes. --> Rework some aspects of the transpose optimizer to ensure we have valid QDQ node units when it is done. Conceptually we need to let individual Transpose nodes move through the graph when optimizing. That can invalidate existing QDQ node units or require new ones. We can fix this after inserting new nodes, or when transpose optimization finishes moving Transpose nodes. Fix when inserting new node - TransposeInputs can add an Unsqueeze (to broadcast) and Transpose to a node's inputs - if there was a DQ node providing the input, add a Q -> DQ after inserting the Unsqueeze/Transpose to make a QDQ node unit for the new node. - Unsqueeze/Transpose don't change data, so we can copy the type/scale/zero point from the existing DQ Fixes when transpose optimization completes moving Transpose nodes - Remove empty DQ -> Q pairs if the type/scale/zero point match - Pushing a Transpose through may have resulted in an existing Transpose/Reshape being cancelled and removed leaving an empty QDQ node unit - the Transpose being moved may have started in a QDQ node unit - Transpose that got blocked inside existing QDQ node unit - e.g. if we hit a DQ -> MatMul -> Q node unit the Transpose gets blocked after the DQ - insert a Q -> DQ after the Transpose to put it in a QDQ node unit and repair the original QDQ node unit - Transpose moves past a DQ providing a graph output - insert a Q -> DQ so the Transpose is in a QDQ node unit This replaces the existing phase 2 logic which flipped a DQ -> Transpose to fix a broken QDQ node unit. The new approach should handle more scenarios and hopefully produce a better graph. Additionally the logic to handle updates to shared initializers that feed DQ nodes was simplified (i.e. largely removed). When we update the shared initializer a Squeeze (if broadcast) and Transpose is added between the initializer and the DQ for other usages of it. We only need to check for this pattern in EstimateTransposeValueCost by looking past a DQ node. We do not need to track the individual DQ nodes leading to an updated shared initializer. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Initially to fix QNN issue with non-const input being transpose and the QDQ node units being broken.	2023-11-23 08:27:47 +10:00
satyajandhyala	841f7ed3e0	[[JS/Web]Added uniform to Expand op. (#18558 ) ### Description <!-- Describe your changes. --> Added Uniforms to Expand operator kernel ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve performance	2023-11-22 14:14:24 -08:00
Arthur Islamov	1c555c5fc1	[JS/Web] Resize & BiasSplitGelu fp16 support (#18536 ) ### Description Resize and BiasSplitGelu fp16 support on WebGPU	2023-11-22 12:12:07 -08:00
Xavier Dupré	3f0ebd6736	Fix opset import in GemmFloat8 python unit tests (#18489 ) ### Description The unit test are failing if a development version of onnx is used. The opset are set to 19.	2023-11-22 09:15:24 -08:00
Xavier Dupré	32fabb5555	Fix opset version of the optimizer in function generate_artifacts (#18300 ) ### Description `generate_artifacts` generates 4 graphs for training. All graphs should share the same opset version, the one coming from the model to train, but the optimizer is left undefined. onnxruntime is using the latest version defined by onnx but onnxruntime does not necessarily support it. ### Motivation and Context The code does not let the user change it.	2023-11-22 09:15:11 -08:00
Wanming Lin	89723c8612	[WebNN EP] Mark and fallback unsupported op for WebNN CPU backend (#18472 ) Current WebNN CPU (XNNPack) backend supports limit op list, fallbacks unsupported ops for WebNN "cpu" deviceType directly. This is a workaround because the op may be included in MLGraphBuilder for DirectML backend but without XNNPack implementation in Chromium.	2023-11-22 09:05:30 -08:00
Vincent Wang	3bc9efc7b2	[ORTModule] Adjust Attention Patterns for Efficient Attention ATen Fallback (#18471 ) Adjust attention patterns to match latest Whisper+exporter. Also add some condition check and add docs.	2023-11-22 15:24:05 +08:00
Adrian Lizarraga	7c573054b6	[QDQ Optimizer] Fix logic that drops Q/DQ ops from QDQ split node groups (#18394 ) ### Description - Fix QDQ optimizer logic that drops Q/DQ ops from Split node groups so that it only occurs when all input/output quantization parameters are equal. - Currently, the selector used for this optimization does not ensure that all quantization parameters are equal. - Support dropping Q/DQ ops from Split node groups with optional split inputs (introduced opset 13). This was not working previously. ### Motivation and Context Fix bugs in handling of QDQ Split node groups. --------- Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>	2023-11-21 21:31:31 -08:00
Tianlei Wu	62da3b1ca4	SDXL Latent Consistency Model (LCM) optimization (#18526 ) Add support of LCM model (https://huggingface.co/latent-consistency/lcm-sdxl) in SDXL demo. Since LCM model does not need classifier-free guidance, so there is no need to use negative prompt. The input and output shape is different from original SDXL model: no need to double the batch dimension. We also save metadata to image, and update image filename to include scheduler and steps. #### Latency (miliseconds) of generating 1024x1024 images in A100-SXM4-80GB GPU Engines are built with static input shape, and CUDA graph is enabled. For dynamic shape input, the latency could be slower. Batch Size \| Pipeline \| Steps \| ORT_CUDA \| ORT_TRT \| TRT 8.6 -- \| -- \| -- \| -- \| -- \| -- 1 \| LCM SDXL \| 4 \| 275 \| 249 \| 258 1 \| LCM SDXL \| 8 \| 460 \| 423 \| 430 1 \| SDXL Base \| 30 \| 2566 \| 2535 \| 2569 4 \| LCM SDXL \| 4 \| 925 \| 887 \| 1032 4 \| LCM SDXL \| 8 \| 1539 \| 1493 \| 1662 4 \| SDXL Base \| 30 \| 9227 \| 9408 \| 9678	2023-11-21 21:27:49 -08:00
Yulong Wang	d455b0f8fd	[js/web] use Chrome in CI for npm tests (#18522 ) ### Description use Chrome in CI for npm tests. Previously we use Edge, however it sometimes crashes with reasons not yet identified.	2023-11-21 18:03:57 -08:00
Jiajia Qin	ac8598a837	[js/webgpu] enable f16 for concat (#18528 ) ### Description With this PR `realesrgan-t64-f16` models becomes 32.8 ms from 1052.55 ms. Now the whole model run on jsep.	2023-11-21 14:26:00 -08:00
Dmitri Smirnov	81a763a9eb	Make TensorShapeVector to use InlinedVector<Int64_t> to reduce on template instantiations (#18519 ) ### Description Use InlinedVector<int64> instead of <int64_t,5> to reduce on the number of template instantiations. ### Motivation and Context The reported size reduction is small, just a few Ks. Just trying it out.	2023-11-21 14:13:50 -08:00
Abhishek Jindal	680a526e73	Training packaging pipeline for cuda12 (#18524 ) ### Description <!-- Describe your changes. --> Build ORT-training packaging pipeline for CUDA 12.2 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This will help any customer using CUDA 12 and would not need to build ORT-training from source Test run: https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=382993&view=logs&s=130be951-c2f3-5601-5709-434b5e50ddb0	2023-11-21 13:19:21 -08:00
Sheil Kumar	2a01622536	Hide NPU Adapter selection behind macro (#18515 ) Hide NPU Adapter selection behind macro --------- Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-11-21 08:47:56 -08:00
Xavier Dupré	29a409acaa	Add missing flags DISABLE_FLOAT8_TYPES in GemmFloat8 custom operator for CUDA < 11.8 (#18162 ) ### Description PR #16051 introduced operator GemmFloat8 but the flags DISABLE_FLOAT8_TYPES was missing in a couple of places. The PR addresses that issue. That would allows the compilation on CUDA < 11.8.	2023-11-21 14:37:48 +01:00
JiCheng	a608c002a3	fix past-kv in general LLM exporter (#18529 ) ### Description <!-- Describe your changes. --> For some models, we need to re run model.forward to get past-kv ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-21 19:04:55 +08:00
Yulong Wang	c7fd930330	[js/web] unify resolve rules for "Clip" (#18527 ) ### Description It was a mistake to use 2 different names for Clip operator in op-resolve-rules.ts for different opset. An optimized implementation can handle both cases (opset < 11 and opset >=11). Remove "ClipV10" as an entry from the table.	2023-11-20 23:18:06 -08:00
Jiajia Qin	abdf8b7c3f	[js/webgpu] Optimize broadcast binary. (#18185 ) ### Description Currently, the binary algorithms are divided into the vectorize one (efficient) and non-vectorize one (less efficient). Below situations will go to the vectorize one: 1) A or B's shape length is 1. 2) The shared dimensions length of A and B are divisible by 4. 3) A and B have same shape. This PR adds another situation as below to go to the vectorize algorithm. 4. A or B's last dimension is divisible by 4. With this change, the aggerate time of Add in sam-b-encoder becomes 309.65 ms from 409.12 ms on Intel ADL.	2023-11-20 16:52:17 -08:00
Dmitri Smirnov	cc542024ce	Create edges with arg positons correctly accounting for non-existing args (#18462 ) ### Description Truncate traling non-existing arguments. Make sure we do not skip on the non-existing arguments in the middle, because shape inferece relies on their proper position. This also affects the argument position in the Edges that must be properly rebuilt each time If node branch is inlined. Make sure that when we rename Defs in subgraphs, new renamed defs are created in those subgraphs instead of pointing to outer scope defs. Add unit test. ### Motivation and Context This is a follow up for https://github.com/microsoft/onnxruntime/pull/18105 Currently, the non-trailing arguments are simply ignored and the edges are created with potentially incorrect positions.	2023-11-20 14:49:09 -08:00
Yulong Wang	247ce21859	[js] optimize eslint config (#18460 ) ### Description optimize eslint config to: - set parserOptions.project to `true` to allow @typescript-eslint/parser to find the nearest tsconfig.json file to that source file. This helps to avoid parsing extra files, may helps with: - reduce the possibility of seeing OOM or stackoverflow with "npm run lint" - faster processing - enforce rule "no-underscore-dangle" with a list of exceptions.	2023-11-20 12:00:56 -08:00
Jian Chen	1dd9bf5340	Remove setup_env_azure.bat (#18482 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-20 09:58:15 -08:00
Jambay Kinley	1af0681554	Bfloat16 support for MatMulBnb4, Training support bitsandbytes>=0.41.2 (#18484 ) ### Description <!-- Describe your changes. --> Add bfloat16 support for `MatMulBnb4` contrib op. This is useful for QLoRA fine-tuning. - On GPUs with SM80+ (A100, etc), it uses the native cuda bfloat16 dtype, `nv_bfloat16`. On other GPUs, it uses the onnxruntime `BFloat16` type which uses float for compute. - I have validated the op in a llama2-7b training scenario. The losses match pytorch training and the training throughput is better. - Cannot add a bfloat16 case in the op unit test since casting BFloat16 to and from float multiple times during the test causes the required tolerances to be unachievable. The custom autograd function exporter in onnxruntime-training is updated to support the latest version of bitsandbytes. They changed how the `quant_state` is stored. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enable QLoRA fine-tuning with bfloat16.	2023-11-20 09:52:58 -08:00
Jian Chen	d97fc1824f	Create a new Python Package pipeline for CUDA 12 (#18348 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-20 09:48:28 -08:00
Wei-Sheng Chin	3bcc137eb4	Tiny change to trigger the update of DORT's CI image (#18507 ) Recent PyTorch breaks DORT CI and [a patch](https://github.com/pytorch/pytorch/pull/113697) has been merged into PyTorch main. In order to update DORT's CI, we made dummy change in this PR.	2023-11-19 22:09:11 -08:00
Changming Sun	dc9ab4f821	Update setup.py: replace libcudart.so.12.0 with libcudart.so.12 (#18501 )	2023-11-19 22:06:32 -08:00
Akshay Sonawane	97cc40d75a	Add fusion patterns for conformer-transducer model (#18461 ) ### Description Add conformer-transducer model type to optimizer. This PR adds pattern matches for attention shown below: Unfused attention: ![ct_unfused](https://github.com/microsoft/onnxruntime/assets/111780983/46c71ed8-67e0-4607-85b1-bcadba5a2956) Fused attention: ![ct_fused](https://github.com/microsoft/onnxruntime/assets/111780983/fbb91c96-0d4b-4f0b-8674-1ae3b9b9a92e)	2023-11-18 23:39:04 -08:00
RandySheriffH	53917a3353	Move up members in Lite Custom Op hierarchy for possible memleaks. (#18478 ) Move data member in LiteOpFunc to its parent to avoid possible mem leaks. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-11-18 15:00:54 -08:00

1 2 3 4 5 ...

10058 commits