onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-16 18:31:27 +00:00

Author	SHA1	Message	Date
Wanming Lin	73a2eb82eb	Fixed bug in Flatten's axis (#18645 ) Flatten's axis is in the range [-r, r] rather than [-r, r-1].	2023-11-30 16:19:22 -08:00
Jiajia Qin	6781b6cf3d	[js/webgpu] add bool type for Expand/Gather (#18615 ) ### Description In [detr-resnet-50](https://huggingface.co/Xenova/detr-resnet-50) model, it uses expand with bool type running on cpu ep. \| Kernel \| Shape \| Provider \| \| -------- \| ------- \| ------- \| \| Expand \| "input_type_shape" : [{"bool":[1,1,1,625]},{"int64":[4]}],"activation_size" : "657","output_type_shape" : [{"bool":[1,1,625,625]}] \| CPUExecutionProvider \| After this change, it will run on jsep. \| Kernel \| Shape \| Provider \| \| -------- \| ------- \| ------- \| \| Expand \| "input_type_shape" : [{"bool":[1,1,1,625]},{"int64":[4]}],"activation_size" : "657","output_type_shape" : [{"bool":[1,1,625,625]}] \| JsExecutionProvider \|	2023-11-30 15:47:08 -08:00
Yi Zhang	efee9abdb7	Reduce downloads in Nuget-Java pipeline to reduce connection exception (#18635 ) ### Description 1. Add a new stage to download java tools from https://oss.sonatype.org and publish them to pipeline artifact 2. Remove downloads in other jobs, they get the java tools from pipeline artifact 3. consolidate final_java_testing stages. ### Motivation and Context Reduce downloads to reduce the connection error like below. ``` --2023-11-28 07:16:31-- https://oss.sonatype.org/service/local/repositories/releases/content/org/junit/platform/junit-platform-console-standalone/1.6.2/junit-platform-console-standalone-1.6.2.jar Resolving oss.sonatype.org (oss.sonatype.org)... 3.227.40.198, 3.229.50.23 Connecting to oss.sonatype.org (oss.sonatype.org)\|3.227.40.198\|:443... connected. HTTP request sent, awaiting response... 502 Bad Gateway 2023-11-28 07:16:32 ERROR 502: Bad Gateway. ```	2023-12-01 07:44:44 +08:00
zesongw	4025bd8ebd	[WebNN EP] Fix bug of padding in Op ConvTranspose (#18577 ) Get the dimensions of H and W according to the layout.	2023-11-30 12:59:36 -08:00
Jiajia Qin	b1e749e3be	[js/webgpu] Add program name into webgpuProfiling info (#18640 ) ### Description Currently, we only print the kernelName, which is hard to distinguish which shader we actually used. For example, GroupedConv/Conv2DMatMul both belong to Conv kernel. It's not intuitive for profiling.	2023-11-30 12:57:29 -08:00
Dmitri Smirnov	c5ea1547c6	Eliminate intermediate string conversion buffer. (#18608 ) ### Description Make use of unsafe string constructor that is able to convert native UTF-8 string straight into the string instance buffer. ### Motivation and Context Reduce garbage,	2023-11-30 10:50:24 -08:00
Yulong Wang	e7f64f4510	[js/web] fix ESLint by excluding generated .js from tsconfig.json (#18634 ) ### Description ESLint will went into error sometimes. The root cause is because some large generated JavaScript file in the tsconfig's include path will cause TypeScript parser fail in a line of `string.match()` with a regex on a huge string (~8MB), causing the following error: ``` RangeError: Maximum call stack size exceeded ``` The solution is to remove the large files from the tsconfig's include path. Previously I excluded the `web/dist/` folder and this PR excludes `web/test/ort.test[.min].js`.	2023-11-30 09:50:47 -08:00
Changming Sun	23a91c8ba8	Fix warning C4003 in ORT python binding code (#18612 ) ### Description Fix warning C4003 in ORT python binding code. ### Motivation and Context It's better to fix the warning instead of suppressing it.	2023-11-30 08:07:47 -08:00
Changming Sun	1b5675ff0f	Update post-merge-jobs.yml: increase timeout value for the Ios job (#18602 )	2023-11-30 08:07:13 -08:00
Vincent Wang	148495ebc5	[ORTModule] Use Default Topo-order for GraphViewer (#18410 ) ORT's default topo-order is a reversed DFS algorithm, while the priority-based topo-order is a forward BFS algorithm. It's likely that the default order is better than priority-based order on memory because tensor memory is more likely to be released right after it's consumed. Currently ORTModule uses priority-based order, for some models, it sorts lots of small Ops to the beginning, this introduces big CPU overhead at the beginning (see below screenshot), this PR is to use default order for training. The priority-based order is heavily used for some recompute optimization, so if there is recompute enabled, we will still use priority-based order. This PR also adds an optimization to the default order, which is to move all Shape/Size Ops to right after their parent nodes. This is to make sure the shape and size nodes are executed right after their parents so it's possible the input tensor memory can be released as soon as possible. This is especially important for non-CPU devices or for training case where some gradient graphs use only shape/size of tensors from forward. Profiling result: Before <img width="910" alt="截屏2023-11-13 12 09 02" src="https://github.com/microsoft/onnxruntime/assets/11661208/e54d5ead-274f-4725-923e-521bbcfce752"> After <img width="910" alt="截屏2023-11-13 12 10 44" src="https://github.com/microsoft/onnxruntime/assets/11661208/f50d196d-11ac-43a2-9493-517e4552ffab">	2023-11-30 20:17:22 +08:00
Vincent Wang	e1d1033131	[ORTModule] Remove Unused Arguments from Generated Triton Code (#18636 ) This PR: - Remove unused arguments from generated triton code, - Remove unnecessary mask for symbolic shape case from generated triton code. - Add doc for usage of ORTMODULE_TRITON_CONFIG_FILE.	2023-11-30 18:32:36 +08:00
George Wu	5c67a00d8e	Revert "remove full protobuf requirement for tensorrt ep" (#18626 ) Reverts microsoft/onnxruntime#18413 there's a timing issue here. we eventually want to get this change merged in but we need to update OSS onnx-tensorrt first.	2023-11-29 22:27:51 -08:00
Jambay Kinley	c20488ced7	skip_infer for SkipGroupNorm in SymbolicShapeInference (#18630 ) ### Description <!-- Describe your changes. --> https://github.com/microsoft/onnxruntime/pull/18273 added `SkipGroupNorm` contrib op but it did not skip onnx shape inference for this op in `SymbolicShapeInference`. This leads to failed shape inference of the transformers optimized model with `enable_skip_group_norm=True`. Also results in an invalid float16 model for the SD CUDA example. This PR adds `SkipGroupNorm` to `skip_infer` so that it skips onnx shape inference for this op and instead uses the relevant dispatcher. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix shape inference failure for models with `SkipGroupNorm` nodes.	2023-11-29 18:27:04 -08:00
Yang Gu	227dcb3a88	[js/webgpu] Log the key and program info for artifact (#18365 ) With uniform support, ideally we may just keep one artifact for each program to save the compilation time. This PR just logs the related info, including key and program name, so that we may understand better the situation.	2023-11-29 18:01:12 -08:00
satyajandhyala	7335760424	[JS/Web] Add uniforms to Einsum (#18531 ) ### Description Add uinforms to Einsum ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve performance.	2023-11-29 15:30:33 -08:00
Edward Chen	483c490ec4	Refine error checks in onnxruntime/core/providers/coreml/model/model.mm. (#18620 ) #18606 updated the original error checks to check that the returned object != nil to appease the static analyzer. However, per the API docs, checking `error != nil` is the way to determine whether an error occurred. This change adds back the `error != nil` check to be safe.	2023-11-29 14:38:44 -08:00
Dmitri Smirnov	d2dfbf4179	Add float16 type support to SplitToSequence and make code type independent (#18594 ) ### Description Add support for `float16` type to address the below issue. Re-work the code to make it type independent. This reduces binary size by ~11 K. ![image](https://github.com/microsoft/onnxruntime/assets/11303988/1a77c7bc-34a8-478c-a16a-abd94062c6c6) ### Motivation and Context This PR addresses https://github.com/microsoft/onnxruntime/issues/18481	2023-11-29 10:44:59 -08:00
Yi Zhang	68209307da	Replace all Azure-Pipelines-EO-Windows2022-aiinfrat to Onnxruntime-Win-CPU-2022 (#18614 ) ### Description Replace all Azure-Pipelines-EO-Windows2022-aiinfrat to Onnxruntime-Win-CPU-2022 ### Motivation and Context Reduce the maintenance cost	2023-11-29 10:32:42 -08:00
Wanming Lin	38b640c797	[WebNN EP] Re-implement Unsqueeze, Squeeze, Flatten with WebNN's reshape (#18585 ) WebNN will not provide `unsqueeze`, `squeeze`, `flatten2d` ops, as it can be easily implemented by reshape.	2023-11-29 08:00:23 -08:00
Edward Chen	14a343441d	Fix Objective-C static analysis build (#18606 ) - Patch abseil to fix a compile error about not finding `cxxabi.h`. - Fix some static analysis warnings.	2023-11-28 17:14:20 -08:00
ivberg	e833d22f14	Change QNN EP Profiling logs to output to CSV (#18201 ) ### Description Change QNN EP Profiling logs to output to CSV. Output is in a similar format to QNN SDK Tools (instead of to ORT logs) https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html#configuration-options (profiling_level) ### Motivation and Context It is hard to read and interpret QNN profiling logs in the ORT logs. --------- Co-authored-by: Hector Li <hecli@microsoft.com>	2023-11-28 16:58:51 -08:00
Tianlei Wu	f13380f3d8	Support LoRA and Control Net in Stable Diffusion demo (#18593 ) ### Description (1) Export onnx model with LoRA weights for both SD 1.5 and SDXL (2) Export onnx model with Control Net for both SD 1.5 and SDXL. For SD 1.5, it is allowed to use multiple control nets. For SDXL, at most one control net is supported right now. (3) Add demo of LCM LoRA (3) Add demo of control net.	2023-11-28 15:46:42 -08:00
Yulong Wang	50e6235af1	[js/web] allow ShaderHelper to use internal (non-I/O) variables (#18525 ) ### Description This PR includes a change that inspired from #18452 to resolve a requirement: a shader may depend on an instance of `IndicesHelper` to generate WGSL code snippet, but the IndicesHelper instance is not necessarily an input/output of the program. So the existing `declareVariables()` function does not work with this scenario. In order to support this requirement, I added this "use" function to `interface ShaderHelper`, which takes a helper-like object as parameter. The hidden implementation `ShaderHelperImpl` class will iterate the helpers and call `impl()` for each. @axinging @qjia7	2023-11-28 15:15:59 -08:00
Jian Chen	a49f31b670	Remove drop-nuget artifact from all pipelines (#18592 ) ### Description Currently, the `drop-nuget` artifact only contains protoc.exe which is also part of the `drop-extra` artifact. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-28 13:23:01 -08:00
Mike Guo	e24733cfe9	fix the Olive CI pipeline failure on Windows (#18464 ) Fix the https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1046 failure for Windows	2023-11-28 11:42:39 -08:00
Rachel Guo	288b80d363	Add MacOS build to ORT C Pod (#18550 ) ### Description <!-- Describe your changes. --> As title. 1. Add macos build as an optionally enabled arch for pod and changes to exsiting build_ios_framework/assemble_c_pod scripts. 2. Enable macos build arch in ios packaging pipeline (currently for variants other than Mobile) and check the output artifacts are correct. 3. Write MacOS Test Target scheme in the test app and integrate into ios packaging CI testing pipeline. Currently the changes only apply to onnxruntime-c pod. as the original request was from ORT SPM which consumes the onnxruntime-c pod only as the binary target. TODO: could look into adding macos platform to objc pod as well. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enable macos platform support in cocoapods. and also potentially produce binary target for enabling macos platform in SPM as well. Replace https://github.com/microsoft/onnxruntime/pull/18334 --------- Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local> Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-11-28 10:11:53 -08:00
Chen Fu	05046e5452	Adding unit test for sm80 prepack (#18514 ) ### Description Prepacking code for block q4 x fp16 GEMM cuda kernel, for SM80 hardware ### Motivation and Context Preparing for addition of Q4 x FP16 GEMM kernel on Nvidia Ampere GPUs. This kernel requires sophisticated quantized weight rearrangement to speedup loading data to tensor-core. To facilitate the addition, this change includes the following: 1. matrix_layout.h A new layout lib that facilitate iterating matrix elements and tiles that balance memory safety and performance. 2. prepack_sm80.h Code for rearranging quantized weight, scales and offsets (aka. prepacking) 3. blkq4_fp16_sm80_prepack_test.cc Unit tests that explicitly test the memory safety and correctness of the prepacking code. Currently the prepacking code runs on CPU with single threaded code. We run this on CPU in order to minimize GPU memory fragmentation. On the other hand, hopefully we get around to parallelize this part of the code. Should be straight forward with the unit tests in place.	2023-11-28 10:01:09 -08:00
Adrian Lizarraga	8d5ecc4dae	[Quantization] Fix scale/zero-point for 16-bit QDQ Softmax (#18589 ) ### Description Sets the appropriate scale and zero-point values for 16-bit QDQ Softmax. Previously, the scale/zp were set to fixed values that were specific to 8-bit quantization. ### Motivation and Context Generate more accurate 16-bit QDQ models that contain Softmax.	2023-11-28 09:46:47 -08:00
Sheil Kumar	0b7048e7d6	Update winml to use #cores - #soc cores by Default as the number of intraopthreads (#18384 ) Update winml to use #cores - #soc cores by Default as the number of intraopthreads --------- Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-11-28 09:26:48 -08:00
Yi Zhang	a6d8726407	Update ADO windows image to custom image (#18598 ) ### Description Update Azure-Pipelines-EO-Windows2022-aiinfra to onnxruntime-win-CPU-2022 in Nuget_Package_CPU. To make the debugging easier, use flex-downloadPipelineArtifact ### Motivation and Context Azure-Pipelines-EO-Windows2022-aiinfra is using 1ES window-latest image. The pipeline might be failed by unexpected upgrade. Verified: https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=384425&view=results ### P.S. I think we should replace all Azure-Pipelines-EO-Windows2022-aiinfra.	2023-11-28 09:04:25 -08:00
Jian Chen	3ea27c2925	Create a new Nuget Package pipeline for CUDA 12 (#18135 )	2023-11-28 09:03:46 -08:00
Xavier Dupré	94a6020a7f	Improve parallelization of TfIdfVectorizer, Reduce memory consumption (#18539 ) ### Description TfIdfVectorizer has two steps: first search for n-grams in the input, second, weight the results. The second step was not parallelized. The PR adresses that issue. Before two vectors were of the size of the output were allocated to compute the results. The first one, frequencies, was used as an intermediate vector between the two steps. This vector is now broken into multiple small vectors, one per thread. The memory consumption is then reduced for batches with a number of rows > the number of threads. ### Motivation and Context Performance and memory consumption. For one model, the improvment is +15% faster (4 cores, model size is ~6Mb, batch size is 100). Here is another benchmark on a machine with 32 cores with different size of vocabularies and batch sizes. The tested TfIdfVectorizer only deals with unigram and processes sequences of 10 tokens (integers). ![image](https://github.com/microsoft/onnxruntime/assets/22452781/0bb9abe9-ed81-44da-b5c4-ad2a12f129bd)	2023-11-28 12:56:00 +01:00
Ran Gal	3f42fbad2e	deleted the unused random_device variables because they caused a warning that was treated like an error. (#18543 ) deleted the unused random_device variables because they caused a warning that was treated like an error. _Please check if the declaration is required for the random number generation. if so, there need to be a dummy reference to the variable or turning off the warning as error behavior._ ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-28 08:54:38 +01:00
Jiajia Qin	fc8631e2f1	[js/web] Fix conv2dMatmul errors due to #18452 (#18562 ) ### Description Currently, all conv2dMatmul with inChannels = 3 and outChannels % 4 = 0 will report compilation errors. Models, which include this kind of shape will be impacted, like mobilenetv2-12, resnet50 . The errors is introduced by #18452 https://github.com/microsoft/onnxruntime/pull/18452/files#diff-8b24ea43aa11b1346c0c9e327f9bce6b37a93bd8f2bf8a6392b2b263972b7ea2R200, which accidentally pass `components` to `x`. But `x`'s components is `innerElementSize` not `components `. And when `innerElementSize` is 3, we should use `1` in current design.	2023-11-27 21:21:47 -08:00
cao lei	b9fd9c5665	remove dead code in openvino EP (#18457 ) ### Description <!-- Describe your changes. --> Remove dead code in openvino EP ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Remove dead code in openvino EP	2023-11-27 13:41:12 -08:00
Caroline Zhu	dd355e39a0	[js/web/training] Added parameters methods (#18250 ) ### Description * Implemented: `getParametersSize`, `getContiguousParameters` (equivalent to copyParametersToBuffer), and `loadParametersBuffer` (equivalent to copyParametersFromBuffer) * as part of these changes, getParametersSize was added to the TrainingSession interface so that users know what size buffer to create for loadParametersBuffer * The parameters methods in the interface were modified to take in a Float32Array instead ### Motivation and Context * part of the work for implementing web bindings for training * enables federated learning in the web * previous PR: #18006 --------- Co-authored-by: Ashwini Khade <askhade@microsoft.com>	2023-11-27 10:30:13 -08:00
Hector Li	a2fd8a6fc0	[QNN EP] Return INVALID_GRAPH if failed to load from context binary (#18485 ) ### Description [QNN EP] Return INVALID_GRAPH if failed to load from context binary ### Motivation and Context Make sure QNN EP return INVALID_GRAPH if error encountered with the context binary file	2023-11-24 20:41:27 -08:00
cloudhan	2f608338cb	Setup default python formatter for new python plugin (#18563 )	2023-11-24 18:04:48 +08:00
Ted Themistokleous	7b2aefa856	undo hipify of __half to rocblas_half (#18573 ) Fixes build issue seen with newer ROCm releases Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2023-11-24 18:04:23 +08:00
mindest	b9c935f605	[ROCm] Some fixes in tunable (#18575 ) ### Description * Fix workspace size for hipBLASLt algos at 32M * Update according to API changes	2023-11-24 17:22:00 +08:00
Rachel Guo	62f00ad8e7	[CoreML] Add Softmax and Split op support (#18358 ) ### Description <!-- Describe your changes. --> As title. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Added for yolov8 model missing operator support. https://github.com/microsoft/onnxruntime/issues/17654 Now the model support info looks like: _CoreMLExecutionProvider::GetCapability, number of partitions supported by CoreML: 3 number of nodes in the graph: 233 number of nodes supported by CoreML: 230_ (only missing 3 concat op support due to input 3d shape is not currently support in CoreML EP Concat). --------- Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net> Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-11-23 14:26:57 -08:00
cloudhan	6f3c1f9dc9	[ROCm] Update ck for GemmFloat8 (#18487 )	2023-11-23 12:06:19 +08:00
Adrian Lizarraga	1c79897c90	[QNN EP] Support LpNormalization (#18561 ) ### Description Add support for the ONNX LpNormalization operator (p == 2). This is translated to QNN's L2Norm operator. ### Motivation and Context Support more models with QNN EP	2023-11-22 19:40:33 -08:00
pengwa	43a5147e01	Memory optimization refactor and refinement (#17481 ) ### Memory optimization refactor and refinement Currently memory optimizer runs graph transformations and print recompute opportunities in INFO level, while ORT backend has many many INFO level logs making users hard to find those information. So we are looking for a Python binding API to retrieve the memory optimization opportunities instead of depending on the MemoryOptimizer's default logging. Then we can print ORTModule feature statistics using this information. Also, with such an API, we can create an ORT session created, where allocation plan is done, the analysis will consider buffer reuse as well. This can void giving some recomputation subgraphs that are reusing other subgraphs' output buffers. Check https://github.com/microsoft/onnxruntime/blob/pengwa/add_devinfo_level/docs/Memory_Optimizer.md for the new flow using `MemoryOptimizer`. This pull requests made following refactoring: 1. Print the log in ORTModule Python script, along with ORTModule feature enabling stats. This is implemented by exposing an API `get_serialized_ortmodule_memory_stat` to retrieve the memory optimization opportunities. 2. We are analyzing memory optimization opportunities considering ORT memory planning. This is done by firstly creating the execution graph without enabling MemoryOptimizer, then we call `execution_agent.get_serialized_ortmodule_memory_stat` which internally will consider the session memory allocation planner when analyzing memory optimization opportunity. As a direct result, the memory optimization opportunities can show those stashed activations that are reusing other buffers. 3. Move recompute analysis logic from memory_optimizer.h/cc to recompute_analysis.h/cc. 4. Abstract optimization strategies for their own implementation. This will make introducing new strategies (for example compression and decompression ) easier. New logging matrix (INFO Level), in WARNING level, the details will NOT show. ``` 2023-09-13 13:25:09,249 orttraining.rank-0 [WARNING] - *** ONNX Runtime Training (ORTModule) is accelerating your model *** ORTModule is enabled with following features ON/OFF for [training] mode: ATen Executor : ON : Dispatch ATen operators to ORT's ATen executor Cast Propagation : ON : Level 1 enabled Custom Function : ON : Support custom torch.autograd.Function export and execution Memory Optimizer : ON : RecomputeConfig: Reshape+Where+BiasSoftmax+:1:-1,Cast+:1:-1, ProbeLevel: 1, available configs: Config Freq Saving(B) Saving Symbolic(Bytes) - Plan 1 : ON : Reshape+Where+BiasSoftmax+:1:-1 5 671,088,640 640.0inputs_input_ids_dim0inputs_input_ids_dim1*2 - Plan 2 : ON : Cast+:1:-1 6 402,587,648 inputs_input_ids_dim0inputs_input_ids_dim1(384.0inputs_input_ids_dim1 - 64.0) - Plan 3 : OFF : Reshape+Where+:1:-1 1 134,217,728 128.0inputs_input_ids_dim0inputs_input_ids_dim1*2 - Plan 4 : OFF : BiasSoftmax+:1:-1 1 134,086,656 128.0inputs_input_ids_dim0inputs_input_ids_dim1(inputs_input_ids_dim1 - 1) - Plan 5 : OFF : BiasGelu+:1:-1 6 125,808,640 inputs_input_ids_dim0(122880.0inputs_input_ids_dim1 - 20480.0) - Plan 6 : OFF : FusedMatMul+:1:-1 6 125,808,640 inputs_input_ids_dim0(122880.0inputs_input_ids_dim1 - 20480.0) - Plan 7 : OFF : FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1 5 26,214,400 25600.0inputs_input_ids_dim0inputs_input_ids_dim1 - Plan 8 : OFF : Add+:1:-1 1 5,237,760 5120.0inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) - Plan 9 : OFF : Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1 1 4,096 4.0inputs_input_ids_dim0inputs_input_ids_dim1 - Plan 10 : OFF : Cast+:2:-1 1 2,048 2.0inputs_input_ids_dim0inputs_input_ids_dim1 Compute Optimizer : ON : Enable/Disable with env ORTMODULE_ENABLE_COMPUTE_OPTIMIZER=1/0 - FLOPReduction : ON : Reduce FLOPs by upstreaming shrinking-sized ops Auto Fallback : ON : Fallback to PyTorch when encountering unsupported ops TritonOp Enabled : OFF : ORT will switch to Triton for executing some ops to further accelerate training. ZeRO Stage3 Support : OFF : Enable/Disable with env ORTMODULE_ENABLE_ZERO_STAGE3=1/0 Total ORT initialization overhead is 10.73s where export takes 8.39s. Other overhead details: graph builder init takes 0.06s, runtime detection takes 0.01s, graph building takes 0.31s, session creation takes 1.96s Versions: ONNX Runtime - 1.16.0+cu118, ONNX - 1.11.0 Note 1: use comma to enable multiple plans at the same time. export ORTMODULE_MEMORY_OPT_CONFIG=<plan1 config>,<plan2 config>,... Note 2: saving is calculated based on the 1st batch symbolic dim values: inputs_input_ids_dim0=1, inputs_input_ids_dim1=1024, inputs_attention_mask_dim0=1, inputs_attention_mask_dim1=1024, inputs_labels_dim0=1, inputs_labels_dim1=1024, ************************************************************************ ``` If DEVINFO level is enabled, then more details about the memory optimizations are printed. ``` MemoryInsight Summary - User config: BiasGelu+:1:-1,Cast+:2:-1 ========================================================================================================================================== \|Freq \| Memory Optimization Opportunities (Clustered by node-level activation patterns) \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|3 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+Add+Reshape+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+Reshape+:1:-1 \| \| \| Stashed Activations: \| \| \| - ReuseFreq : Output 0(3), \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 32 x 240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+:1:-1 \| \| \| Stashed Activations: \| \| \| - ReuseFreq : Output 0(2), \| \| \| - Output 0 : [ x 2560 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Cast+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+Where+BiasSoftmax+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+BiasSoftmax+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph BiasGelu+ \| \| \| Status : Enabled, requested count=-1, actual applied count=2 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+Add+FusedMatMul+Add+Add+Add+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 2560 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+Where+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Cast+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \| \| \| \| \|>>Option 2 : RecomputeWithCompromise subgraph Cast+ \| \| \| Status : Enabled, requested count=-1, actual applied count=1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 50% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph BiasSoftmax+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=BiasSoftmax+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph BiasGelu+ \| \| \| Status : Enabled, requested count=-1, actual applied count=1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Add+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Add+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) x 2560 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| ========================================================================================================================================== Note: use comma as a separator for enabling more than one subgraphs. *********************************************************************** ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-23 11:39:00 +08:00
Jiajia Qin	64dacc2892	[js/webgpu] Add BatchNormalization Op (#18468 ) ### Description This PR adds `BatchNormalization` with `float` support. Some Todos: 1. all inputs don't have same data type. For example, x/y is float16, but bias/scale is float32 or double. 2. training mode support. We see many models are using `BatchNormalization` ops. However, due to the missing in jsep, all of them run on cpu, which result very poor performance. With this PR's support, densenet-9 model becomes 20.29 ms from 250.69 ms.	2023-11-22 15:58:06 -08:00
Xu Xing	fa106942a7	[js/webgpu] Refactor matmul conv to support uniforms for matmul (#18452 ) This change refactored matmul/conv related programs to support shape uniforms. Currently only matmul shape uniforms are fully enabled. TODOs: add input dependencies for conv related programs, turn clipMax and clipMin to uniforms.	2023-11-22 14:42:55 -08:00
Scott McKay	42c6799c59	Update transpose optimization to be more QDQ aware (#18444 ) ### Description <!-- Describe your changes. --> Rework some aspects of the transpose optimizer to ensure we have valid QDQ node units when it is done. Conceptually we need to let individual Transpose nodes move through the graph when optimizing. That can invalidate existing QDQ node units or require new ones. We can fix this after inserting new nodes, or when transpose optimization finishes moving Transpose nodes. Fix when inserting new node - TransposeInputs can add an Unsqueeze (to broadcast) and Transpose to a node's inputs - if there was a DQ node providing the input, add a Q -> DQ after inserting the Unsqueeze/Transpose to make a QDQ node unit for the new node. - Unsqueeze/Transpose don't change data, so we can copy the type/scale/zero point from the existing DQ Fixes when transpose optimization completes moving Transpose nodes - Remove empty DQ -> Q pairs if the type/scale/zero point match - Pushing a Transpose through may have resulted in an existing Transpose/Reshape being cancelled and removed leaving an empty QDQ node unit - the Transpose being moved may have started in a QDQ node unit - Transpose that got blocked inside existing QDQ node unit - e.g. if we hit a DQ -> MatMul -> Q node unit the Transpose gets blocked after the DQ - insert a Q -> DQ after the Transpose to put it in a QDQ node unit and repair the original QDQ node unit - Transpose moves past a DQ providing a graph output - insert a Q -> DQ so the Transpose is in a QDQ node unit This replaces the existing phase 2 logic which flipped a DQ -> Transpose to fix a broken QDQ node unit. The new approach should handle more scenarios and hopefully produce a better graph. Additionally the logic to handle updates to shared initializers that feed DQ nodes was simplified (i.e. largely removed). When we update the shared initializer a Squeeze (if broadcast) and Transpose is added between the initializer and the DQ for other usages of it. We only need to check for this pattern in EstimateTransposeValueCost by looking past a DQ node. We do not need to track the individual DQ nodes leading to an updated shared initializer. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Initially to fix QNN issue with non-const input being transpose and the QDQ node units being broken.	2023-11-23 08:27:47 +10:00
satyajandhyala	841f7ed3e0	[[JS/Web]Added uniform to Expand op. (#18558 ) ### Description <!-- Describe your changes. --> Added Uniforms to Expand operator kernel ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve performance	2023-11-22 14:14:24 -08:00
Arthur Islamov	1c555c5fc1	[JS/Web] Resize & BiasSplitGelu fp16 support (#18536 ) ### Description Resize and BiasSplitGelu fp16 support on WebGPU	2023-11-22 12:12:07 -08:00
Xavier Dupré	3f0ebd6736	Fix opset import in GemmFloat8 python unit tests (#18489 ) ### Description The unit test are failing if a development version of onnx is used. The opset are set to 19.	2023-11-22 09:15:24 -08:00

1 2 3 4 5 ...

10081 commits