Commit graph

10081 commits

Author SHA1 Message Date
Wanming Lin
73a2eb82eb
Fixed bug in Flatten's axis (#18645)
Flatten's axis is in the range [-r, r] rather than [-r, r-1].
2023-11-30 16:19:22 -08:00
Jiajia Qin
6781b6cf3d
[js/webgpu] add bool type for Expand/Gather (#18615)
### Description
In [detr-resnet-50](https://huggingface.co/Xenova/detr-resnet-50) model,
it uses expand with bool type running on cpu ep.




| Kernel    | Shape | Provider |
| -------- | ------- | ------- |
| Expand | "input_type_shape" :
[{"bool":[1,1,1,625]},{"int64":[4]}],"activation_size" :
"657","output_type_shape" : [{"bool":[1,1,625,625]}] |
CPUExecutionProvider |

After this change, it will run on jsep.
| Kernel    | Shape | Provider |
| -------- | ------- | ------- |
| Expand | "input_type_shape" :
[{"bool":[1,1,1,625]},{"int64":[4]}],"activation_size" :
"657","output_type_shape" : [{"bool":[1,1,625,625]}] |
JsExecutionProvider |
2023-11-30 15:47:08 -08:00
Yi Zhang
efee9abdb7
Reduce downloads in Nuget-Java pipeline to reduce connection exception (#18635)
### Description
1. Add a new stage to download java tools from https://oss.sonatype.org
and publish them to pipeline artifact
2. Remove downloads in other jobs, they get the java tools from pipeline
artifact
3. consolidate final_java_testing stages.


### Motivation and Context
Reduce downloads to reduce the connection error like below.

```
--2023-11-28 07:16:31--  https://oss.sonatype.org/service/local/repositories/releases/content/org/junit/platform/junit-platform-console-standalone/1.6.2/junit-platform-console-standalone-1.6.2.jar
Resolving oss.sonatype.org (oss.sonatype.org)... 3.227.40.198, 3.229.50.23
Connecting to oss.sonatype.org (oss.sonatype.org)|3.227.40.198|:443... connected.
HTTP request sent, awaiting response... 502 Bad Gateway
2023-11-28 07:16:32 ERROR 502: Bad Gateway.
```
2023-12-01 07:44:44 +08:00
zesongw
4025bd8ebd
[WebNN EP] Fix bug of padding in Op ConvTranspose (#18577)
Get the dimensions of H and W according to the layout.
2023-11-30 12:59:36 -08:00
Jiajia Qin
b1e749e3be
[js/webgpu] Add program name into webgpuProfiling info (#18640)
### Description
Currently, we only print the kernelName, which is hard to distinguish
which shader we actually used. For example, GroupedConv/Conv2DMatMul
both belong to Conv kernel. It's not intuitive for profiling.
2023-11-30 12:57:29 -08:00
Dmitri Smirnov
c5ea1547c6
Eliminate intermediate string conversion buffer. (#18608)
### Description
  Make use of unsafe string constructor that is able to convert native
  UTF-8 string straight into the string instance buffer.

### Motivation and Context
Reduce garbage,
2023-11-30 10:50:24 -08:00
Yulong Wang
e7f64f4510
[js/web] fix ESLint by excluding generated .js from tsconfig.json (#18634)
### Description
ESLint will went into error sometimes.

The root cause is because some large generated JavaScript file in the
tsconfig's include path will cause TypeScript parser fail in a line of
`string.match()` with a regex on a huge string (~8MB), causing the
following error:
```
RangeError: Maximum call stack size exceeded
```

The solution is to remove the large files from the tsconfig's include
path. Previously I excluded the `web/dist/` folder and this PR excludes
`web/test/ort.test[.min].js`.
2023-11-30 09:50:47 -08:00
Changming Sun
23a91c8ba8
Fix warning C4003 in ORT python binding code (#18612)
### Description
Fix warning C4003 in ORT python binding code.

### Motivation and Context
It's better to fix the warning instead of suppressing it.
2023-11-30 08:07:47 -08:00
Changming Sun
1b5675ff0f
Update post-merge-jobs.yml: increase timeout value for the Ios job (#18602) 2023-11-30 08:07:13 -08:00
Vincent Wang
148495ebc5
[ORTModule] Use Default Topo-order for GraphViewer (#18410)
ORT's default topo-order is a reversed DFS algorithm, while the
priority-based topo-order is a forward BFS algorithm. It's likely that
the default order is better than priority-based order on memory because
tensor memory is more likely to be released right after it's consumed.

Currently ORTModule uses priority-based order, for some models, it sorts
lots of small Ops to the beginning, this introduces big CPU overhead at
the beginning (see below screenshot), this PR is to use default order
for training. The priority-based order is heavily used for some
recompute optimization, so if there is recompute enabled, we will still
use priority-based order.

This PR also adds an optimization to the default order, which is to move
all Shape/Size Ops to right after their parent nodes. This is to make
sure the shape and size nodes are executed right after their parents so
it's possible the input tensor memory can be released as soon as
possible. This is especially important for non-CPU devices or for
training case where some gradient graphs use only shape/size of tensors
from forward.

Profiling result:
Before
<img width="910" alt="截屏2023-11-13 12 09 02"
src="https://github.com/microsoft/onnxruntime/assets/11661208/e54d5ead-274f-4725-923e-521bbcfce752">

After
<img width="910" alt="截屏2023-11-13 12 10 44"
src="https://github.com/microsoft/onnxruntime/assets/11661208/f50d196d-11ac-43a2-9493-517e4552ffab">
2023-11-30 20:17:22 +08:00
Vincent Wang
e1d1033131
[ORTModule] Remove Unused Arguments from Generated Triton Code (#18636)
This PR:
- Remove unused arguments from generated triton code,
- Remove unnecessary mask for symbolic shape case from generated triton
code.
- Add doc for usage of ORTMODULE_TRITON_CONFIG_FILE.
2023-11-30 18:32:36 +08:00
George Wu
5c67a00d8e
Revert "remove full protobuf requirement for tensorrt ep" (#18626)
Reverts microsoft/onnxruntime#18413

there's a timing issue here. we eventually want to get this change
merged in but we need to update OSS onnx-tensorrt first.
2023-11-29 22:27:51 -08:00
Jambay Kinley
c20488ced7
skip_infer for SkipGroupNorm in SymbolicShapeInference (#18630)
### Description
<!-- Describe your changes. -->
https://github.com/microsoft/onnxruntime/pull/18273 added
`SkipGroupNorm` contrib op but it did not skip onnx shape inference for
this op in `SymbolicShapeInference`.

This leads to failed shape inference of the transformers optimized model
with `enable_skip_group_norm=True`. Also results in an invalid float16
model for the SD CUDA example.

This PR adds `SkipGroupNorm` to `skip_infer` so that it skips onnx shape
inference for this op and instead uses the relevant dispatcher.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix shape inference failure for models with `SkipGroupNorm` nodes.
2023-11-29 18:27:04 -08:00
Yang Gu
227dcb3a88
[js/webgpu] Log the key and program info for artifact (#18365)
With uniform support, ideally we may just keep one artifact for each
program to save the compilation time. This PR just logs the related
info, including key and program name, so that we may understand better
the situation.
2023-11-29 18:01:12 -08:00
satyajandhyala
7335760424
[JS/Web] Add uniforms to Einsum (#18531)
### Description
Add uinforms to Einsum



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance.
2023-11-29 15:30:33 -08:00
Edward Chen
483c490ec4
Refine error checks in onnxruntime/core/providers/coreml/model/model.mm. (#18620)
#18606 updated the original error checks to check that the returned object != nil to appease the static analyzer. However, per the API docs, checking `error != nil` is the way to determine whether an error occurred. This change adds back the `error != nil` check to be safe.
2023-11-29 14:38:44 -08:00
Dmitri Smirnov
d2dfbf4179
Add float16 type support to SplitToSequence and make code type independent (#18594)
### Description
Add support for `float16` type to address the below issue.
Re-work the code to make it type independent.
This reduces binary size by ~11 K.


![image](https://github.com/microsoft/onnxruntime/assets/11303988/1a77c7bc-34a8-478c-a16a-abd94062c6c6)


### Motivation and Context
This PR addresses https://github.com/microsoft/onnxruntime/issues/18481
2023-11-29 10:44:59 -08:00
Yi Zhang
68209307da
Replace all Azure-Pipelines-EO-Windows2022-aiinfrat to Onnxruntime-Win-CPU-2022 (#18614)
### Description
Replace all Azure-Pipelines-EO-Windows2022-aiinfrat to
Onnxruntime-Win-CPU-2022


### Motivation and Context
Reduce the maintenance cost
2023-11-29 10:32:42 -08:00
Wanming Lin
38b640c797
[WebNN EP] Re-implement Unsqueeze, Squeeze, Flatten with WebNN's reshape (#18585)
WebNN will not provide `unsqueeze`, `squeeze`, `flatten2d` ops, as it
can be easily implemented by reshape.
2023-11-29 08:00:23 -08:00
Edward Chen
14a343441d
Fix Objective-C static analysis build (#18606)
- Patch abseil to fix a compile error about not finding `cxxabi.h`.
- Fix some static analysis warnings.
2023-11-28 17:14:20 -08:00
ivberg
e833d22f14
Change QNN EP Profiling logs to output to CSV (#18201)
### Description
Change QNN EP Profiling logs to output to CSV. Output is in a similar
format to QNN SDK Tools (instead of to ORT logs)

https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html#configuration-options
(profiling_level)

### Motivation and Context
It is hard to read and interpret QNN profiling logs in the ORT logs.

---------

Co-authored-by: Hector Li <hecli@microsoft.com>
2023-11-28 16:58:51 -08:00
Tianlei Wu
f13380f3d8
Support LoRA and Control Net in Stable Diffusion demo (#18593)
### Description
(1) Export onnx model with LoRA weights for both SD 1.5 and SDXL
(2) Export onnx model with Control Net for both SD 1.5 and SDXL. For SD
1.5, it is allowed to use multiple control nets. For SDXL, at most one
control net is supported right now.
(3) Add demo of LCM LoRA
(3) Add demo of control net.
2023-11-28 15:46:42 -08:00
Yulong Wang
50e6235af1
[js/web] allow ShaderHelper to use internal (non-I/O) variables (#18525)
### Description
This PR includes a change that inspired from #18452 to resolve a
requirement: a shader may depend on an instance of `IndicesHelper` to
generate WGSL code snippet, but the IndicesHelper instance is not
necessarily an input/output of the program. So the existing
`declareVariables()` function does not work with this scenario.

In order to support this requirement, I added this "use" function to
`interface ShaderHelper`, which takes a helper-like object as parameter.
The hidden implementation `ShaderHelperImpl` class will iterate the
helpers and call `impl()` for each.

@axinging @qjia7
2023-11-28 15:15:59 -08:00
Jian Chen
a49f31b670
Remove drop-nuget artifact from all pipelines (#18592)
### Description
Currently, the `drop-nuget` artifact only contains protoc.exe which is
also part of the `drop-extra` artifact.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-28 13:23:01 -08:00
Mike Guo
e24733cfe9
fix the Olive CI pipeline failure on Windows (#18464)
Fix the https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1046
failure for Windows
2023-11-28 11:42:39 -08:00
Rachel Guo
288b80d363
Add MacOS build to ORT C Pod (#18550)
### Description
<!-- Describe your changes. -->

As title.

1. Add macos build as an optionally enabled arch for pod and changes to
exsiting build_ios_framework/assemble_c_pod scripts.
2. Enable macos build arch in ios packaging pipeline (currently for
variants other than Mobile) and check the output artifacts are correct.
3. Write MacOS Test Target scheme in the test app and integrate into ios
packaging CI testing pipeline.
Currently the changes only apply to onnxruntime-c pod. as the original
request was from ORT SPM which consumes the onnxruntime-c pod only as
the binary target. TODO: could look into adding macos platform to objc
pod as well.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable macos platform support in cocoapods. and also potentially produce
binary target for enabling macos platform in SPM as well.

Replace https://github.com/microsoft/onnxruntime/pull/18334

---------

Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-11-28 10:11:53 -08:00
Chen Fu
05046e5452
Adding unit test for sm80 prepack (#18514)
### Description
Prepacking code for block q4 x fp16 GEMM cuda kernel, for SM80 hardware


### Motivation and Context
Preparing for addition of Q4 x FP16 GEMM kernel on Nvidia Ampere GPUs.
This kernel requires sophisticated quantized weight rearrangement to
speedup loading data to tensor-core. To facilitate the addition, this
change includes the following:

1. matrix_layout.h A new layout lib that facilitate iterating matrix
elements and tiles that balance memory safety and performance.
2. prepack_sm80.h Code for rearranging quantized weight, scales and
offsets (aka. prepacking)
3. blkq4_fp16_sm80_prepack_test.cc Unit tests that explicitly test the
memory safety and correctness of the prepacking code.

Currently the prepacking code runs on CPU with single threaded code. We
run this on CPU in order to minimize GPU memory fragmentation. On the
other hand, hopefully we get around to parallelize this part of the
code. Should be straight forward with the unit tests in place.
2023-11-28 10:01:09 -08:00
Adrian Lizarraga
8d5ecc4dae
[Quantization] Fix scale/zero-point for 16-bit QDQ Softmax (#18589)
### Description
Sets the appropriate scale and zero-point values for 16-bit QDQ Softmax.
Previously, the scale/zp were set to fixed values that were specific to 8-bit quantization.

### Motivation and Context
Generate more accurate 16-bit QDQ models that contain Softmax.
2023-11-28 09:46:47 -08:00
Sheil Kumar
0b7048e7d6
Update winml to use #cores - #soc cores by Default as the number of intraopthreads (#18384)
Update winml to use #cores - #soc cores by Default as the number of
intraopthreads

---------

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2023-11-28 09:26:48 -08:00
Yi Zhang
a6d8726407
Update ADO windows image to custom image (#18598)
### Description
Update Azure-Pipelines-EO-Windows2022-aiinfra to
onnxruntime-win-CPU-2022 in Nuget_Package_CPU.
To make the debugging easier, use flex-downloadPipelineArtifact

### Motivation and Context
Azure-Pipelines-EO-Windows2022-aiinfra is using 1ES window-latest image.
The pipeline might be failed by unexpected upgrade.
Verified:
https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=384425&view=results

### P.S.
I think we should replace all Azure-Pipelines-EO-Windows2022-aiinfra.
2023-11-28 09:04:25 -08:00
Jian Chen
3ea27c2925
Create a new Nuget Package pipeline for CUDA 12 (#18135) 2023-11-28 09:03:46 -08:00
Xavier Dupré
94a6020a7f
Improve parallelization of TfIdfVectorizer, Reduce memory consumption (#18539)
### Description

TfIdfVectorizer has two steps: first search for n-grams in the input,
second, weight the results. The second step was not parallelized. The PR
adresses that issue. Before two vectors were of the size of the output
were allocated to compute the results. The first one, frequencies, was
used as an intermediate vector between the two steps. This vector is now
broken into multiple small vectors, one per thread. The memory
consumption is then reduced for batches with a number of rows > the
number of threads.

### Motivation and Context
Performance and memory consumption.

For one model, the improvment is +15% faster (4 cores, model size is
~6Mb, batch size is 100). Here is another benchmark on
a machine with 32 cores with different size of vocabularies and batch
sizes. The tested TfIdfVectorizer only deals with unigram and processes
sequences of 10 tokens (integers).


![image](https://github.com/microsoft/onnxruntime/assets/22452781/0bb9abe9-ed81-44da-b5c4-ad2a12f129bd)
2023-11-28 12:56:00 +01:00
Ran Gal
3f42fbad2e
deleted the unused random_device variables because they caused a warning that was treated like an error. (#18543)
deleted the unused random_device variables because they caused a warning
that was treated like an error.

**_Please check if the declaration is required for the random number
generation. if so, there need to be a dummy reference to the variable or
turning off the warning as error behavior._**

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-28 08:54:38 +01:00
Jiajia Qin
fc8631e2f1
[js/web] Fix conv2dMatmul errors due to #18452 (#18562)
### Description
Currently, all conv2dMatmul with inChannels = 3 and outChannels % 4 = 0
will report compilation errors. Models, which include this kind of shape
will be impacted, like mobilenetv2-12, resnet50 .

The errors is introduced by #18452
https://github.com/microsoft/onnxruntime/pull/18452/files#diff-8b24ea43aa11b1346c0c9e327f9bce6b37a93bd8f2bf8a6392b2b263972b7ea2R200,
which accidentally pass `components` to `x`. But `x`'s components is
`innerElementSize` not `components `. And when `innerElementSize` is 3,
we should use `1` in current design.
2023-11-27 21:21:47 -08:00
cao lei
b9fd9c5665
remove dead code in openvino EP (#18457)
### Description
<!-- Describe your changes. -->
Remove dead code in openvino EP


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Remove dead code in openvino EP
2023-11-27 13:41:12 -08:00
Caroline Zhu
dd355e39a0
[js/web/training] Added parameters methods (#18250)
### Description
* Implemented: `getParametersSize`, `getContiguousParameters`
(equivalent to copyParametersToBuffer), and `loadParametersBuffer`
(equivalent to copyParametersFromBuffer)
* as part of these changes, getParametersSize was added to the
TrainingSession interface so that users know what size buffer to create
for loadParametersBuffer
* The parameters methods in the interface were modified to take in a
Float32Array instead


### Motivation and Context
* part of the work for implementing web bindings for training
* enables federated learning in the web
* previous  PR: #18006

---------

Co-authored-by: Ashwini Khade <askhade@microsoft.com>
2023-11-27 10:30:13 -08:00
Hector Li
a2fd8a6fc0
[QNN EP] Return INVALID_GRAPH if failed to load from context binary (#18485)
### Description
[QNN EP] Return INVALID_GRAPH if failed to load from context binary

### Motivation and Context
Make sure QNN EP return INVALID_GRAPH if error encountered with the
context binary file
2023-11-24 20:41:27 -08:00
cloudhan
2f608338cb
Setup default python formatter for new python plugin (#18563) 2023-11-24 18:04:48 +08:00
Ted Themistokleous
7b2aefa856
undo hipify of __half to rocblas_half (#18573)
Fixes build issue seen with newer ROCm releases

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2023-11-24 18:04:23 +08:00
mindest
b9c935f605
[ROCm] Some fixes in tunable (#18575)
### Description

* Fix workspace size for hipBLASLt algos at 32M
* Update according to API changes
2023-11-24 17:22:00 +08:00
Rachel Guo
62f00ad8e7
[CoreML] Add Softmax and Split op support (#18358)
### Description
<!-- Describe your changes. -->

As title.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Added for yolov8 model missing operator support.
https://github.com/microsoft/onnxruntime/issues/17654

Now the model support info looks like:
 
_CoreMLExecutionProvider::GetCapability, number of partitions supported
by CoreML: 3 number of nodes in the graph: 233 number of nodes supported
by CoreML: 230_

(only missing 3 concat op support due to input 3d shape is not currently
support in CoreML EP Concat).

---------

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-11-23 14:26:57 -08:00
cloudhan
6f3c1f9dc9
[ROCm] Update ck for GemmFloat8 (#18487) 2023-11-23 12:06:19 +08:00
Adrian Lizarraga
1c79897c90
[QNN EP] Support LpNormalization (#18561)
### Description
Add support for the ONNX LpNormalization operator (p == 2). This is
translated to QNN's L2Norm operator.



### Motivation and Context
Support more models with QNN EP
2023-11-22 19:40:33 -08:00
pengwa
43a5147e01
Memory optimization refactor and refinement (#17481)
### Memory optimization refactor and refinement

Currently memory optimizer runs graph transformations and print
recompute opportunities in INFO level, while ORT backend has many many
INFO level logs making users hard to find those information. So we are
looking for a Python binding API to retrieve the memory optimization
opportunities instead of depending on the MemoryOptimizer's default
logging.
Then we can print ORTModule feature statistics using this information. 
Also, with such an API, we can create an ORT session created, where
allocation plan is done, the analysis will consider buffer reuse as
well. This can void giving some recomputation subgraphs that are reusing
other subgraphs' output buffers.

Check
https://github.com/microsoft/onnxruntime/blob/pengwa/add_devinfo_level/docs/Memory_Optimizer.md
for the new flow using `MemoryOptimizer`.

This pull requests made following refactoring:
1. Print the log in ORTModule Python script, along with ORTModule
feature enabling stats. This is implemented by exposing an API
`get_serialized_ortmodule_memory_stat` to retrieve the memory
optimization opportunities.
2. We are analyzing memory optimization opportunities considering ORT
memory planning. This is done by firstly creating the execution graph
without enabling MemoryOptimizer, then we call
`execution_agent.get_serialized_ortmodule_memory_stat` which internally
will consider the session memory allocation planner when analyzing
memory optimization opportunity. As a direct result, the memory
optimization opportunities can show those stashed activations that are
reusing other buffers.
3. Move recompute analysis logic from memory_optimizer.h/cc to
recompute_analysis.h/cc.
4. Abstract optimization strategies for their own implementation. This
will make introducing new strategies (for example compression and
decompression ) easier.

New logging matrix (INFO Level), in WARNING level, the details will NOT
show.
```
2023-09-13 13:25:09,249 orttraining.rank-0 [WARNING] -
***** ONNX Runtime Training (ORTModule) is accelerating your model *****

ORTModule is enabled with following features ON/OFF for [training] mode:

  ATen Executor         :   ON    :   Dispatch ATen operators to ORT's ATen executor
  Cast Propagation      :   ON    :   Level 1 enabled
  Custom Function       :   ON    :   Support custom torch.autograd.Function export and execution
  Memory Optimizer      :   ON    :   RecomputeConfig: Reshape+Where+BiasSoftmax+:1:-1,Cast+:1:-1, ProbeLevel: 1, available configs:
                                      Config                                                      Freq    Saving(B)       Saving Symbolic(Bytes)
   - Plan 1             :   ON    :   Reshape+Where+BiasSoftmax+:1:-1                             5       671,088,640     640.0*inputs_input_ids_dim0*inputs_input_ids_dim1**2
   - Plan 2             :   ON    :   Cast+:1:-1                                                  6       402,587,648     inputs_input_ids_dim0*inputs_input_ids_dim1*(384.0*inputs_input_ids_dim1 - 64.0)
   - Plan 3             :   OFF   :   Reshape+Where+:1:-1                                         1       134,217,728     128.0*inputs_input_ids_dim0*inputs_input_ids_dim1**2
   - Plan 4             :   OFF   :   BiasSoftmax+:1:-1                                           1       134,086,656     128.0*inputs_input_ids_dim0*inputs_input_ids_dim1*(inputs_input_ids_dim1 - 1)
   - Plan 5             :   OFF   :   BiasGelu+:1:-1                                              6       125,808,640     inputs_input_ids_dim0*(122880.0*inputs_input_ids_dim1 - 20480.0)
   - Plan 6             :   OFF   :   FusedMatMul+:1:-1                                           6       125,808,640     inputs_input_ids_dim0*(122880.0*inputs_input_ids_dim1 - 20480.0)
   - Plan 7             :   OFF   :   FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1               5       26,214,400      25600.0*inputs_input_ids_dim0*inputs_input_ids_dim1
   - Plan 8             :   OFF   :   Add+:1:-1                                                   1       5,237,760       5120.0*inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1)
   - Plan 9             :   OFF   :   Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1         1       4,096           4.0*inputs_input_ids_dim0*inputs_input_ids_dim1
   - Plan 10            :   OFF   :   Cast+:2:-1                                                  1       2,048           2.0*inputs_input_ids_dim0*inputs_input_ids_dim1
  Compute Optimizer     :   ON    :   Enable/Disable with env ORTMODULE_ENABLE_COMPUTE_OPTIMIZER=1/0
   - FLOPReduction      :   ON    :   Reduce FLOPs by upstreaming shrinking-sized ops
  Auto Fallback         :   ON    :   Fallback to PyTorch when encountering unsupported ops
  TritonOp Enabled      :   OFF   :   ORT will switch to Triton for executing some ops to further accelerate training.
  ZeRO Stage3 Support   :   OFF   :   Enable/Disable with env ORTMODULE_ENABLE_ZERO_STAGE3=1/0

Total ORT initialization overhead is 10.73s where export takes 8.39s.
Other overhead details:  graph builder init takes 0.06s, runtime detection takes 0.01s, graph building takes 0.31s, session creation takes 1.96s

Versions: ONNX Runtime - 1.16.0+cu118, ONNX - 1.11.0

Note 1: use comma to enable multiple plans at the same time.
  export ORTMODULE_MEMORY_OPT_CONFIG=<plan1 config>,<plan2 config>,...
Note 2: saving is calculated based on the 1st batch symbolic dim values:
  inputs_input_ids_dim0=1,
  inputs_input_ids_dim1=1024,
  inputs_attention_mask_dim0=1,
  inputs_attention_mask_dim1=1024,
  inputs_labels_dim0=1,
  inputs_labels_dim1=1024,

************************************************************************
```

If DEVINFO level is enabled, then more details about the memory
optimizations are printed.
```

MemoryInsight Summary - User config: BiasGelu+:1:-1,Cast+:2:-1
==========================================================================================================================================
|Freq   | Memory Optimization Opportunities (Clustered by node-level activation patterns)                                                |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|3      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+Add+Reshape+                                                                    |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+Reshape+:1:-1                         |
|       |  Stashed Activations:                                                                                                          |
|       |   - ReuseFreq :  Output 0(3),                                                                                                  |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 32 x 240 x ], byte/elem: 2, 100% saved                        |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+                                                                                    |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+:1:-1                                         |
|       |  Stashed Activations:                                                                                                          |
|       |   - ReuseFreq :  Output 0(2),                                                                                                  |
|       |   - Output 0  : [ x 2560 x ], byte/elem: 2, 100% saved                                                                         |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+                                                                                |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1                                     |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved                           |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Cast+                                                                                       |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved      |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+Where+BiasSoftmax+                                                                  |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+BiasSoftmax+:1:-1                       |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved      |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph BiasGelu+                                                                                   |
|       |  Status       : Enabled, requested count=-1, actual applied count=2                                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved                           |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+Add+FusedMatMul+Add+Add+Add+                                                    |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1         |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 2560 x ], byte/elem: 2, 100% saved                            |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+Where+                                                                              |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+:1:-1                                   |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved      |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+                                                                                |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1                                     |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved                       |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Cast+                                                                                       |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved  |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+                                              |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1   |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved                           |
|       |                                                                                                                                |
|       |>>Option 2     : RecomputeWithCompromise subgraph Cast+                                                                         |
|       |  Status       : Enabled, requested count=-1, actual applied count=1                                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 50% saved                            |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph BiasSoftmax+                                                                                |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=BiasSoftmax+:1:-1                                     |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved  |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph BiasGelu+                                                                                   |
|       |  Status       : Enabled, requested count=-1, actual applied count=1                                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved                       |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Add+                                                                                        |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Add+:1:-1                                             |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 2560 x ], byte/elem: 2, 100% saved                        |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
==========================================================================================================================================
Note: use comma as a separator for enabling more than one subgraphs.

************************************************************************

```


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-23 11:39:00 +08:00
Jiajia Qin
64dacc2892
[js/webgpu] Add BatchNormalization Op (#18468)
### Description
This PR adds `BatchNormalization` with `float` support.

Some Todos:
1. all inputs don't have same data type. For example, x/y is float16,
but bias/scale is float32 or double.
2. training mode support.

We see many models are using `BatchNormalization` ops. However, due to
the missing in jsep, all of them run on cpu, which result very poor
performance. With this PR's support, densenet-9 model becomes 20.29 ms
from 250.69 ms.
2023-11-22 15:58:06 -08:00
Xu Xing
fa106942a7
[js/webgpu] Refactor matmul conv to support uniforms for matmul (#18452)
This change refactored matmul/conv related programs to support shape
uniforms. Currently only matmul shape uniforms are fully enabled.
TODOs: add input dependencies for conv related programs, turn clipMax
and clipMin to uniforms.
2023-11-22 14:42:55 -08:00
Scott McKay
42c6799c59
Update transpose optimization to be more QDQ aware (#18444)
### Description
<!-- Describe your changes. -->
Rework some aspects of the transpose optimizer to ensure we have valid
QDQ node units when it is done.

Conceptually we need to let individual Transpose nodes move through the
graph when optimizing. That can invalidate existing QDQ node units or
require new ones. We can fix this after inserting new nodes, or when
transpose optimization finishes moving Transpose nodes.

Fix when inserting new node
- TransposeInputs can add an Unsqueeze (to broadcast) and Transpose to a
node's inputs
- if there was a DQ node providing the input, add a Q -> DQ after
inserting the Unsqueeze/Transpose to make a QDQ node unit for the new
node.
- Unsqueeze/Transpose don't change data, so we can copy the
type/scale/zero point from the existing DQ

Fixes when transpose optimization completes moving Transpose nodes
- Remove empty DQ -> Q pairs if the type/scale/zero point match
- Pushing a Transpose through may have resulted in an existing
Transpose/Reshape being cancelled and removed leaving an empty QDQ node
unit
  - the Transpose being moved may have started in a QDQ node unit
- Transpose that got blocked inside existing QDQ node unit
- e.g. if we hit a DQ -> MatMul -> Q node unit the Transpose gets
blocked after the DQ
- insert a Q -> DQ after the Transpose to put it in a QDQ node unit and
repair the original QDQ node unit
- Transpose moves past a DQ providing a graph output
  - insert a Q -> DQ so the Transpose is in a QDQ node unit

This replaces the existing phase 2 logic which flipped a DQ -> Transpose
to fix a broken QDQ node unit. The new approach should handle more
scenarios and hopefully produce a better graph.

Additionally the logic to handle updates to shared initializers that
feed DQ nodes was simplified (i.e. largely removed). When we update the
shared initializer a Squeeze (if broadcast) and Transpose is added
between the initializer and the DQ for other usages of it. We only need
to check for this pattern in EstimateTransposeValueCost by looking past
a DQ node. We do not need to track the individual DQ nodes leading to an
updated shared initializer.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Initially to fix QNN issue with non-const input being transpose and the
QDQ node units being broken.
2023-11-23 08:27:47 +10:00
satyajandhyala
841f7ed3e0
[[JS/Web]Added uniform to Expand op. (#18558)
### Description
<!-- Describe your changes. -->
Added Uniforms to Expand operator kernel


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve performance
2023-11-22 14:14:24 -08:00
Arthur Islamov
1c555c5fc1
[JS/Web] Resize & BiasSplitGelu fp16 support (#18536)
### Description
Resize and BiasSplitGelu fp16 support on WebGPU
2023-11-22 12:12:07 -08:00
Xavier Dupré
3f0ebd6736
Fix opset import in GemmFloat8 python unit tests (#18489)
### Description
The unit test are failing if a development version of onnx is used. The
opset are set to 19.
2023-11-22 09:15:24 -08:00