### Description
This change reduces the number of calls to globby functions so that it
accelerates the initialization for 'npm test' with suite0/1 tests from
~14sec to <2sec.
Fix Orttraining Linux Lazy Tensor CI
Orttraining Linux Lazy Tensor CI is broken.
The error message is
AttributeError: 'OnnxRegistry' object has no attribute 'register'
### Description
Added Expand operator support.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Simply add double quotes to prevent there is spaces in the path
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
As if there are spaces in path the bat cannot run, error would occurs.
So with a simple double quotes can fix these problems
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Log ORTModule initialization overhead
When profiling some model for example
```
torchrun --nproc_per_node=1 examples/onnxruntime/training/language-modeling/run_mlm.py --model_name_or_path microsoft/deberta-v3-large --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --do_train --overwrite_output_dir --output_dir ./outputs/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps 200 --logging_steps 1 --use_module_with_loss
{'train_runtime': 303.8711, 'train_samples_per_second': 0.658, 'train_steps_per_second': 0.658, 'train_loss': 6.569518616199494, 'epoch': 0.09}
100%|200/200 [05:03<00:00, 1.52s/it]
***** train metrics *****
epoch = 0.09
train_loss = 6.5695
train_runtime = 0:05:03.87
train_samples = 2223
train_samples_per_second = 0.658
train_steps_per_second = 0.658
```
The end to end time is 303s (train_runtime=0:05:03.87), but the
ORTModule first step initialization (including export, graph build, etc)
takes about 255s, so when we compare the end to end time for a baseline
ORT with an improved version of ORT, there is no perf gains, since the
x% gains over (303-255) is diluted out among the overall 303s. This is
misleading!
So this PR outputs the ORTModule initialization overhead in the output,
then we can manually compute the real compte time and get the perf
gains.
If the log level is >= WARNING, then only the total end to end time +
export time is logged, otherwise, more details of break down is logged:


### Description
Add support for Op InstanceNormalization and GroupNormalization via MeanVarianceNormalization.
### Motivation and Context
Enable more models like Olive'ified SD unet to run on WebNN EP.
### Give user warnings if nondeterministic kernels got called when
Deterministic flag is set
When we do accuracy investigation (for example training convergence
issue debug), usually we will set `use_deterministic_compute ` to be
true.
```
SessionOptions sess_options;
sess_options.use_deterministic_compute = true;
```
While in recent investigation, it is found GatherElementsGrad kernel
(who used atomic add) generate non-deterministic results, making a
deberta model ouput pretty different loss curve every time we run it
even we fix the seed, remove the dropout ratio, and set
use_deterministic_compute to be true. It turned out to be an expected
problem if we do the add in different order by cuda threads. The order
cannot be guaranteed.
So this PR will give warnings when users set `use_deterministic_compute
`, but some kernels don't have determinstic kernel impl, has to run with
non-determinstic impls. This would at least let users know the results
is not determinstic though that flag is set to be True.

Only print the message once in case it floods training logs.
### Description
This PR includes documentation updates, providing step-by-step
instructions on how to implement the ModuleWithLoss wrapper in a
different codebase.
The documentation outlines the necessary code changes and offers
customization options based on specific requirements.
---------
Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Description
Add Stable Diffusion Text2Image pipelines of TensorRT EP and CUDA EP.
They can automatically export and optimize ONNX model, and create
ONNXRuntime session to use TensorRT EP or CUDA execution provider.
Add support for benchmarking TensorRT.
Add support of cuda graph. The feature is only supported in nightly
package right now.
Engine/Provider to test | command line
---- | ---
CUDA EP | `python benchmark.py -v 1.5`
CUDA EP with cuda graph | `python benchmark.py -v 1.5
--enable_cuda_graph`
TensorRT EP | `python benchmark.py -v 1.5 -r tensorrt`
TensorRT EP with cuda graph | `python benchmark.py -v 1.5 -r tensorrt
--enable_cuda_graph`
TensorRT | `python benchmark.py -v 1.5 -e tensorrt`
Add benchmark numbers of T4 GPU using CUDA 11.7, cuDNN 8.5, PyTorch
1.13.1+cu11.7, TensorRT 8.6.1, onnxruntime-gpu 1.15.1 (or
ort-nightly-gpu 1.16 for cuda graph).
TODO: add benchmark numbers of A100-80GB
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
kernel explorer has lots of tests and need numpy to verify the results
of GPU kernels, it will make CPU utilization very high. This PR use
`cupy ` to replace `numpy` to do compute on GPU to reduce CPU
utilization.
set `KERNEL_EXPLORER_TEST_USE_CUPY=1` to enable cupy.
- Move ROCm build step on CPU only machine
- Add the performance data of the huggingface bert-large model on the
MI200
- At the beginning of the test step, check the agent's GPU usage and
kill the threads occupying the GPU, which may be left over from previous
tasks that exited abnormally.
- Use different docker images during the build and test steps. The
difference is the `uid` and `user` when build docker image and create
docker container.
### Description
Add ConvTranspose support for WebGPU
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
clean unused parameter in ORT_UNUSED_PARAMETER
### Motivation and Context
clean unused parameters in ORT_UNUSED_PARAMETER which are introduced
from #15833
### Description
Added WeGPU/JSEP Split operator support.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Adds support for 1D (rank 3) convolutions to QNN EP
- Implements 1D convolutions as 2D convolutions with height == 1.
Reshape nodes are added at the inputs and outputs as necessary.
- Adds more unit tests for Conv and ConvTranspose (2D and 1D).
### Motivation and Context
Allow more models to run on QNN EP.
### Description
Add missing L1Reduce and L2Reduce operator kernels.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The `GemmSoftmaxGemmPermuteTunableOp<HipT>` is expensive to construct,
avoid the ctor invocation will substantially improve the launch time and
get better performance during the decoding. This get <7% e2e time
reduction of whisper large.
### Description
Windows GPU Reduced Ops CI Pipeline is broken due to the introduction of
a second template type in registered kernels. The python code checking
the registration is broken due to that. This PR addresses this issue on
the python side by keeping only one type equal to the concatenation of
the two types.
- Fix some warnings from Xcode build (`-Wshorten-64-to-32`).
- Enable `-Wshorten-64-to-32` warning if available. Currently it's not fully enabled for `onnxruntime_test_all` and `onnxruntime_providers_xnnpack` yet.
- Some clean up in build.py including setting CMake generator more consistently.
- Update some documentation comments.
- Use onnxruntime_training.h as the umbrella header so training API docs are included in generated docs.
- Fix static analysis build.
The PR optimizes BiasGelu/BiasGeluGrad CUDA kernel by 3 changes:
- Use Erf instead of Normcdf for half compute
- Change CUDA thread organization for BiasGelu kernel instead of using
binary elementwise template
- Add vectorized support
Using BiasGelu(A[256, 128, 768] + B[768]) in V100 as example, the perf
number below are in us
Before change, FW: 246.37, BW: 292.77
Use Erf, FW: 152.86, BW: 238.98
All above changes, FW: 132.45, BW: 199.14
For Huggingface's bertweet-base model, with the changes, the step time
(FW+BW) reduces from 324.71766 ms to 316.42552 ms, which is 1.026x
faster.
Using Erf is for half data only, evaluation shows that for float on
CUDA, Normcdf is faster. I didn't check the perf for BFloat16 or on AMD,
so keep them unchanged.
### Description
<!-- Describe your changes. -->
Split out the more basic changes from #15552 for easier review.
Re-organize to clarify the structure
- Separate out generic base functionality from ORT specific components
- pass in handlers for internal ORT ops to Optimize
- Split out layout transformation from transpose optimization
- Separate out level 1 transpose optimizer
- Cleanup some naming to try and clarify things like an optimizer vs.
general optimization code
Most of the changes are from this movement of code.
Two implementation changes:
- the extended handlers are queried first in GetHandler
- allows the extended handlers to override the default behaviour for an
ONNX operator
- simplify the Optimize function to remove OptimizerMode.
- `can_modify_node` is used instead of `mode` and
`ignore_assigned_nodes` and a long description of the current usage is
added. I don't _think_ that changes the current behavior and hopefully
clarifies what happens and when, and makes the base transpose optimizer
implementation more generic.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Create a cleaner separation to support adding EP specific logic next to
cleanly handle where an EP has additional layout sensitive behaviour
required (e.g. it's Resize implementation only handles one layout).
### Description
Introduce an API that allows users to gain access to a string tensor
element buffer of requested length in bytes
so then can quickly load any utf8 data.
### Motivation and Context
Useful for testing an otherwise.
### Description
C API for custom ops does not support float 8 types. This PR changes
that.
### Motivation and Context
The list of operators supporting float 8 is very limited. It should be
extended to custom ops to let developpers add customized operators for
these specific types.
In #16339, the `ORT_ENFORCE(cuda_device_arch_ >= 530` (throw) it changed
to `ORT_RETURN_IF` (Status) but the condition is negated. This fixes the
problem.
### Description
Enable support for building iOS packages/CocoaPods with training API
- Add `Training` Package variant and config files in current iOS
packaging utilities to enable creation of training packages
### Motivation and Context
This PR introduces new `Training` variant in
`build_and_assemble_ios_pods.py` script which allows creating pods for
iOS with training API enabled.
The sample script to build training pods:
```
python3 tools/ci_build/github/apple/build_and_assemble_ios_pods.py --variant Training \
--build-settings-file tools/ci_build/github/apple/default_full_ios_training_framework_build_settings.json \
-b=-- path_to_protoc_exe=<path/to/protoc>
```
Note: build settings file should have `--enable_training` as a build
parameter.
Simply adding training packaging increases the duration of the Azure
pipeline for packaging by 70 minutes. To address this issue, we need to
parallelize pod creation. In order not to further strain the pipeline,
the changes for training packaging will be added in another PR, which
optimizes the packaging pipeline.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
Adds support for adding external initializers or overriding initializers
to a session options from Java.
### Motivation and Context
We want to instantiate large models from Java without filesystem access.
cc @yuslepukhin
### Description
always use 'typescript' from /js/ folder. This allows all NPM packages
to use the same typescript version.
- remove 'typescript' from /js/react_native/package.json. use the one
from /js/package.json
- remove unused '@types/fs-extra'
### Description
Add Concat operator
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix memory leak issue which comes from TRT EP's allocator object not
being released upon destruction.
Following is the log from valgrind:
```
==1911860== 100,272 (56 direct, 100,216 indirect) bytes in 1 blocks are definitely lost in loss record 1,751 of 1,832
==1911860== at 0x483CFA3: operator new(unsigned long) (vg_replace_malloc.c:472)
==1911860== by 0x315DC2: std::_MakeUniq<onnxruntime::OrtAllocatorImplWrappingIAllocator>::__single_object std::make_unique<onnxruntime::OrtAllocatorImplWrappingIAllocator, std::shared_ptr<onnxruntime::IAllocator> >(std::shared_ptr<onnxruntime::IAllocator>&&) (unique_ptr.h:857)
==1911860== by 0x30EE7B: OrtApis::KernelContext_GetAllocator(OrtKernelContext const*, OrtMemoryInfo const*, OrtAllocator**) (custom_ops.cc:121)
==1911860== by 0x660D115: onnxruntime::TensorrtExecutionProvider::Compile(std::vector<onnxruntime::IExecutionProvider::FusedNodeAndGraph, std::allocator<onnxruntime::IExecutionProvider::FusedNodeAndGraph> > const&, std::vector<onnxruntime::NodeComputeInfo, std::allocator<onnxruntime::NodeComputeInfo> >&)::{lambda(void*, OrtApi const*, OrtKernelContext*)#3}::operator()(void*, OrtApi const*, OrtKernelContext*) const (tensorrt_execution_provider.cc:2223)
```
This issue happens after this [EP allocator
refactor](https://github.com/microsoft/onnxruntime/pull/15833)