### Description
* Adds TrainingSession.create() functionality following the web bindings
for training design doc
* Added 2 new training APIs to wasm/api.h:
* OrtTrainingGetInputOutputName
* OrtTrainingGetInputOutputCount
* Moved isOrtEnvInitialized boolean to the wasm-core-impl and added a
method that references it
### Motivation and Context
* Adding web bindings for training
#### Related work
* #16521 allowed for training artifacts to be built
* #17333 added interfaces for training
* #17474 allows for training package to be built + adds training backend
to web package **[MUST BE MERGED IN BEFORE THIS ONE]**
---------
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Ashwini Khade <askhade@microsoft.com>
Historically, DML was only able to fuse partitions when all sizes are
known in advance or when we were overriding them at session creation
time. But in practice, it should be possible to compile partitions at
compute time if the caller knows that the dimensions won't be changed
for every inference (e.g. resizing a webcam window, or padding the input
to powers of 2). This graph will be cached and reused until the sizes
change.
This is an opt-in option gated under the `enable_dynamic_graph_fusion`
option, which means that it will only be enabled when the caller
requests it since they have more context on how their model will be
called between inferences.
This PR also adds the option to disable metacommands from the python
API, which is an option for the C API but was lacking for python.
### Description
Add a contrib op MatMulBnb4 (FP4 and NF4) and related toolchain to
support quantization on weight.
This PR adds:
- schema for contrib op MatMulBnb4 which can support FP4 (4-bit floating
point) and NF4 (4-bit NormalFloat) quantization on weight.
- a naive implementation for MatMulBnb4 on CPU and GPU, i.e.,
implemented like MatMul(A, Dequantize(B)).
- a special implementation for GemV for MatMulBnb4 and related benchmark
tool.
- tool to quantize model to FP4 or NF4.
### Description
<!-- Describe your changes. -->
The mmla kernels require additional ISA flags
and are currently supported only on Linux
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
more context is in https://github.com/microsoft/onnxruntime/pull/15270
cc: @skottmckay , @chenfucn , @snnn
### FP16 optimizer automatically detect DeepSpeed compatibility
Optimum/Transformers are using accelerate lib to prepare models, so our
FP16 optimizer wrapper does not work for long time. Because the
namespace is `accelerate.utils.deepspeed.DeepSpeedOptimizerWrapper`,
which underlying is still calling into DeepSpeed stage1and2 optimizer.
This PR includes following changes:
1. Add `accelerate.utils.deepspeed.DeepSpeedOptimizerWrapper` in the
modifier registry, plus a check on its contained `optimizer` property
MUST be DeepSpeed stage 1 and 2 optimizer. (let's cover Stage 3
optimizer later)
2. For DeepSpeed version > 0.9.1, we will store the source code in a
version list. As long as the related function in DeepSpeed remains
unchanged during its new release, we won't need manually upgrade the
version check any more. If some day, the source code did not match, a
warning will be raised to users, to add a new version of source code in
the list.
With the above change, we will have our FP16 Optimizer working again in
Optimum.

### Description
Introduce new ORT L1 optimizer under RewriteRule category to fuse MatMul
+ BatchNormalization node. This optimizer look for a specific pattern
observed in one of the impacting customer models and fuse the Matmul and
Batchnormalization node into a Gemm node. For details on the pattern
matching and fusion please refer to the comment section of
`matmul_bn_fusion.cc`.
To visualize, this optimizer will replace following subgraph to a Gemm
node.
<pre>
MatMul GEMM
| |
Reshape ^ ---> Reshape ^
| |
Transpose ^ Transpose ^
|
BatchNormalization
Note: ^ means there can be >=0 occurrence(s) of that node.
Few example fusable pattern:
* - MatMul -> Reshape -> Transpose -> BatchNormalization ---> GEMM ->
Reshape -> Transpose
* - MatMul -> Reshape -> BatchNormalization ---> GEMM -> Reshape
* - MatMul -> Transpose -> BatchNormalization ---> GEMM -> Transpose
* - MatMul -> Reshape -> Reshape -> BatchNormalization ---> GEMM ->
Reshape -> Reshape
* - MatMul -> Reshape -> Transpose -> Reshape -> BatchNormalization --->
GEMM -> Reshape -> Transpose -> Reshape
* - MatMul -> BatchNormalization ---> GEMM
</pre>
Note: This optimizer may evolve in the future to be more generic in
terms of the pattern matching.
### Motivation and Context
- Why is this change required? What problem does it solve?
One of the user of ORT+DML ep needs this to better target the model to
DML. But this transformation applies more broadly, so added L1
optimizer.
<!-- - If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Cast/Resize with f16 are missing in vae-decoder-f16. With this change,
vae-decoder-f16 becomes 315 ms from over than 1 seconds.
Memcpy nodes could have negative impact on performance, they also cause
ORT unable to run CUDA graph.
Here we add a warning log for CUDA EP when this happens. It could help
trouble shooting. For example, when CUDA graph cannot run, we can see
the logs to find out where the Memcpy nodes are inserted (Although it is
also possible through saving optimized model, but that need more time
and disk space).
Note that the warning is per graph. When there are subgraphs, we might
see multiple warnings if the issue happens in multiple graphs.
Example logs:
```
2023-10-19 20:58:10.678176531 [I:onnxruntime:, transformer_memcpy.cc:329 AddCopyNode] Add MemcpyFromHost after input_ids for CUDAExecutionProvider
2023-10-19 20:58:10.678198702 [I:onnxruntime:, transformer_memcpy.cc:329 AddCopyNode] Add MemcpyFromHost after /text_model/ArgMax_output_0 for CUDAExecutionProvider
2023-10-19 20:58:10.678211727 [I:onnxruntime:, transformer_memcpy.cc:329 AddCopyNode] Add MemcpyFromHost after /text_model/Gather_3_output_0 for CUDAExecutionProvider
2023-10-19 20:58:10.678257903 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 3 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
```
Make sure "trt.plugins" custom op domain only being registered once.
The bottom line is "trt.plugins" custom op domain needs to be registered
before model load.
`CreateTensorRTCustomOpDomainList()` is TRT EP's function to create
"trt.plugins" custom op domain. Following are places where this function
will be called. (This function only fetches all the TRT plugins from TRT
plugin registry but not yet registered them to ORT custom op registry.
The real registration happens in AddCustomOpDomains())
C/C++ APIs:
- `OrtApis::SessionOptionsAppendExecutionProvider_TensorRT_XX`: This
function will make session option object contain the "trt.plugins"
custom op domain for ORT to register. So that later the session creation
api can register the custom op domain accordingly and won't complain
about invalid onnx node.
- `InferenceSession::RegisterExecutionProvider`: In some cases, users
might create the session object first and later call
session_object.RegisterExecutionProvider(). This function will call
p_exec_provider->GetCustomOpDomainList() which returns "trt.plugins"
custom op domain. Otherwise, session_object.Load(model) will complain.
Python APIs:
- `RegisterTensorRTPluginsAsCustomOps`: Need to call this function so
that session option object contains the "trt.plugins" custom op domain
for ORT to register.
Different language bindings have slightly different workflow of
initializing the session. This might cause duplicate custom op domain in
`session_option.custom_op_domains_` or
`CreateTensorRTCustomOpDomainList()` being called more than once, but we
put checks to make sure ep's custom op domain won't be registered twice.
### Description
Inline functions in an EP aware fashion.
The result of this PR is that models that are having been inlined by
ONNX inliner and optimized and models that have been AOT inlined appear
to be visually identical.
For tests I used two models. The only difference is the resulting size
because ONNX inliner removes local function definitions and AOT does
not. Difference in sizes for `HF Mobile` model was 2.5 MB, and for `HF
Bart` it was ~500K. It seems that the resuling model size affects the
load time more than the actual optimizations.
In general, the inlined models grow in size very fast and can easily
exceed 2Gb limit.
Q. Should we make AOT optional?
`If` costant folding and the removal of local inlined models will be
coming in other PRs.
Some stats:

### Description
Enable one-dim special input to GlobalAveragePoll input
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Currently only 2D input is supported.
### Description
<!-- Describe your changes. -->
This PR adds UMMLA and SMMLA based QGEMM kernels for aarch64. This
covers
(i) symmetric quantization (zero point is Zero)
(ii) asymmetric quantization (zero point is non zero)
(iii) per channel as well as per tensor quantization
(iv) Signed weights (U8S8 Gemm)
(v) Unsigned weights (U8U8 Gemm) and
(vi) Signed activations and weights (S8S8 Gemm) scenarios
I've enabled the ummla/smmla kernels based on cpuinfo check for `I8MM`
support
MMLA QGEMM kernels are enabled for all the devices that support I8MM
instructions.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is to improve INT8 quantized MatMul performance on aarch64
platform.
I have run the below benchmarking script (bert , roberta and gpt2 model
inference) on AWS Graviton3 based c7g.4xl instance and observed up to
1.33x performance improvement compared to the optimized UDOT qgemm
kernel performance.
```
cd onnxruntime/python/tools/transformers
python3 benchmark.py
```
I have also run the unit tests, and made sure all are passing
```
./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync
```
### Description
<!-- Describe your changes. -->
Currently, the uniform support has bugs when dims rank is larger than 4.
See https://github.com/microsoft/onnxruntime/issues/17860 item 1.
So this PR only enables shapes uniforms when shape rank is <= 4 for
transpose. Otherwise, below compilation errors are thrown:
```
1 error(s) generated while compiling the shader:
:3:50 error: uniform storage requires that array elements are aligned to 16 bytes, but array element of type 'u32' has a stride of 4 bytes. Consider using a vector or struct as the element type instead.
struct Uniforms { output_size:u32, a_shape:array<u32, 5>, a_strides:array<u32, 5>, output_shape:array<u32, 5>, output_strides:array<u32, 5> };
^^^^^^^^^^^^^
:3:7 note: see layout of struct:
/* align(4) size(84) */ struct Uniforms {
/* offset( 0) align(4) size( 4) */ output_size : u32;
/* offset( 4) align(4) size(20) */ a_shape : array<u32, 5>;
/* offset(24) align(4) size(20) */ a_strides : array<u32, 5>;
/* offset(44) align(4) size(20) */ output_shape : array<u32, 5>;
/* offset(64) align(4) size(20) */ output_strides : array<u32, 5>;
/* */ };
struct Uniforms { output_size:u32, a_shape:array<u32, 5>, a_strides:array<u32, 5>, output_shape:array<u32, 5>, output_strides:array<u32, 5> };
^^^^^^
:4:42 note: 'Uniforms' used in address space 'uniform' here
@group(0) @binding(2) var<uniform> uniforms: Uniforms;
^^^^^^^^
```
### Description
This PR is to implemente a exporter which works for large language
models(LLM).
It works for models like Llama2-70b or gpt-175.
The main idea is to utilize multiple-GPU and dispatch differnet layers
to different GPU, in short, it symply implemented auto pipeline
parallelism.
For example : to export Llama2-70b, you need 8x V100-32GB or 4x A100-80G
or More GPU memories.
It would expect to export decoder-only models. For encoder-decoder
arch-like models, we didn't test it yet.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Expose a new allocator from cuda stream.
The allocator manages deferred cpu memory which only get recycled before
stream destruction.
---------
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
If the model is partitioned into TRT subgraphs and CUDA EP node, we
observed cuda stream synchronization issue when multithreading. Calling
stream sync API after enqueue can solve this issue without adding much
performance overhead.
### Description
Update dockerfiles/Dockerfile.source to avoid installing onnx python
package. ONNX is not listed in
https://github.com/microsoft/onnxruntime/blob/main/requirements.txt.in.
We do not have to install it. Especially when we do not run tests, the
package provides no help when building onnxruntime from source.
### Motivation and Context
Resolve#17781
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Reduce overhead of QNN context binary loading by avoiding memory copy
### Motivation and Context
Reduce the session initialization time and memory usage while load from
QNN context binary
### Description
Initialize previously unitialized parameters that were causing Op to
crash.
### Motivation and Context
Solves Cuda Memory Misalignment / Illegal Memory Access error when
FlashAttention was used in Packed Multi-Head Attention.
Add CUDA EP to the StableDiffusion XL Demo including:
(1) Add fp16 VAE support for CUDA EP.
(2) Configuration for each model separately (For example, some models
can run with CUDA graph but some models cannot).
Some remaining works will boost performance further later:
(1) Enable CUDA Graph for Clip2 and UNet. Currently, some part of graph
is partitioned to CPU, which blocks CUDA graph.
(2) Update GroupNorm CUDA kernel for refiner. Currently, the cuda kernel
only supports limited number of channels in refiner so we shall see some
gain there if we remove the limitation.
Some extra works that are nice to have (thus lower priority):
(3) Support denoising_end to ensemble base and refiner.
(4) Support classifier free guidance (The idea is from
https://www.baseten.co/blog/sdxl-inference-in-under-2-seconds-the-ultimate-guide-to-stable-diffusion-optimiza/).
#### Performance on A100-SXM4-80GB
Example commands to test an engine built with static shape or dynamic
shape:
```
engine_name=ORT_CUDA
python demo_txt2img_xl.py --engine $engine_name "some prompt"
python demo_txt2img_xl.py --engine $engine_name --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape "some prompt"
```
Engine built with dynamic shape could support different batch size (1 to
4 for TRT; 1 to 16 for CUDA) and image size (256x256 to 1024x1024).
Engine built with static shape could only support fixed batch size (1)
and image size (1024x1024).
The latency (ms) of generating an image of size 1024x1024 (sorted by
total latency):
Engine | Base (30 Steps)* | Refiner (9 Steps) | Total Latency (ms)
-- | -- | -- | --
ORT_TRT (static shape) | 2467 | 1033 | 3501
TRT (static shape) | 2507 | 1048 | 3555
ORT_CUDA (static shape) | 2630 | 1015 | 3645
ORT_CUDA (dynamic shape) | 2639 | 1016 | 3654
TRT (dynamic shape) | 2777 | 1099 | 3876
ORT_TRT (dynamic shape) | 2890 | 1166 | 4057
\* VAE decoder is not used in Base since the output from base is latent,
which is consumed by refiner to output image.
We can see that ORT_CUDA is faster on dynamic shape, while slower in
static shape (The cause is Clip2 and UNet cannot run with CUDA Graph
right now, and we will address the issue later).
### Motivation and Context
Follow up of https://github.com/microsoft/onnxruntime/pull/17536
### Description
Adds a method to access the backing direct byte buffer from a Java
`OnnxTensor` object, assuming it is backed by a direct byte buffer
(tensors created by ORT's run call or ones created in Java from
multidimensional arrays are not). Also adds a method to check if the
backing byte buffer was copied from the user's buffer supplied on
creation (this could be tested via a pointer comparison from the output
of `getBufferRef` and the user's input buffer, so I'm not sure if it's
necessary).
### Motivation and Context
This is the first part of changes necessary to support output pinning in
Java OrtSession.run/OrtTrainingSession.run calls. I split it out from
the rest of the work as it's useful by itself (e.g. to allow users to
keep a single input tensor and rewrite it each time with new inputs
rather than allocate a fresh one) and the other change will be much more
involved so splitting it makes it easier to review.
cc @yuslepukhin
### Description
This is a temp fix for the failing "Zip-Nuget-Java-Nodejs Packaging
Pipeline". The pipeline is failing because I removed NodeJS from the
build machine pool's image, to reduce the number of dependencies we need
to maintain in VMs.
So this PR will temporarily move the test to a different machine pool to
get the test passed. Then I will move the test to docker. Docker images
are relatively easier to update and maintain. Now we almost run all
Linux test in docker, except for this one. Moving it to docker is needed
for enabling GPU support in nodejs, because all our Linux VMs do not
have CUDA.
### Motivation and Context
### Description
This PR:
(1) Fixes AMD builds after #17200 broke them (Need to remember to run
AMD builds while trying to merge external CUDA PRs next time)
(2) Turn on the NHWC CUDA feature in the Linux GPU CI. The extra time
spent in building a few more files and running a few more tests will not
be much.
Test Linux GPU CI run :
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1170770
### Motivation and Context
Keep the NHWC CUDA ops tested
(https://github.com/microsoft/onnxruntime/pull/17200) and guard against
regressions
### Description
<!-- Describe your changes. -->
Fix missing attribute. Causes build error on release xamarin iOS build.
Fix some long lines as well.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#16463 - once the dummy extensions nuget package is used this problem
shows up.
Fix a bug in https://github.com/microsoft/onnxruntime/pull/11803:
When hidden size is not exactly same as next size (for example ld=320 in
stable diffusion) current vectorized kernel might read out-of-bounds,
and might cause CUDA failure.
Also resolved another issue: for the first and last size, current macro
will cause some dead code (some branch will never run). Here we change
it to avoid those branches in boundary sizes.
Performance tests with stable diffusion shows that the performance is
on-par before/after this fix.
### Description
CUDA inference speed heavily relies on Tensor Cores. To have tensor
cores achieve the optimal throughput they require the data layout to be
NHWC rather than NCHW.
### Motivation and Context
Especially for convolutional networks this is very important. I will
illustrate this using a very simple network:
```
import torch
import torch.nn as nn
class Net1(nn.Module):
def __init__(self):
super(Net1, self).__init__()
# 1 input image channel, 6 output channels, 5x5 square convolution
# kernel
self.m = nn.ModuleList([
nn.Conv2d(in_channels=8, out_channels=32, kernel_size=5, stride=1),
nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1),
nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=1),
nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False),
nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False),
])
def forward(self, x):
for module in self.m:
x = module(x)
return x
if __name__ == "__main__":
dtype = torch.half
device = "cuda"
dummy_input = torch.randn(8, 8, 512, 512, dtype=dtype, device=device)
model = Net1().to(dtype=dtype, device=device)
input_names = ["input1"]
output_names = ["output1"]
torch.onnx.export(model, dummy_input, "test.onnx",
input_names=input_names, output_names=output_names)
```
I profiled the launch of `./build/RelWithDebInfo/onnxruntime_perf_test
-e cuda -I -q -t 5 test.onnx` using sys and nvtx ranges.
Current master launches below kernels:

If I add the introduced `-l` flag we see below kernels:

Notice the missing NCHW<>NHWC kernels per operation. The layout
optimizer introduced a transpose op as first and last op of the whole
network. The `op_generic_tensor_kernel` shows the bias used which should
also be optimized out next.
Measured across some very basic models:
| CUDA EP | **NCHW** [ms] | **NHWC** [ms] | Speedup |
|:------------------------|--------------------------------------:|-----------------------------------------:|------------------:|
| | -e cuda -t 5 -q | -e cuda -t 5 -q -l | |
| resnet101-v2-7_bs8_fp16 | 18.33 | 13.07 | 1.4 |
| resnet101-v2-7_bs8 | 21.8 | 12.06 | 1.81 |
| test | 102.07 | 73.62 | 1.39 |
Average speedup: 1.53
## Outlook
Next the mission will be to first write a templated unit test to check
for correctness of NHWC vs NCHW ops. After that we have to transition
more ops to measure perf improvements on a broader range of models.
Currently this is not easily possible as we can do not support all ops
in the NHWC domain.
---------
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>