Being able to leverage I/O binding for DML and registering `If` for the
DML EP allows us to avoid copying the past/present key/values back and
forth between the CPU and the GPU after every token.
This gives us a 25% performance increase for Dolly V2 with 128 tokens on
an RTX 4090.
### Description
Set `canvas` dimensions to the `ImageBitmap` dimensions, thus fixing a
malformed Tensor creation.
### Motivation and Context
According to the [HTMLCanvasElement.drawImage()
spec](https://html.spec.whatwg.org/multipage/canvas.html#drawing-images):
> When the destination rectangle is outside the destination image (the
output bitmap), the pixels that land outside the output bitmap are
discarded, as if the destination was an infinite canvas whose rendering
was clipped to the dimensions of the output bitmap.
meaning that `ImageBitmap` pixels exceeding the canvas dimensions will
be discarded. Since no canvas dimensions are set for
`Tensor.fromImage(ImageBitmap)` if-case, the default 300x150px canvas
dimensions are used leading to the creation of malformed Tensors where
all the exceeding pixels are discarded and equal to `0, 0, 0, 0` during
the subsequent `pixels2DContext.getImageData()` call.
### Description
<!-- Describe your changes. -->
This PR moves checking profilingMode to each run instead of the
initialization stage. In this way, users can start/stop profiling at any
time. Otherwise, profiling only take effects at the very beginning and
can't be stopped.
### Description
This PR adds support for saving model optimizations after loading a
model that contains external data into an `InferenceSession`.
### Motivation and Context
This PR is a follow-up to a [previous
PR](https://github.com/microsoft/onnxruntime/pull/16716) for saving a
model optimized by an `InferenceSession`.
### Description
1. As a follow-up of #16761, this PR allows build ORT on iOS/Android
without the need to explicitly specify a protoc path. #16761 is for
WASM. This one is for iOS/Android
2. Update the MacOS/Linux build scripts that build/install protobuf from
source. Make them be more flexible. Add the support for
RedHatEnterprise(ubi), which will needed for upgrading the base image
from centos:7 to ubi:8.
3. Update tools/ci_build/github/pai/rocm-ci-pipeline-env.Dockerfile :
the docker file's base image has preinstalled protobuf in /usr/local, we
should uninstall them to avoid conflicts.
### Description
The `%AGENT_TEMPDIRECTORY%\v11.8` is created in azcopy step.
So, the set env step should be after the azcopy step.
### Motivation and Context
Correct the previous logic
Unify the step since multiple jobs are using it.
### Description
Implemented Resize operator support in JSEP
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve graph transformer DoubleQDQPairsRemover
### Description
Improve DoubleQDQPairsRemover to not reset the scale & zero point if
existing value are same on the target DQ & Q nodes.
### Motivation and Context
Fix a bug that DoubleQDQPairsRemover reset the scale value while
removing unnecessary DQ & Q nodes.
### Description
Added Gelu operator to JSEP
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Python Package Pipeline failed since there is exception raised in
test_smooth_quant (from #16288):
```
File "/home/cloudtest/.local/lib/python3.8/site-packages/onnxruntime/quantization/quantize.py", line 384, in quantize_static
importlib.import_module("neural_compressor.adaptor.ox_utils.smooth_quant")
File "/home/cloudtest/.local/lib/python3.8/site-packages/neural_compressor/__init__.py", line 24, in <module>
from .contrib import *
File "/home/cloudtest/.local/lib/python3.8/site-packages/neural_compressor/contrib/__init__.py", line 19, in <module>
from .strategy import *
File "/home/cloudtest/.local/lib/python3.8/site-packages/neural_compressor/contrib/strategy/__init__.py", line 26, in <module>
__import__(basename(f)[:-3], globals(), locals(), level=1)
File "/home/cloudtest/.local/lib/python3.8/site-packages/neural_compressor/contrib/strategy/sigopt.py", line 22, in <module>
from neural_compressor.strategy.strategy import strategy_registry, TuneStrategy
File "/home/cloudtest/.local/lib/python3.8/site-packages/neural_compressor/strategy/__init__.py", line 20, in <module>
from .strategy import STRATEGIES
File "/home/cloudtest/.local/lib/python3.8/site-packages/neural_compressor/strategy/strategy.py", line 41, in <module>
from ..algorithm import AlgorithmScheduler, ALGORITHMS
File "/home/cloudtest/.local/lib/python3.8/site-packages/neural_compressor/algorithm/__init__.py", line 20, in <module>
from .algorithm import ALGORITHMS, Algorithm, AlgorithmScheduler, algorithm_registry
File "/home/cloudtest/.local/lib/python3.8/site-packages/neural_compressor/algorithm/algorithm.py", line 21, in <module>
from neural_compressor.utils.create_obj_from_config import get_algorithm
File "/home/cloudtest/.local/lib/python3.8/site-packages/neural_compressor/utils/create_obj_from_config.py", line 20, in <module>
from neural_compressor.metric import METRICS
File "/home/cloudtest/.local/lib/python3.8/site-packages/neural_compressor/metric/__init__.py", line 30, in <module>
__import__(basename(f)[:-3], globals(), locals(), level=1)
File "/home/cloudtest/.local/lib/python3.8/site-packages/neural_compressor/metric/coco_tools.py", line 54, in <module>
from pycocotools import coco
File "/usr/local/lib/python3.8/dist-packages/pycocotools/coco.py", line 52, in <module>
from . import mask as maskUtils
File "/usr/local/lib/python3.8/dist-packages/pycocotools/mask.py", line 3, in <module>
import pycocotools._mask as _mask
File "pycocotools/_mask.pyx", line 1, in init pycocotools._mask
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
```
The cause is pycocotools package uses "oldest-supported-numpy", which
might cause older version numpy in build pycocotools:
9e9164f979/PythonAPI/pyproject.toml (L4)
Related issue: https://github.com/cocodataset/cocoapi/issues/248
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Fixes the issue with IRFFT output dimension calculation as described in
#13236
### Motivation and Context
Please refer to #13236 for detailed description.
Specifically, [this code](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/contrib_ops/cuda/math/fft_ops.cc#L103) computes the output dimension as:
```
out_dim = in_dim * 2 - 1
```
while it should be this instead:
```
out_dim = 2 * (in_dim - 1)
```
(assuming the original signal has even number of samples, of course).
For example, if the original signal has 4 samples, then the round trip should look something like:
```
4 -> (one-sided RFFT) -> 3 (complex) -> (one-sided IRFFT) -> 4
```
with the current code the output will be a signal with 5 points.
---------
Co-authored-by: Alexey Kamenev <akamenev@nvidia.com>
Co-authored-by: Nick Geneva <nicholasgeneva@gmail.com>
### Description
<!-- Describe your changes. -->
Split stages for CPU and CPU+NNAPI builds as CodeQL is enabled at the
stage level.
We run it for CPU+NNAPI as that covers all the Android code.
We don't want to run it for both as duplicate issues would be created
for a problem in code included in both builds.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Remove VS 2019 code.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Allow creating ConstantScalarNode for double type
Allow create ConstantScalarNode for double type. Looks double type is
not respected when creating constant. So fix it.
```
onnxruntime::python::addObjectMethodsForTraining(pybind11::module&, onnxruntime::python::ExecutionProviderRegistrationFn)::<lambda(onnxruntime::training::OrtModuleGraphBuilder*, const onnxruntime::training::TrainingGraphTransformerConfiguration&)> [ONNXRuntimeError] : 1 : FAIL : Type Error: Type parameter (T) of Optype (Sub) bound to different types (tensor(double) and tensor(float) in node (/_original_module/_original_model/gpt_neox/layers.0/input_layernorm/Pow_Grad/Sub_1).
```
### Description
These yaml files and docker files are not used by any pipeline. If I
were wrong, feel free to submit a PR to get the wrongly deleted file
back from git history (git keeps everything forever).
### Description
Add some safety check for conv op.
It is to validate if the attributes coming from a conv op are in a valid
range. (shouldn't be too large or too small).
### Description
Added Flatten operator support to JSEP.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Check the axis attribute for LayerNorm for HTP
### Description
Add code to check the axis value for LayerNorm for HTP explicitly to
make sure only the last dimension is allowed.
### Description
<!-- Describe your changes. -->
This PR adds support to cache the exported training/evaluation ONNX
model in `ORTModule`. On future runs, instead of exporting the model
again, we can pick up the model from a location on disc and run
`ORTModule` training/evaluation.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ORT Training DRI Contribution
---------
Co-authored-by: root <root@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
Co-authored-by: pengwa <pengwa@microsoft.com>
### Description
### Motivation and Context
It's also used to upgrade visual studio to VS2022.
onnxruntime-gpu-winbuild-T4 and onnxruntime-gpu-tensorrt8-winbuild-t4
are using the image based on one dev branch and VS2019
To avoid breaking the current CIs, we move jobs running on
onnxruntime-gpu-winbuild-T4/onnxruntime-gpu-tensorrt8-winbuild-t4 to
onnxruntime-Win2022-GPU-T4.
### Description
Adds javadoc for all protected and public members, methods and classes.
### Motivation and Context
The javadoc warnings were annoying me when running the builds. Also,
those types should have been documented.
---------
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
### Description
Support SmoothQuant for ORT static quantization via intel neural
compressor
> Note:
Please use neural-compressor==2.2 to try SmoothQuant function.
### Motivation and Context
For large language models (LLMs) with gigantic parameters, the
systematic outliers make quantification of activations difficult. As a
training free post-training quantization (PTQ) solution, SmoothQuant
offline migrates this difficulty from activations to weights with a
mathematically equivalent transformation. Integrating SmoothQuant into
ORT quantization can benefit the accuracy of INT8 LLMs.
---------
Signed-off-by: Mengni Wang <mengni.wang@intel.com>
winml/ was previously excluded from lintrunner config. This change
includes the directory and adds the clang-format config file specific to
winml/ that fits existing style.
---------
Signed-off-by: Justin Chu <justinchu@microsoft.com>
This will remove transposes that are non needed in the DML kernel. To
keep backward compatiblity, the default behavior is to set NHWC when no
attribute is set.
### ORTModule log clean up
ORTModule log level - WARNING(Default) is for end users; INFO and
VERBOSE is for internal ORT training developers.
Few issues:
1. ONNX export will output lots of WARNING error message like "The shape
inference of
com.microsoft::SoftmaxCrossEntropyLossInternal/ATen/PythonOp type is
missing", which is useless for us or end users.

3. ORT also print some information like
""CleanUnusedInitializersAndNodeArgs] Removing
initializer","ReverseBFSWithStopGradient] Skip building gradient for",
which is also useless for us or end users most of the time.

5. Different ranks output logs and making ORT developers or end users
feels there are too many logs but usually not useful until we need
investigate.
Few improvements for the issues:
1. For ONNX export logs, there are two kinds of logs: a. export verbose
log; b. other logs printed by torch C++ backend. So this PR make
following change:
# VERBOSE -> FULL export verbose log + FULL torch other logs from stdout
and stderr (C++ backend)
# INFO -> FULL export verbose log + FILTERED torch other logs from
stdout and stderr (C++ backend)
# WARNING/ERROR -> [Rank 0] NO export verbose log + FILTERED torch other
logs from stdout and stderr (C++ backend)
e.g. for verbose level, print all logs as usually; for info level, print
verbose export log, and filtered logs from torch C++ backend (removing
messages like this "The shape inference of
com.microsoft::SoftmaxCrossEntropyLossInternal/ATen/PythonOp type is
missing") . For higher level, only log the info on rank 0.
2. For ORT gradient graph build and session creation, also suppress the
message and filtered out the message when log level >=INFO.
3. log level > INFO, then only logs on rank 0 is logged, to have a
cleaner user experience
This is the log for a BLOOM model training after the change: there are
limited of warnings.

### Description
Disable two PERF* rules in ruff to allow better readability. Rational
commented inline. This change also removes the unused noqa directives
because of the rule change.
### Motivation and Context
Readability
### Description
"onnxruntime-common" starts to get more and more complicated, so it's a
good idea to add unit tests for it.
Includes the following changes:
- move `mocha` from each subfolder (js/web/, js/node/) to root (js/), so
that it will be installed once and all subfolder can use.
- add folder `test` in js/common/ as root folder for ort-common tests.
- add sub folder `type-tests`. this folder contains a few typescript
source code, which are excluded from the tsconfig.json. they are not
compiled by default. instead, file `type-tests.ts` calls typescript
compiler (tsc) to check for the files under this folder whether the
compilation result is as expected. If tsc compiles a file successfully
when a failure is expected, this is considered an failed test.
- add sub folder `unit-tests`. files under this folder will be compiled
by default. we use default mode of mocha (using `describe()` and `it()`)
to setup test groups and cases.
- update eslint rules accordingly.
### Description
Added Slice operator support to JSEP.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR splits out the FP16 conversions into a separate package we can
override in the android build with a version which works on old versions
of Android.
I'm not sure the android build system changes are correct as I haven't
got an android build environment configured on my workstation.
@YUNQIUGUO if the CI build fails we should follow up offline to get my
environment configured so I can iterate on it.
### Motivation and Context
Fixes the CI failure after #16703.
Update upload_pod_archive_and_update_podspec.sh to take a pod archive path glob pattern. The actual pod archive path has a version suffix which changes.
### Description
This PR improves `TreeNodeElementId` hash function by employing [Elegant
Pairing function](http://szudzik.com/ElegantPairing.pdf). In few works,
Elegant Pairing function maps two non−negative integers to a
non−negative integer that is uniquely associated with that pair. This
drastically reduces the collision and therefore reduces the time
required to create a session in order to use a large tree ensemble
model.
### Motivation and Context
We use ONNX runtime to serve our models as part of Triton backend. We
noticed that it was taking around 2 minutes to load a model which is a
large tree ensemble model (around 5k trees with around 3 millions nodes
in total). After investigating the issue, it was clear that the
`TreeNodeElementId` hash function wasn't being able to map keys to
buckets of C++ `unordered_map` without a significant amount of
collisions (in same cases 700 items per bucket).
The following picture shows graphically the improvement obtained by the
proposed change. We used the `onnx_test_runner` command.

#### Before
```
$> time ./onnx_test_runner -v ~/folder_with_model
result:
Models: 1
Total test cases: 0
Succeeded: 0
Not implemented: 0
Failed: 0
Stats by Operator type:
Not implemented(0):
Failed:
Failed Test Cases:
real 0m55.695s
user 0m52.919s
sys 0m0.760s
```
#### After
```
$> time ./onnx_test_runner -v ~/folder_with_model
result:
Models: 1
Total test cases: 0
Succeeded: 0
Not implemented: 0
Failed: 0
Stats by Operator type:
Not implemented(0):
Failed:
Failed Test Cases:
real 0m17.152s
user 0m14.318s
sys 0m0.619s
```
### Description
<!-- Describe your changes. -->
Updating README.md to add vitisai for onnxruntime_perftest
### Motivation and Context
<!-- - Why is this change required? What problem does it solve? -->
The perftest tool does support vitisai whereas the README.md does not
list it. This creates some confusions internally about if vitisai is
supported. See https://github.com/microsoft/onnxruntime/pull/15673 for
context.
### Fix slice upstream - (MatMul) [ShapeInferenceError] Incompatible
dimensions
```
2023-07-22 14:58:16.918478478 [I:onnxruntime:Default, constant_sharing.cc:256 ApplyImpl] Total shared scalar initializer count: 10
2023-07-22 14:58:16.919494252 [W:onnxruntime:Default, graph.cc:108 MergeShapeInfo] Error merging shape info for output. 'onnx::Cast_424' source:{-1,31,-1,-1} target:{-1,32,-1,-1}. Falling back to lenient merge.
2023-07-22 14:58:16.921014114 [W:onnxruntime:Default, graph.cc:108 MergeShapeInfo] Error merging shape info for output. 'onnx::MatMul_425' source:{-1,31,-1,-1} target:{-1,32,-1,-1}. Falling back to lenient merge.
Traceback (most recent call last):
File "examples/onnxruntime/training/language-modeling/run_clm.py", line 594, in <module>
main()
File "examples/onnxruntime/training/language-modeling/run_clm.py", line 542, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/bert_ort/pengwa/optimum/optimum/onnxruntime/trainer.py", line 454, in train
return inner_training_loop(
File "/bert_ort/pengwa/optimum/optimum/onnxruntime/trainer.py", line 755, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/transformers/trainer.py", line 2735, in training_step
loss = self.compute_loss(model, inputs)
File "/bert_ort/pengwa/optimum/optimum/onnxruntime/trainer.py", line 363, in compute_loss
return model_with_loss(dict_inputs, return_outputs)
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1724, in forward
loss = self.module(*inputs, **kwargs)
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_utils.py", line 384, in _forward
return ortmodule._torch_module.forward(*inputs, **kwargs)
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_utils.py", line 364, in _forward
return torch_module_ort._execution_manager(torch_module_ort.is_training()).forward(*inputs, **kwargs)
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 345, in forward
self._fallback_manager.handle_exception(
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_fallback.py", line 157, in handle_exception
raise exception
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 280, in forward
self._build_graph(graph_transformer_config)
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_logger.py", line 218, in wrapper
result = func(graph_execution_manager, *args, **kwargs)
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 360, in _build_graph
super()._build_graph(graph_transformer_config)
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 186, in _build_graph
self._graph_builder.build(config)
RuntimeError: /bert_ort/pengwa/onnxruntime/orttraining/orttraining/python/orttraining_pybind_state.cc:823 onnxruntime::python::addObjectMethodsForTraining(pybind11::module&, onnxruntime::python::ExecutionProviderRegistrationFn)::<lambda(onnxruntime::training::OrtModuleGraphBuilder*, const onnxruntime::training::TrainingGraphTransformerConfiguration&)> [ONNXRuntimeError] : 1 : FAIL : Node (MatMul_403) Op (MatMul) [ShapeInferenceError] Incompatible dimensions
```
Missed using `axis` attribute for `Slice` op, so change to use `axes`
inputs instead.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Sometimes, ONNX exporter generates rank- or shape-dependent sub-graphs.
Thus, error could occur when running the ONNX model with different
inputs. This PR
([78e736d](78e736d857))
addresses this problem by
- if needed, exporting multiple ONNX models with different inputs for
the same GraphModule.
- implementing a naive mechanism to determine of existing ONNX models
(and the associated InferenceSession) can be reused.
On the other hand, in the second commit
[b5a9b5f](b5a9b5f849),
this PR also enables dynamic shapes in DORT by
- passing dynamic_shapes = True to exporter (see how
DEFAULT_DYNAMIC_BACKEND is created)
- calling torch._dynamo.optimize(dynamic_ort_aot, dynamic=True) (see how
dynamic_ort_aot is created).
/builds/devtechproviz/dl/ort-builder/onnxruntime/onnxruntime/python/onnxruntime_pybind_state.cc:388:14:
error: missing initializer for member
'OrtTensorRTProviderOptionsV2::trt_cuda_graph_enable'
[-Werror=missing-field-initializers]
388 | 0};
|
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->