Remove half of the compiled shaders for gridsample with unused types
Shader compilations for Bool and Double datatypes are not needed for GPU
evaluation and are causing binary bloat.
Removing them.
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
In SLN strict mode, current code (#16510) does not handle skip broadcast
nicely . There are two issues:
(1) skip related parameters is not passed to cuda kernel in strict mode
(2) Strict mode kernel also has bug in handling skip broadcasting (like
cuWelfordMuSigma2 does not handle skip broadcasting).
Here we remove the support of skip broadcasting in strict mode, and
operator will return error message that strict mode only support same
shape of input and skip.
Other changes:
* skip_size is misleading when there is no broadcasting. Change to
correct value.
* Refactor the code to be more efficient: (1) no need to check whether
there is broadcasting in kernel. (2) remove one local buffer (load input
to sum_v directly to save a local buffer copy).
* compute input + bias + skip instead of input + skip + bias. The order
is followed common pattern in transformers model (Here assume graph
fusion will distinguish input and skip correctly, need double check
fusion code later).
* update unit test so that strict mode is triggered in each test case
(unless skip broadcasting) to have higher test coverage.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
SLN strict mode does not support skip broadcast but current code will
silently run (kernel might fail)
### Description
We need to ensure that tensors are first created and validated by their
producers. If we don't, then builders that need to modify their outputs
may not be able to do so if consumers are processed first (due to
caching of tensors). For example, the Tanh builder may need to override
its output quant param for 16-bit QDQ. I've encountered a scenario
(while working on a partner model) where the override was not being
correctly applied due to the graph traversal order.
I tried to fix this bug in a previous
[PR](https://github.com/microsoft/onnxruntime/pull/17877#discussion_r1353676802),
but my fix was incorrect.
### Description
The QNN HTP backend does not support rank 3 InstanceNorm if the batch
size is not 1. To work around this limitation, QNN EP can wrap a rank 4
QNN InstanceNorm op with Reshapes (with the H dim set to 1).
### Motivation and Context
Enable support for more models.
Migrate most CUDA EP improvements and changes to ROCM EP. The process
involves using hipify against all CUDA EP files (i.e. do not exclude any
files from onnxruntime_rocm_hipify.cmake) then vimdiff compare them
against the ROCM EP files that are under source control and pull in most
changes. These changes include functional as well as formatting and
makes comparing CUDA EP and ROCM EP easier, though it makes the PR diff
somewhat less obvious due to formatting changes.
- hipify audit of onnxruntime/core/providers/rocm, enable ops
- Loop
- Scan
- hipify audit of onnxruntime/contrib_ops/rocm
- fix contrib ops search implementation
- enable more contrib ops
- Affine
- ComplexMul
- ConvTransposeWithDynamicPads
- Crop
- DynamicSlice
- FFT [Rfft, Irfft]
- GreedySearch
- ImageScaler
- ParametricSoftplus
- ScaledTanh
- ThresholdRelu
---------
Co-authored-by: cloudhan <cloudhan@outlook.com>
### Description
<!-- Describe your changes. -->
Use cpuinfo value when checking to dot product is available. Reading the
ID_AA64ISAR0_EL1 register is unsafe.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#17647#17541#17851
### Description
Support DistributedSlice kernel in Cuda EP.
mainly support following cases:
1. input data is sharded or replica for all axes (including slice axes)
2. slice axes is sharded across different devices.
starts / ends / steps sharded across different devices are not supported
yet.
---------
Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Cheng Tang <chenta@microsoft.com>
### Description
Update NDK to 26.0.10792818 which is included in every macOS build
machine so that we do not need to download a different version every
time in every build.
### Motivation and Context
Downloading NDK on-the-fly is a main contributor of Android related
build failures.
### Description
<!-- Describe your changes. -->
Improve readability by fixing misplaced comments and utilizing
std::rotate.
### Motivation and Context
Resolve some comments in
https://github.com/microsoft/onnxruntime/pull/17714
### Description
* follows the packaging approach according to the design document
* adds `ENABLE_TRAINING` boolean flag to `BUILD_DEFS`
* modifies `package.json` to include training submodule
* modifies build script to handle, validate, and minimize training WASM
artifacts
* adds the binding for the new backend with training enabled & the new
training artifacts
* adds training backend
* edits `index.ts` to use training backend depending on `BUILD_DEFS`
* edits `wasm-factory.ts` to use the training artifacts if necessary
### Motivation and Context
* we are in the process of adding web bindings to enable training.
* Adding the "glue" to allow onnxruntime-web to use the training WASM
artifacts is required for this work.
* Since BUILD_DEFS is defined and used at build time, I thought that it
made sense to bundle the changes to building in the same PR.
#### Related work
* #16521 allowed for training artifacts to be built
* #17333 must be merged in before this one
---------
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
Compliance check would fail randomly but the stage couldn't be rerun if
the pipeline artifacts are already published.
There's the error like `Artifact xxxx already exists`.
We had to restart the whole pipeline if there's a random error in
compliance check.
Without doing this CMake gives a miscellaneous error on windows when
checking if NVCC is functional. It will be missing a number after
`--threads`.
Currently it is only possible to configure through the python build scripts and not CMake
only configure - which is what I am usually doing through CLion.
Right now, GroupNorm only support limited number of channels (320, 640,
960, 1280, 1920, 2560, 128, 256, 512). Skip the fusion if number of
channels are not supported.
### Motivation and Context
SD XL refiner model uses number of channels 384, 768, 1152, 2304 and
3072 in GroupNorm.
### Description
allow gpu IO binding tests to fail temporarily.
when the root cause is still in investigation, use `continueOnError:
true` to allow the test to fail without blocking PRs.
### Description
"NPM packaging pipeline" needs to download an artifact from
"Zip-Nuget-Java-Nodejs Packaging Pipeline".
It has been a long-time issue that they two pipelines often use
different commit ids.
This change declares 'Zip-Nuget-Java-Nodejs Packaging Pipeline' as a
resource, so that "NPM packaging pipeline" will always fetch from the
pipeline run that triggers this NPM pipeline.
Their official document says:
"When you define a resource trigger, if its pipeline resource is from
the same repo as the current pipeline, triggering follows the same
branch and commit on which the event is raised."
### Description
1. If the model should be skipped, don't load it.
2. print loaded tests and skipped tests
3. add more same filters as of the onnxruntime_test_all.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
The QNN HTP backend only supports Softmax/LogSoftmax operators with an
axis attribute set to `input_rank - 1` (i.e., the last dimension). This
PR adds support for any axis by wrapping the QNN operator in transposes.
### Motivation and Context
Support more models.
### Support inplace update for PythonOp/Grad
This PR is based on another PR
https://github.com/microsoft/onnxruntime/pull/17685's branch, to make it
easier to review.
With PR: PR https://github.com/microsoft/onnxruntime/pull/17685, By
default all PythonOp inputs/outputs are assumed to not be inplaced, if
during run, we found some inplace update happens (by checking output
data address with all inputs data address), we add clone before set it
as PythonOp/Grad's outputs. In this case, results are correct, but
implicit copies overheads are introduced.
This PR allow users to define output input reuse map, to let ORT know
how to do the reuse map, avoid such unnecessary copies.
### Description
<!-- Describe your changes. -->
Patch for All gather fn for Deepspeed Stage 3 changes
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Previously, BiasAdd only supports hidden dimensions of 32, 640 and 1280
for stable diffusion. This adds a kernel that could support any number
of channels.
### Motivation and Context
Stable Diffusion XL refiner model uses hidden dimensions of 768 or 1536,
which was not supported in BiasAdd.
### Description
upgrade JS shared dev dependencies.
- webpack: removed
- eslint: upgrade to latest.
- eslint config upgraded to compatible with latest version
- typescript upgrade to v5
- update module "CommonJS" to "Node16" in tsconfig
- update deprecated config "importsNotUsedAsValues" to
"verbatimModuleSyntax"
- remove webpack bundles in onnxruntime-common
### Description
flags `--enable_wasm_api_exception_catching --disable_rtti` are used in
release build, so fix the build_jsep.bat script to make it more
consistent with CI.
SD XL Refiner model has new hidden dimension sizes not supported by BiasSplitGelu. This update the kernel to support them.
### Motivation and Context
Current BiasSplitGelu does not support optimization for SD XL refiner model.
### Description
support using uniform buffer.
This PR allows to use uniform buffer in shader program, so that some
runtime information (eg. input/output shape) is no longer need to be
hardcoded into shader code.
There are 2 commits in this PR:
-
[667f31c](667f31c83d):
framework changes to support uniform buffer, as well as updates in
program manager, gpu data manager and indices helper.
-
[09e1d2a](09e1d2ad1d):
an example change for operator `Transpose` to use input's rank-only
instead of dims as shader key. With this change, model mobilenetv2-12
shader compile times dropped from 71 to 52.
### Description
<!-- Describe your changes. -->
Include CoreML EP in python package.
I've added to the base package as CoreML comes from the OS so there are
no additional libraries to distribute.
Updated the CPU-based provider list to add the AzureEP, which is also
included in the base package, to fix some test failures. Without this
the infrastructure thinks a device copy implementation is required
between AzureEP and CoreML nodes, which is not the case as the AzureEP
is CPU based.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#16989
Fix#17760. Upstream exporter creates empty string as Pad's 3rd input
and the quantization tool 1) considers that as a valid tensor name and
2) adds corresponding invalid quantization nodes. This PR adds a
condition check to make quantization tool working.
- Update ROCm and MIGraphX CI to ROCm5.7
- Simplify test exculde file. Some tests will output `registered
execution providers ROCMExecutionProvider were unable to run the model.`
if they cannot run.
- Add `enable_training` build argument for MIGraphX pipeline.
Python package pipeline fails due to "tokenizers" compilation. Since
"tokenizers" is a dep of "transformers", we update its version and hope
a new solution had been there.
```
error: casting `&T` to `&mut T` is undefined behavior, even if the reference is unused, consider instead using an `UnsafeCell`
--> tokenizers-lib/src/models/bpe/trainer.rs:517:47
```
Improve CUDA EP's GetCapability: Add layout transformer support.
Currently the code detects if a node is already assigned to some EP, if
yes, it will directly return.
```c++
if (!node.GetExecutionProviderType().empty()) {
return;
}
```
So, if you call the GetCapability function twice,
```c++
auto caps = GetCapability();
assign_nodes_to_eps(..., caps, ...);
auto caps2 = GetCapability();
```
The second GetCapability() call will return fewer results than the first
one. Layout transformer needs to call GetCapability twice as above. So
the current GetCapability() implementation is incompatible with the
Layout transformer. It is not an issue right now because the CUDA EP
doesn't need to do layout transform. But we might want to support a
different layout.
- we will publish the onnxruntime-training-rocm package on ADO feeds.
The onnxruntime-training package will solely be for cuda.
- Add new pipeline for onnxruntime-training-rocm ADO feeds
https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1278. Only
package with latest rocm version is publish to ADO.
### Fix convergence for dolly+stage3 training
In
[ZeROOffloadSubscriber](216214b7d3/orttraining/orttraining/python/training/utils/hooks/_zero_offload_subscriber.py (L359C7-L359C28)),
we defined some PythonOp, taking input and returning it inplace, for
example:
216214b7d3/orttraining/orttraining/python/training/utils/hooks/_zero_offload_subscriber.py (L223C20-L223C20).
While it is possible, when ORT runs such a PythonOp, once it completes,
it will release the input OrtValue, triggered the data erasing or
overridden. But the PythonOp's returned value OrtValue are still
pointing to that address, reading or writting on that may introduce a
wrong result or even undefined behaviors.
```
/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_custom_autograd_function_runner.py:28: UserWarning: .rank-0: onnxruntime.training.utils.hooks._zero_offload_subscriber.ORTZeROOffloadPreForwardFunction->Backward: ONNX Op attribute 'tensor_reuse_map' doesn't indicate 8-th output is reusing any input, but detected inplace_map indicates it is reusing some input index. A clone will be done before returning to ORT, to align with ORT's NO Buffer reuse plan. Please update inplace_map explicitly to avoid such a copy.
warnings.warn(f".rank-{get_rank()}: {message}")
0%|▏ | 1/1000 [00:04<1:15:08, 4.51s/it][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,023 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 14.1406, 'learning_rate': 0, 'epoch': 0.0}
0%|▏ | 1/1000 [00:04<1:15:08, 4.51s/it]Invalidate trace cache @ step 5: expected module 6, but got module 7
0%|▍ | 2/1000 [00:04<31:53, 1.92s/it][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,124 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
0%|▋ | 3/1000 [00:04<18:05, 1.09s/it][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,227 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
0%|▋ | 3/1000 [00:04<18:05, 1.09s/it][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,326 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
0%|█▏ | 5/1000 [00:04<08:44, 1.90it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,419 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
0%|█▏ | 5/1000 [00:04<08:44, 1.90it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,505 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
1%|█▋ | 7/1000 [00:05<05:28, 3.02it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,597 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
1%|█▋ | 7/1000 [00:05<05:28, 3.02it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,690 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
1%|██▏ | 9/1000 [00:05<03:57, 4.17it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,791 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
1%|██▏ | 9/1000 [00:05<03:57, 4.17it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,889 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
1%|██▋ | 11/1000 [00:05<03:06, 5.32it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:44,981 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0}
1%|██▋ | 11/1000 [00:05<03:06, 5.32it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:45,073 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01}
1%|███▏ | 13/1000 [00:05<02:33, 6.42it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:45,166 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01}
1%|███▏ | 13/1000 [00:05<02:33, 6.42it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:45,256 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01}
2%|███▌ | 15/1000 [00:05<02:12, 7.43it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:45,348 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01}
2%|███▌ | 15/1000 [00:05<02:12, 7.43it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:45,439 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01}
2%|████ | 17/1000 [00:06<01:59, 8.22it/s][WARNING|trainer_pt_utils.py:849] 2023-09-25 08:30:45,535 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01}
2%|████ | 17/1000 [00:06<01:59, 8.22it/s]Traceback (most recent call last):
File "examples/onnxruntime/training/language-modeling/run_clm.py", line 600, in <module>
main()
File "examples/onnxruntime/training/language-modeling/run_clm.py", line 548, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/bert_ort/pengwa/optimum/optimum/onnxruntime/trainer.py", line 457, in train
return inner_training_loop(
File "/bert_ort/pengwa/optimum/optimum/onnxruntime/trainer.py", line 781, in _inner_training_loop
self.deepspeed.step()
File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/engine.py", line 2084, in step
self._take_model_step(lr_kwargs)
File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/engine.py", line 1990, in _take_model_step
self.optimizer.step()
File "/bert_ort/pengwa/deepspeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/zero/stage3.py", line 1854, in step
if self._overflow_check_and_loss_scale_update():
File "/bert_ort/pengwa/deepspeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/zero/stage3.py", line 1788, in _overflow_check_and_loss_scale_update
self._update_scale(self.overflow)
File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/zero/stage3.py", line 2132, in _update_scale
self.loss_scaler.update_scale(has_overflow)
File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
raise Exception(
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
2%|████ | 17/1000 [00:06<06:07, 2.67it/s]
[2023-09-25 08:30:51,075] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1065120) of binary: /bert_ort/pengwa/py38/bin/python
Traceback (most recent call last):
File "/bert_ort/pengwa/py38/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
examples/onnxruntime/training/language-modeling/run_clm.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-09-25_08:30:51
host : orttrainingdev10.internal.cloudapp.net
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1065120)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(/bert_ort/pengwa/py38) pengwa@microsoft.com@orttrainingdev10:/bert_ort/pengwa/optim
```
## The Fix
For those output that are reusing input, but ORT is not aware of, we
detected on the fly (the first iteration, by checking the output tensor
addresses with input tensor addresses) , then do implicit copy before
set it as PythonOp's output tensors.
With this fix: (left: PyTorch, right: ORT)

### Description
Adds logging to `GraphTransformer::Apply` whether modification has taken
place or not.
### Motivation and Context
A general high level info logging to track which optimization occurred
for a given model. To help improve dynamo exported model performance by
monitoring the difference of triggered transformations between that of
torchscript exported model.
### Description
Improve the QNN context binary cache feature to reduce the memory
overhead and initialization time overhead.
Instead of dumping a Qnn context binary file with metadata as header, we
dump a Onnx format file with metadata inside Onnx node.
### Motivation and Context
reduce the memory overhead and initialization time overhead