### Description
<!-- Describe your changes. -->
Minor changes to allow CoreML EP to handle more nodes and models.
- Remove graph input dynamic shape check from
coreml::GetSupportedNodes(). Each node input is still checked.
- Add check for optional input in coreml::IsInputSupported(). If an
input does not exist it should not be considered unsupported.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Some CoreML EP checks seem too strict now.
Here's the motivating issue:
https://github.com/microsoft/azure-pipelines-tasks/issues/10331
Noticed some problems in other repos so also updating usages in ORT.
We may be fine now without it, but this change adds some safeguard against future additions of 'set -x' for debugging.
### Description
Change CUDA pipelines to download CUDA SDK in every build job
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Add dml registration for bitwise and, or, xor and not added in opset 18.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Linnea May <linneamay@microsoft.com>
### Description
Add maybe_unused attribute to variables that are only used for logging
### Motivation and Context
Building ORT with training using Xcode 14.3 causes`
-Wunused-but-set-variable` error as some variables are created and
exclusively used for debug logging. Adding maybe_unused suppresses
warnings on unused variables when logging is disabled and fixes the
local build.
### Description
1. Set gtest output while ctest is set to empty.
2. onnx_src in _deps shouldn't be removed because
onnx_test_pytorch_converted and onnx_test_pytorch_converted need to read
data from onnx/backend/test/data/..
### Motivation and Context
Test result report is important to find the flaky tests.
### To do
Tests are not inconsistent.
If ctest_path is empty, onnx_test_pytorch_converted and
onnx_test_pytorch_converted will not be executed, if it's not,
onnxruntime_mlas_test will not be executed.
270c09a37f/tools/ci_build/build.py (L1743-L1753)
### Description
After this PR there are following pool need to be updated.
old|new|note
---|---|---
onnxruntime-Win2019-GPU-dml-A10|tbd|
onnxruntime-Win2019-GPU-T4|onnxruntime-Win2022-GPU-T4|
onnxruntime-Win2019-GPU-training-T4|onnxruntime-Win2022-GPU-T4|ame as
the above because we do not have many T4 GPUs
onnxruntime-tensorrt8-winbuild-T4|tbd|
aiinfra-dml-winbuild|tbd|
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This reduces peak nonlocal memory consumption when uploading large
weights for big models (e.g. LLMs), while at the same time trying to
keep the GPU as busy as possible. This change could be more
sophisticated, but at this stage it is the most minimal and least risky
change required to support LLMs.
### Description
In some scenarios, the triton written kernels are more performant than
CK or other handwritten kernels, so we implement a framework that
onnxruntime can use these triton written kernels.
This PR is to integrate triton into ort, so that ort can use kernels
that written and compiled by triton.
The main change focus on two part:
1. a build part to compile triton written kernel and combine these
kernels into libonnxruntime_providers_rocm.so
2. a loader and launcher in c++, for loading and launch triton written
kernels.
#### Build
To compile triton written kernel, add a script
`tools/ci_build/compile_triton.py`. This script will dynamic load all
kernel files, compile them, and generate `triton_kernel_infos.a` and
`triton_kernel_infos.h`.
`triton_kernel_infos.a` contains all compiled kernel instructions, this
file will be combined into libonnxruntime_providers_rocm.so, using
--whole-archive flag.
`triton_kernel_infos.h` defines a const array that contains all the
metadata for each compiled kernel. These metadata will be used for load
and launch. So this header file is included by 'triton_kernel.cu' which
defines load and launch functions.
Add a build flag in build.py and CMakeList.txt, when building rocm
provider, it will call triton_kernel build command, and generate all
necessary files.
#### C++ Load and Launch
On c++ part, we implement load and launch functions in triton_kernel.cu
and triton_kernel.h.
These two files located in `providers/cuda`, and when compiling rocm,
they will be hipified. so this part supports both cuda and rocm. But
currently we only call triton kernel in rocm.
We also implement a softmax triton op for example. Because there will
generate many kernels for different input shape of softmax, we use
TunableOp to select the best one.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Register Split18 for DirectML
Split13 was previously implemented. Split18 adds a new attribute called
"num_outputs" that must be used mutually exclusively with the "split"
input.
The "num_outputs" attribute wil split the tensor evenly (and handles odd
uneven splits). To implement, the DML split tensor just needs to be
overridden in the presence of the num_output attribute.
---------
Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>
### Description
<!-- Describe your changes. -->
Old pool | New pool | Notes
-- | -- | --
onnxruntime-Win-CPU-2019 | onnxruntime-Win-CPU-2022 |
onnxruntime-Win2019-CPU-training | onnxruntime-Win2022-CPU-training-AMD
|
onnxruntime-Win2019-CPU-training-AMD |
onnxruntime-Win2022-CPU-training-AMD | Same as the above
onnxruntime-Win2019-GPU-dml-A10 | Need be created | You need to create a
new image for it first
onnxruntime-Win2019-GPU-T4 | onnxruntime-Win2022-GPU-T4 |
onnxruntime-Win2019-GPU-training-T4 | onnxruntime-Win2022-GPU-T4 | Same
as the above because we do not have many T4 GPUs
onnxruntime-tensorrt8-winbuild-T4| TBD|TBD
Win-CPU-2021|onnxruntime-Win-CPU-2022| will do it in next PR
Win-CPU-2019|onnxruntime-Win2022-Intel-CPU'| Intel CPU needed for
win-ci-pipeline.yml -> `stage: x64_release_dnnl`
<br class="Apple-interchange-newline">
### Motivation and Context
With vs2022 we can take the advantage of 64bit compiler. It also with
better c++20 support
Set default value for parameters in nuget-zip pipeline, and only apply
the configurations when they are not "NONE".
---------
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
### Description
<!-- Describe your changes. -->
Adds option to pass in pretrained weights file during T5 inference onnx
export. Mimics the changes made to whisper:
https://github.com/microsoft/onnxruntime/pull/15759
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Required for ONNX Runtime demo being presented at BUILD.
ROCm CI batch size test occasionally fail. Try reduce batch size to fix
it.
error log:
Non-zero status code returned while running FusedMatMul node.
Name:'MatMul_2914_Grad/FusedMatMul_0' Status Message: HIP error
hipErrorNotFound:named symbol not found
Non-zero status code returned while running Gemm node.
Name:'MatMul_2891_Grad/Gemm_5' Status Message: HIP error
hipErrorNotFound:named symbol not found
### Description
Added MaxPool tests to show the issues with MaxPool and also provide
test coverage
The following tests are currently Failing:
./onnxruntime_test_all --gtest_filter=*.TestMaxPool*
[ FAILED ] 5 tests, listed below:
[ FAILED ] QnnCPUBackendTests.TestMaxPool_Ceil
[ FAILED ] QnnCPUBackendTests.TestMaxPool_Large_Input2_Ceil
[ FAILED ] QnnHTPBackendTests.TestMaxPool_Large_Input_HTP_u8
[ FAILED ] QnnHTPBackendTests.TestMaxPool_Large_Input2_HTP_u8
[ FAILED ] QnnHTPBackendTests.TestMaxPool_Large_Input2_Ceil_HTP_u8
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
Provide test coverage for MaxPool and debug model related issues.
This PR mainly fixes building errors when trying to build nupkg for ROCm EP.
It also slighly improve the packaging logic so that devlopers can
produce the nupkg on linux natively.
### Description
<!-- Describe your changes. -->
This should produced fused Resnet50.fp16.onnx
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR enables Whisper's multitask format and allows a user to use
Whisper for multiple tasks (e.g. transcription, translation) and for
multilingual purposes (e.g. English, Spanish). This PR also removes
`attention_mask` as a required input for Whisper with beam search.
### Usage
Here is an example of how you can use Whisper for English transcription.
```
import numpy as np
import onnxruntime as ort
from datasets import load_dataset
from transformers import AutoConfig, AutoProcessor
model = "openai/whisper-tiny"
config = AutoConfig.from_pretrained(model)
processor = AutoProcessor.from_pretrained(model)
forced_decoder_ids = processor.get_decoder_prompt_ids(language="english", task="transcribe")
# forced_decoder_ids is of the format [(1, 50259), (2, 50359), (3, 50363)] and needs to be
# of the format [50258, 50259, 50359, 50363] where 50258 is the start token id
forced_decoder_ids = [config.decoder_start_token_id] + list(map(lambda token: token[1], forced_decoder_ids))
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
input_features = processor(ds[0]["audio"]["array"], return_tensors="np").input_features
inputs = {
"input_features": np.float32(input_features),
"max_length": np.array([26], dtype=np.int32),
"min_length": np.array([1], dtype=np.int32),
"num_beams": np.array([2], dtype=np.int32),
"num_return_sequences": np.array([1], dtype=np.int32),
"length_penalty": np.array([1.0], dtype=np.float32),
"repetition_penalty": np.array([1.0], dtype=np.float32),
"decoder_input_ids": np.array([forced_decoder_ids], dtype=np.int32),
}
sess = ort.InferenceSession("whisper-tiny_beamsearch.onnx", providers=["CPUExecutionProvider"])
outputs = sess.run(None, inputs)
# Print tokens and decoded output
print(outputs[0][0][0])
print(processor.decode(outputs[0][0][0]))
```
If you don't want to provide specific decoder input ids or you want
Whisper to predict the output language and task, you can set
`forced_decoder_ids = [config.decoder_start_token_id]` instead.
### Motivation and Context
As seen in the figure below from the [OpenAI Whisper
paper](https://cdn.openai.com/papers/whisper.pdf), Whisper can be used
for multiple tasks and languages.

### Description
<!-- Describe your changes. -->
V100, b_4_s_128, max_output_len=64, beam=4
before:
t5_small: 101.28ms
t5_base: 200.07ms
after:
t5_small: 87.65ms
t5_base: 174.44ms
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
### Description
To prevent people from accidentally changing OrtApiBase, these
static_asserts will fire if anyone adds a method, rearranges the method
ordering, or changes the signature of any of the methods.
### Motivation and Context
People have submitted changes that have done these things and there was
no mechanism to stop them besides someone noticing in a code review. An
automated way to let people know is needed.
Register CPU OptionalGetElement, OptionalHasElement on DirectML
Graphs with OptionalGetElement and OptionalHasElement should work in a
DML graph without extra memcpy operation on and off the GPU.
CopyCpuTensor is swapped with DataTransferManager.CopyTensor() to make
the CPU operator usable by other providers.
---------
Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Scope down a unit test case where the condition should only apply when:
1. Two streams, one GPU one CPU;
2. If node is on CPU;
3. There is a wait step before the If node.
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
### Description
Updates the default QNN SDK version to 2.10 for the QNN NuGet pipeline.
### Motivation and Context
Ensures that the daily QNN NuGet pipeline builds ORT using the latest
QNN SDK by default.
### Description
<!-- Describe your changes. -->
change the EP device to default OrtDevice() for memoryType equals
CPUInput for cuda, rocm, migraph
x and tensorRT EP
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
My previous PR (https://github.com/microsoft/onnxruntime/pull/15618)
caused random failures on cuda training test
GradientCheckerTest.TileGrad (see build
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=986784&view=logs&j=5076e696-f193-5f12-2d8a-703dda41a79b&t=a3824a7c-2162-5e3d-3fdd-8cf808834fbb)
and rocm test:
root@a59558217e53:/workspace# pytest
orttraining/orttraining/test/python/orttraining_test_ortmodule_api.py::test_gradient_correctness_minmax
...
E RuntimeError: Error in backward pass execution: Non-zero status code
returned while running ATen node.
Name:'/_original_module/ATen_Grad/ATen_1' Status Message: Storage size
calculation overflowed with sizes=[72340172838076673, 72340172838076673,
128]
Potential reason is that if the memType of cuda/tensorRT/rocm/migraphx
EP is CPUInput, previously the corresponding device in the IAllocator's
memoryInfo is default OrtDevice(), while after my change, it becomes
OrtDevice(CPU, xx_PINNED, 0);
Changing it back fixed GradientCheckerTest.TileGrad in Win GPU training
build.
update ROCm/MIGraphX CI to ROC5.5.
TODO:
two PR to fix failure on
orttraining/orttraining/test/python/orttraining_test_ortmodule_api.py
-
test_gradient_correctness_minmax/test_gradient_correctness_argmax_unfold/test_gradient_correctness_argmax_diagonal
(https://github.com/microsoft/onnxruntime/pull/15903)
- test_ortmodule_attribute_name_collision_warning
(https://github.com/microsoft/onnxruntime/pull/15884)
### Description
Change BeamHypotheses to not use a stl::priority_queue and instead all
BeamHypotheses use a single buffer that they each get a small slice of.
As the beam count is really small (typically 4,8, max of 32) and the
array size fixed, the BeamHypotheses just does a sorted insert into an
array.
This also allows for the BeamHypotheses inside of the BeamSearchScorer
to be a single fixed allocation vs an onnxruntime::FastAllocVector.
### Motivation and Context
The goal is to simplify the memory usage and make the code more easily
ported to CUDA.
### Description
This PR partially reverts changes introduced in
https://github.com/microsoft/onnxruntime/pull/15643
We make two API return std::string always in UTF-8.
We also move the entry points from OrtApiBase to OrtApi to make them
versioned.
### Motivation and Context
`GetVersionString` always returns x.y.z numbers that are not subject to
internationalization.
`GetBuildInfoString` can hold international chars, but UTF-8 should be
fine to contain those.
We prefix them with u8"" in case the compiler default charset is not
UTF-8.
Furthermore, creating platform dependent APIs is discouraged.
`ORTCHAR_T` is platform dependent and was created for paths only.
On non-unix platforms would still produce `std::string` that can only
contain UTF-8
The API was introduced after the latest release, and can still be
adjusted.
Remove serialization for execution plans, will follow up with another PR
along with proper unit tests.
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
### Description
<!-- Describe your changes. -->
Use `__name__` detection in `optimize_pipeline.py`.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
It prevents unwanted execution of `main` when importing the file.
ConstantOfShape TypeTests were previously broken due to a bug where the
case for the uint64 test was being passed an int64_data_size. Changing
the data type to uint64_data_size fixes the bug.
TensorProto Int8 and Int16 tests are reenabled since they are now
passing.
### Description
Remove attention_mask from unnecessary code paths in the whisper export
process.
### Motivation and Context
Current export script frequently hits OOM error when export
whisper-large. Memory profiling shows that this is a result of
generating dummy inputs for the `encoder_attention_mask` input for a
model pass during exporting - in whisper-large, this dummy tensor can be
around 20GB in size.
`encoder_attention_mask` is ultimately a dummy input - it's just there
to satisfy certain BeamSearch requirements. Thus, we're currently
creating a 20GB tensor and passing it to the model, which then discards
the input anyways. By removing the code path to generate a dummy
encoder_mask tensor, we can reduce the memory requirements to export
whisper substantially, while keeping the BeamSearch checks satisfied.
---------
Co-authored-by: Peter McAughan <petermca@microsoft.com>
ImageScalar is an experimental operator added in ONNX 1.2.1 and removed
in ONNX 1.5 so it's no longer in use.
Changing the comment to won't fix.
---------
Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>