Commit graph

10733 commits

Author SHA1 Message Date
Hector Li
d5c6a2cecf
Enable code in QNN UT to verify the fix for partition issue (#19939)
### Description
Enable code in QNN UT to verify the fix for partition issue relate to
QDQ model.
https://github.com/microsoft/onnxruntime/pull/19723
2024-03-15 17:02:01 -07:00
enximi
7b46b31558
fix: "UserWarning: Unsupported Windows version (11). ONNX Runtime sup… (#19845)
fix: "UserWarning: Unsupported Windows version (11). ONNX Runtime
supports Windows 10 and above, only."

### Description

Include Windows 11 in the version check. Now, you will not see the
warning “Unsupported Windows version (11). ONNX Runtime supports Windows
10 and above, only.”

### Motivation and Context

Warning on Windows 11: Only supports systems above Windows 10, which is
somewhat strange.
2024-03-15 12:41:44 -07:00
Yulong Wang
79e50aeef3
[js/web] rewrite backend resolve to allow multiple EPs (#19735)
### Description

This PR rewrite the backend resolve logic to support specifying multiple
EPs.

#### Backend

The first version of ONNX Runtime Web actually carried some existing
code from [ONNX.js](https://github.com/microsoft/onnxjs), which includes
the "backend" concept. The original "backend" in ONNX.js is designed in
a way assuming there is only one backend from user's backend hint list
will be used. For example, in ONNX.js, if user specify a backend hint as
`['webgl', 'wasm']`, ONNX.js will first try to use WebGL backend - if it
loads successfully (the browser supports webgl), then "webgl" backend
will be used and "wasm" will be ignored; otherwise, "webgl" will be
ignored and try to load "wasm" backend.

In short: only one backend will be used when initializing a session.

#### Execution Provider

Execution Provider, or EP, in ONNX Runtime is a different concept. One
of the differences is that users are allow to specify multiple EPs, and
if one does not support a particular kernel, it can fallback to other
EP. This is a very common case when using a GPU EP in ONNX Runtime.

#### Current Status: Backend v.s. EP

Because of the history reasons mentioned above, the current status is
quite confusing. There are **real backend**s, which means it's different
implementation in code; and there are **backend hint**s, which are used
as string names for backend hint; and there are **EP**s of the ONNX
Runtime concepts.

currently there are only 2 **backend**s in our code base: The "onnxjs
backend", and the "wasm backend". The "onnxjs backend" currently only
powers backend hint "webgl", which go into the old onnx.js code path.
All other backend hints including "wasm", "cpu"(alias to wasm), "webgpu"
and "webnn" are all powered by "wasm backend".

And because ORT Web treat "backend" as an internal concept and want to
align with ONNX Runtime, so those names of backend hints are becoming EP
names.

The following table shows today's status:

| Execution Provider Name (public) / Backend Hint (internal) | Backend |
EP in ORT
| -------- | ------- | ------- |
| "wasm"/"cpu" | WasmBackend | CPU EP
| "webgl" | OnnxjsBackend | \* technically not an EP
| "webgpu" | WasmBackend | JSEP
| "webnn" | WasmBackend | WebNN EP

#### Problem

While the API allows to specify multiple EPs, the backend resolving only
allows one backend. This causes issues when user specify multiple EP
names in session options, the backend resolve behavior and EP
registration behavior is inconsistent. Specifically, in this issue:
https://github.com/microsoft/onnxruntime/issues/15796#issuecomment-1925363908:

EP list `['webgpu', 'wasm']` on a browser without WebGPU support
resolves to 'wasm' backend, but the full EP list is passed in session
options, so JSEP is still enabled, causing the runtime error.


#### Solution

Since we still need WebGL backend, we cannot totally remove the backend
register/resolve system. In this PR I made the following changes:
- initialize every backend from the EP list, instead of only do that for
the first successful one.
- for the first resolved backend, filter all EP using the exact same
backend. Remove all EPs not using this backend from session options
- for every explicitly specified EP, if it's removed, show a warning
message in console
2024-03-15 11:47:45 -07:00
Yifan Li
0b2a75b274
[EP Perf] Add concurrency test (#19804)
### Description
<!-- Describe your changes. -->
* Add concurrency test to EP Perf CI panel (impl. by onnx_test_runner)
  * Model: FasterRCNN-10 model within CI image
  * `-c` param configurable via CI panel when kicking off CI tasks
  * Auto-replicate test input/outputs according to `-c` param
* By default, the model test will be executed in 100 iterations (~2min
added to T4 CI task load overall)

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
To monitor potential concurrency issues of ORT-TRT
2024-03-15 07:41:21 -07:00
Hariharan Seshadri
42399dfd2b
Fix a potential race in the CUDA TopK kernel (#19917)
### Description
If the `K` value is flowing through as a tensor, we are updating a
mutable member of the `TopK` class and basing the compute off that -
which is likely to cause data race issues with concurrent Run() calls
and `K` value changes.


### Motivation and Context
Fix potential race in CUDA TopK kernel
2024-03-14 18:13:47 -07:00
Justin Chu
bcf47d3546
Update install_deps_lort.sh to fix onnxscript installation (#19922)
Install onnxscript correctly with `pip install`. Dev dependencies are
not required.

### Motivation and Context

Fix build breaks.
2024-03-14 17:05:50 -07:00
Adam Louly
32558134a9
[On-Device-Training] Upgrade Flatbuffers to Support 2GB+ Checkpoints. (#19770)
### Description
Modifications to support 2GB+ checkpoint & Upgrading Flatbuffers


### Motivation and Context
This PR includes changes that will make ort handle 2GB+ checkpoints.
To do that we need to upgrade flatbuffers to 23.5.9 -
https://github.com/google/flatbuffers/pull/7945

- Modified the commitHash and the hash for the new version
- Removed the patch for rust generator's unused variable warning as it
is no longer producing this - [Check it out
here](d121e09d89/src/idl_gen_rust.cpp)
- Updated the VerifyField calls with alignment values that were
introduced in the new version.

---------

Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com>
2024-03-14 16:36:24 -07:00
Yi Zhang
87a9f77c56
Refactor Python Packaing Pipeline (Training Cuda 11.8) (#19910)
### Description
1. Use stage to organize the pipeline and split building and testing
2. Move compilation on CPU machine
3. test stage can leverage existing artifacts
4. check wheel size, it gives warning if the size above 300M
5. docker image name wasn't change even the argument changed, which
caused the docker image was always rebuilt. So update the docker image
name according to the argument can save the docker build time.

Pipeline duration reduced by 60% (2 hours ->  50 minutes)
Compilation time reduced by 75% (1.5hours -> 20 minutes)
GPU time reduced by 87% ( 8 hours to 1 hours)
for debugging, the GPU time could be reduced by above 95%, because we
can choose run only one test stage and skip building.

### Motivation and Context
Make the pipeline efficient.
Optimized

https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=424177&view=results
Curent

https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=422393&view=results

---------
2024-03-15 06:47:41 +08:00
Changming Sun
8b766bd24e
Change nuget pipeline's "Windows_Packaging_combined_GPU" job to download TRT binaries in every build (#19919)
### Description
Change nuget pipeline's "Final_Jar_Testing_Windows_GPU" job to download
TRT binaries in every build. Now all the other build jobs are already
doing this. This is the only one left.

Similar to #19909

### Motivation and Context

As a follow up of #19118
2024-03-14 15:07:56 -07:00
Tianlei Wu
a2ffc3740b
[Cuda] Demo multiple cuda graphs and user compute stream (#19883)
Update stable diffusion demo to add options `--max-cuda-graphs` and
`--user-compute-stream`.

* Add python class GpuBindingManager to manage IO Binding based on input
shape and max number of cuda graphs setting. The benefit is that one
inference session could enable or disable cuda graph in different runs.
* When `--user-compute-stream`, the demo will use custom compute stream.
2024-03-14 13:48:37 -07:00
Edward Chen
0b90363acb
[MLAS][AArch64] SQ4BitGemm CompInt8 multi-block implementation (#19826)
Update SQ4BitGemm CompInt8 implementation to process multiple blocks along a single column instead of processing single blocks from multiple columns.
2024-03-14 13:05:42 -07:00
Baiju Meswani
226f60f2f1
Add support for SGD optimizer in minimal build (#19901) 2024-03-14 11:31:20 -07:00
Changming Sun
1fb6cbddee
Add a build patch for Windows ARM64EC (#19898)
### Description
Add a patch for Windows ARM64EC


### Motivation and Context
Will need more changes in onnxruntime/core/common/cpuid_arch_definition.h and onnxruntime/core/common/cpuid_info.cc
2024-03-14 08:50:42 -07:00
Changming Sun
ea4a5eea18
Change nuget pipeline's "Final_Jar_Testing_Windows_GPU" job to download TRT binaries in every build (#19909)
### Description
Change nuget pipeline's "Final_Jar_Testing_Windows_GPU" job to download
TRT binaries in every build. Now all the other build jobs are already
doing this. This is the only one left.


### Motivation and Context

As a follow up of #19118
2024-03-14 07:55:00 -07:00
cao lei
966fa74597
Add 2 C API for ort extension (#19808)
### Description
<!-- Describe your changes. -->
Add 2 C API for ORT extension:
- KernelInfo_GetAllocator
- OrtCustomOp::GetMayInplace


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Add 2 C API for ORT extension project, which will leverage these 2 APIs
for GroupQueryAttention custom op.
2024-03-14 06:00:41 -07:00
pengwa
409b811325
Refine logging for execution plan print (#19777)
### Refine logging for execution plan print

Printing NodeIndex only is not enough for us to debug the execution
order.

keep original behaviour for ORT_MINIMAL_BUILD build in case of any CPU
memory concerns.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-14 16:31:32 +08:00
Scott McKay
0be0791fcc
Update MAUI model tester tool to .net8 (#19907)
### Description
<!-- Describe your changes. -->
Update to .net8. Didn't want to build with the latest VS2022 using net6
(which was EOL last year).



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-14 15:19:19 +10:00
Jeff Daily
9443366009
[ROCm] fix build failure when nccl is enabled (#19900)
Building onnxruntime ROCm EP with --enable_nccl --use_mpi fails due to
inclusion of MOE source files but MOE is not supported. The error
observed is

`error: contrib_ops/rocm/moe/ft_moe/moe_kernel.h: No such file or
directory`

The fix is to exclude collective/sharded_moe.* files when nccl is
requested.
2024-03-13 21:16:54 -07:00
Adrian Lizarraga
9c3242ab70
[QNN EP] Copy security catalog file for HtpV73Skel.so from QNN SDK (#19903)
### Description
Copies the `QNN_HOME/lib/hexagon-v73/unsigned/libqnnhtpv73.cat` file
from QNN SDK to the unittest build directory. This is necessary in order
to be able to load the `libQnnHtpV73Skel.so` file on Windows for modern
versions of QNN SDK.

### Motivation and Context
A [digitally-signed catalog
file](https://learn.microsoft.com/en-us/windows-hardware/drivers/install/catalog-files)
(.cat) can be used as a digital signature for an arbitrary collection of
files.
2024-03-13 20:52:59 -07:00
cao lei
2c525a79b1
Add new API KernelContext_GetScratchBuffer (#19809)
### Description
<!-- Describe your changes. -->
add new API KernelContext_GetScratchBuffer to get scratch buffer from
kernel context


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
add new API KernelContext_GetScratchBuffer to get scratch buffer from
kernel context which will be used in ORT extension project for
GroupQueryAttention custom op
2024-03-13 19:41:15 -07:00
Jake Mathern
18ad8587a6
[CP] Fix for xfgcheck and Fix WAI ARM64 build (#19634) (#19644)
### Description
Fix WAI build by only conditionally copying linker flags



### Motivation and Context
I broke the WAI build that contains ORT on ARM64
2024-03-13 17:54:06 -07:00
Markus Tavenrath
f42e6ad61e
Add support for LRN NHWC OPs (#19866)
Support LRN NHWC in the CUDA EP.

### Motivation and Context
Add support for all NHWC OPs to avoid NHWC/NCHW Layout transformation
2024-03-13 17:52:07 -07:00
raoanag
9f08f8d5b2
Set seed for DynamicQuantizeMatMul tests (#19896)
Seed for DynamicQuantizeMatMul tests to avoid pipeline failures with marginal mismatches.
2024-03-13 17:49:55 -07:00
kunal-vaishnavi
4ac98d6d65
Update replacing MultiHeadAttention with GroupQueryAttention (#19882)
### Description
This PR updates the replacement of MultiHeadAttention (MHA) with
GroupQueryAttention (GQA). It is related to the changes in [this
PR](https://github.com/microsoft/onnxruntime/pull/18906).

### Motivation and Context
The updated replacement of MHA with GQA includes the following fusion
changes.
- Apply sliding window within GQA
- Fuse the rotary embeddings within GQA
- Fuse the 3 MatMuls into 1 packed MatMul if possible
- Fuse the 3 Adds into 1 packed Add if possible
2024-03-13 14:10:52 -07:00
aciddelgado
8eb49c5f00
fix gqa rotary dim 1 (#19874)
### Description
GQA Rotary Dimension 1 incorrectly assumed to be based on head size.



### Motivation and Context
This change should enable us to run phi-2 with GQA and Rotary Embedding
fused.
2024-03-13 14:09:54 -07:00
Yulong Wang
e771a763c3
[js/test] align web test runner flags with ort.env (#19790)
### Description
the `npm test` flags are difficult to memorize, because they are
different to the `ort.env` flags. This change makes those flags align
with ort JS API. eg. `--wasm-enable-proxy` became `--wasm.proxy`.

Old flags are marked as deprecated except `-x` (as a shortcut of
`--wasm.numThreads`)
2024-03-13 12:00:36 -07:00
Yi Zhang
d5d9dbd51d
reuse T4 on Linux GPU (#19879)
### Description

### Motivation and Context
Linux GPU test on A10 isn't very stable
2024-03-13 10:41:36 -07:00
Satya Kumar Jandhyala
ed250b88c3
[JS/WebGPU] Optimize MatMulNBits (#19852)
### Description
Use vec<2> or vec<4>, operands in MatMulNBits


### Motivation and Context
Improve performance
2024-03-13 10:33:14 -07:00
Hariharan Seshadri
ed306b4f97
Fix Android CI pipeline (#19877) 2024-03-13 10:09:43 -07:00
Justin Chu
faea42af95
Bump ruff to 0.3.2 and black to 24 (#19878)
### Motivation and Context

Routing updates
2024-03-13 10:00:32 -07:00
Yi Zhang
9e0a0f0f32
Check whether required tests are executed. (#19884)
### Description
Check the onnx node tests and model tests worked

### Motivation and Context
onnx node test data and model data are mount in one dir.
And onnxruntime_test_all search the dir and load the data.
If the dir does exist or there's some change in onnxruntime_test_all,
those tests may not be executed.
For example, all onnx node test data is 32M. It's hardly for us aware of
the regression.
So I add the simple check to ensure those tests are executed.

---------

Co-authored-by: Yi Zhang <your@email.com>
2024-03-13 09:59:57 -07:00
Yi Zhang
7313aa4efe
Remove --extra-index-url (#19885)
### Description
<!-- Describe your changes. -->



### Motivation and Context
--extra-index-url is not allowed by injected Secure Supply Chain Step in
packaging pipelines.
```
> Starting Multifeed Python Security Analysis:
##[warning]tools/ci_build/github/azure-pipelines/bigmodels-ci-pipeline.yml - Found "extra-index-url". (https://aka.ms/cfs/pypi)
```
And those 2 packages can be installed from PyPI as well now.

Co-authored-by: Yi Zhang <your@email.com>
2024-03-13 09:45:22 -07:00
Hector Li
60ad6c6409
Enable float32 model with FP16 precision for QNN HTP backend (#19863)
### Description
Enable float32 model with FP16 precision for QNN HTP backend
2024-03-13 08:35:21 -07:00
George Wu
6579f74af0
skip onnx node_tests for tensorrt ep (#19880)
fix build break caused by image update. tensorrt isn't expected to pass
all onnx node tests.
2024-03-12 23:35:05 -07:00
Yang Gu
53de2d8cb0
[js/webgpu] Enable GroupedConvVectorize path (#19791)
Vectorize met 2 failed cases in a CI bot with NVIDIA GPU, but we
couldn't repro with all the GPUs at hand, including NVIDIA GPUs. This PR
introduces GPUAdapterInfo and enables this opt on non-NVIDIA GPUs to
make the bots happy.
No obivous perf gain can be seen if we enable vectorize on NVIDIA.
However, it shows big perf improvement on Intel. On my Gen12 Intel GPU,
mobilenetv2-12 perf was improved from 11.14ms to 7.1ms.
2024-03-12 22:25:07 -07:00
Yulong Wang
4538d31a8b
[js/webgpu] expose a few properties in WebGPU API (#19857)
### Description
This change exposes a few properties in `ort.env.webgpu` to resolve
feature requirement mentioned in properties in
https://github.com/microsoft/onnxruntime/pull/14579#discussion_r1519612619.

- Add `powerPreference` and `forceFallbackAdapter` in `ort.env.webgpu`,
to allow users to set the value of the properties before the first
inference session is created.
- Add readonly property `adapter` in `ort.env.webgpu` to allow users to
get the adapter instance. Now users can access `ort.env.webgpu.device`
and `ort.env.webgpu.adapter`.

@xenova @beaufortfrancois
2024-03-12 19:50:51 -07:00
wejoncy
22ad629cf7
[bug fix] dequantize 4bit (#19793)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-12 18:27:46 -07:00
Edward Chen
860eb762c2
[Apple framework] Fix minimal build with training enabled. (#19858)
Fix some linker errors that come up when integrating the onnxruntime-training-c pod into another Xcode project. The problematic configuration is a minimal build with training APIs enabled.
- training_op_defs.o had some unresolved references to ONNX functions. It should not be included at all in a minimal build.
- tree_ensemble_helper.o also had unresolved references to ONNX ParseData. The containing function is unused in a minimal build.

Added a test to cover this configuration.
2024-03-12 11:33:30 -07:00
Adrian Lizarraga
00c3cd497e
[QDQ Quantization] Refactor shared functionality into a base quantizer (#19817)
### Description
This PR does not add or remove any functionality. It refactors common
functionality shared by the `ONNXQuantizer` and `QDQQuantizer` classes
into a new `BaseQuantizer` class.

This change helps decouple the QDQ quantizer from other quantization
modes and makes it easier to determine if a change to one quantization
mode will impact another.

### Motivation and Context
An upcoming PR aims to add mixed-precision support to QDQ models (e.g.,
one part of the graph uses u8 activations and another uses u16
activations). This change makes the upcoming PR smaller and should
presumably make determining the impact on existing features more
straightforward.
2024-03-12 10:47:09 -07:00
Ye Wang
7f0520cdf9
bug fix to multi-cudagraph (#19856)
### Description
<!-- Describe your changes. -->

run_count_before_capture_ is graph_id aware, fix the bug by adding a map
to retrieve the run_count_ for each graph_id.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-12 10:33:37 -07:00
zz002
319159b7bd
[VitisAI]set-data_loaction-as-default-when-load-external-data (#19712)
### Description
<!-- Describe your changes. -->

set-data_loaction-as-default-when-load-external-data
fix vitis ai ep can not get CutomOps by session_option register

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

VitisAI bug daily fixes
when use pass: fuse_qdq_GEMM or fuse_qdq_MATMUL, get error like : Error
Data of TensorProto ( tensor name: xxx) is stored externally and should
not have data field.raw_data

---------

Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
2024-03-12 10:27:14 -07:00
Bowen Bao
742595b885
Speedup Llama2 cpu throughput in bench by 1.69x with iobinding (#19853)
### Description
Always set `use_io_binding=True` when using optimum.onnxruntime unless
there is a special case.


### Motivation and Context
By default, `ORTModel` under optimum.onnxruntime will choose the
appropriate `use_io_binding` value based on provider and use cases.

>         use_io_binding (`Optional[bool]`, defaults to `None`):
> Whether to use IOBinding during inference to avoid memory copy between
the host and device, or between numpy/torch tensors and ONNX Runtime
ORTValue. Defaults to
> `True` if the execution provider is CUDAExecutionProvider. For
[~onnxruntime.ORTModelForCausalLM], defaults to `True` on
CPUExecutionProvider,
 >           in all other cases defaults to `False`.

For Llama token benchmark, using iobinding yields almost 2x speedup,
even on CPU. This is because this particular model yields a large number
of outputs (>60). Without iobinding, a copy is performed for each output
from ortvalue to numpy array. This adds significant overhead to the
overall run time.

```
Evaluating Llama2 `model(inputs)` step with past_key_values

Before, w/o iobinding on cpu

Batch Size: 1
Sequence Length: 512
Latency: 0.4518657898902893 s
Throughput: 2.2130464894073856 tps

After, w/ iobinding on cpu

Batch Size: 1
Sequence Length: 512
Latency: 0.2662619352340698 s
Throughput: 3.7557001871893703 tps
```
2024-03-12 09:41:11 -07:00
Yi Zhang
d4fa4f0276
Remove FFmpeg to meet compliance (#19859) 2024-03-12 09:06:59 -07:00
pengwa
3fb8905393
Fix torch cpp extension build warnings (#19842)
### Fix torch cpp extension build warnings

For the warnings shown as below: 

```
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
[4/5] c++ -MMD -MF /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.o.d -pthread -B /opt/conda/envs/ptca/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/include/python3.8 -c -c /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.cc -o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.o -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=torch_interop_utils -D_GLIBCXX_USE_CXX11_ABI=0
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_arg_parser.h:65,
                 from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/tensor_new.h:4,
                 from /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.cc:9:
/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_strings.h:104:19: warning: ‘pybind11::object PyObject_FastGetAttrString(PyObject*, const char*)’ defined but not used [-Wunused-function]
  104 | static py::object PyObject_FastGetAttrString(PyObject* obj, const char* name) {
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~
[5/5] c++ -MMD -MF /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.o.d -pthread -B /opt/conda/envs/ptca/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/include/python3.8 -c -c /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.cc -o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.o -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=torch_interop_utils -D_GLIBCXX_USE_CXX11_ABI=0
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_arg_parser.h:65,
                 from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/tensor_new.h:4,
                 from /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.cc:13:
/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_strings.h:104:19: warning: ‘pybind11::object PyObject_FastGetAttrString(PyObject*, const char*)’ defined but not used [-Wunused-function]
  104 | static py::object PyObject_FastGetAttrString(PyObject* obj, const char* name) {
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~
g++ -pthread -B /opt/conda/envs/ptca/compiler_compat -Wl,--sysroot=/ -pthread -shared -B /opt/conda/envs/ptca/compiler_compat -L/opt/conda/envs/ptca/lib -Wl,-rpath=/opt/conda/envs/ptca/lib -Wl,--no-as-needed -Wl,--sysroot=/ /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/ctx_pool.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_shared.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/torch_interop_utils.o -L/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-cpython-38/torch_interop_utils.cpython-38-x86_64-linux-gnu.so
Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/fused_ops.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/fused_ops.cpython-38-x86_64-linux-gnu.so
Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/aten_op_executor.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/aten_op_executor.cpython-38-x86_64-linux-gnu.so
Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/torch_gpu_allocator.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/torch_gpu_allocator.cpython-38-x86_64-linux-gnu.so
Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/torch_interop_utils.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/torch_interop_utils.cpython-38-x86_64-linux-gnu.so

```

Fix by replacing eixsting `PyObject_GetAttrString` with
`PyObject_FastGetAttrString` which claims to be faster in its
implementation comment.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-12 10:51:30 +08:00
pengwa
3e954da3e6
Fix and enable few ORTModule Unit Tests (#19847)
### Fix and enable few ORTModule Unit Tests

Fix 'test_bert_inputs_with_dynamic_shape' and
'test_bert_result_with_layerwise_recompute' generate Nan loss in ORT
run.

The root cause is, the logic to generatic attention mask test data is
not correct, only 0 or 1 is allowed in the dataset, but we see lots of
other numbers. ( The reason we don't have this using old version of
transformers for example v4.4.2 or 4.16.2 is because they don't contains
such
d3cb28886a,
which increase the scaling to a bigger number, causing a overflow to
inf)

Another improvement during the investigation using convergence tools:
Don't dump the activations during model export phase, otherwise, the
dumped data might contains some PyTorch run's result making us confused
during comparing with stock PyTorch run results.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-12 10:49:19 +08:00
Vincent Wang
0c078dfc8b
Some Shape Related Fusions (#19832)
This PR adds below shape related fusions, which is helpful for some
transformer models:
- ShapeInputMerge is to merge all Shape nodes' input NodeArg to a single
one (the 1st one on topo order) if they have the same shape value. This
helps CSE fusion to merge more nodes.
- CSE fusion to support scalar tensor as attribute value. This is mainly
to support ConstantOfShape node.
2024-03-12 10:29:27 +08:00
Scott McKay
978c40d853
Make partitioning utils QDQ aware so it does not break up QDQ node units (#19723)
### Description
<!-- Describe your changes. -->
If the EP handles QDQ node units, we need to make sure we do not split
those into different partitions.

Update the partitioning utils to be QDQ aware. If there are node units
we process the logical nodes they represent instead of individual nodes.
This ensure we process all nodes in a QDQ node unit at the same time so
that they are always in the same partition.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix one of the issues in #19590

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-03-12 10:55:49 +10:00
Hector Li
cba605e845
Fix Clip op builder for FP16 support (#19825)
### Description
Fix Clip op builder for FP16 support.

### Motivation and Context
Enables mobilenet v2 FP16 model inference on HTP
2024-03-11 16:39:41 -07:00
raoanag
89aa4697b1
[DML] QAttention (#19766)
### Description
DML Implementation for
[com.microsoft.QAttention](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.QAttention)



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Xiang Zhang <xianz@microsoft.com>
2024-03-11 10:44:34 -07:00
Changming Sun
5479124834
Remove remaining Windows ARM32 build jobs (#19840)
### Description
As a follow up of #19788, remove more remaining Windows ARM32 build
jobs.


### Motivation and Context
Our nuget packaging pipeline is failing because it could not find an
artifact for Win ARM32.
```
##[error]Artifact onnxruntime-training-win-arm was not found for build 421397.
```

Deprecation of Win ARM32 was announced by Windows team in January 2023.
We should follow it.
2024-03-11 11:25:11 +08:00