Commit graph

8141 commits

Author SHA1 Message Date
Nat Kershaw (MSFT)
9a9d45fefb
Add instructions for previewing docs changes (#12528) 2023-02-09 16:25:46 -08:00
Baiju Meswani
94bc0fe029
Skip all training opset model tests (#14636) 2023-02-09 14:56:50 -08:00
Ryan Hill
02bba3e268
Switch to a static local variable to avoid global constexpr warning (#14638)
### Description
Switch to a static local variable to fix the warning

Comments in the code so it's clear that it's intentional.

### Motivation and Context
Prefast warning: [prefast:Warning]: C26426 (in
'onnxruntime::cuda::`dynamic initializer for 'castOpTypeConstraints''')
Global initializer calls a non-constexpr function
'onnxruntime::DataTypeImpl::GetTensorType<onnxruntime::MLFloat16>'
(i.22).
2023-02-09 14:25:02 -08:00
Justin Stoecker
23f0e44265
Fix SAL annotation in private DML EP interface (#14639)
In #14461 I added a private interface to MLOperatorAuthorPrivate.h to
pipe ORT node names through to the debug name of DML operators/graphs.
The wrong SAL annotation was used on the `Get*Name` methods, which
confused static analysis tools into thinking there is a potential buffer
overrun.
2023-02-09 10:27:20 -08:00
JiCheng
c5b485d25f
[prefast:Warning]: C26451 (#14628)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-02-09 17:04:52 +08:00
PeixuanZuo
b53038b6a0
Fix softmax block forward with small element size (#14475)
### Description
1. ALIGN_BYTES is set to 16 before because float4 is used for
vectorization by default. This PR computes ALIGN_BYTES by vectorize
size.
2. Fix wrong data access when using small elemant size (e.g., 1, 33).
Small case may be used for SoftmaxTunableOp.
3. Fix the bug that data may be written first and then read in
BlockReduce function on ROCm EP. There is a slightly performance
improvement because all theads in warp-0 work.

BlockReduce method before this PR:
One block has N(warps_per_block) warps, one warp has M(WARP_SIZE)
threads.
step1. All the threads in one block read data into shared memory.
step2. Reduce all data to the first warp. Only the first N threads of
warp-0 are used. thread-0 computes data in warp-0 and writes the result
into the location of data0, thread-1 computes data in warp-1 and writes
the result into the location of data1.
__syncwarp(mask) is necessary here to make sure thread-1,...N will delay
writing data into warp-0 until thread-0 has finished reading data from
warp-0.
step3. Thread-0 reduces all vaild data(only the first N data) in warp-0
and writes the results into the location of data0, then return data0.

Issue: ROCm doesn't support __syncwarp() now, we need another
implementation to make sure read before write in warp-0.

BlockReduce function in this PR.
step2. Reduce all data to the first warp. Only the threads of warp-0 are
used. Each thread in warp-0 read data from the same location of every
warp and computes result. For example, thread-0 computes the first data
of every warp and writes the result into the location of data0.
step3. Thread-0 reduces all data in warp-0 and writes the results into
the location of data0, then return data0.

Shared memory

![image](https://user-images.githubusercontent.com/94887879/216281207-8b332af5-bb9f-443a-8e2d-5d40c2231629.png)

Test: kernel explorer will use small element to test.
(https://github.com/microsoft/onnxruntime/pull/14541)
2023-02-09 13:55:21 +08:00
Wei-Sheng Chin
875a7791bf
[DORT] Update import path (#14605)
Follow up changes from
https://github.com/pytorch/pytorch/pull/93409/files for fixing DORT CI
failures.
2023-02-08 19:54:06 -08:00
Boyd Johnson
96b95a24ee
Add rust bindings (#12606)
This adds updated Rust bindings that have been located at
[nbigaouette/onnxruntime-rs](https://github.com/nbigaouette/onnxruntime-rs).

check out the build instructions included in this PR at /rust/BUILD.md.

Changes to the bindings included in this PR:
- The bindings are generated with the build script on each build
- The onnxruntime shared library is built with ORT_RUST_STRATEGY=compile
which is now the default.
- A memory leak was fixed where a call to free wasn't called
- Several small memory errors were fixed
- Session is Send but not Sync, Environment is Send + Sync
- Inputs and Outputs can be ndarray::Arrays of many different types.

Some commits can be squashed, if wanted, but were left unsquashed to
show differences between old bindings and new bindings.

This PR does not cover packaging nor does it include the Rust bindings
withing the build system.

For those of you who have previous Rust code based on the bindings,
these new bindings
can be used as a `path` dependency or a `git` dependency (though I have
not tested this out).

The work addressed in this PR was discussed in #11992
2023-02-08 14:57:15 -08:00
Xavier Dupré
30ec8b038f
Test and fix optimizers LayerNormFusion, BiasSoftmaxFusion, Transpose for opset 18 (#14542)
### Description

Due to the changes introduced in opset 18 on Reduce operators (axes is
an input and not an attribute), the following optimizers are not
catching the pattern they are supposed to optimize. This PR addresses
that.

* layer_norm_fusion.cc: the optimizer was not detecting the pattern it
was suppose to optimize
* bias_softmax_fusion.cc: the optimizer was not detecting the pattern it
was suppose to optimize
* transpose_optimizer.cc: the optimizer was not optimize Reduce
operators other than ReduceSum

### Motivation and Context
Better performance.

---------

Signed-off-by: xadupre <xadupre@microsoft.com>
2023-02-08 14:11:31 -08:00
Tianlei Wu
cfda876a3f
Remove torch package from requirements.txt of stable diffusion models (#14630)
### Description
Remove torch package from requirements to unblock nuget windowsai
pipeline which does not allow --extra-index-url

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-02-08 12:18:17 -08:00
Kevin Chen
0a6b22018f
Move TRT include_directories to outside scope (#14622)
Signed-off-by: Kevin Chen <kevinch@nvidia.com>

### Description
Previously `include_directories(${TENSORRT_INCLUDE_DIR})` was only done
if `onnxruntime_USE_TENSORRT_BUILTIN_PARSER` was false. This would cause
a build failure when the switch was true as the include directory was
not added.

### Motivation and Context
Fixes TRT build when `onnxruntime_USE_TENSORRT_BUILTIN_PARSER` is true.

---------

Signed-off-by: Kevin Chen <kevinch@nvidia.com>
2023-02-08 10:19:55 -08:00
Dmitri Smirnov
767619cf3b
Rework C API to remove new/delete warnings (#14572)
### Description
Re-work code so it does not require GSL_SUPPRESS

### Motivation and Context
Do things right.
2023-02-08 10:05:53 -08:00
Alex Kogan
10ab252982
Enable parallel output reordering in MlasReorderOutputNchw() (#13643)
### Description
This PR speeds-up the output reordering operation (as implemented in
[MlasReorderOutputNchw](9954454c65/onnxruntime/core/mlas/lib/reorder.cpp (L400)))
by replacing the sequential implementation with a parallelized one. The
parallelization is achieved through the use of the existing
[TryBatchParallelFor](9954454c65/include/onnxruntime/core/platform/threadpool.h (L284))
construct.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The output reordering operation is frequently executed in image
processing models.
Its implementation can be easily parallelized and therefore sped up when
executed on a multi-core machine.
The amount of speedup achieved by this PR varies and depends on the
actual input.

The table below summarizes the results of some of the experiments I have
conducted on a 16-core VM running on an AMD EPYC 7742 64-core processor.
The experiment is based on the existing [unit
test](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/test/mlas/unittest/test_reorder_output.cpp)
for the output reordering operation. The first column represents the
shape of the output as BatchCount:Channels:Height:Width, and the numbers
in other columns represent the latency (in us, on average out of 100
runs) for the tested variants. Specifically, I compare the (sequential)
baseline (in second column) with the (parallelized) variants, each using
a number of worker threads equal to 1, 2, 4, 8 or 16 (as specified in
[the constructor to the threadpool
object](9954454c65/onnxruntime/test/mlas/unittest/test_main.cpp (L12))).
The numbers in () represent the speedup over the baseline.

| Input | baseline | 1 Thread | 2 Threads | 4 Threads | 8 Threads | 16
Threads|
| ------------- | -------------
|---------------|---------------|---------------|---------------|---------------|
1:1:112:112 | 20.8 | 21.5 (x0.97) | 21.9 (x0.95) | 22.2 (x0.94) | 22.5
(x0.92) | 23.0 (x0.90) |
1:128:160:84 | 540.4 | 712.5 (x0.76) | 404.0 (x1.34) | 327.8 (x1.65) |
377.9 (x1.43) | 371.8 (x1.45) |
13:240:4:314 | 1484.0 | 1851.1 (x0.80) | 1080.9 (x1.37) | 570.2 (x2.60)
| 531.8 (x2.79) | 511.2 (x2.90) |
13:96:4:314 | 471.0 | 679.9 (x0.69) | 427.2 (x1.10) | 372.1 (x1.27) |
445.5 (x1.06) | 428.5 (x1.10) |
1:64:320:168 | 1215.1 | 1497.8 (x0.81) | 863.8 (x1.41) | 456.7 (x2.66) |
435.7 (x2.79) | 462.5 (x2.63) |
30:240:4:140 | 1711.5 | 2181.4 (x0.78) | 1182.6 (x1.45) | 657.4 (x2.60)
| 592.5 (x2.89) | 578.0 (x2.96) |
30:336:4:140 | 2432.5 | 3039.2 (x0.80) | 1695.6 (x1.43) | 920.7 (x2.64)
| 817.1 (x2.98) | 819.2 (x2.97) |

The initial drop between the baseline and the variant using just one
worker thread can be attributed to the overhead of invoking the
reordering loop as a functor in TryBatchParallelFor. This overhead is
compensated by the speedup of parallel processing when the number of
worker threads is increased.
2023-02-08 10:02:54 -08:00
Valery Chernov
ba8a00f62f
[TVM EP] Support zero copying TVM EP output tensor to ONNX Runtime output tensor (#12593)
**Description**:
Support new feature of TVM Virtual Machine (method `set_outputs`) on TVM
Execution Provider side. It allows to avoid excess copying from TVM EP
output tensor to ONNX Runtime one

**Motivation and Context**
Tests with multiple output topologies and big output tensors shows that
there is overheads spent on copying from TVM EP to ONNX Runtime.
Returning output(s) on preallocated memory for VirtualMachine was
implemented on TVM side.

**Details**
`set_output_zero_copy` provider option for TVM EP switches on/off this
feature. It is true by default.
The feature works for both GraphExecutor and VirtualMachine from TVM.

---------

Co-authored-by: Valery Chernov <valery.chernov@deelvin.com>
2023-02-08 10:02:20 -08:00
Faith Xu
0b52a887b6
[Readme] Update table for build pipelines (#14618)
### Description
Update list of pipelines to remove obsolete pipelines and reformat
Optional pipelines are not included except for Android and iOS 


![image](https://user-images.githubusercontent.com/20780999/217395702-f08f1252-e1aa-4fec-ac34-1c0b9859ec20.png)
2023-02-08 09:44:20 -08:00
Yi Zhang
8ed3dfe063
Revert "try VS 2022 in windowsAI pipeline (#14608)" (#14619)
This reverts commit f88a4646cd.

### Description
<!-- Describe your changes. -->



### Motivation and Context
For release, winai packaing pipeline's container image is revert to old
image.
So we should revert VS to 2019
2023-02-08 12:45:37 +08:00
Maximilian Müller
e9ab56fa64
Adding RunOptions synchronization behaviour to C/C++ API (#14088)
### Description
This is exposing the already existent interface of asynchronous work of
all CUDA base EP's (CUDA + TensorRT).


### Motivation and Context
This is something requested in #12216. It will enable users to build an
efficient data pipeline with ONNXRuntime and CUDA pre-/post-processing.
PCI traffic to the CUDA device can be run during inference as soon as
the postprocessing consumed the input buffer and it can be overwritten.
To do this work has to be submitted async to the device. Please see
below screenshots showing the illustration of this using NSight Systems.

Async: 
<img width="1401" alt="image"
src="https://user-images.githubusercontent.com/44298237/209894303-706460ed-cbdb-4be2-a2e4-0c111ec875dd.png">

Synchronous:
<img width="1302" alt="image"
src="https://user-images.githubusercontent.com/44298237/209894630-1ce40925-bbd5-470d-b888-46553ab75fb9.png">

Note the gap in between the 2 inference runs due to issuing PCI traffic
in between and to the CPU overhead the active synchronization has.

---------

Co-authored-by: Chi Lo <chi.lo@microsoft.com>
2023-02-07 19:59:28 -08:00
Hector Li
cd7098fdf4
fix snpe build (#14616)
### Description
Fix SNPE build issue caused by cmake dependency refactor

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
fix issue: https://github.com/microsoft/onnxruntime/pull/14547
2023-02-07 15:33:05 -08:00
Tang, Cheng
8f34c8c8ed
Introduce collective ops to ort inference build (#14399)
### Description
Introduce collective ops into onnxruntime inference build, including
1) AllReduce and AllGather schema in contrib op, controlled by USE_MPI
flag
2) AllReduce and AllGather kernel in cuda EP, controlled by ORT_USE_NCCL
flag


### Motivation and Context
Enable the collective ops in onnxruntime inference build so we have the
ability to run distributed inference with multiple GPUs.
The original ncclAllReduce ops in training build require quite complex
configurations, which is not suitable for inference case, and it already
broken. so we introduce a new implementation.

---------

Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-02-07 13:47:48 -08:00
Ye Wang
b539c364ee
Some kernel changes for TULR (#14517)
### Description
<!-- Describe your changes. -->
1. fix a bug in relative position bias kernel where seq_len > 32
2. rename extra_add_qk to relative_position_bias
3. support relative_position_bias in multihead attention (B, N, S, S*)
4. gru_gate support by Lei


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
Co-authored-by: Lei Zhang <zhang.huanning@hotmail.com>
2023-02-07 11:51:06 -08:00
RandySheriffH
b6bec54341
Revert mimalloc from v2.0.9 to v2.0.3 (#14603)
Revert mimalloc from v2.0.9 to v2.0.3 to silence build error in
[post-merge
](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=273075&view=logs&j=f019f681-ae8f-5ee4-d119-02530df66a84&t=6c90c65c-2ab2-56af-633f-b5631256a8e1&l=351)
pipeline.
New dependency version was generated
[here](https://aiinfra.visualstudio.com/Lotus/_artifacts/feed/Lotus/UPack/onnxruntime_build_dependencies/overview/1.0.29).

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: rui-ren <ruiren1225@gmail.com>
2023-02-07 09:58:25 -08:00
Jian Chen
585f43e31d
Remove Identical Children Consolidation from default transformer uitil. (#14602)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Scott McKay <skottmckay@gmail.com>
2023-02-07 09:22:30 -08:00
Yufeng Li
8de885fdb1
reduce cuda library binary size (#14555)
### Description
Reduce the cuda library size by:
1. refactoring beam_search_top_k to reduce template instantiation. It
saves ~56MB
2. opt out TopK for type uint*, int8_t and int16_t. It saves ~50MB.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-02-07 09:03:14 -08:00
Tianlei Wu
742658d171
Stable Diffusion CUDA optimizations Part 2 (#14597)
### Description
This is a follow-up of
https://github.com/microsoft/onnxruntime/pull/14428 for Stable Diffusion
CUDA optimizations:
(1) use NchwConv to replace Conv in onnx graph and add Tranpose nodes
accordingly
(2) reduce sequential Transpose nodes to at most one.
(3) symbolic shape infer of NchwConv
(4) fix add bias transpose which causes CUDA error (launching more than
1024 threads per block) in inferencing fp32 model.
(5) add models (bert, bart, stable_diffusion subdirectories) to package;
(6) remove option --disable_channels_last

Note that 
(1) We can add a few graph transformations to reduce Transpose nodes
further. It is not done in this PR due to time limit.
(2) Stable diffusion 2.1 model outputs black images. It seems that
forcing Attention to float32 could avoid the issue. However it is much
slow to use float32 Attention.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-02-07 07:49:15 -08:00
Yi Zhang
f88a4646cd
try VS 2022 in windowsAI pipeline (#14608)
### Description
update VS2019 to VS 2022 in
onnxruntime-Nuget-WindowsAI-Pipeline-Official


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-02-07 17:53:53 +08:00
Chun-Wei Chen
cf8bad7f19
Fix CI failure: temporarily disable real model tests from onnx repo (#14606)
### Description
<!-- Describe your changes. -->
To faster unblock pipeline failure globally, disable these real models
tests from onnx repo for now. Meanwhile, we are trying to move these
models to Azure.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
https://github.com/onnx/onnx/issues/4857 these models in onnx repo are
broken. They are setup 4 years ago and the owner of these AWS instances
is unfound.
2023-02-07 13:44:04 +08:00
ytaous
d632f9a3fa
[ROCm] Enable Sampling Op UT on AMD (#14581)
Making basic porting effort to run Sampling UT on ROCm ep, based on the
commits:

https://github.com/microsoft/onnxruntime/pull/13426
https://github.com/microsoft/onnxruntime/pull/14218

1. enabling EmbedLayerNorm op
2. enabling Sampling op
3. enabling helpers to copy data from CPU->GPU for subgraph

This task is the first checkpoint. There could be other missing ops when
testing a real model.
We will migrate more code onto ROCm as needed.

Co-authored-by: Ubuntu <ettao@ettao-amd-dev1.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
2023-02-06 20:52:06 -08:00
dependabot[bot]
a5dab850b8
Bump jszip from 3.7.1 to 3.8.0 in /js/web (#14536) 2023-02-07 01:38:00 +00:00
Jian Chen
6f2dd10d52
IdentityBuilder should add Delimit for each input (#14592)
…("####") should append for each input_def, not only on the last one
else branch of this if should return ignore_identity

3d7518762a/onnxruntime/core/optimizer/identical_children_consolidation.cc (L66)
identity.append("####") should append for each input_def, not only on
the last one
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-02-06 15:36:42 -08:00
Patrice Vignola
b8fb9320ac
[DML EP] Fix ScatterElements registration (#14560) 2023-02-06 10:01:02 -08:00
cao lei
20684021da
do not use raw pointer for CpuBuffersInfo::buffers (#14574)
### Description
Do not use raw pointer for CpuBuffersInfo::buffers object



### Motivation and Context
This PR is to fix the bug 11159:
https://dev.azure.com/aiinfra/ONNX%20Runtime/_workitems/edit/11159/
2023-02-06 09:53:48 -08:00
PeixuanZuo
4bb95d7690
Change the return type of softmax function to Status (#14559)
### Description
Change the return type of Softmax
function(`dispatch_warpwise_softmax_forward `and
`dispatch_blockwise_softmax_forward`) from `void ` to `Status`.

### Motivation and Context
Softmax function will call TunableOp which return Status. It's necessary
to pass the `Status` from inner function to outer function.
2023-02-06 13:40:26 +08:00
Vincent Wang
3d7518762a
[ORTModule] ATen Support for upsample_bilinear (#14519)
It's required by model MobileViT.
2023-02-04 15:20:18 +08:00
Ted Themistokleous
c1a0fc55e7
[ROCm][MIGraphX EP]Add back in support for gfx1030 (#14565)
Adds back in proper build support for the Navi gen cards (gfx1030) 

Co-authored-by: Ted Themistokleous <tthemist@amd.com>
2023-02-04 11:35:45 +08:00
Ye Wang
999e5bf45e
Add SLN support for t5 model with beam search (#14429)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-02-03 11:38:18 -08:00
Nat Kershaw (MSFT)
638f21b969
Upgrade doxygen to fix C API docs build issue (#13950) 2023-02-03 09:43:29 -08:00
Baiju Meswani
a5eb616819
Enable ability to control whether or not to quantize the bias (#14549) 2023-02-03 09:01:30 -08:00
pengwa
7eca42484c
link mpi when either use_mpi or use_nccl enabled (#14467)
### Only link mpi when either use_mpi or use_nccl enabled

To fix the issue https://github.com/microsoft/onnxruntime/issues/14278. 

Talked with @askhade, we think if users want to enable NCCL/MPi but MPI
is not found, it should be failure instead of warning.
So this PR made the change. As a result, to make CIs pass, we need
disable NCCL/MPI explicitly in the build command. This PR take an
alternative approach, e.g. since NCCL and MPi are not used for
customers, disable NCCL by default if "--disable_nccl" not specified,
disable MPI by default if "--use_mpi" not specified.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-02-03 20:11:50 +08:00
pengwa
c6c11039d7
Fix sharing scalar bug (#14544)
If an initializer is used as graph outputs, we should keep its name,
instead of renaming it as constant sharing transformer did currently.

To fix https://github.com/microsoft/onnxruntime/issues/14488
2023-02-03 16:59:11 +08:00
Tianlei Wu
a6c5ba0185
Stable Diffusion CUDA Optimizations (#14428)
### Description

Add stable diffusion CUDA kernel optimizations.

The following are included:
(1) GroupNorm operator. This kernel is from TensorRT 8.5.
(2) BiasSplitGelu operator. This kernel is modified from SplitGelu of
TensorRT 8.5. We added bias to the SplitGelu.
(3) NhwcConv operator. This adds support of NHWC format (ONNX Conv
operator uses NCHW format).
(3) Update MultiHeadAttention (packed kv and no bias) for cross
attention. This could avoid transpose of kv for TRT fused cross
attention kernel.
(4) Optimization and benchmark script

Not included:
(1) Script to convert Conv to NhwcConv in onnx graph.
(2) Update symbolic shape inference for NhwcConv.
(3) Add SeqLen2Spatial operator
(4) Documents

Limitations: GroupNorm, BiasSplitGelu and NhwcConv kernels are
implemented based on stable diffusion usage. They might not be
applicable to any input size or dimensions. For example, BiasSplitGelu
requires hidden size to be 2560 | 5120 | 10240, and NhwcConv assumes 4D
input/weight.

There is minor increasement of binary size. For SM=75 only, python
package wheel size adds (33757K - 33640K) = 117 KB. It is possible to
move NHWC from template parameter to constructor to reduce binary size
(with slight cost of performance).

Note: for RTX 4090/4080/4070 Ti, need build with CUDA 11.8 and latest
cuDNN to get best performance.
2023-02-02 23:43:51 -08:00
PeixuanZuo
1059cf6d98
[ROCm] Fix ROCm build issue caused by REMOVE_ITEM incorrect path (#14534)
### Description
Fix not working REMOVE_ITEM.

`onnxruntime/contrib_ops/rocm/aten_ops/aten_op.cc` is hipyfied from
`onnxruntime/contrib_ops/cuda/aten_ops/aten_op.cc`.
The file correct path is
`${CMAKE_CURRENT_BINARY_DIR}/amdgpu/onnxruntime/contrib_ops/rocm/aten_ops/aten_op.cc`
and it exists in hipyfied source files list
`onnxruntime_rocm_generated_contrib_ops_cc_srcs`.

A better way to fix it: If we don't want to build a file. Add it into
hipify excluded files and will not hipify it.
2023-02-03 13:34:59 +08:00
dependabot[bot]
7b75ebdb31
Bump http-cache-semantics from 4.1.0 to 4.1.1 in /js/web (#14535) 2023-02-03 03:16:37 +00:00
JiCheng
78280bf565
fix build err inbuild with minimal_build conjuncting disable_exceptions flags (#14524)
### Description
If we set flag 'disable_exceptions' to build ORT:


`onnxruntime/contrib_ops/cpu/quantization/qlinear_global_average_pool.cc.o`
woundn't generate such symbols which used by qlinear_pool.c
```
0000000000000000 W _ZN11onnxruntime7contrib27ComputeQLinearGlobalAvgPoolIaEENS_6common6StatusEPKT_fS4_PS4_fS4_lllbPNS_11concurrency10ThreadPoolE
0000000000000000 W _ZN11onnxruntime7contrib27ComputeQLinearGlobalAvgPoolIhEENS_6common6StatusEPKT_fS4_PS4_fS4_lllbPNS_11concurrency10ThreadPoolE
```
so we get a error of undefined symbols of
ComputeQLinearGlobalAvgPool<uin8_t> and
ComputeQLinearGlobalAvgPool<in8_t>......


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-02-03 11:01:20 +08:00
Pavel Grunt
20b5a75cfa
Add missing semicolon (#14143)
Fix compilation issue when DISABLE_SPARSE_TENSORS is defined

### Description
There is missing semicolon when DISABLE_SPARSE_TENSORS is defined


### Motivation and Context
Avoid a compilation failure when cmake option
`onnxruntime_DISABLE_SPARSE_TENSORS` is turned on
2023-02-02 14:08:33 -08:00
Scott McKay
549cbc7e69
Fix issue with schema lookup where there are custom ops using the ONNX domain (#14492)
### Description
<!-- Describe your changes. -->
Fix issue with schema lookup where there are custom ops using the ONNX
domain.

Update testing infrastructure to use an explicit domain for custom ops.
Using an empty string clashes with the ONNX domain and can cause
unexpected issues. It's also a bad example for external users as our
docs point to the unit tests.

Fix a couple of places using exact matching of the node since version to
be slightly more flexible and use a range (which aligns with how the
kernel lookup works).

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fixes a problem that came up when adding support for standalone custom
ops in an ORT format model. Separating these changes out to simplify
review.
2023-02-03 08:05:18 +10:00
Yulong Wang
cfb6e528c8
[js/web] remove 'module' field from package.json (#14532)
### Description
this is a workaround for
[#14529](https://github.com/microsoft/onnxruntime/issues/14504) when
consuming onnxruntime-web as ES module.
2023-02-02 13:46:57 -08:00
Justin Stoecker
03cfb7d73e
Use ORT node names in DML graphs/ops (#14461)
### Description
Applies ORT node names to corresponding compiled operators or DML graph
nodes.

### Motivation and Context
This makes it easier to correlate ONNX nodes to events in PIX GPU
captures when using the DML EP. Names set in the DML graph nodes require
additional modifications to the DML runtime library (available in a
future NuGet package).
2023-02-02 13:42:15 -08:00
Xavier Dupré
0bcca7ad45
Fix Gather to Split optimizer (#14478)
### Description
Gather to Split optimizer fails if opset == 18. This PR fixes one bug
and extend unit tests.



### Motivation and Context
The model produced by the optimizer does not follow onnx specifications
with opset 18.
2023-02-02 13:29:44 -08:00
Baiju Meswani
3d8fa4d77b
GetTrainingApi to not print to stderr when not an ort training build (#14515) 2023-02-02 13:28:32 -08:00
Baiju Meswani
68a402e739
Add support for python 3.10 for onnxruntime-training cuda and cpu (#14100) 2023-02-02 11:32:41 -08:00