Commit graph

9936 commits

Author SHA1 Message Date
Hector Li
55c19d6ab5
[QNN EP] Enable option to set QNN context priority (#18315)
Enable option qnn_context_priority to set QNN context priority, options:
"low", "normal", "normal_high", "high".

### Description
Enable option qnn_context_priority to set QNN context priority, options:
"low", "normal", "normal_high", "high".

This feature guarantees the model inference with higher priority. Tested
with onnxruntime_perf_test tool using same model.
1. Run the model on the NPU with single instance, the latency is 300ms.
2. Run the same model on NPU with 2 instance at same time.
   Case 1:   
   both with same priority (high ) -- latency is 600ms
   Case 2:   
   1 with low priority -- latency is 30,000ms
   1 with high priority --  latency is 300ms
   Case 3:   
   1 with normal priority -- latency is 15,000ms
   1 with high priority --  latency is 300ms
2023-11-08 20:56:36 -08:00
Prathik Rao
7a3da4526f
add bfloat16 support for CUDA Neg kernel (#18306)
### Description
<!-- Describe your changes. -->

Registers BFloat16 datatype as valid input type for CUDA Neg Kernel.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Enabling `meta-llama/Llama-2-70b` to be finetuned with ONNX Runtime
training.

---------

Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-11-08 18:32:12 -08:00
guyang3532
4dc63692f8
Add FlattenAndUnpad Op (#17845)
### Description
Add an op named `FlattenAndUnpad`.
This op implements functions:
1. Flatten the first two dims of input tensor.
2. Gather valid value from input tensor with index tensor,.


### Motivation and Context
The grad op of `PadAndUnflatten` was `GatherGrad` which is inefficient
in performance.
I implement this `FlattenAndUnpad` just to replace the `GatherGrad` as
grad of `PadAndUnflatten`.
With this op, we also can simplify the "Reshape + ShrunkenGather"
pattern to `PadAndUnflatten` in padding elimination optimizer, which
will also improve performance.
2023-11-09 09:52:48 +08:00
Scott McKay
885bf3561d
Add tool to fix lines > 120 chars. (#18293)
### Description
<!-- Describe your changes. -->
Helper to run clang-format on lines that are > 120 chars.

We disable clang-format enforcing 120 chars by default because it's
formatting can negatively impact readability. If a developer has not
manually kept a line within the 120 char limit this tool will fix it. It
will leave all other lines alone to honor the formatting the developer
chose.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Help developers fix lint errors. 

Preferred is to use a vertical ruler/guideline in your editor when
actually writing the code.
2023-11-09 10:12:57 +10:00
Justin Chu
c250540722
Bump linter versions (#18341)
Bump linter versions and run format.
2023-11-08 13:04:40 -08:00
Changming Sun
812532592e
Add a build validation for Linux ARM64 cross-compile (#18200)
### Description
1. Add a build validation for Linux ARM64/ARM32 cross-compile to catch
issues listed in #18195 .
2. Revert eigen's commit id back to what we had before. 


### Motivation and Context
To catch cross-compile issues.
Added a TODO item for fixing the compile warnings in Linux ARM32 build: AB#21639
2023-11-08 13:03:18 -08:00
sophies927
68fab24c22
Update stale.yml (#18304)
Exempt all issues w/ assignees from stale bot, increase days before
issue close, + add start date to address issue w/ GH API rate limiting

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-08 11:56:35 -08:00
Dmitri Smirnov
a37e6a503b
Update Abseil raw_flat_hash visualization (#18329)
### Description
<!-- Describe your changes. -->
Fix the broken pieces due to the latest   Abseil update.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
Make the debugging bearable.
2023-11-08 11:19:45 -08:00
Adrian Lizarraga
a0eeeafa80
[QNN EP] Session option for graph optimization (#18262)
### Description
Adds the QNN session option `htp_graph_finalization_optimization_mode`
to enable QNN graph optimizations at the expense of longer preparation
time.

### Motivation and Context
Allow enabling QNN graph optimizations per app/model.
2023-11-08 10:06:15 -08:00
kunal-vaishnavi
c8def0cc51
Add LLaMA GQA ragged batching (#18337)
This PR updates replacing MHA with GQA and updates the LLaMA scripts for
the modified GQA op. It is related to the changes in [this
PR](https://github.com/microsoft/onnxruntime/pull/18283).

### Motivation and Context
This PR allows us to run LLaMA with the GQA op end-to-end using ragged
batching (i.e. batched inputs of different lengths).
2023-11-08 09:36:28 -08:00
Prathik Rao
34f77eaa24
bfloat16 support for quickgelugrad (#18336)
### Description
<!-- Describe your changes. -->

Registers BFloat16 datatype as valid input type for CUDA QuickGeluGrad
Kernel.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Enabling `meta-llama/Llama-2-70b` to be finetuned with ONNX Runtime
training.

---------

Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-11-08 08:40:02 -08:00
pengwa
2151c79bf1
Tune ORTModule logging experience a bit (#18298)
### Tune logging experience a bit

After last time we update the ORTModule log experience, we found few
issues:
1. `INFO` level output too many things, including PyTorch exporter
verbose logs (tracing graphs) on every ranks. On this level, we only
want to
- Output a little bit more information to Users than `WARNING` level,
for example the memory recomputation recommendations or other
not-fully-ready features.
- Output a little bit more information for a quick diagnostic, collected
on rank-0 only.
2. ONNX Runtime logging filter during graph build, session init
sometimes will hide the issues (for example segement fault), there is no
useful information in `WARNING`/`INFO` for users to report to us. This
is not good!
3. Some of our devs like using `pdb` to debug Python code, but if we add
`import pdb; pdb.set_trace()` in models' code might hang when they use
`INFO` or `WARNING`, where exporter happens and all output got
redirected due to log filtering. The only workaround is to switch to
VERBOSE, which output toooooooooooo many logs.

The corresponding changes proposed here are:
1. For `INFO` logging, 
    - We only logs rank-0. 
- We restricted the ORT backend logging level to be WARNING in this
case, because ORT backend code output way too many logs that should be
under verbose, while we cannot guarantee we can get them cleaned up
immediately once they are added.
- We output the PyTorch exporter verbose log (including tracing graph),
which is useful for a quick diagnostic when an issue happens.
2. Remove all logging filtering on ORT backend, then the segment fault
issue details will not be hidden once it happens again.
 3. Introduced a `DEVINFO` logging,
     - Log logs on all ranks
     - Log ORT backend logging level INFO
- PyTorch exporter logging filtering are all turned OFF (to unblock the
pdb debugging).
4. Currently, to use Memory Optimizer, need use DEVINFO (which will
output ORT backend INFO log). So update memory optimizer document to
reflect this. https://github.com/microsoft/onnxruntime/pull/17481 will
update the requirement back to INFO for show memory optimization infos.

You can check
https://github.com/microsoft/onnxruntime/blob/pengwa/devinfo_level/docs/ORTModule_Training_Guidelines.md#log-level-explanations
for a better view of different log levels.

This PR also extract some changes from a bigger one
https://github.com/microsoft/onnxruntime/pull/17481, to reduce its
complexity for review.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: mindest <30493312+mindest@users.noreply.github.com>
2023-11-08 17:42:50 +08:00
Tianlei Wu
8044e5f603
SDXL: Update demo with dynamic shape serving with CUDA EP (#18340)
Update the SDXL demo with dynamic shape serving with CUDA EP.
2023-11-08 00:42:55 -08:00
aciddelgado
3dece27f51
GQA Flash Attention with Attention Mask (#18283)
### Description
GQA now only works with Flash Attention with Attention Mask input,
allowing for batched input. Note: This PR Disables Memory Efficient
Attention, only allowing Flash Attention kernel to be used.



### Motivation and Context
Allows GQA to work with batched input.

---------

Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
2023-11-07 17:47:51 -08:00
Yulong Wang
10df847baf
[js] fix linter out-of-memory issue (#18307)
### Description
fix linter out-of-memory issue by ignoring file pattern 'test/data/'.
2023-11-07 17:12:22 -08:00
Yulong Wang
d117a8010f
fix typo (node)->(browser) in linux-wasm-ci.yml (#18309)
### Description
fix display name `'Build and test (node) (simd + threads)'` to `'Build
and test (browser) (simd + threads)'`
2023-11-07 17:07:40 -08:00
Dmitri Smirnov
096307c64b
Do not run AOT function inlining when the model does not define any local functions (#18302)
### Description
Check if the model defines any local functions.
if not, skip AOT inlining including any schema based functions.
The latter would be inlined during partitioning.

### Motivation and Context
This prevents calls GetCapability() to EPs and enhahces  compatibility.
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Pranav Sharma <prs@microsoft.com>
2023-11-07 13:46:42 -08:00
Jiajia Qin
606356d0b1
[js/webgpu] Simplify the Resize shader when noScale is true (#18321)
### Description
For Resize, when `noScale` is true, the shader can become very simple,
which is not related with `attributes.mode` anymore. So we should remove
those parts of shader code for simplification.

This PR can also fix #18311 since the `noScale` are all true in that
model.

However, #18311 also exposes that the Resize implementation for `linear`
mode has bug. It seems that the currently implementation always treat
the input as either 2d or 4d tensor, however, the actual input is 3d
tensor, that's why the shader compilation is failed. We may need to fix
it in a separate PR.
2023-11-07 12:54:20 -08:00
liqun Fu
6127dd1d2d
implement gridsample 20 (#17744) 2023-11-07 10:42:41 -08:00
Prathik Rao
83c0275354
add bfloat16 support for ConcatTraining and SplitTraining ops (#18280)
### Description
<!-- Describe your changes. -->

Updates input/output type constraints on training operators
ConcatTraining and SplitTraining to include bfloat16 which was
introduced in IR version 4.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Enabling `meta-llama/Llama-2-70b` to be finetuned with ONNX Runtime
training.

Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-11-07 10:10:01 -08:00
satyajandhyala
a16d528399
[JS/Web] Added Uniforms support to binary ops. (#18260)
### Description
Added Uniform support to binary ops



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
To improve performance
2023-11-07 08:41:52 -08:00
Patrice Vignola
800ae7742c
[DML EP] Add RotaryEmbedding (#18158)
This is a graph implementation of RotaryEmbedding since there's no time
to add it to DML before 1.16.2, but it eventually should move into
DirectML since we're bandwidth-bound.
2023-11-07 08:26:11 -08:00
Yi Zhang
9868a71373
[Fix] Stages to Run couldn't be selected (#18310)
### Description
Add the pool definition in 2 stages even the pool is Microsoft-Hosted
Pool.



### Motivation and Context
Recently, in Nuget pipeline, when we click the Stages to Run

![image](https://github.com/microsoft/onnxruntime/assets/16190118/45af295e-fa75-402a-a7de-803c6a2ab7cd)
It always pops up 
```
Encountered error(s) while parsing pipeline YAML:
Could not find a pool with ID 5206. The pool does not exist or has not been authorized for use. For authorization details, refer to https://aka.ms/yamlauthz.
Could not find a pool with ID 5206. The pool does not exist or has not been authorized for use. For authorization details, refer to https://aka.ms/yamlauthz.
```
2023-11-07 17:52:47 +08:00
pengwa
4f15b42728
Customize _get_tensor_rank for model export in stage3 (#18294)
### Customize _get_tensor_rank for model export in stage3

Weight/Params sizes are all (0), so exporter logic depending on input
shape will fail.

This PR override `_get_tensor_rank` function by retrieving the shape for
weight differently.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-07 16:37:11 +08:00
zhijiang
630c877b43
Zhijxu/improve ortmodule python perf a little bit (#13716)
improve 2 python functions a little bit.

according to a profiling result from a real user case, we find that 2
python function can be improved. the first is the result before
improvement, the second is after improvement, we can see 8ms saved from
the improvement.

![image](https://user-images.githubusercontent.com/43435212/202961725-b88d679e-993b-4910-a339-253f3ed5dcde.png)

![image](https://user-images.githubusercontent.com/43435212/202961732-6c6deebf-962f-4392-90d7-03705433e3ee.png)
2023-11-07 15:24:57 +08:00
Tianlei Wu
00c2bf39bd
SkipGroupNorm fusion and SDXL Pipeline Update (#18273)
Update a few optimizations for Stable Diffusion XL:
(1) Add SkipGroupNorm fusion
(2) Remvoe GroupNorm fusion limits. Previously, we only fuse GroupNorm
when channels is one of `320, 640, 960, 1280, 1920, 2560, 128, 256, 512`
so some GroupNorm in refiner was not fused.
(3) Tune SkipLayerNormalization to use vectorized kernel for hidden size
320, 640 and 1280.

Pipeline Improvements:
(4) Enable cuda graph for unetxl.
(5) Change optimization to generate optimized fp32 model with ORT, then
convert to fp16. Otherwise, fp16 model might be invalid.
(6) Add option to enable-vae-slicing.

Bug fixes:
(a) Fix vae decode in SD demo.
(b) Fix UnipPC add_noise missing a parameter.
(c) EulerA exception in SDXL demo. Disable it for now.
(d) Batch size > 4 has error in VAE without slicing. Force to enable vae
slicing when batch size > 4.

#### Performance Test on A100-SXM4-80GB

Description about the experiment in results:
*Baseline*: removed GroupNorm fusion limits; CUDA graph is enabled in
Clip and VAE, but not in Clip2 and UNet.
*UNetCG*: Enable Cuda Graph on UNet
*SLN*: Tune SkipLayerNormalization
*SGN*: Add SkipGroupNorm fusion

The latency (ms) of generating an image of size 1024x1024 with 30 steps
base model and 9 steps of refiner model:

  | Baseline | UNetCG| UNetCG+SLN | UNetCG+SLN+SGN
-- | -- | -- | -- | --
Base Clip | 3.74 | 3.70 | 3.88 | 3.81
Base Unet x30 | 2567.73 | 2510.69 | 2505.09 | 2499.99
Refiner Clip | 7.59 | 7.42 | 7.41 | 7.58
Refiner Unet x 9 | 814.43 | 803.03 | 802.20 | 799.06
Refiner VAE Decoder | 84.62 | 85.18 | 85.24 | 87.43
E2E | 3480.56 | 3412.05 | 3405.77 | 3400.23

We can see that enable cuda graph brought major gain (around 68ms). SLN
Tuning has about 7ms gain. SkipGroupNorm fusion has 5ms gain.

SkipGroupNorm fusion won't reduce latency much, while it also has
benefit of reducing memory usage, so it is recommended to enable it.

### Motivation and Context
Additional optimizations upon previous work in
https://github.com/microsoft/onnxruntime/pull/17536.
2023-11-06 22:02:33 -08:00
Patrice Vignola
276918d93b
Allow SkipLayerNorm and LayerNorm in rotary attention fusion (#18288)
Although SimplifiedLayerNorm is faster than LayerNorm, DML doesn't have
an optimized implementation for the former yet and LayerNorm ends up
being faster.
2023-11-06 22:01:17 -08:00
Wei-Sheng Chin
fb6737e893
Distributed Squeeze and Distributed Unsqueeze (#18269)
Implementat DistributedSqueeze & DistributedUnsqueeze for llama 2.
2023-11-06 20:11:35 -08:00
Hector Li
ad34c67a44
[QNN EP] Enable Expand op (#18234)
### Description
Enable Expand Op.
There no directly mapping from Onnx Expand op to QNN. Need to use
ElementWiseMultiply to do the data broadcast. Basically create the 2nd
input with value 1.0 and use the shape data from Expand op.
2023-11-06 16:28:11 -08:00
Xavier Dupré
3b63d85c25
Fix unit test when TVM EP is enabled (#18189)
### Description

TestInlinedLocalFunctionNotRemoved checks that local functions are not
removed but TVM EP optimizes the whole graph after it is inlined.
2023-11-06 19:32:26 +01:00
Changming Sun
398ef677ba
Update protobuf python package's version (#18203)
1. Now we use a released version of ONNX, so we can directly download a
prebuilt package from pypi.org. We do not need to build one from source.
2. Update protobuf python package's version to match the C/C++ version
we are using.
3. Update tensorboard python python because the current one is
incompatible with the newer protobuf version.
2023-11-06 09:22:54 -08:00
Yi Zhang
b7b8b5b2ce
Fix Eigen-3.4.0 URL and hash (#18290)
### Description
Add CI changes for #18287

Install onnx explicitly to pass windows GPU+dml stage.


### Motivation and Context
'eigen-3.4' was refering to a branch, not to a tag. There is now an
Eigen 3.4.1 on that branch, and thus the hash has changed.
See
https://github.com/microsoft/onnxruntime/issues/18286#issuecomment-1793683416
2023-11-06 09:19:51 -08:00
BoarQing
d652b1fe48
[VitisAI] fix tensor has multi data type (#18188)
### Description
<!-- Describe your changes. -->
When take a tensor's data as raw, clear data with other types within the
tensor.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve? -->
One model's graph transformation caused a node with multiple data types.
This would make the model valid.
2023-11-06 07:16:17 -08:00
Chi Lo
dfafcb58aa
[TensorRT EP] Properly set CUDA_INCLUDE_DIR for onnx-tensorrt (#18274)
https://github.com/microsoft/onnxruntime/pull/17468
The above PR didn't fully fix the issue for some environments.
This PR fixes this.
2023-11-03 20:04:10 -07:00
kunal-vaishnavi
08eaa1c55d
Remove internal enforce for IO binding inputs (#18266)
### Description
This PR removes an internal `ORT_ENFORCE` when binding `torch.tensor`
inputs using IO binding for end-to-end scripts.



### Motivation and Context
In merged exports of PyTorch models to ONNX, each past key and past
value in the past KV cache has an input shape of `(batch_size,
num_heads, past_sequence_length, head_size)`. In the first pass through
the model to process the prompt, `past_sequence_length = 0`. Therefore,
each of these inputs is of shape `(batch_size, num_heads, 0,
head_size)`. In subsequent passes, `past_sequence_length > 0`.

When binding a `torch.tensor` of shape `(batch_size, num_heads, 0,
head_size)` with `io_binding.bind_input`, the tensor's `data_ptr()` must
be passed. For a `torch.tensor` of this shape, its `data_ptr()` returns
0. Because it returns 0, the existing `ORT_ENFORCE` is therefore false
and an error is raised. By removing the internal `ORT_ENFORCE`, no error
is raised and the model runs successfully.

LLaMA-2 Example:
Input Name | Input Size | Device | Device ID | Torch Dtype | data_ptr()
------------- | ----------- | ------- | ----------- | ------------- |
-----------
input_ids | torch.Size([1, 11]) | cuda | 7 | torch.int64 |
140639561842688
attention_mask | torch.Size([1, 11]) | cuda | 7 | torch.int64 |
140639561843200
position_ids | torch.Size([1, 11]) | cuda | 7 | torch.int64 |
140639561844224
past_key_values.0.key | torch.Size([1, 32, 0, 128]) | cuda | 7 |
torch.float32 | 0
past_key_values.0.value | torch.Size([1, 32, 0, 128]) | cuda | 7 |
torch.float32 | 0
... | ... | ... | ... | ... | ...
2023-11-03 16:12:32 -07:00
Chi Lo
84bdf04b25
[TensorRT EP] Fix bug for shape tensor input (#18253)
When the model has "shape tensor" as one of the inputs and user provides
explicit profile shapes for it, TRT EP doesn't correctly set the "shape
tensor" input.
Also, there is a bug for applying explicit profile shapes for the shape
tensor input.

Note: It seems the model has shape tensor input is a rare case. Most of
the cases, the inputs are all execution tensor.
2023-11-03 16:07:50 -07:00
Chen Fu
26b396418d
Block-wise 4b quantization matmul operator change (#18172)
### Description
Replace block-wise 4b quantization implementation


### Motivation and Context
In https://github.com/microsoft/onnxruntime/pull/18101 we have an
augmented block-wise 4b quantization interface and implementation. Here
we use this new implementation in onnxruntime contrib ops

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-11-03 15:29:43 -07:00
Edward Chen
2ec1f94bfd
Make MlasTestFixture::mlas_tester an inline variable. (#18263)
Make MlasTestFixture::mlas_tester an inline variable. With this change we no longer need to define `MlasTestFixture<T>::mlas_tester` outside of the class definition.
2023-11-03 10:50:21 -07:00
Changming Sun
4c4d79a612
Change a bitwise logical xor to logical wise (#18246)
### Description
Change a bitwise logical xor to logical-wise

### Motivation and Context
For Boolean values we should not use bitwise operations.
2023-11-03 10:42:51 -07:00
Numfor Tiapo
192caee81f
Fix Signed Mismatch (#18258)
This PR fixes the the signed mismatch warning in
DmlRuntimeFusedGraphKernel. This warning is treated as an error on the
x86 versions of our internal builds preventing us from updating to
latest ORT.
2023-11-03 10:16:37 -07:00
satyajandhyala
e207060ac9
[JS/Web] Added Unifroms support to unary ops. (#18223)
### Description
Added uniforms support to unary ops.


### Motivation and Context
Improve performance
2023-11-03 09:30:54 -07:00
winskuo-quic
90f205e79c
[QNN EP] Fix Pad UT (#17982)
### Description

QNN EP has 2 unit tests failing:

TEST_F(QnnHTPBackendTests, DISABLED_PadReflectMode)
TEST_F(QnnHTPBackendTests, DISABLED_Pad4dOutOfRangePadConstantValue)

For the first unit test, in QNN's master definition, it is stated that
when using MIRROR_REFLECT, the before and after pad amounts must not be
greater than shape(in[0])[i] - 1. Therefore, we need to change the pad
amount from {0,2,0,0} to {0,1,0,0}.

For second unit test, QNN does not have limitations stating that pad
constant should be smaller than input[0]. The reason that the test is
failing is because the unit test did not take the pad constant into
consideration when doing quantization.

### Motivation and Context
Fix the 2 unit tests mentioned in description.
2023-11-03 09:21:33 -07:00
Scott McKay
c352e9b1f9
Rework/cleanup the C# build infrastructure for nuget packages. (#18127)
### Description
Update the C# nuget build infrastructure to make building a test nuget
package more user friendly and to simplify
- Remove usage of dotnet and msbuild in CIs
- was temporary requirement until .net 6 MAUI was added to the released
Visual Studio
  - remove SelectedTargets property and its usage
- Add property for excluding mobile targets
  -  generally we exclude based on the nuget package name
- can now specify `/p:IncludeMobileTargets=false` on the command line to
force exclusion
- support building test package using build.py `--build_nuget` better
- limit inclusion of xamarin targets as building with them requires a
lot more infrastructure
- use msbuild directly if xamarin targets are included. use dotnet
otherwise.
- remove quoting of property values as it doesn't appear to be necessary
and breaks when msbuild is being used
- add infrastructure to be able to pack the nuget package on linux with
`dotnet pack`
    - `nuget pack` is not user friendly as-per comments in changes
    - requires stub csproj to provide the nuspec path 
- Remove netstandard1.0 targets from nuspec
  - we removed support from the actual bindings previously
- Remove usage of nuget-staging directory when creating nuget package on
linux
- the nuspec file element has a fully qualified path for a source file
so there is no obvious benefit to copying to a staging directory prior
to packing

### Motivation and Context
Address issues with 1P users trying to create test nuget packages
locally.
Long overdue cleanup of CI complexity.
2023-11-03 09:05:17 -07:00
Scott McKay
4f2096be38
Update XNNPACK to latest version (#18038)
### Description
<!-- Describe your changes. -->
Update XNNPACK to latest version
- adds fp16 kernels and various other improvements
- requires pthreadpool update as well

Most code updates in the XNNPACK EP are to adjust to the new XNNPACK API
- 'setup' is split into 'reshape' and 'setup'
-  some ops use a workspace buffer
   -  copied workspace allocation from XNNPACK unit test code
- some suffixes changed 

Added wrapper for XNNPACK caches to base XNNPACK EP kernel
- simplifies usage
- XNNPACK split out the code and weights caches, but the code cache
isn't currently usable via the public API
- we could use the internal types if we think it's required for
performance reasons. non-trivial though as we'd need to propagate ifdef
values from the XNNPACK build up to the ORT build.
- using XNNPACK internals would also mean we would not be able to
support using a pre-build XNNPACK package
    - not an issue currently
  
Fixed opset registration for internal NHWC domain
- was not being tied to the ONNX version, so nodes inserted by layout
transformation had the incorrect opset
- a number of other places needed updating once this issue was fixed

Remove support for NCHW Resize from XNNPACK EP so it's NHWC only
- we only supported NCHW for fp32,
- doing so adds complexity in multiple places (XNNPACK EP kernel
implementation, layout transformation and transpose optimization)
- unclear if that complexity provides any benefit. can add back if
required by production scenario

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We're looking at enabling fp16 support for CoreML and NNAPI. If we do
that we need a good fallback story if the CPU EP will be used. The
XNNPACK fp16 kernels will hopefully provide that.

NOTE: This PR doesn't add fp16 support to the XNNPACK EP kernels. That
can be done as required in separate EPs and should be relatively simple
to do.
2023-11-03 09:04:28 -07:00
Sumit Agarwal
e36d003765
Introduce new optimizer Pad + Conv/MaxPool (#18136)
### Description
Introducing new L1 optimizer to fuse Pad to it's child node if the child
node is Conv or MaxPool.

Pad -> Conv = Conv
Pad -> MaxPool = MaxPool

Major Conditions:
- It will only fuse for the `Constant` mode of padding.
- Conv/MaxPool should not have optional `indices` output tensor
- Padding value for non-spatial dimensions should be zero and for
spatial dimensions padding values should be positive for `pad` operator.

For other conditions please see `SatisfyCondition()` in `pad_fusion.cc`.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-03 07:17:02 -07:00
Scott McKay
016b75260b
Pre-link when creating static library for apple framework (#18241)
### Description
<!-- Describe your changes. -->
Pre-link with `ld -r` to apply symbol visibility when the static library
is created to replicate XCode's Single Object Pre-link.

Current builds set the visibility flags but that doesn't get applied
until the static library is linked into something else, which can be too
late. Pre-linking fixes this.

The pre-link uses the .o files from the ORT static libraries and the .a
files from external libraries. This combination limits the symbols
included from the .a files to things required by the ORT .o files.

In order to minimize changes elsewhere in the build we extract the .o
files from the ORT static libraries using `ar -x`.

Re-ordered the pieces use to build the Apple framework to make it a
little more readable.
Fixed a couple of misc issues with missing symbols from the minimal
build that show up when pre-linking is applied.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Will hopefully address #17722
2023-11-03 23:38:29 +10:00
Xavier Dupré
1439da36fe
Partially disable QGemm tests for float 8 types (#18196)
### Description
The quantization tool assumes QGemm is implemented for float 8 types but
it is not yet supported. The condition partially disabling the test was
not robust enough. This is changed by this PR.
2023-11-03 10:17:50 +01:00
Yi Zhang
9f5a6856fe
Rerun the flaky ort-web tests automatically (#18187)
### Description
Retry 3 times at most if the web test fails.


### Motivation and Context
Web GPU tests are not stable.

From this link, we could find these ort-web tests are all in top 10
failing tasks.

https://dev.azure.com/onnxruntime/onnxruntime/_pipeline/analytics/stageawareoutcome?definitionId=161&contextType=build.

Generally, it could pass by manually rerunning it.
So, enable it to rerun automatically.

These test steps duration isn't long. So, it won't take too long to
retry.
2023-11-03 16:34:56 +08:00
Changming Sun
d8d79521ca
Disable ccache for DML (#18230)
### Description
Disable ccache for DML. This change is similar to #18104. Now the DML
build job is having the same timeout issue. I don't know why. But
disabling ccache probably would help.
2023-11-02 16:00:55 -07:00
xhcao
8d48d3e9cc
[js/web] optimize reduce related operators (#17957)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-02 12:51:48 -07:00