Commit graph

9318 commits

Author SHA1 Message Date
cloudhan
a4902ee65b
[CUDA][ROCm] Allow allocating ScratchBuffer from TuningContext (#17028)
By switching to ort native stream, we can allocate scratch buffer
directly from tuning context.
2023-08-10 00:05:10 +08:00
pengwa
6e6f582e08
Use full qualified name for PythonOp export (#17021)
### Use full qualified name for PythonOp export

Originally, when there are duplicate named torch.autograd.Function in
different module, for example:

`a.b.c.Gelu` v.s. `d.e.func.<locals>.Gelu`

We by default will throw exception to let user be aware we cannot
distinguish the two Gelu because during model export, we did not module
path. The workaround is we introduced
`ORTMODULE_SKIPPED_AUTOGRAD_FUNCTIONS` to ignore those duplicated named
Gelu that is not used by model run. This has limitations obviously for
example if two Gelus are both used in training.



This PR finds a way to construct a full qualified name.

`def _export_pt_1_10(g, n, *args, **kwargs):`

1. in exporter function, kwargs contains `name` and `module`, in the
above example:
   `a.b.c.Gelu`  --> name: `Gelu`, module: `a.b.c`
   `d.e.func.<locals>.Gelu` --> name: `Gelu`, module: `d.e`
   
 
Using name and module is not enough to get a full qualified name, for
the second case, where `d.e` is the module path, then there is a
function called `func`, in this function, there is a local
auto.grad.Function named `Gelu`. (Many of our UT looks like this). We
can only get `d.e.Gelu`, but this is not the correct full qual name.

The reason for this: `kwargs[name]` or `n.name` only return the class's
name, not the class's full qual name. (be noted kwargs[module]` is
correct).

2. `n` is torch.Node, we can access `pyobj` to get the
torch.autograd.Function's apply method instance, then use `._self` to
get the torch.autograd.Function class. Then we can get the `module` and
`class`'s ful qual name, added together, we get the full qual name.

With the above change, we don't need use `kwargs[name]` and
`kwargs[module]` , and don't need check naming conflicting or
`ORTMODULE_SKIPPED_AUTOGRAD_FUNCTIONS` env var any more.
2023-08-09 10:58:33 +08:00
Dmitri Smirnov
c424e42594
[C++] Correctly handle scalar inputs in reduction ops, enforce Transpose perm attribute matches input rank. (#17041)
### Description

This PR addresses the following issues related to the use of the
functions in ORT.

- https://github.com/microsoft/onnxruntime/issues/16492
- https://github.com/microsoft/onnxruntime/issues/16997
- https://github.com/microsoft/onnxruntime/issues/14678
- Partially addresses
https://github.com/microsoft/onnxruntime/issues/16813

The optimization case for a scalar input did not correctly recognize it
as such.
Transpose kernel assumed that `perm` attribute would always match input
tensor rank.

### Motivation and Context
The issues causes crashes and erratic behavior.
2023-08-08 14:47:01 -07:00
Tianlei Wu
fb11c67368
Fix SkipLayerNorm for 2D input (#17014)
Fix an obvious bug:
(1) In packing mode, the input for SLN has two dimensions (introduced by
#15283): [token_count, hidden_size]. Current code of `element_count =
input_dims[0] * sequence_length * hidden_size` will use element_size =
token_count * hidden_size * hidden_size, and causes invalid memory write
in cuda kernel and ORT crash

and two minor issues:
(2) potential integer overflow in `static_cast<int>(element_count)`
(3) some dead code after `return LaunchSkipLayerNormKernel` that will
never have chance to run.
2023-08-08 14:04:03 -07:00
Chi Lo
73037978f8
Add PerThreadContext for TRT EP (#16599)
Maintaining one execution context on a per thread basis is suggested per
TRT
[doc](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#threading)
to avoid synchronization issue.
For previous TRT EP, we did see synchronization issues when running
multithreading on some models, for example, FasterRCNN.

This PR leverages per thread context implementation from CUDA EP.
Followings are the modifications:

- Move CUDA graph and IExecutionContext objects to per thread context.
- Remove lock_gruad that previously placed for the whole compute_func()
and put lock_gruad in the blocks where multiple threads may update
kernel function state, access one builder, create/serialize/save engine,
save profile and serialize/save timing cache.
- On CentOS, don't unload TRT EP shared library and leave it around, so
that destructor of thread local data is still accessible upon thread
exits.

Note: Tested this PR with onnxruntime_perf_test and the overhead of
PerThreadContext is small.
2023-08-08 13:02:34 -07:00
Yulong Wang
56bced0581
[js/web] enable webgpu in browser unit test (#16310)
### Description
enable webgpu in browser unit test.

The CI pipeline uses Edge v113+ which enables WebGPU.

===

**UPDATE on 08/07/2023:**
- add flags to Edge browser launch commandline so that Edge on CI agents
can initialize WebGPU correctly.
- ONLY enable webgpu on web release build. Other pipelines are using
flag `-b=wasm,webgl,xnnpack` to specify the other 3 backends explicitly.
- disable "Resize" related test failures. Once they are fixed the tests
can be re-enabled.

---------

Co-authored-by: Satya Jandhyala <satya.k.jandhyala@gmail.com>
2023-08-08 11:45:04 -07:00
Arthur Islamov
c3f04251c7
[js/web] JSEP LayerNormalization and InstanceNormalizations kernels (#16830)
### Description
Added two kernels for Layer and Instance norm

Also added maximum limits for `maxBufferSize` when requesting GPU device
as by default it's limited to 256mb and it fails allocating 600mb buffer
while running fp32 StableDiffusion weights.


### Motivation and Context
These two are used in StableDiffusion and many other networks
2023-08-08 09:09:37 -07:00
Chi Lo
5b9bf8b663
[TensorRT EP] Fix bug for using correct device id for EP allocator (#17036)
The code always uses device id 0. Fix to use provider option
`device_id_`
2023-08-08 09:06:44 -07:00
Edward Chen
50719d2f8e
[iOS] Add script to get simulator device info. (#17012)
Add script to get iOS simulator device info so we don't need to use hardcoded specifiers which may or may not refer to a valid simulator device.

Add use-xcode-version step to a packaging pipeline so it uses a consistent version of Xcode.
2023-08-08 09:04:06 -07:00
Ti-Tai Wang
45ea907f53
Fix orttraining_test_dort.py (#17034)
Converter has moved `opset_version` out from `torch.onnx.ExportOptions`,
and put it into `torch.onnx.OnnxRegistry`.
This PR fixes the usage in DORT.
2023-08-08 08:11:48 -07:00
Xavier Dupré
d0316ee768
Updating QDQ to support Float8E4M3FN (#16550)
### Description
Naive update quantization tools to support Float8E4M3FN for Gemm.
2023-08-08 12:18:48 +02:00
RandySheriffH
063e9054b8
RunAsync in C# (#16890)
Implement c# binding for RunAsync.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-08-07 22:19:38 -07:00
Baiju Meswani
249917a093
Add mac and windows python packages for onnxruntime-training (#16993) 2023-08-07 20:32:55 -07:00
Yi-Hong Lyu
e48dc3b281
Parallelize Transpose (#16854)
It gives up to 5.6% improvement for prompt and 2.3% improvement for token generation in LLaMA 7B case.
2023-08-07 14:25:53 -07:00
Chen Fu
3c10f027de
4b quantization for weights of LLMs (#16833)
### Description
Blockwise 4b quantization for LLMs. 
1. Introduce 4b block-wise quantization for linear layer weights.
2. Implements matrix multiplication kernel for fp32 x int4
3. Implements special operator MatMulFpQ4
4. Implements quantization tool, that convert MatMul operator to
MatMulFpQ4, when the right hand side is 2D const tensor.


### Motivation and Context
Compress and accelerate LLMs

|Benchmark | Time(ns)|
|-------------|----------|
|Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:8| 218054|
|Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:8| 35830155|
|Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:8| 73479790|
|Q4GEMM/Q4Zp8/M:1/N:4096/K:4096/Threads:8| 270152|
|Q4GEMM/Q4Zp8/M:1024/N:4096/K:4096/Threads:8| 35826721|
|Q4GEMM/Q4Zp8/M:2048/N:4096/K:4096/Threads:8| 73021200|
|Q4GEMM/Q4Sym128/M:1/N:4096/K:4096/Threads:8| 213832|
|Q4GEMM/Q4Sym128/M:1024/N:4096/K:4096/Threads:8| 36749874|
|Q4GEMM/Q4Sym128/M:2048/N:4096/K:4096/Threads:8| 72618120|


|Benchmark | Time(ns)|
|-------------|----------|
|SGEMM/LLM/M:1/N:4096/K:4096/Threads:8|   522610|
|SGEMM/LLM/M:1024/N:4096/K:4096/Threads:8| 39237689|
|SGEMM/LLM/M:2048/N:4096/K:4096/Threads:8| 75983467|

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-08-07 12:23:55 -07:00
Ti-Tai Wang
8a335b8347
Update torch.onnx.OnnxRegistry usage in DORT tests (#17009)
Update the usage of torch.onnx.OnnxRegistry, as it's officially
published in PyTorch: https://github.com/pytorch/pytorch/pull/106140.

---------

Co-authored-by: Wei-Sheng Chin <wechi@microsoft.com>
2023-08-07 10:15:51 -07:00
Khalia Spear
4e6ea730d6
Broadcasting for SLN for CPU and CUDA (#16510)
### Description
Enhanced SkipLayerNorm by implementing broadcasting for both CPU and
CUDA



### Motivation and Context
The input and skip tensors no longer have to be the same size which
means that it can accept data where the skip shape can be the same size
as the input shape, have a shape of {1, sequence_length, hidden_size},
or {sequence_length, hidden_size}.

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
2023-08-07 09:55:42 -07:00
pengwa
3649376f09
Fix few small bugs (#17019)
### Fix few bugs

1. symbolic shape infer, there is no None check before get length. 
2. Rename PythonOp/PythonOpGrad's attribute `name` to `func_name`,
otherwise, when we use onnx.helper.make_node to create node, `name`
conflicts with node name.
3. Filter shape inference warnings for PythonOp for torch 2.0 or newer. 
4. Close file descriptor for log suppression. Without the fix, two extra
fd is left after the log suppression exit its context.
Before enter log suppression (left), Before exit log suppression (right)

![image](https://github.com/microsoft/onnxruntime/assets/10530022/3cd3057a-59f9-4c89-8359-d9b32c49a17e)
   With the fix, no fd added after context exit.

![image](https://github.com/microsoft/onnxruntime/assets/10530022/03454a8f-ab48-4552-bb9b-293a4f51be67)
2023-08-07 14:01:36 +08:00
Chi Lo
a451318820
Refactor TRT EP error message with details (#17007)
If users use `trt_profile_min_shapes`, `trt_profile_max_shapes` and
`trt_profile_opt_shapes`, they need to provide all the dynamic shape
input with associated shape profiles.
In the case of the main graph is partitioned into TRT/CUDA subgraphs, if
the input of the subgraph is also dynamic shape, users need to provide
its shape profiles as well. User might not notice, so TRT EP will tell
them which input shape profiles need to be provided.

New warning message is :

```
  Traceback (most recent call last):
    File "/home/azureuser/disk2/debug/optional_inputs.py", line 218, in <module>
      test_optional_input_dynamic(trt_profile=True, optional=True)
    File "/home/azureuser/disk2/debug/optional_inputs.py", line 195, in test_optional_input_dynamic
      session = ort.InferenceSession(
    File "/home/azureuser/anaconda3/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 
  419, in __init__
      self._create_inference_session(providers, provider_options, disabled_optimizers)
    File "/home/azureuser/anaconda3/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 
  471, in _create_inference_session
      sess.initialize_session(providers, provider_options, disabled_optimizers)
  onnxruntime.capi.onnxruntime_pybind11_state.EPFail: [ONNXRuntimeError] : 11 : EP_FAIL : User needs to provide all the 
  dynamic shape inputs with associated profiles if they want to explicitly set profiles through provider options.
  Please note that main graph could be partitioned into TRT/CUDA/CPU subgraphs, in this case, user also needs to provide 
  shape profiles for the TRT subgraph's input if it's dynamic shape input.
  Following input(s) has no associated shape profiles provided: x1
```

Please see this github issue:
https://github.com/microsoft/onnxruntime/issues/16600
2023-08-06 09:04:21 -07:00
Dmitri Smirnov
d5e4bdbe7d
Fix protobuf TaggedStringPtr display (#17008)
### Description
<!-- Describe your changes. -->
Adjust nativs to display tagged strings.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Hard to debug without seeing names.
2023-08-04 17:51:01 -07:00
Sheil Kumar
78a5f049f4
[DML] Model corrupter during layernorm fusion and DmlNonZeroOperator crashes (#16918)
[DML] Model corrupter during layernorm fusion and DmlNonZeroOperator
crashes

Two issues fixed in this PR:
1) Changes to layernom fusion regressed DirectML. This has been disabled
for DML to unblock models.
2) DmlNonZero needs to create an operator call that needs to know the
number of non-zero elements (size in bytes). Therefore this needs to be
allocated during compute, but is being allocated during initialization.
This causes the output tensor size to mismatch with the operator's
expectations.

---------

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2023-08-04 17:44:54 -07:00
Yifan Li
d6ce43db5e
[EP Perf] MemTest: Add Valgrind and fix addressSanitizer (#16930)
### Description
1. Add valgrind to existing ep_perf CI MemTest and parse ORT-TRT memLeak
details
1. General Valgrind logs and logs related to ORT-TRT will be parsed in
[CI
artifacts](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=334122&view=artifacts&pathAsName=false&type=publishedArtifacts)
      1. Logic:
1. Run valgrind with `onnxruntime-perf-test -e tensorrt` and export log
to `valgrind.log`
         2. Identify if any `definitely lost` memleak happened
1. For log paragraphs which show `definitely lost`, parse if they have
keyword `TensorrtExecutionProvider`.
2. If so, extract these details to `ort_trt_memleak_detail.log`, and
return `build failure` to EP Perf CI
3. Fix existing addressSanitizer and sync the squeezenet testcase with
latest update from
[ort-inference-example](https://github.com/microsoft/onnxruntime-inference-examples/blob/main/c_cxx/squeezenet/main.cpp)
1. Updates in short: Upgrade main.cpp to be using
OrtTensorRTProviderOptionsV2
4. Reorder the 7-min-MemTest to be ahead of 9-hr-model-tests, and enable
MemTest by default
2023-08-04 16:58:57 -07:00
Yulong Wang
5af8774a0b
[build] do init and precheck first (#16961)
### Description
This change allows Web CI to do some check as the first step, so that if
there are errors it won't launch the task to build web assembly, which
is heavy.

Checks includes:
- "npm ci" in /js, /js/common and /js/web. this implicitly include:
    - typescript compiler in /js
    - typescript compiler in /js/common
    - webpack build in /js/common
    - typescript compiler in /js/web
- ESLint on typescripts
- clang-format formatter (.js, .ts, .cc, .h, .mm)
- Prettier formatter (.json, .jsonc, .md)

---------

Co-authored-by: Caroline Zhu <carolinezhu@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-08-04 16:44:45 -07:00
Chi Lo
fc8003349e
Add API for updating TRT EP provider option user compute stream (#16965)
Add a generic `UpdateTensorRTProviderOptionsWithValue()` C API to update
TensorRT provider options where its data type is pointer that can't be
represented by string.
2023-08-04 15:14:43 -07:00
Jiajia Qin
9ea0a3129b
[js/webgpu] Make sure only storage buffers are reused (#16893)
### Description
<!-- Describe your changes. -->
This PR makes sure that only storage buffers are reused. Previously, the
query buffer might also get from the freeBuffers list if there is a
matching size in it. But they are different usage, which results errors.
2023-08-04 13:40:52 -07:00
satyajandhyala
7ad43d9564
[JS/Web] Fixed ArgMin and ArgMax and refactored (#17002)
Fixed ArgMin and ArgMax and refactored using functionality from Reduce
operator code.

### Description
Removed code/functionality duplication and fixed some issue.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-04 12:59:36 -07:00
Adrian Lizarraga
191f98a00e
[QNN EP] Improve QDQ model accuracy tests (#16916)
### Description
- Improves how unit tests measure the accuracy of QDQ models on QNN EP.
- Adds tests for ops: Add, Mul, Abs<sup>1</sup>, And<sup>1</sup>,
Or<sup>1</sup>, Ceil<sup>1</sup>, Cos<sup>1</sup>

<sup>1</sup>: Not previously supported due to missing node unit
handling.

### Motivation and Context
The new approach for testing QDQ operator accuracy requires running 3
inferences:

1. float model on CPU EP (baseline)
2. qdq model on CPU EP
3. qdq model on QNN EP

The units tests check that running the QDQ model on QNN EP (3) is at
least as accurate (+- small tolerance) as running the QDQ model on CPU
EP (2). We measure accuracy by comparing to the baseline (1).

This is essentially what we care about: is qnn ep as accurate as cpu ep.
If not, it is worth investigating as a potential bug.
2023-08-04 12:15:27 -07:00
Baiju Meswani
e5bb7aba50
Add Gradient for Reciprocal (#16945) 2023-08-04 09:38:09 -07:00
Yi Zhang
555414f1aa
Set PR trigger rules (#16987)
### Description
Add a script to insert the trigger rules to workflow yamls.
First step, skipp windows gpu and linux gpu workflow when there's only
doc change

### Motivation and Context
Make skipping workflows for doc change easily.

[AB#18201](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/18201)
2023-08-04 08:21:07 -07:00
pengwa
a6887f171f
Refactor schema extraction and output unflattening (#16894)
### Motivation and Context

When we handle PyTorch models' inputs in different places (ORTModule or
others), it's common for us to flatten a structured data into a 1-D
tensor list (required by lib for example torch.onnx.export,
torch.autograd.Function.forward or ORT inference session), then do
subsequent work, then unflatten back to original hierarchy as returned
values.

DeepStage3 hooks support work also need such a lib to do similar things,
so I was proposing to extract this pair of APIs in training/utils/,
which can be more used more generally. Also a comprehensive set of test
data are used for testing unflatten/flatten in unit tests.

Let me know if you have any other suggestions. 


### Refactor schema extraction and output unflattening

Move `_extract_schema` and `unflatten_user_output` in
`orttraining/orttraining/python/training/ortmodule/_io.py` . to
`extract_data_and_schema` and `unflatten_data_using_schema` in
`orttraining/orttraining/python/training/utils/torch_io_helper.py` as
shared libs, which can be used later by other features (deepspeed stage
3 hook rewrite).

While there are still a few duplicated logic handling flatten with
different task by recursively loop the data struct, will change them
step by step in case of heavy review efforts.
2023-08-04 13:58:21 +08:00
Edward Chen
f98d3f8a23
[CoreML EP] Enable inputs with dynamic shape (#16915)
Enable node inputs with dynamic shape to be handled by the CoreML EP.
2023-08-03 18:15:00 -07:00
Jeff Daily
1629a6fa75
[ROCm] add gfx1100 and gfx1101 to CMAKE_HIP_ARCHITECTURES (#16972)
### Description
Support additional AMD GPU architectures.

### Motivation and Context
AMD announced expanding support for additional GPUs.

https://community.amd.com/t5/rocm/new-rocm-5-6-release-brings-enhancements-and-optimizations-for/ba-p/614745

This PR is how we will deliver that expanded support to onnxruntime.
2023-08-04 08:38:42 +08:00
satyajandhyala
cc4b64f646
[JS/Web] Modify Reduce, Expand and Slice to pass op and node tests. (#16979)
### Description
Make CacheHint mechanism, which is designed to avoid running the same
test multiple times saving the result mapped against a key, working by
adding input dims.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-03 15:48:47 -07:00
Tianlei Wu
a25d0d296b
Add --mask_type option to generate different format of attention mask in bert_perf_test.py (#16976)
### Description
Add an option to generate different formats of attention_mask for
testing transformers models:
1 - 1D mask index, actual sequence length excluding padding
2 - 2D attention mask. Value 0 means padding, 1 otherwise.
3 - 1D, key lengths and cumulated sequence lengths of query and key

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-03 15:24:20 -07:00
Tianlei Wu
bda012a4b2
Scripts to convert model with MulitHeadAttention to packing mode (#16925)
### Description

Update scripts for converting model with MulitHeadAttention to packing
mode.
- [x] Update symbolic shape inference for PackedMultiHeadAttention and
GatedRelativePositionBias
- [x] Update convert_to_packing_mode to handle model with
MulitHeadAttention


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-03 15:23:55 -07:00
Edward Chen
06096fcb31
Hardcode xcodebuild destination iOS simulator OS to 16.4. (#16982) 2023-08-03 14:49:54 -07:00
Yulong Wang
641c3a4a37
[js/web] update op test schema (#16921)
### Description
update op test schema.

This changes fixes several problems for operator tests for web:
- `opsets` -> `opset`: an operator uses exactly one opset instead of
multiple
- `condition` -> `platformCondition`: make it less confusing
- `inputShapeDefinitions`: allows to test ORT behaviors when it get
no/partial/full shape info.

Added a JSON schema file and also an example file
2023-08-03 14:20:20 -07:00
Arthur Islamov
ea55700e1c
[js/web] JSEP Gather OP (#16855)
### Description
Added Gather op that works with both i32 and i64 indices, assuming that
values fall into i32 limit. The assumption is safe because it's not
possible to allocate more than 2gb buffer for inputs.

It treats all data from input tensor as u32, copying 1 or 2 elements for
i64, u64 and double.

---------

Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
2023-08-03 14:09:37 -07:00
Arthur Islamov
acb9e56164
[js/web] JSEP Expand fix for inputs with rank < 2 (#16829)
### Description
If Expand inputs has rank < 2, `inputIndicesHelper` and
`outputIndicesHelper` create indices as u32 instead if array<u32> and
`calculateInputIndex` throws an error



### Motivation and Context
I've encountered this error while making StableDiffusion work with JSEP
2023-08-03 11:38:04 -07:00
Rachel Guo
757c42cea7
[rn] Update expo/config-plugins to 7.2.4 due to security warning with current version (#16977)
### Description
<!-- Describe your changes. -->

As title.

And manually validated it in the
https://github.com/fs-eire/ort-rn-hello-world test app with the
dev/updated version of onnxruntime-react-native package:


https://www.npmjs.com/package/onnxruntime-react-native/v/1.16.0-dev.20230712-a396a15fa6

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Resolve security warning issues. cc @skottmckay thanks author for the
changes.

Co-authored-by: Scott McKay <skottmckay@gmail.com>
2023-08-03 10:13:43 -07:00
Arthur Islamov
c11cffb565
[js/web] Fix typo in JSEP ConvTranspose (#16884)
### Description
A typo fix in JSEP ConvTranspose. It used $12 as output shape pointer
but it should be $13. As $12 holds shape size
2023-08-03 09:46:18 -07:00
Wei-Sheng Chin
e6c9ed0606
More element types in AllGather and AllToAll (#16941)
Two things done in this PR.
- [2nd commit] More tensor element types are supported because in
distributed computation, we need to re-shard tensors in many different
types.
- [1st commit] We now specify opset version in test models. Without this
change, those models will have opset=20 with latest ONNX and results
test errors.
- [3rd commit] Tests are modified to test `AllGather` and `AllToAll` for
boolean tensors. Several graph patterns are tried for tests. We found
that `int64_tensor -> Cast -> bool_tensor -> AllToAll -> bool_tensor ->
Cast -> int64_tensor` always generate random results. My guess is that
`AllToAll` needs to synchronize all GPUs before calling `ncclSend` and
`ncclRecv` since `AllGather` doesn't hit this problem. For reproducing
the error, search for `TODO` in this PR. Note that this PR doesn't fix
it.
2023-08-03 09:31:55 -07:00
BoarQing
b8bbc898c6
fix errors for node with empty name for vitis ai (#16949)
### Description
Fixed the issue of finding nodes with empty name for vitis ai.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
It is required because we encountered this error when testing newly
created models.
2023-08-02 19:08:49 -07:00
Dmitri Smirnov
246cb3a197
Simplify shrink, replace Eigne in Sign implemenation (#16975)
### Description
<!-- Describe your changes. -->
Simplify Shrink.
Replace Eigen code with the one that does not require fp16 conversion in
Sign.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-02 18:24:38 -07:00
Guenther Schmuelling
0df2e14038
js/webgpu: argmax,argmin,softmax support (#16882)
argmax and argmin are similar to reduce. Eventually we need to add
optimized flavors of the shader.

softmax is optimized but only works on the last axis for now which
should be the common use case.

todo: enable more ut for argmax/argmin
2023-08-02 18:16:19 -07:00
Hariharan Seshadri
506ddb3d5d
[js/WebGPU] Support int32 Transpose in WebGPU (#16952) 2023-08-02 16:27:24 -07:00
BoarQing
6361b22103
vitis ai support generic data type (#16902)
### Description
<!-- Describe your changes. -->
Support more data types for vitis ai.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
It is required because the models we are testing now have uint8 data
type. To solve this once for all, we changed the code to support generic
data type.
2023-08-02 15:56:39 -07:00
satyajandhyala
d399648869
[JS/Web] Added Resize kMSInternalNHWCDomain domain registration. (#16946)
### Description
Added Resize NHWC domain kernel registration.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-02 14:16:21 -07:00
Michael Klimenko
07e6648e12
Enable Intel oneAPI DPC++/C++ compiler build (#16587)
Last week I fixed error #16484 found when trying to build onnxruntime
with the icpx compiler. Another thing I found out is that icpx uses
-ffast-math flag by default. You can check it by running the compiler
with -v flag like following:

```bash
# Setup the environment
. /opt/intel/oneapi/setvars.sh
# Compile any file to see all the implicit flags
icpx -v main.cpp
```

This leads to a bunch of warnings during the build like:

```bash
In file included from /mnt/f/wsl_home/onnxruntime/onnxruntime/test/providers/cpu/tensor/upsample_op_test.cc:5:
In file included from /mnt/f/wsl_home/onnxruntime/onnxruntime/test/providers/provider_test_utils.h:6:
In file included from /mnt/f/wsl_home/onnxruntime/onnxruntime/test/providers/checkers.h:10:
In file included from /mnt/f/wsl_home/onnxruntime/onnxruntime/core/util/math_cpuonly.h:68:
In file included from /mnt/f/wsl_home/onnxruntime/build/Linux/RelWithDebInfo/_deps/eigen-src/Eigen/Core:172:
/mnt/f/wsl_home/onnxruntime/build/Linux/RelWithDebInfo/_deps/eigen-src/Eigen/src/Core/MathFunctions.h:1019:12: warning: comparison with NaN always evaluates to false in fast floating point modes [-Wtautological-constant-compare]
    return isnan EIGEN_NOT_A_MACRO (x);
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~
```
		   
And some tests are failing as well, usually with infinities involved. To
list a few:

```bash
# ...
1: [  FAILED  ] IsInfTest.test_isinf_float
1: [  FAILED  ] IsInfTest.test_isinf_double
1: [  FAILED  ] IsInfTest.test_isinf_positive_float
1: [  FAILED  ] IsInfTest.test_isinf_positive_double
1: [  FAILED  ] IsInfTest.test_isinf_negative_float
1: [  FAILED  ] IsInfTest.test_isinf_negative_double
1: [  FAILED  ] IsNaNOpTest.IsNaNFloat
1: [  FAILED  ] IsNaNOpTest.IsNaNDouble
# ...
```

This PR adds a quick global check for the IntelLLVM compiler, as in the
way its name is reported by CMake and then, depending on the compiler
driver, sets either MSVC-like or GCC-like switch to disable fast-maths.

Probably a bit cleaner solution would be to use
```target_compile_options(${TARGET} PRIVATE MEOW)``` instead of a
global-wide ```set(CMAKE_CXX_FLAGS MEOW)```, but then we'd be required
to add it to all the individual targets and execution providers and this
will lead to a lot of code duplication.
2023-08-02 12:50:35 -07:00
Tianlei Wu
76aff63f37
Update bert_perf_test to test inputs with different padding ratio (#16963)
Add --average_sequence_length and --random_sequence_length so that we
can test the performance of model on different padding ratio.
2023-08-02 10:28:39 -07:00