Commit graph

8271 commits

Author SHA1 Message Date
mindest
bf2cc808a1
[ROCm] SkipLayerNorm: add more configs for block size; loosen constraints (#14900)
### Description
* add more configs for `threads_per_block` in SkipLayerNorm, also in
kernel explorer.
* loosen constraints for hidden_size, so that `SkipLayerNormSmallOp` can
be selected for larger hidden sizes.
* add flag for optional output in kernel_explorer


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-03-09 22:27:01 +08:00
Yi Zhang
d55ae490e1
detach patch manylinux from get_docker_image (#14958)
### Description
Make patch manylinux one single step.


### Motivation and Context
If we want to use hash of docker-related files as the cache key, the
files should keep consistent before and after docker build.
And changes in generated build_scripts should trigger rebuilding the
image as well.
2023-03-09 15:40:58 +08:00
zhijiang
80e25ad6ac
fix cg issue (#14372)
### Description
tensorboard depends on rsa>=3.1.4, while rsa 4.5 has vuln issue, so pin
it to higher version as suggested

Fixed
[AB#7352](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/7352)



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-03-09 15:28:11 +08:00
Yulong Wang
3c4efd2e77
[js/common] allows polyfill for bigint (#14921)
### Description
This change delays the execution of checking whether bigint is available
in the context. This allows polyfill for
`BigInt64Array`/`BigUint64Array` (if there is any)
2023-03-08 15:29:04 -08:00
Yulong Wang
8844474083
[js] remove 'npm bin' (#14943)
### Description
'npm bin' is deprecated in latest version. use 'npx' instead. 

This PR resolves #14934
2023-03-08 15:03:27 -08:00
Ye Wang
d8d96f0788
Fix a build issue (#14944)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

https://github.com/microsoft/onnxruntime/issues/14940
2023-03-08 13:05:49 -08:00
Edward Chen
c46c7ccba5
Update Gradle version (#14862)
- Update Gradle version used in most places from 6.8.3 to 8.0.1. Update Android Gradle Plugin version where applicable.
  Not updated in this change: React Native Android projects (under `js/react_native/`). That can be done later along with updating the React Native projects.

- Add Gradle wrapper in `java/` to make it easier to consistently use a specific Gradle version.
2023-03-08 12:22:06 -08:00
Changming Sun
d9436407b6
Use safe allocator for JNI code (#13999)
### Description
Use a customized allocarray function to replace the original malloc
calls to avoid integer overflow.

### Motivation and Context
Fix Prefast warnings. 

Fixed
[AB#8990](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/8990)
Fixed
[AB#8991](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/8991)
Fixed
[AB#9016](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/9016)
2023-03-08 11:40:55 -08:00
Adam Pocock
47f00b5d49
[Java] Initial on device training support (#14027)
contributor: @Craigacp
2023-03-08 10:01:08 -08:00
Ashwini Khade
f14ab63c19
fix prefast warnings (#14931)
### Description
Fixes prefast warnings

Fixed
[AB#11328](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/11328)
Fixed
[AB#11329](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/11329)
2023-03-08 09:49:15 -08:00
Hariharan Seshadri
112a4d215a
[CUDA] Support decoding multihead self-attention implementation (#14848) 2023-03-08 09:17:54 -08:00
Kyushick Lee
c696392f0c
Support external output tensors for DORT (#14516)
### Description
<!-- Describe your changes. -->
Support externally-managed output tensors (torch Tensors) for dort. 
Add `preallocate_output` option to OrtBackend to rely on
externally-managed output tensors for dort.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
DORT currently allocates and returns output ortvalues and convert them
to torch Tensors. The conversion based on dlpack does not support torch
Tensors for custom Aten backends, and it is not yet possible to transfer
the ownership from ortvalue to external handle (torch Tensor).

To avoid this issue, the PR change provides an option
(`preallocate_output`) to allocate output tensors externally in pytorch,
which creates torch Tensor for an Aten backend, and let dort take
pointers from torch Tensors to construct output ortvalues instead of
allocating them inside InferenceSession.
2023-03-07 21:32:23 -08:00
edgchen1
2ef25a2200 Update CODEOWNERS file. 2023-03-07 17:56:37 -08:00
edgchen1
5b3f79a11a Add gradle wrapper validation workflow. 2023-03-07 17:56:37 -08:00
Ashwini Khade
f71ac9859e
Update acpt image in the training pipeline (#14855)
### Description
Current pipeline refers to an old image which is causing test failures.
Updating the image to the latest one.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
Fixes pipeline failure:
https://dev.azure.com/onnxruntime/onnxruntime/_build?definitionId=198
- If it fixes an open issue, please link to the issue here. -->
2023-03-07 14:10:32 -08:00
pengwa
5d8ce817cb
Fix simplified layer norm fusion for training (#14866)
### Fix simplified layer norm fusion for training

Co-author with @prathikr.

Fix bug identified by @prathikr.
https://github.com/microsoft/onnxruntime/issues/14822.

Running T5 model enabling deepspeed, we see simplified layer norm is not
fused because the device check did not pass

b7fde84341/onnxruntime/core/optimizer/layer_norm_fusion.cc (L568).
Since during pretraining optimization pass, there is no device
placement, so the device check not fulfilled is expected.

On the other hand, the device check is still valid to avoid simplified
layer norm fusion works correctly for CPU runs. As a mitigation, added a
flag to indicate whether the fusion is triggered by pre-training
optimization or not. There is a risk though, when we run ORTModule
training with CPU EP, but I feel the risk can be much reduced if we
check CUDA/ROCM is enabled for the build.

```
CUDA_VISIBLE_DEVICES=0 python examples/onnxruntime/training/summarization/run_summarization.py --model_name_or_path t5-small --do_train --dataset_name cnn_dailymail --dataset_config "3.0.0" --source_prefix "summarize: " --predict_with_generate --overwrite_output_dir --output_dir /bert_ort/pengwa/output --fp16 --max_steps 1 --logging_steps 1 --deepspeed aml_ds_config_zero_1.json
```

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-03-07 13:59:20 -08:00
Patrice Vignola
65f1f840f6
[DML EP] Fix Attention regression caused by removing transposes (#14908)
By removing the transposes and using strides instead, the metacommands
are not able to be reached anymore since it's not using NCHW layout.
2023-03-07 11:17:28 -08:00
Xavier Dupré
6b604521a6
Fix tree implementation when left, right node have lower index (#14839)
### Description
Previous implementation did not support left or right node of a node to
have an index lower than the node itself. This condition would forbid
the tree to enter an infinite loop. Lightgbm does not follow that rule.
The changes do not change the algorithm but remove the test enforcing
that condition.



### Motivation and Context
It fixes a regression introduced by #14670.
2023-03-07 19:47:12 +01:00
Hitesh Shah
66101c02a2 Implement AllToAll collective op 2023-03-07 10:17:07 -08:00
Adam Pocock
150043f74f
Adds a Java accessor for GetVersionString (#14876)
### Description
Java part of #14873.
2023-03-07 09:46:56 -08:00
Xavier Dupré
5930e7e22f
Introduce RemovableAttributes (#14868)
### Description
TreeEnsemble* kernels fully copies all the parameters from the onnx
graph. Even if they are no longer needed or unused (hitrates), they
remain in memory. For big models >= 200 trees, max_depth > 10, the model
usually weights more than 10 Mb. This change offers a kernel the
possibility to remove all unneeded attributes after they were used to
create the session. Attributes are deleted after the model was possibly
saved, at the of the session creation.

The current design is to be debatted:
* it stored the list of removable attributes in class
`onnxruntime::Node`,
* the node is marked as `const` everytime this implementation needs to
register the name of a removable attribute or to remove them.

The current implementation is just a POC as it needs to cast
`onnxruntime::Node*` into `const onnxruntime::Node*`.

Should we keep the list of removable attributes in `onnxruntime::Node`?

### Motivation and Context
Motivation is mostly to reduce memory consumption.

---------

Signed-off-by: xadupre <xadupre@microsoft.com>
2023-03-07 12:37:12 +01:00
JiCheng
be1416d032
prefast C26451 (#14933)
Fixed
[AB#13290](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/13290)
2023-03-07 15:16:50 +08:00
Changming Sun
3e08a67dd6
Add Linux ARM64 CI pipeline (#14904) 2023-03-06 21:47:10 -08:00
Adrian Lizarraga
d45b47945c
Linux QNN Pipeline: fix build error reporting (#14922)
### Description
Split up the ORT build step in the Linux QNN CI Pipeline.


### Motivation and Context
Build errors were not being immediately reported at the end of the build
step. The build step currently concatenates multiple shell commands, and
the return code for the last (mkdir) was being reported. This PR ensures
that the return code of the `python build.py ...` command is reported
for the build step.
2023-03-06 17:49:35 -08:00
Sheil Kumar
f88b97ede2
Cast to int32_t->size_t to avoid prefast overflow warning (#14902)
Cast to int32_t->size_t to avoid prefast overflow warning
2023-03-05 06:21:46 -08:00
Tianlei Wu
6c8538f086
Fix prefast warning (#14895)
Fix a prefast warning: `The const variable 'spatial_dim_start' can be
computed at compile-time. Consider using constexpr (con.5).`
2023-03-03 12:54:28 -08:00
Changming Sun
c1155b70c5
Remove 37 and 50 from CUDA compute archs (#14874)
### Description
To reduce CUDA package's size a little bit. 37 is for Tesla K80. Azure's
NC-series uses it, but in most cases CUDA can dynamic generate device
code .
2023-03-03 12:24:21 -08:00
George Wu
289f7dbcdd
enable pybind for qnn ep (#14897)
enable python bindings for QNN EP.
tested on Windows Dev Kit 2023 (ARM64) with python 3.11 (ARM64) from 
https://www.python.org/ftp/python/3.11.1/python-3.11.1-arm64.exe
2023-03-03 07:26:53 -08:00
pengwa
f6c81d8aca
Introduce padding inspector in ORTModule (#14652)
### Introduce padding inspector in ORTModule

In some Transformer-based LLM training recipes, high data sparsity is
observed due to 1). token padding (to max sequence length), 2). labels
contains many ignore_index for calculate loss.

This PR introduces a switch to enable data sparsity inspection, which 
1). in short term, can inform training users to use techniques like
dynamic batching to amortize the issue.
2). in medium and longer term, also helps us (training team) to have
better understanding what our training customers' models looks like from
perspective of data sparsity (and potentially motivate us to improve
with runtime).

Here is an example of different data sparsity with same training model
arch, same training input, but with different user models.

**Low Embed Density, High Label Density Case - Sentence Classification**
`
python -m torch.distributed.launch --nproc_per_node=4
examples/onnxruntime/training/text-classification/run_glue.py
--model_name_or_path roberta-large-openai-detector --task_name mnli
--do_train --do_eval --max_seq_length 128 --per_device_train_batch_size
32 --learning_rate 2e-5 --num_train_epochs 3 --overwrite_output_dir
--output_dir ./outputs/ --per_device_eval_batch_size 32 --seed 1137
--fp16 True --ignore_mismatched_sizes True --optim adamw_ort_fused
`
```
>>>Valid token/label density (e.g. valid/total) in passing 10 steps:
        | STEP       | INPUT TYPE |  INPUT NAME     | PAD IDX    | DENSITY    | VALID TOKENS    | TOTAL TOKENS    | VALID TOKENS/BATCH |
        | 60         | EMBED      | input_ids       | 1          | 35.21    % | 1442            | 4096            | [50, 81, 35, 11, 29, 36, 66, 19, 40, 22, 21, 42, 17, 37, 40, 41, 26, 58, 38, 54, 41, 73, 48, 57, 50, 51, 49, 85, 48, 36, 79, 62] |
        | 61         | LABEL      | labels          | -100       | 100.00   % | 32              | 32              | N/A             |
        | 62         | EMBED      | input_ids       | 1          | 30.00    % | 1229            | 4096            | [36, 73, 13, 47, 27, 33, 53, 25, 51, 28, 36, 42, 42, 32, 39, 52, 27, 13, 31, 66, 42, 45, 52, 45, 58, 42, 37, 66, 12, 18, 29, 17] |
        | 63         | LABEL      | labels          | -100       | 100.00   % | 32              | 32              | N/A             |
        | 64         | EMBED      | input_ids       | 1          | 26.73    % | 1095            | 4096            | [37, 28, 20, 53, 16, 20, 44, 52, 27, 28, 16, 19, 16, 24, 63, 31, 24, 42, 33, 41, 44, 60, 44, 67, 54, 30, 20, 19, 33, 23, 24, 43] |
        | 65         | LABEL      | labels          | -100       | 100.00   % | 32              | 32              | N/A             |
        | 66         | EMBED      | input_ids       | 1          | 30.03    % | 1230            | 4096            | [22, 46, 36, 41, 46, 43, 26, 50, 60, 16, 24, 42, 56, 35, 35, 59, 29, 39, 34, 20, 66, 23, 47, 53, 19, 35, 44, 23, 34, 81, 21, 25] |
        | 67         | LABEL      | labels          | -100       | 100.00   % | 32              | 32              | N/A             |
        | 68         | EMBED      | input_ids       | 1          | 31.62    % | 1295            | 4096            | [75, 36, 48, 20, 38, 21, 49, 54, 38, 41, 26, 28, 80, 45, 48, 16, 22, 41, 34, 28, 37, 16, 74, 63, 62, 34, 22, 45, 23, 27, 37, 67] |
        | 69         | LABEL      | labels          | -100       | 100.00   % | 32              | 32              | N/A             |
<<<
```

**High Embed Density, Low Label Density Case - masked language model** 
`
python -m torch.distributed.launch --nproc_per_node=4
examples/onnxruntime/training/language-modeling/run_mlm.py
--model_name_or_path bert-base-uncased --dataset_name wikitext
--dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10
--per_device_train_batch_size 8 --per_device_eval_batch_size 8
--do_train --do_eval --overwrite_output_dir --output_dir ./outputs/
--seed 1137 --fp16 --report_to none --optim adamw_ort_fused
`
```
>>>Valid token/label density (e.g. valid/total) in passing 10 steps:
        | STEP       | INPUT TYPE |  INPUT NAME     | PAD IDX    | DENSITY    | VALID TOKENS    | TOTAL TOKENS    | VALID TOKENS/BATCH |
        | 710        | EMBED      | input_ids       | 0          | 100.00   % | 4096            | 4096            | [512, 512, 512, 512, 512, 512, 512, 512] |
        | 711        | LABEL      | labels          | -100       | 13.77    % | 564             | 4096            | N/A             |
        | 712        | EMBED      | input_ids       | 0          | 100.00   % | 4096            | 4096            | [512, 512, 512, 512, 512, 512, 512, 512] |
        | 713        | LABEL      | labels          | -100       | 14.48    % | 593             | 4096            | N/A             |
        | 714        | EMBED      | input_ids       | 0          | 100.00   % | 4096            | 4096            | [512, 512, 512, 512, 512, 512, 512, 512] |
        | 715        | LABEL      | labels          | -100       | 14.18    % | 581             | 4096            | N/A             |
        | 716        | EMBED      | input_ids       | 0          | 100.00   % | 4096            | 4096            | [512, 512, 512, 512, 512, 512, 512, 512] |
        | 717        | LABEL      | labels          | -100       | 14.53    % | 595             | 4096            | N/A             |
        | 718        | EMBED      | input_ids       | 0          | 100.00   % | 4096            | 4096            | [512, 512, 512, 512, 512, 512, 512, 512] |
        | 719        | LABEL      | labels          | -100       | 15.31    % | 627             | 4096            | N/A             |
<<<
```

#### Next Step

Let's see how we leverage the data sparsity for improvement.
Optimizations on the way around compute optimizer wave 2:
> Loss compute flops reduction.
> Flatten/Unflatten embedding tokens to save compute flops.
2023-03-03 18:36:08 +08:00
Yi Zhang
8c454a76e0
Check Mac silicon package name (#14898)
### Description
1. add comments 
2. check Mac silicon package name 

### Motivation and Context
There isn't Mac silicon Agent in ADO.
We couldn't add smoking test to test the wheel can be installed.
But We can check whether the package name is correct to avoid the
mistake in 1.14 release.

Test run

https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=283100&view=logs&j=fe710151-df7c-5aa4-0cea-cf5331faa499&t=3182cefe-2612-53c6-4445-e5b3e0c4ac57
2023-03-03 18:27:54 +08:00
cloudhan
a997bb46b6
Refactor rocm attention (#14688)
Extract QKV projection and attention computation into pipelines (composed from gemms and kernel launch). 

This will allow us to introduce ck flash attention in next PR
2023-03-03 12:16:11 +08:00
Changming Sun
f3b6664384
Remove Python 3.7 from the python packaging pipeline (#14887)
### Description
1. Remove Python 3.7 from the python packaging pipeline. It is planned
for the next release and approved by the PMs. Also we will add 3.11, but
it will be addressed in another PR.
2. Stop generating python packages based on Ubuntu 18.04 which will
reach EOL next month. We will either replace them with Ubuntu 20.04 or a
CentOS 8 variant.
2023-03-02 19:44:49 -08:00
guyang3532
c49f250a14
Del ort_model._modules to foward its accessing to torch_model._modules (#14563)
Missing '_modules' attribute in ORTModule will cause load_state_dict for
wrapped_ortmodule fail.

reference:https://github.com/microsoft/onnxruntime/pull/7847
2023-03-03 10:12:37 +08:00
Dmitri Smirnov
8d87fdcfa1
Add GetVersionSting API for C++, C# and Python (#14873)
### Description
Added APIs.

### Motivation and Context
Addresses https://github.com/microsoft/onnxruntime/issues/14584

Cc: @Craigacp cp
2023-03-02 17:11:07 -08:00
Zachary Streeter
6e2ca15140
added miopenGetConvolutionSpatialDim if ROCm5.5 (#14772)
The API should be `miopenGetConvolutionSpatialDim(cdesc, &spatial_dim)`, NOT `miopenGetConvolutionDescriptorSize(cdesc, &spatial_dim)`
2023-03-03 08:02:32 +08:00
Yuriy Chernyshov
0ebe8e34f8
Do not use ADL to invoke std algos (#14851)
This is a follow up for #14716
2023-03-02 15:38:33 -08:00
Chun-Wei Chen
70a31e047a
Consume ONNX 1.13.1 in ONNX Runtime (#14812)
### Description
<!-- Describe your changes. -->
Consume ONNX 1.13.1 in ONNX Runtime. (ONNX 1.13.0 to ONNX 1.13.1)


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ONNX 1.13.1 patch was just released yesterday. This PR is making ORT's
ONNX submodule consistent with the latest released ONNX. Not sure
whether this PR is really needed, but let me make it ready. Previous PR
for testing ONNX 1.13.1rc2 :
https://github.com/microsoft/onnxruntime/pull/14634.

Fixed
[AB#13174](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/13174)
.
2023-03-02 14:57:35 -08:00
G. Ramalingam
2facc5efe6
Fix function inliner bug re. outer-scope names (#14734)
### Description

Fix the function inliner logic for renaming variables. Typically, a
FunctionProto does not contain references to outer-scope names. However,
special cases, such as the function-expansion of SequenceMap, can
generate such FunctionProtos. Extend the renaming logic to ensure that
references to outer-scope names are not renamed.

### Motivation and Context
Fixes https://github.com/onnx/onnx/issues/4892

Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
2023-03-02 12:08:37 -08:00
Chen Fu
603026fb84
Transpose for 16b tensors (#14877)
### Description
Matrix transpose for 16b tensors (shorts, and half precision floats)


### Motivation and Context

Need it for fp16 operations
2023-03-02 11:32:05 -08:00
Rachel Guo
7cd4b334a9
[CoreML EP] Add Flatten Op and LRN Op support (#14857)
### Description
<!-- Describe your changes. -->

As title.

CoreML Spec for reference: 


https://apple.github.io/coremltools/mlmodel/Format/NeuralNetwork.html#flattento2dlayerparams


https://apple.github.io/coremltools/mlmodel/Format/NeuralNetwork.html#lrnlayerparams

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fill CoreML Clipchamp usage gaps.

---------

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
2023-03-02 09:43:15 -08:00
Hector Li
bf35ad2aa3
[Qnn EP] Call Qnn deviceCreate during backend setup and deviceFree during shutdown (#14875)
### Description
Call Qnn deviceCreate during backend setup and call deviceFree during
shutdown

### Motivation and Context
Algin with Qnn formal setup and shutdown procedure.
2023-03-02 08:54:13 -08:00
cao lei
d69823f764
Do not create Barrier and triggerDownstream steps if the corresponding nodes are split by yield Op in training scenario (#14570)
### Description
Do not create Barrier and triggerDownstream steps during execution plan
creation if the corresponding nodes are split by yield Op in training
scenario.



### Motivation and Context
In training scenario, forward and backward processes are running two
different partial nodes of a graph. If there are two nodes each in one
of the partial graph and separate in two streams, there are still
triggerDownstream/barrier steps between them which work quite different
from inference process as one of the steps will not be executed due to
it is not in the correct range. To make it work, there is a hacky way to
trigger the barrier step explicitly for training.
This PR is to do some check, and do not create Barrier and
triggerDownstream steps if the corresponding nodes are split by yield Op
in training scenario. So the hacky way is not needed.
2023-03-02 07:08:29 -08:00
Tianlei Wu
c66af46fc1
Doc for Stable Diffusion CUDA Optimizations (#14830)
Add document for stable diffusion optimizations and benchmark.
2023-03-01 19:29:30 -08:00
Hector Li
c6074f3a4b
OnnxRuntime QNN EP (#14791)
### Description
Integrate Qualcomm QNN SDK to enable inference on QC hexagon NPU devices

### Motivation and Context
Enable Ort inference on QC hexagon NPU devices.

---------

Co-authored-by: Satya Jandhyala <sajandhy@microsoft.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: Adrian Lizarraga <adrianlm2@gmail.com>
2023-03-01 13:48:20 -08:00
George Wu
6044312a43
fix TRT dockerfile documentation https://github.com/microsoft/onnxruntime/issues/14556 (#14600)
address https://github.com/microsoft/onnxruntime/issues/14556
2023-03-01 07:02:42 -08:00
Scott McKay
b7fde84341
Changes to support standalone custom ops in a minimal build. (#14497)
### Description
<!-- Describe your changes. -->
Changes to support standalone custom ops in a minimal build. Also
incorporates changes from #14492 (needed to test builds prior to that
being checked in).

We first need to save the schema info from the operators used by the
standalone op invoker in the ORT format model. Add mechanism for that.

Merge the kernel lookup logic so the same is used in full and minimal
build. NOTE: the version matching is now consistent with all other
kernel lookups, and the call to CreateOp MUST use the exact version for
the operator. Previously matching wasn't as strict, but this can lead to
the incorrect kernel being chosen.

Add tests.

NOTE: There is currently no way to detect the ops/types/opsets used
inside these custom ops as they don't exist until we create kernels,
which is after model loading completes (which is the point the ORT
format model is saved). Due to that they have to be manually added to
the configuration used to do the reduced ops build. That shouldn't be
too hard for the custom op author to add given the custom op
implementation is specifying the op, opset and type constraints (i.e.
they have the info and it's just a case of capturing/formatting it
correctly).


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable usage of the standalone op invoker by custom ops in a minimal
build.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-03-01 11:22:54 +10:00
Chen Fu
acc2ac627f
Fp16 Activations (#14722)
### Description

NEON fp16 SIMD implementation of Activation functions


### Motivation and Context
Step 2 of fp16 SIMD support.

---------

Co-authored-by: Chen Fu <fuchen@microsoft.com>
2023-02-28 17:20:40 -08:00
Yulong Wang
69c5edb11b
[wasm] upgrade emsdk from 3.1.19 to 3.1.32 (#14818)
### Description
upgrade emsdk from 3.1.19 to 3.1.32

also add explicit config for stack size (1MB).
2023-02-28 11:06:09 -08:00
Nat Kershaw (MSFT)
95c777b745
Fix link to High Level Design (#11786)
Address #11661
2023-02-28 11:05:54 -08:00
Yi Zhang
6320decf04
increase Test GPU Job's timeout to 8 hours (#14850)
### Description
<!-- Describe your changes. -->

### Motivation and Context
In practice, 6 hours is not enough to finish the job.
2023-02-28 18:52:03 +08:00