Commit graph

8301 commits

Author SHA1 Message Date
Rachel Guo
db4e664f7c
Re-enable react native e2e android unit test for CI and upgrade targetSDK level for test project (#14989)
### Description
<!-- Describe your changes. -->

Re-enable the react native e2e android unit test for react native CI as
recent change of specifying `default` instead of `google-apis` in
android emulator CI tests gives pretty stable result for now.

Upgrade the targetSDKversion for gradle test project in
react-native/android to meet minimum target api level requirement for
Google Play apps.


https://support.google.com/googleplay/android-developer/answer/11926878?hl=en

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

React Native CI issue.
2023-03-14 13:35:38 -07:00
Alex Kogan
8b09702b88
Enable parallel computation in Clip ops (#14925)
### Description
<!-- Describe your changes. -->
This PR speeds-up Clip operations by replacing their sequential
implementation with a parallelized one. The parallelization is achieved
by dividing the input data into chunks of size N and using a thread pool
to process the chunks in parallel. The chunk size N is set to 16K based
on performance evaluation on input tensors of 10^i elements for i in [1
.. 6].


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The Clip operation is frequently executed in image processing models.
Its implementation can be easily parallelized and therefore sped up when
executed on a multi-core machine. On long inputs (>= 100K elements) this
PR achieves speedup of over 2x. On shorter inputs, this PR does not
introduce any substantial performance change.
2023-03-14 09:41:44 -07:00
PeixuanZuo
2ff7f3e93a
[ROCm] support optimized Stable Diffusion model (#14980)
Add BiasSplitGelu/BiasAdd/GroupNorm/NhwcConv operator for ROCm EP.

1. BiasSplitGelu and BiasAdd operators can be automatically hipified
from CUDA EP.
2. GroupNorm was hipified from CUDA EP and modified to build.
3. NhwcConv is similar to NhwcConv in CUDA EP, But the MIOpen API and
cuDnn API are different. `miopenConvolutionForwardbias` and
`miopenOpTensor` of MIOpen doesn't support NHWC layout now, use
BinaryElementwise to replace miopenConvolutionForwardbias(NHWC layout).
2023-03-14 23:15:37 +08:00
PeixuanZuo
ff2850029b
[ROCm] refact SkipLayernorm long if-elseif statements (#14795)
Refact SkipLayernorm long if-elseif statements.
2023-03-14 23:04:55 +08:00
Ye Wang
0fa00429d5
[T5 optimization] script fusions and fixes (#14967)
### Description
<!-- Describe your changes. -->

1. added script for t5 encoder self attention and t5 decoder self/cross
attention fusions.
2. added simplified layernorm fusion for --external_data_format senario.
(otherwise relying on ORT optimizer)
3. added rel_pos_bias shape inference code, modified attention/mha shape
inference script.
4. reworked graph_topologic_sort() because the currently implementation
is not functioning correctly. also added an option to topo-sort the
graph in a deterministic way to let tests pass.

note:
1. the t5-beamsearch export code is slightly modified. specifically,
encoder_hidden_states(ehs) is no longer an input to the t5 decoder since
the ehs is not actually used in the graph execution.
2. recent PRs do not add optimizations to t5 on cpu. 
3. the fp32 model(encoder and decoder) for t5-small, t5-base and
t5-large can get a parity of e-5 and the corresponding beam search
models generate same results as pytorch.
4. fp16(mixed-precision) models, however, get a parity around 3e-2 and
some has maximum diff a bit over 3e-2. But the beam search models still
generate same results as pytorch (based on limited input data)
5. mt-5 model has a parity issue at the moment, even before any
optimization. will investigate later.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-03-13 23:35:56 -07:00
Christian Veenhuis
59dfcfdce7
Fix typos in sources: operater, tranform, neccessary, trainig (#14907)
### Description
While browsing the sources I found several typos here and there.
I collected them to a single PR and fixed them.
Namely these typos are: operater, tranform, neccessary, trainig.
After fixing none of them was found anymore:

$ git grep "operater"
$ git grep "tranform"
$ git grep "neccessary"
$ git grep "trainig"
$ 

### Motivation and Context
Since some of the typos are in example notebooks and markdown files,
users can see them.
2023-03-13 22:45:04 -07:00
Ye Wang
538d64891a
[t5 optimization] kernel changes to t5 (#14928)
### Description
<!-- Describe your changes. -->

1. support optional bias in Attention op (used in T5 encoder)
2. support broadcasting rel_pos_bias in attention_softmax.h
3. add scale in
MHA op's attributes
4. support past_key/past_value and present_key/present_value in MHA
5. UT and parity tests are added
6. fix an issue: https://github.com/microsoft/onnxruntime/issues/14920

note: the fusions will be in another PR since mt5 needs to be tested and
an issue from github will be investigated.

Future works:
1. support shared buffer for past/present
2. enable trt kernels when possible and investigate (trt/cutlass)kernels
with rel_pos_bias)
3. support KV/QKV packing with past/present

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-03-13 14:29:16 -07:00
Dmitri Smirnov
b34e570ad0
Enable LeakyRelu latest and refactor fast_gelu_fusion to enable the script (#15003)
### Description
Enable LeakyRelu latest since the last version differs only in type
support.
Refactor `fast_gelu_fusion` to enable the script, because our script is
unable to
check if any of the optimizers are outdated and no longer in effect.

### Motivation and Context
We do not want to loose performance.

Next step is to file improvements issues if any are required.
2023-03-13 14:20:11 -07:00
Nat Kershaw (MSFT)
a5d814008c
Fix API docs deploy so that a PR is not required (#15011)
Fixes this
[issue](https://github.com/microsoft/onnxruntime/actions/runs/4387534694/jobs/7682945415#step:12:534)
and removes the extra PR step in the workflow.

Also logs the commit of the main branch that the docs were generated
from to a file called version.txt at the root of the API docs tree.

Tested for Java API docs and results staged here:
https://natke.github.io/onnxruntime/docs/api/java/index.html

If approved, I can migrate all of the other API docs generation
workflows to use this scheme.
2023-03-13 09:36:08 -07:00
pengwa
44dda08b51
Renaming files (#15015)
### Renaming files for compute optimizer

### Motivation and Context

A follow up for https://github.com/microsoft/onnxruntime/pull/14832
2023-03-13 17:07:59 +08:00
PeixuanZuo
c55f347689
[ROCm] change miopen_conv_use_max_workspace=true (#14982)
Change miopen_conv_use_max_workspace=true to get best algorithm during
`miopenFindConvolutionForwardAlgorithm` process.
2023-03-13 16:19:23 +08:00
pengwa
448e989df8
Op slicing upstream refactor (#14832)
### Slice op upstream refactor

A refactor work for https://github.com/microsoft/onnxruntime/pull/13672.

### Motivation and Context

There is a similar optimization opportunity for other operator
upstreaming, to reduce compute flops. So refactor the existing code base
for making it easier to support other ops.

The changes in this PR are mainly about renaming and moving. 
- Move common logic (from compute_optimizer.h/cc) into
upstream_transformer_base.h/cc and shared_utils.h/cc.
- For upstream common logic, they are moved into
upstream_transformer_base.h/cc
   - For shared utilities, they are moved to shared_utils.h/cc.
- After the move, compute_optimizer.h/cc mainly for upstreaming gather
implementation (inheriting upstream_transformer_base.h/cc). Ideally it
should be renamed, but for easier review this time, I keep its name.
2023-03-13 08:19:32 +08:00
Yi-Hong Lyu
cce9e0eaad
Add float32 hardsigmoid tests (#14948) 2023-03-12 10:56:29 -07:00
G. Ramalingam
930e009567
[WIP] Update call to GetFunction (#14949)
### Description

OpSchema::GetFunction() changed in ONNX to support
opset-version-dependent function-body. Update the call to GetFunction
appropriately.

### Motivation and Context

Motivated by https://github.com/microsoft/onnxruntime/issues/14810

---------

Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
2023-03-11 07:04:17 -08:00
Yi Zhang
ca315b9148
Use ADO cache to cache docker image instead of ACR (#14496)
### Description
Now, we only enable image cache in pipeline cache for Linux Aten
Pipeline.
It'll be enabled in other Linux pipelines gradually.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fixed
[AB#13143](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/13143)


### Verification
1. No Image Cache in Pipeline

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=904531&view=results
2. Use Cached Image in Pipeline

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=904533&view=results
2023-03-11 10:32:02 +08:00
Vincent Wang
7950189920
[CUDA] Optimize Perf for AtomicAdd of Half Type (#14992) 2023-03-11 08:52:01 +08:00
Changming Sun
a8ad0edbeb
BUG FIX: the if...else in telemetry-steps.yml does not really work (#14972)
### Description
BUG FIX: the if...else in telemetry-steps.yml does not really work. It
always says "Telemetry is disabled." even through the pipeline doesn't
have the pipeline variable.

### Motivation and Context
For example, recently I setup a new pipeline in
https://dev.azure.com/onnxruntime/onnxruntime/_build without setting the
ADO variable, but the powershell code still thinks that we have enabled
telemetry.

See:

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=910107&view=results

The reason it didn't work because when the pipeline
variable("TELEMETRYGUID") doesn't exist,  the occurrence of
 "$(TELEMETRYGUID)" would be not replace to anything. It will remain as
it is.
2023-03-10 15:39:07 -08:00
Adrian Lizarraga
d8ddd25272
Add InstanceNormalization operator to QNN EP (#14867)
### Description

QNN EP:
- Adds the
[InstanceNormalization](https://onnx.ai/onnx/operators/onnx__InstanceNormalization.html)
operator to QNN EP.
- Fixes graph composition bug when Transpose node is the last node in a
graph.
- Adds check for input shape when GetCapability is called (before and
after layout transformation)
- Should add similar checks for other layout sensitive ops (conv, pool,
...) in a separate PR
- Adds initial QNN op tests for QDQ conv and QDQ  InstanceNormalization
  - Should add tests for other ops in a separate PR

Optimizer:
- Makes InstanceNormalization a layout sensitive operator.
- Adds a custom QDQ group selector for InstanceNormalization.

Quantization tool:
- Adds QDQ support for InstanceNormalization operator.
- Adds python unit test for InstanceNormalization quantization.

### Motivation and Context
Needed to support stable diffusion models with QNN.

---------

Co-authored-by: Hector Li <hecli@microsoft.com>
2023-03-10 14:42:41 -08:00
Ryan Hill
a5c436e148
Fix prefast warnings (#14975)
### Description

In transpose.cc:
Arithmetic overflow: Using operator '-' on a 4 byte value and then
casting the result to a 8 byte value. Cast the value to the wider type
before calling operator '-' to avoid overflow (io.2).

In cuda_provider_factory.h:
The type 'struct onnxruntime::CUDA_Provider' with a virtual function
needs either public virtual or protected non-virtual destructor (c.35).
2023-03-10 14:31:55 -08:00
Dmitri Smirnov
0d7855ea5a
Re-work global objects dependancies in pybind layer. (#14941)
### Description
Re-work handling of static objects in pybind.
Make sure we ref-count Environment from Sessions.

The following has been done:

- Make global objects function static. This ensures that the objects are
constructed on demand. The first object constructed is destructed last.
This is platform independent.
- Make global objects ownership shared as suggested by pybind since they
are not surfaced at Python level, and they cannot be referred to by
dependent python objects. Verified that all python objects are GCed
before globals are destroyed. This takes care of inference session
dependency on environment and its default logger and this is also
platform independent.
- Utilize pybind atexit mechanism to clear execution providers and
unload CUDA libraries (as suggested by
https://github.com/microsoft/onnxruntime/pull/14903) . Since this is
registered for module exit, it takes place before any other global are
destroyed and clears shared objects state or even unloads the libraries.
This should also work in a platform independent way.

### Motivation and Context

- Global object destruction order is managed manually and that becomes
source of trouble. We want to make it deterministic and platform
independent.
- Frequent hangs in Python layer due to the static object's destruction
order. Some of the Python session objects are being garbage collected
after main exits and they require ORT environment to be alive. (Use
after free)
2023-03-10 13:55:31 -08:00
Adrian Lizarraga
e2febe87f6
[QNN EP] Update QNN SDK to 2.8 (#14978)
### Description
- Add QNN 2.8 SDK
- Make QNN SDK version a pipeline template parameter for QNN pipelines.

### Motivation and Context
Updates to latest QNN SDK version, and allows testing different QNN SDK
versions without modifying yaml files.
2023-03-10 13:21:19 -08:00
Edward Chen
bd142bfb04
Gradle clean up (#14973)
- Use java/gradlew directly in .github/workflows/publish-java-apidocs.yml.
- Remove use of deleted step from tools/ci_build/github/azure-pipelines/android-arm64-v8a-QNN-crosscompile-ci-pipeline.yml.
- Remove Gradle installations and PATH updates from Dockerfiles and scripts. Now Gradle wrapper is used so a system Gradle installation is not needed.
2023-03-10 10:50:32 -08:00
Baiju Meswani
748758c135
Address issue with uninitialized variable (#14988) 2023-03-10 09:24:04 -08:00
Maximilian Müller
ad4db12699
TensorRT EP - timing cache (#14767)
### Description

This will enable a user to use a TensorRT timing cache based on #10297
to accelerate build times on a device with the same compute capability.
This will work across models as it simply store kernel runtimes for
specific configurations. Those files are usually very small (only a few
MB) which makes them very easy to ship with an application to accelerate
the build time on the user end.

### Motivation and Context
Especially for workstation use cases TRT build times can be a roadblock.
With a few model from ONNX model zoo i evaluated speedups when a timing
cache is present.
`./build/onnxruntime_perf_test -e tensorrt -I -t 5 -i
"trt_timing_cache_enable|true" <onnx_path>`

|Model | no Cache | with Cache|
| ------------- | ------------- | ------------- |
|efficientnet-lite4-11 | 34.6 s | 7.7 s|
|yolov4 | 108.62 s | 9.4 s|

To capture this is had to modify the onnxruntime_perf_test. The time is
sometimes not captured within "Session creation time cost:" which is why
i introduced "First inference time cost:".

---------

Co-authored-by: Chi Lo <Chi.Lo@microsoft.com>
2023-03-10 09:02:27 -08:00
Yi Zhang
acbb7ad453
enable cache in orttraining-mac-ci (#14979)
### Description
enable compilation cache  in orttraining-mac-ci

### Motivation and Context
The workflow duration can be reduced to 12 minutes from about 100
minutes at best.

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=911536&view=results
2023-03-10 07:34:25 +08:00
Yulong Wang
1187d4ade6
[wasm] extend build timeout for static lib (#14952)
### Description
extend build timeout for web assembly static lib.
2023-03-09 15:03:34 -08:00
Preetha Veeramalai
79d47c1530
Enable sorting of initializers (#14631)
Add intializers to model proto in sorted order.



### Motivation and Context
Onnxruntime OpenVino Execution Provider interacts with Openvino API by
passing onnx serialised model proto.
Current flow is that onnx serialised model proto will be passed into
Read_model() API of OpenVino that creates an OpenVino execution network
thats passed to compile_model() API.

As part of optimizations we have combined the API's (Read_model and
Compile_model) into single compile_model() API that directly accepts
serialized onnx model proto. A hash function will be computed on this
serialized input for internal Openvino optimizations. This requires the
model_proto to be deterministic during each inference requests.

With the current flow, the [initializers are added to
model_proto](c1ff4b468d/onnxruntime/core/graph/graph_proto_serializer.cc (L48))
from an [unordered_map data
structure](8ed3dfe063/onnxruntime/core/providers/shared_library/provider_interfaces.h (L93))
that brings in random ordering of these initializers for inference runs.


The proposed solution is to add these initializers by iterating through
a sorted[ vector consisting of the initializer
names](2c7146cef8/onnxruntime/core/graph/graph_proto_serializer.cc (L49)).
2023-03-09 12:12:46 -08:00
Jian Chen
b4fe98ac2e
Update to MacOS-12 (#14924)
### Description
<!-- Describe your changes. -->


Update to MacOS-12
### Motivation and Context

Fixed
[AB#13233](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/13233)
2023-03-09 10:18:14 -08:00
cloudhan
51b67fa15c
Make ROCm Attention biased+masked and biased+nomask scaling logic consistent (#14976)
The biased+masked and biased+nomask have different scaling logic in current ROCm implementation

Currently,

biased + masked:  (QK'+ bias) * scale + convert(mask)
biased + nomask:   QK' * scale + bias

which is not correct. What we want is

  QK' * scale [+ bias]

That is, bias should not be scaled.

This effectively follows
https://github.com/microsoft/onnxruntime/pull/14517/files?w=1#diff-e4768ce15a73499f584f9cd7d71adcb1ff2ed8d68ad7e496723a4775cbc35e33
2023-03-09 23:37:50 +08:00
mindest
f83923d5df
fix rocBLAS extensions API issue; add batched- and strided_batched- cases (#14883)
### Description
For rocBLAS extensions API:
* fix `alpha`/`beta` dtype mismatch in `rocblas_gemm_ex()`, which should
be the same as `compute_type`.
* add support for `BatchedGemm` and `StridedBatchedGemm` cases.
2023-03-09 23:23:35 +08:00
mindest
bf2cc808a1
[ROCm] SkipLayerNorm: add more configs for block size; loosen constraints (#14900)
### Description
* add more configs for `threads_per_block` in SkipLayerNorm, also in
kernel explorer.
* loosen constraints for hidden_size, so that `SkipLayerNormSmallOp` can
be selected for larger hidden sizes.
* add flag for optional output in kernel_explorer


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-03-09 22:27:01 +08:00
Yi Zhang
d55ae490e1
detach patch manylinux from get_docker_image (#14958)
### Description
Make patch manylinux one single step.


### Motivation and Context
If we want to use hash of docker-related files as the cache key, the
files should keep consistent before and after docker build.
And changes in generated build_scripts should trigger rebuilding the
image as well.
2023-03-09 15:40:58 +08:00
zhijiang
80e25ad6ac
fix cg issue (#14372)
### Description
tensorboard depends on rsa>=3.1.4, while rsa 4.5 has vuln issue, so pin
it to higher version as suggested

Fixed
[AB#7352](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/7352)



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-03-09 15:28:11 +08:00
Yulong Wang
3c4efd2e77
[js/common] allows polyfill for bigint (#14921)
### Description
This change delays the execution of checking whether bigint is available
in the context. This allows polyfill for
`BigInt64Array`/`BigUint64Array` (if there is any)
2023-03-08 15:29:04 -08:00
Yulong Wang
8844474083
[js] remove 'npm bin' (#14943)
### Description
'npm bin' is deprecated in latest version. use 'npx' instead. 

This PR resolves #14934
2023-03-08 15:03:27 -08:00
Ye Wang
d8d96f0788
Fix a build issue (#14944)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

https://github.com/microsoft/onnxruntime/issues/14940
2023-03-08 13:05:49 -08:00
Edward Chen
c46c7ccba5
Update Gradle version (#14862)
- Update Gradle version used in most places from 6.8.3 to 8.0.1. Update Android Gradle Plugin version where applicable.
  Not updated in this change: React Native Android projects (under `js/react_native/`). That can be done later along with updating the React Native projects.

- Add Gradle wrapper in `java/` to make it easier to consistently use a specific Gradle version.
2023-03-08 12:22:06 -08:00
Changming Sun
d9436407b6
Use safe allocator for JNI code (#13999)
### Description
Use a customized allocarray function to replace the original malloc
calls to avoid integer overflow.

### Motivation and Context
Fix Prefast warnings. 

Fixed
[AB#8990](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/8990)
Fixed
[AB#8991](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/8991)
Fixed
[AB#9016](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/9016)
2023-03-08 11:40:55 -08:00
Adam Pocock
47f00b5d49
[Java] Initial on device training support (#14027)
contributor: @Craigacp
2023-03-08 10:01:08 -08:00
Ashwini Khade
f14ab63c19
fix prefast warnings (#14931)
### Description
Fixes prefast warnings

Fixed
[AB#11328](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/11328)
Fixed
[AB#11329](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/11329)
2023-03-08 09:49:15 -08:00
Hariharan Seshadri
112a4d215a
[CUDA] Support decoding multihead self-attention implementation (#14848) 2023-03-08 09:17:54 -08:00
Kyushick Lee
c696392f0c
Support external output tensors for DORT (#14516)
### Description
<!-- Describe your changes. -->
Support externally-managed output tensors (torch Tensors) for dort. 
Add `preallocate_output` option to OrtBackend to rely on
externally-managed output tensors for dort.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
DORT currently allocates and returns output ortvalues and convert them
to torch Tensors. The conversion based on dlpack does not support torch
Tensors for custom Aten backends, and it is not yet possible to transfer
the ownership from ortvalue to external handle (torch Tensor).

To avoid this issue, the PR change provides an option
(`preallocate_output`) to allocate output tensors externally in pytorch,
which creates torch Tensor for an Aten backend, and let dort take
pointers from torch Tensors to construct output ortvalues instead of
allocating them inside InferenceSession.
2023-03-07 21:32:23 -08:00
edgchen1
2ef25a2200 Update CODEOWNERS file. 2023-03-07 17:56:37 -08:00
edgchen1
5b3f79a11a Add gradle wrapper validation workflow. 2023-03-07 17:56:37 -08:00
Ashwini Khade
f71ac9859e
Update acpt image in the training pipeline (#14855)
### Description
Current pipeline refers to an old image which is causing test failures.
Updating the image to the latest one.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
Fixes pipeline failure:
https://dev.azure.com/onnxruntime/onnxruntime/_build?definitionId=198
- If it fixes an open issue, please link to the issue here. -->
2023-03-07 14:10:32 -08:00
pengwa
5d8ce817cb
Fix simplified layer norm fusion for training (#14866)
### Fix simplified layer norm fusion for training

Co-author with @prathikr.

Fix bug identified by @prathikr.
https://github.com/microsoft/onnxruntime/issues/14822.

Running T5 model enabling deepspeed, we see simplified layer norm is not
fused because the device check did not pass

b7fde84341/onnxruntime/core/optimizer/layer_norm_fusion.cc (L568).
Since during pretraining optimization pass, there is no device
placement, so the device check not fulfilled is expected.

On the other hand, the device check is still valid to avoid simplified
layer norm fusion works correctly for CPU runs. As a mitigation, added a
flag to indicate whether the fusion is triggered by pre-training
optimization or not. There is a risk though, when we run ORTModule
training with CPU EP, but I feel the risk can be much reduced if we
check CUDA/ROCM is enabled for the build.

```
CUDA_VISIBLE_DEVICES=0 python examples/onnxruntime/training/summarization/run_summarization.py --model_name_or_path t5-small --do_train --dataset_name cnn_dailymail --dataset_config "3.0.0" --source_prefix "summarize: " --predict_with_generate --overwrite_output_dir --output_dir /bert_ort/pengwa/output --fp16 --max_steps 1 --logging_steps 1 --deepspeed aml_ds_config_zero_1.json
```

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-03-07 13:59:20 -08:00
Patrice Vignola
65f1f840f6
[DML EP] Fix Attention regression caused by removing transposes (#14908)
By removing the transposes and using strides instead, the metacommands
are not able to be reached anymore since it's not using NCHW layout.
2023-03-07 11:17:28 -08:00
Xavier Dupré
6b604521a6
Fix tree implementation when left, right node have lower index (#14839)
### Description
Previous implementation did not support left or right node of a node to
have an index lower than the node itself. This condition would forbid
the tree to enter an infinite loop. Lightgbm does not follow that rule.
The changes do not change the algorithm but remove the test enforcing
that condition.



### Motivation and Context
It fixes a regression introduced by #14670.
2023-03-07 19:47:12 +01:00
Hitesh Shah
66101c02a2 Implement AllToAll collective op 2023-03-07 10:17:07 -08:00
Adam Pocock
150043f74f
Adds a Java accessor for GetVersionString (#14876)
### Description
Java part of #14873.
2023-03-07 09:46:56 -08:00