Commit graph

7955 commits

Author SHA1 Message Date
RandySheriffH
d152452d4b
Tune test case for hybrid cpu (#14204)
Tune test case for hybrid cpu architecture.

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-01-10 12:54:02 -08:00
Chen Fu
90142899bd
Supporting Intel AMX instructions in quantized GEMM (#14042)
### Description
Using Intel AMX int8 instructions to accelerate quantized GEMM


### Motivation and Context
AMX instructions accelerate quantized GEMM significantly:

Prepacked B perf numbers (latency in ns)

GEMM Config | AVX512Vnni | AMX
-- | --: | --:
M:384/N:1024/K:1024/Batch:1/Threads:4 | 1057511 | 285393
M:384/N:1024/K:3072/Batch:1/Threads:4 | 2643929 | 700397
M:384/N:1024/K:4096/Batch:1/Threads:4 | 3784750 | 890701
M:384/N:4096/K:1024/Batch:1/Threads:4 | 2378139 | 887251
M:384/N:1024/K:1024/Batch:1/Threads:16 | 307137 | 138481
M:384/N:1024/K:3072/Batch:1/Threads:16 | 855730 | 295027
M:384/N:1024/K:4096/Batch:1/Threads:16 | 1126878 | 317395
M:384/N:4096/K:1024/Batch:1/Threads:16 | 781963 | 237014
M:1536/N:1024/K:1024/Batch:1/Threads:16 | 538864 | 181459
M:1536/N:1024/K:3072/Batch:1/Threads:16 | 1681002 | 561600
M:1536/N:1024/K:4096/Batch:1/Threads:16 | 2158127 | 717470
M:1536/N:4096/K:1024/Batch:1/Threads:16 | 2428622 | 896140
M:3072/N:1024/K:1024/Batch:1/Threads:16 | 1058029 | 357031
M:3072/N:1024/K:3072/Batch:1/Threads:16 | 3138504 | 1095857
M:3072/N:1024/K:4096/Batch:1/Threads:16 | 4155640 | 1386183
M:3072/N:4096/K:1024/Batch:1/Threads:16 | 4679030 | 1778624

Co-authored-by: Yi-Hong Lyu <yilyu@microsoft.com>
Co-authored-by: Chen Fu <fuchen@microsoft.com>
2023-01-10 12:16:27 -08:00
Ye Wang
a01bf8dbb1
rename CrossAttention to MultiHeadAttention (#14201)
### Description
<!-- Describe your changes. -->

rename the CrossAttention to MultiheadAttention since this op can also
be used as self attention

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-01-10 10:18:39 -08:00
Guenther Schmuelling
6b8c72cfa6
pin ort-ext to 81e7799c69044c745239202085eb0a98f102937b (#14044)
pin onnxruntime-extension to 81e7799c69044c745239202085eb0a98f102937b in
preparation to in enable extension in wasm build.
2023-01-10 10:10:17 -08:00
Numfor Tiapo
f4ea781b81
DML EP Register Identity-16 (#14053)
This PR Registers Identity-16 to the DML EP.

ONNX Backend tests and optional type tests were skipped pending future
additions.

Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>
2023-01-10 09:16:09 -08:00
Tianlei Wu
05e26f302a
Hot fix for prefast failure to unblock python package pipeline (#14206)
### Description
Hot fix python packaging pipeline failures by disabling an attention op
test which causes cl crashes in prefast build.

Verified that python package is good with this hot fix:

https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=263786&view=results

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Failed in prefast build that linker crashes:
```
 cl : command line error D8040: error creating or communicating with child process
```
The cause is high stack usage in an attention op unit test introduced in
https://github.com/microsoft/onnxruntime/pull/13953.
2023-01-10 07:57:32 -08:00
Adrian Lizarraga
3d8b596cb9
Use a local copy of murmurhash3 in TensorRT shared library (#14207)
### Description
Uses a local copy of murmurhash3 in TensorRT.



### Motivation and Context
The current murmurhash3 implementation is located in core/framework,
which is not linked to the provider shared library. This causes a
segfault when tensorrt shared library is used standalone.
2023-01-10 07:24:06 -08:00
Ryan Hill
da57c0a701
Add protected destructor to Provider structure (#14152)
### Description
Add protected destructor so that any inherited classes can't
accidentally be deleted through a pointer to the base.

Fixes this prefast warning:
The type 'struct onnxruntime::CUDA_Provider' with a virtual function
needs either public virtual or protected non-virtual destructor (c.35).

Internal bug 8999
2023-01-09 23:04:04 -08:00
Ryan Hill
f8117b6f87
Add catch-all exception handler to API_IMPL_END (#14194)
### Description
Fairly self explanatory. Someone pointed out we could miss some
exceptions, and we never want to throw exceptions through the C API.

### Motivation and Context
This doesn't fix any known issue, it's just a good idea to have.
2023-01-09 21:58:46 -08:00
PeixuanZuo
33367fa2dc
[MIGraphX] update the MIGraphX version used in ORT to rocm-5.4.0 (#14184)
### Description
Update the MIGraphX version used in ORT to rocm-5.4.0

### Motivation and Context
The previous branch migraphx_for_ort has stopped updating, it is too far
away from the MIgraphX latest release branch. More discussion here:
https://github.com/microsoft/onnxruntime/issues/14126#issuecomment-1373201049

Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
2023-01-10 13:40:25 +08:00
Yi Zhang
6463f4383b
make WITHCACHE as an option in MacOS workflow (#14188)
### Description
1. Set the WithCache default value as false in Mac OS CI workflow too.
2. Add date of today in cache key to avoid cache size keep increasing
too.

WithCache, the pipeline duration reduced from 70 more minutes to 10 more
minutes
2023-01-10 10:54:19 +08:00
Tianlei Wu
7e751ac6e6
update convert_generation for Attention op change (#14191)
We remove key and value inputs in
https://github.com/microsoft/onnxruntime/pull/14146, need update the
convert_generation as well.
2023-01-09 18:04:44 -08:00
Patrice Vignola
c151afec71
[DML EP] Fix unconnected node removal logic (#14193)
### Description
Fix unconnected node removal logic



### Motivation and Context
The edges need to be removed before the nodes themselves, otherwise the
indices will reference the wrong nodes.
2023-01-09 15:40:09 -08:00
Sumit Agarwal
906f578be8
[DML EP] Update DML_FEATURE_LEVEL 5.0 (#14172)
### Description
DML EP was using very old feature level (2.0) which may lead to model
(having latest operator) execution failure, if model is running against
old DirectML.dll.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-01-09 13:00:56 -08:00
liqun Fu
1be36913cc
to work with onnx 1.13 rc, implement ver 18 reduce and optioanl ops, … (#13765) 2023-01-09 10:26:16 -08:00
Xavier Dupré
79dc39600f
Replace distutils by setuptools to import build_ext (#14108)
### Description
Uses setuptools instead of distutils.



### Motivation and Context
Fixes #14107.
2023-01-09 11:48:01 +01:00
Patrice Vignola
64541a587d
[DML EP] Remove unconnected nodes from the graph (#14155)
### Description
Remove unconnected nodes from the DML EP graph.



### Motivation and Context
Some operators like `EmbedLayerNorm` have many outputs, and some of the
outputs are non-optional. But in practice, they act like optional
outputs because they can have a value of 0, which means that the rest of
the model doesn't need to depend on those. The problem with that is that
DML will implicitly remove those output from the graph, but the nodes
that feed into that output will stay and become unconnected from the
rest of the graph, which is illegal in DML. Removing unconnected nodes
as a last pass will make sure that those nodes are getting removed and
will simplify the logic of individual operators by not having to account
for these special cases.
2023-01-08 15:20:52 -08:00
Zhang Lei
74fe45bf09
activate past_present_share_buffer for sampling node (#14166) 2023-01-07 19:36:39 -08:00
cloudhan
be879c11ee
Add batched and strided batched gemm as TunableOp (#13841) 2023-01-07 19:11:40 +08:00
Ye Wang
5eac2c1f41
relational attention bias cuda op (#14149)
### Description

This cuda op implements the compute_bias() method in T5 Attention
including the permutation.

note:
1. bias_table needs to be saved in col-major. be careful when
implementing fusion script
2. second input(sequence length) is placed on cpu. (using Shape node's
output should be good)
3. the first dimension of output is 1, so extra_add_qk in attention
should support broadcasting
4. compute_bias() only used in self-attn in t5

TODO: docs change will be applied later

### Motivation and Context
It's part of the process of optimizing t5 attention as well as t5 based
generation model

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-01-06 17:32:58 -08:00
cloudhan
8e2163018d
Ignore more build directories and clangd files (#14154)
Ignore all `build_*` directories in repo root. Ignore `.cache` and
`compile_commands.json` which are related to clangd cache and
configuration.
2023-01-07 06:58:57 +08:00
Tianlei Wu
2cacb24cb0
Add CrossAttention operator (#14146)
Move separated Q, K and V (without input projection) from Attention to a
new operator CrossAttention.

The Attention operator is hard to maintain when we need support with and
without input projection in one class. Add a new operator according to
feedback.

Some change might need in the future, but not in this PR:
(1) bias could be optional (We will not proceed that route unless
experiments show that fusing Add bias with MatMul instead of this op
could improve performance).
(2) support packed KV. There are two ways to support it: when key and
value are same Tensor, they are packed; or we can make value as
optional, and use packed mode when value is empty and the key has packed
K/V.
(3) support cached key and value, and other (like relative position
bias), or more attention mask format. They can be added easily without
breaking backward compatible.
(4) ROCm/CPU implementation of this op.
2023-01-06 14:27:40 -08:00
Baiju Meswani
c6ff5bac9d
Update torch in eager mode CI pipeline (#14094) 2023-01-06 11:46:44 -08:00
we1559
c65a03699a
add ThreadingOptions, wraps OrtThreadingOptions (#13711)
…threadpools' options of The Env.

### Description
<!-- Describe your changes. -->
add a c++ class ThreadingOptions, wraps OrtThreadingOptions
as I described in issue #13710 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

close #13710

Co-authored-by: zengxiangneng <zengxiangneng@360.cn>
2023-01-06 11:21:10 -08:00
Jian Chen
babc1323e3
Consolidate Identical Children Nodes (#14026)
### Description
In case where Q have multiple DQ children, we want to keep only 1 DQ.
The only remaining DQ's will channel its output to deleted DQ children's
outputs.

ex
Q->N(DQ).    => Q->DQ


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-01-06 09:03:10 -08:00
Hariharan Seshadri
d0c5ffd5f7
Misc transformer fixes - 2 (#14156)
### Description
1. The graph pattern search introduced in
https://github.com/microsoft/onnxruntime/pull/13914/ needs to be
enhanced so that SkipLayerNormalization is supported

2. Fix fp32 parity for GPT-2 while using `SkipLayerNormalization`
fusion. The optional output of SLN needs to also include the bias (if
present) and the added output should be a sum of `input + skip + (bias)`

### Motivation and Context
Fix some breaking tests
2023-01-06 07:27:10 -08:00
PeixuanZuo
3702806653
[ROCm] add softmax, topk, layernorm to microbench (#13997)
### Description

Add softmax, layernorm, topk benchmark to microbench.

Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
2023-01-06 18:06:24 +08:00
PeixuanZuo
b222a8e01b
[Fix] build error with MIGraphX tag rocm-5.4.0 (#14141)
### Description
<!-- Describe your changes. -->

Fix the error https://github.com/microsoft/onnxruntime/issues/14126


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
2023-01-06 15:51:25 +08:00
zhijiang
0ed7277bbe
fix training compilation option (#14151)
fix the pipeline failure for compilation option error
2023-01-06 14:25:03 +08:00
Yi Zhang
2ce7b1c1dc
Enable cache for msbuild (#14085)
### Description
Enable ccache in windows CPU compilation.
The windows compilation in CI could be reduced to 1 more minute at most.

![image](https://user-images.githubusercontent.com/16190118/210294061-86742cf4-65c7-4cc2-9725-e102c3c64abd.png)
2023-01-06 11:19:57 +08:00
Abhishek Udupa
d460c01b8c
Fix skew between GPU/CPU timestamps in ORT profiler (#14004)
### Description
This PR fixes the skew between GPU/CPU timestamps with a more reliable
algorithm.

### Motivation and Context
An earlier implementation attempted to guess the right correction to
apply, but this led to misleading profile outputs. This PR fixes this
problem by utilizing a more reliable technique to normalize GPU
timestamps. Attached are sample profile outputs and visualization
screen-grabs from a run of a transformer-based model before and after
the fix.

Before Fix:

![profile_visualization_cuda_without_fix](https://user-images.githubusercontent.com/17418420/208197234-7390d8e3-4354-4e67-93cf-958c319146ee.png)

After Fix:

![profile_visualization_cuda_with_fix](https://user-images.githubusercontent.com/17418420/208197230-3e108b82-8dfa-476b-9277-7895639a3785.png)

Profiler outputs that are rendered in the visualizations above:

[sample_outputs.zip](https://github.com/microsoft/onnxruntime/files/10249689/sample_outputs.zip)

Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
2023-01-05 11:07:26 -08:00
Tang, Cheng
90cff21fa7
Avoid the lock for device stream impact the cpu build (#14131)
### Description
Introduce a runtime flag in SessionState about whether any EP in current
session using stream feature, if no, avoid trigger the lock. This will
avoid the impact to CPU build.

### Motivation and Context
Currently we use a lock in SessionState when retrieve device stream
collection, this is mainly for reusing the device stream for EP like GPU
eps, so it shouldn't impact the build which doesn't using stream
feature, like cpu build. Instead of play with build flags, this PR
introduce a runtime flag in SessionState to indicate whether current
session has any EP that using the stream feature. if no, we don't need
to trigger the lock.

Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-01-05 09:01:33 -08:00
PeixuanZuo
4eac0db3af
[ROCm] Add GemmFastGelu CK implementation (#13759)
### Description
<!-- Describe your changes. -->

Add GemmFastGelu CK implementation.

TODO 
1. The performance of CK GemmFastGelu in ORT is not good as using CK
directly, still need to investigate the reason and improve the CK in
ORT.
`GemmFastGeluUnfused float16 NN m=49152 n=3072 k=768 2298.8064 us 100.89
tflops`
`withbias DeviceGemmMultipleD_Xdl_CShuffle<256, 256, 128, 32, 8, 8,
Default> LoopScheduler: Default, PipelineVersion: v1 float16 NN m=49152
n=3072 k=768 2401.9799 us 96.56 tflops`

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
2023-01-05 17:53:30 +08:00
Adrian Lizarraga
2b45410e52
Fix Prefast warning in CUDA contrib op (#14074)
### Description
Fixes Prefast C26814

```shell
onnxruntime::contrib::cuda::QAttention<onnxruntime::MLFloat16,signed char>::ComputeInternal
onnxruntime/contrib_ops/cuda/quantization/attention_quantization.cc
The const variable 'element_size' can be computed at compile-time. Consider using constexpr (con.5).
```
2023-01-04 19:32:06 -08:00
Adrian Lizarraga
68794d0ac1
Improve custom op library handle cleanup (#14099)
### Description
- Adds a new C API `OrtApi::RegisterCustomOpsLibrary_V2` that manages
the lifetime of dynamic library handles (i.e., calls `dlclose` or
`FreeLibrary`).
- Deprecates C API `OrtApi::RegisterCustomOpsLibrary`.
- Adds C++ API wrapper for convenient registering of custom op
libraries.
- `PySessionOptions` is now an alias of `OrtSessionOptions`

### Motivation and Context
The current API for registering custom op libraries loads dynamic
libraries but requires users to handle the release of the corresponding
library handles. Additionally, the user has to make sure to release the
library handle _after_ the session has been destroyed (or the program
segfaults).

The new API automatically cleans up the library and allows the user to
write more straightforward code.
2023-01-04 17:56:29 -08:00
cloudhan
dc997af695
Use RegisterOp to add Op instead of directly manipulate base class field (#14123)
Add API `RegisterOp` to TunableOp.
2023-01-05 09:02:46 +08:00
Nat Kershaw (MSFT)
b313055ad6
Updated issue router to migrated project (#14114) 2023-01-04 14:47:43 -08:00
Ye Wang
ae148ebc05
T5 skip_layer_norm cuda op (#14093)
### Description

T5 uses a layer_norm which only scales and doesn't shift, which is also
known as Root Mean Square Layer Normalization.
ORT already have the simplified_layer_norm which is the RMS layer_norm.
This PR extends this T5 layer_norm with support of skip/bias and the
residual output.
This new op is named SkipSimplifiedLayerNorm and has similar interface
as SkipLayerNorm but removes the beta as input


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-01-04 13:31:53 -08:00
Patrice Vignola
b6ea60436d
[DML EP] Decouple the bucketized allocator from the individual block allocation logic (#14056)
### Description
Decouple the DML bucketized allocator from the individual block
allocation logic



### Motivation and Context
This is the first step into using tiled/placed resources instead of
committed resources. Given the potential impact of changing the
allocation logic and the large number of edge cases, I decided to take a
step-by-step approach. It will also reduce the size of the PRs to a
reasonable length, while making sure each PR has a single
responsibility.

Decoupling the logic that way will make it easier in the future to
easily plug in different kind of "suballocators" if we want to play
around with the allocation logic. Currently, the only suballocator is a
committed resource, but placed resources are the next step and will come
in a future PR.
2023-01-04 13:13:54 -08:00
Nat Kershaw (MSFT)
f344d4b3d1
Label issues with mobile when android or ios are present (#14033) 2023-01-04 13:03:25 -08:00
Ye Wang
821baa5b83
Support generation script with custom eos/pad token id (#14113)
### Description
<!-- Describe your changes. -->

when custom decoder onnx model passes in, user can specify eos/pad token
id instead of populating from torch config.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-01-04 10:51:53 -08:00
Ashwini Khade
e5e3570ac5
fix cg issue (#14112)
### Description
Update torch version to 1.13.1 to fix CG issue:
https://dev.azure.com/aiinfra/ONNX%20Runtime/_workitems/edit/10666/
2023-01-04 09:07:13 -08:00
JiCheng
d279ce2f67
bug fix, NNAPI filter out un-supported graph (#14040)
### Description
"Consttant Folding" need to enhance to support "function" in onnx spec.
If those nodes are inlined into sub-graph and captured by a EP,
espeicially this EP doesn't support that, error occured.

There are many test failure in Onnx 1.13 agaist NNAPI, these are listed
bellow;
```
 prelu_broadcast_expanded
 selu_example_expanded_ver18
 layer_normalization_2d_axis0
 shrink_hard_expanded_ver18
 elu_expanded_ver18
 softsign_example_expanded_ver18
 leakyrelu_example_expanded
 hardsigmoid_example_expanded_ver18
 thresholdedrelu_default_expanded_ver18
 split_variable_parts_2d_opset18
efault_expanded
 prelu_example_expanded
 thresholdedrelu_example_expanded_ver18
 selu_default_expanded_ver18
 elu_example_expanded_ver18
 hardsigmoid_default_expanded_ver18
 softsign_expanded_ver18
 hardsigmoid_expanded_ver18
 leakyrelu_expanded
 scatter_with_axis
 selu_expanded_ver18
 shrink_soft_expanded_ver18
 relu_expanded_ver18
 thresholdedrelu_expanded_ver18
 elu_default_expanded_ver18
```


Solution: To prevent NNAPI capture it for now, we can revert it once a
better CF implemented.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-01-04 20:20:00 +08:00
Vincent Wang
15c1157ef2
New Pattern Support for LayerNormFusion (#14118)
Latest torch exporter changed the LayerNorm exporting code to add two
more Cast nodes (to make it logically correct in compute), but our
current LayerNormFusion doesn't support the new pattern. The PR is to
add support of this.
2023-01-04 17:51:14 +08:00
Yi Zhang
f864b54393
Use today's cache only (#14120)
### Description
Add date value of today into the cache key.

### Motivation and Context
Microsoft-host agent has only 10GB for build.
To limit cache size, pipeline only use cache generated today.
2023-01-04 17:48:52 +08:00
dependabot[bot]
bdeba4e31c
Bump json5 from 1.0.1 to 1.0.2 in /js (#14109) 2023-01-04 08:54:59 +00:00
Baiju Meswani
0ff61f7b97
Update torch to 1.13.1 in CI and packaging pipelines for ort training (#14055) 2023-01-03 20:03:33 -08:00
cao lei
b29a1c7348
Address follow-up comments on multistream pr #13495 (#13992)
### Description
This PR is to address follow-up comments for the multi-stream pr
https://github.com/microsoft/onnxruntime/pull/13495

Changes including:

- Make StreamAwareArena transparent to minimal build
- Make DeviceStreamCollection transparent to minimal build
- Replace ORT_MUST_USE_RESULT with [[nodiscard]]
- Remove unnecessary shared_ptr


### Motivation and Context
This PR is to address follow-up comments for the multi-stream pr
https://github.com/microsoft/onnxruntime/pull/13495

Co-authored-by: Lei Cao <leca@microsoft.com>
2023-01-03 16:33:36 -08:00
Nat Kershaw (MSFT)
a78ab4fbef
Add labeled to docs workflow trigger (#13979)
Capture the case where issue is manually labeled
2023-01-03 15:19:22 -08:00
Ashwini Khade
68b5b2d7d3
Refactor training build options (#13964)
### Description
1. Renames all references of on device training to training apis. This
is to keep the naming general. Nothing really prevents us from using the
same apis on servers\non-edge devices.
2. Update ENABLE_TRAINING option: With this PR when this option is
enabled, training apis and torch interop is also enabled.
3. Refactoring for onnxruntime_ENABLE_TRAINING_TORCH_INTEROP option: 
   -  Removed user facing option
- Setting onnxruntime_ENABLE_TRAINING_TORCH_INTEROP to ON when
onnxruntime_ENABLE_TRAINING is ON as we always build with torch interop.

Once this PR is merged when --enable_training is selected we will do a
"FULL Build" for training (with all the training entry points and
features).
Training entry points include:
1. ORTModule
2. Training APIs

Features include:
1. ATen Fallback
2. All Training OPs includes communication and collectives
3. Strided Tensor Support
4. Python Op (torch interop)
5. ONNXBlock (Front end tools for training artifacts prep when using
trianing apis)

### Motivation and Context
Intention is to simply the options for building training enabled builds.
This is part of the larger work item to create dedicated build for
learning on the edge scenarios with just training apis enabled.
2023-01-03 13:28:16 -08:00