Commit graph

7650 commits

Author SHA1 Message Date
Ye Wang
df796bbb62
cast logits to half when T=MLFloat16 (#13454)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-11-03 16:40:19 -07:00
Edward Chen
b4a1ae8350
Use narrow instead of gsl::narrow. (#13555) 2022-11-03 16:24:11 -07:00
cloudhan
2de883c592
Update CK and fix performance issue on dev machine (#13531)
1. Update CK to its latest develop branch
2. `-mllvm -amdgpu-early-inline-all=true` is critical to CK's
performance, ensure it is properly configured.
- The flags are propagated from target `hip-lang::device`'s
`INTERFACE_COMPILE_OPTIONS`, we must not manually add the flags.
- Instead, we must ensure this target is properly configured by checking
_CMAKE_HIP_DEVICE_RUNTIME_TARGET is set.

TL,DR

`hip-lang::device` sometime will be not be properly configured if our
`CMAKE_PREFIX_PATH` is not configured carefully. In the CI docker, the
configuration is in good state, but on dev machine it is not, which then
silently result poor performance for kernels. We fixed it in this PR and
add a guard to avoid unsuccessful future editing and to prevent
convoluted debugging process.

`_CMAKE_HIP_DEVICE_RUNTIME_TARGET ` is shared in
`/opt/rocm/lib/cmake/hip-lang/hip-lang-config.cmake` and it is internal
to
[CMake](https://gitlab.kitware.com/cmake/cmake/-/merge_requests/6121/diffs),
the variable name will not be changed in the foreseeable future.
2022-11-03 19:32:30 +08:00
Yi Zhang
7c3a23c186
extend some timeout value (#13552)
### Description
<!-- Describe your changes. -->



### Motivation and Context
these workflows are prone to timeout.
2022-11-03 15:11:41 +08:00
pengwa
a3e7da60e7
Trade subgraph recompute for memory (#12852)
**Description**: Subgraph-level recompute

This PR adds an optional capability trading additional re-computation
for better memory efficiency. Specifically, a pre-defined operator list
used to iterate the Graph to find some subgraphs for recompute, to
reduce some stashed activations whose lifetime across forward and
backward pass.

When training with ORTModule, by default, the graph transformer will
scan the execution graph to find all eligible subgraph to recompute,
along with sizes that can save. An example looks like below.
If we want to enable some of them to recompute, we can define env
variable this way:
`export
ORTMODULE_ENABLE_MEMORY_ALLEVIATION="Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1:-1,BiasGelu+:1:-1,BitmaskDropout+Cast+:1:-1,FusedMatMul+:1:-1,Cast+:1:-1,Mul+Add+:1:-1,Mul+Sub+:1:-1"`
```

[1,0]<stderr>:2,022-10-12 14:47:39.302,954,530 [W:onnxruntime:, memory_alleviation.cc:595 PrintSummary]
[1,0]<stderr>:MemoryAlleviation Summary:
[1,0]<stderr>:  User config:
[1,0]<stderr>:  Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1,BiasGelu+:1,BitmaskDropout+Cast+:1,FusedMatMul+:1,Cast+:1,Mul+Add+:1,Mul+Sub+:1
[1,0]<stderr>:  =================================
[1,0]<stderr>:  Subgraph: BitmaskDropout+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 1,024 x   Frequency:1
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: BiasGelu+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x  Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Reshape[1,0]<stderr>:+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:labels_dim0 x      Frequency:1
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Unsqueeze+Unsqueeze+Cast+Sub+Mul+Mul+FusedMatMul+Cast+Add+BiasSoftmaxDropout+Cast+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x    Frequency:23
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x    Frequency:1
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Mul+Add+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x         Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: FusedMatMul+Cast+Add+Reshape+Cast+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 2 x 4 x     Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Mul+Sub+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x         Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Cast+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:1,024 x 1,024 x    Frequency:97
[1,0]<stderr>:                  PatternShape:3 x 1,024 x        Frequency:1
[1,0]<stderr>:                  PatternShape:8 x 64 x   Frequency:24
[1,0]<stderr>:                  PatternShape:1,024 x 4,096 x    Frequency:24
[1,0]<stderr>:                  PatternShape:4,096 x    Frequency:24
[1,0]<stderr>:                  PatternShape:4,096 x 1,024 x    Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: FusedMatMul+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x  Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  =================================
```


"Type config:" whether recompute is enabled by users. 0 - disable, 1-
enable.
"Subgraph" means what kind of subgraph will be recomputed, in this case,
it is a single node "Gelu", and it will be "Recompute".
"Shape && Frequency" means, for this recompute, one tensor of size
(batch size, 500) will be saved because it will be recomputed.

**Baseline**

On a 1P model (DEBERTA V2), sequence length 256, training with 16 A100
GPUs. With latest main branch, we can run batch size 16, and the maximum
batch size < 32. So 16 is usually chosen by data scientists. 65% of 40GB
memory is used during training. The SamplesPerSec=479.2543353561354.


![image](https://user-images.githubusercontent.com/10530022/188320941-13dde5e7-c32b-4399-a64b-6803fbb9dcda.png)

**With this PR**

Gelu is recomputed for saving memory peak, batch size 32 can be run. The
97% of 40GB A100 is used, the SamplesPerSec=562.041593991271 (**1.17X**
of baseline).


![image](https://user-images.githubusercontent.com/10530022/188321081-f64811bf-9637-4873-8095-349de8d498cc.png)


**Motivation and Context**
- Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here.
2022-11-03 13:49:41 +08:00
George Nash
77be22f379
[oneDNN ep] Update from oneDNN v2.7.0 to oneDNN v2.7.1 (#13536)
The oneDNN 2.7.1 release includes multiple functional and performance
improvements.

Signed-off-by: George Nash <george.nash@intel.com>

### Description
Update the oneDNN library from 2.7.0 to 2.7.1. This contains multiple
functional and performance improvements.



### Motivation and Context
This is a minor point release from the oneDNN library that gives
performance and functional fixes that were found in the oneDNN 2.7
library shortly after release.

Signed-off-by: George Nash <george.nash@intel.com>
2022-11-02 15:57:49 -07:00
Changming Sun
b1e1b25e04
Delete CUB (#13534)
### Description
Delete CUB

### Motivation and Context
Because it is already in CUDA SDK.
2022-11-02 13:06:22 -07:00
Changming Sun
5914a7e0ae
Fix an error in the python packaging pipeline (#13538)
### Description
It missed a space there.

### Motivation and Context
Right now the pipeline is failing because GSL was just converted from a
submodule to a cmake external project.
2022-11-02 07:55:20 -07:00
Wei-Sheng Chin
b5904c40dd
Enable ORT in TorchDynamo (#13259)
This PR enables ORT to execute graphs captured by TorchDynamo. Major compilation code is in `OrtBackend.compile` in ort_backend.py. `register_backend.py` is for plugging `OrtBackend` into TorchDynamo as a compiler.
2022-11-01 11:19:29 -07:00
PeixuanZuo
6740528b98 [ROCm] Fix bug for rocm ep build using MS GSL 4.0.0 (#13525) 2022-11-01 13:05:55 +08:00
PeixuanZuo
c8886c5b4c Revert "Update CK and fix performance due to lacking -amdgpu-early-inline-all=true (#13493)"
This reverts commit 4dd053cc15.
2022-11-01 13:05:55 +08:00
Baiju Meswani
c557a55816
Fix on-device training ExportModelForInferencing api (#13510) 2022-10-31 21:29:06 -07:00
Vincent Wang
17f0ffd1c8
Support More Cases in NoOpElimination (#13460)
Current NoOpElimination can support only Add node. This PR adds support
for: x-0, x*1, 1*x and x/1 besides x+0 and 0+x.

With this PR, all Div(x,1) and their gradients (also Div(x,1)) in
Huggingface's diffusers model can be removed, which takes ~1% of compute
time in total previously.
2022-11-01 10:39:52 +08:00
Patrice Vignola
3d0db47c17
[DML EP] Fix variable shadowing in EinSum (#13520)
### Description
Fix variable shadowing in the DML EP's implementation of EinSum



### Motivation and Context
An SDL bug was opened because of shadowing of the variable `i` in a
nested loop of the EinSum operator.
2022-10-31 19:27:43 -07:00
Patrice Vignola
74f905b237
DML EP enable the provider in the op tests (#13441)
### Description
Enables the DML provider in the op tests to allow for better CI
coverage.



### Motivation and Context
Some of the CI tests for DML were actually running on the CPU because
there was no default DML provider, so it was returning a `nullptr`. This
should add better coverage, and it already uncovered some failures and
asserts hitting in a few tests, which need to be investigated
separately.
2022-10-31 15:49:03 -07:00
Adrian Lizarraga
9d867a07c0
Fix regression in CustomOpApi::GetTensorData (#13450)
- Reverts change to CustomOpApi::GetTensorData introduced by commit 5dae0c477d,
which causes infinite recursion.
- Moves EndsProfilingAllocated to non-const session implementation
(C++ API header).
2022-10-31 12:20:49 -07:00
Edward Chen
2ecd1d6622
Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
Edward Chen
7fbfbf789f
Increase timeout for binary-size-checks-pipeline. (#13498) 2022-10-28 23:15:56 -07:00
zhangyaobit
33b8778a46
Minor improvement for the documentation of kernel explorer (#13490)
### Description
<!-- Describe your changes. -->
Fix the input shape of FastGelu
Minor improvement for the documentation of kernel explorer

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-10-28 22:57:53 -07:00
Fei Hu
943e156f4c
Allow custom ops to set input memory type (#10879) 2022-10-28 21:45:26 -07:00
Hector Li
1b494daffa
Add yml file for Snpe EP build (#13494)
Add yml file for Snpe EP build
2022-10-28 19:47:50 -07:00
Changming Sun
689e524c58
Move DML packaging pipelines to aiinfra-dml-winbuild machine pool (#13487)
1. Move DML packaging pipelines to aiinfra-dml-winbuild machine pool
2. Delete
tools/ci_build/github/azure-pipelines/templates/windowsai-nuget-build.yml
because the pipeline has been migrated to Onebranch. I monitored it for
months, it worked well.
2022-10-28 10:30:16 -07:00
Numfor Tiapo
49e5a11ccd
Fix SDL and Prefast Errors (#13465)
Fixes Errors 1978844, 1978870, 1978850, 1978855, and 9245

Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>
2022-10-28 09:41:18 -07:00
zhangyaobit
0a524cfe1c
Fix the input shape of FastGelu (#13488)
### Description
<!-- Describe your changes. -->
Fix the input shape of FastGelu


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-10-28 09:36:31 -07:00
cloudhan
4dd053cc15
Update CK and fix performance due to lacking -amdgpu-early-inline-all=true (#13493)
1. Update CK to its latest develop branch
2. `-mllvm -amdgpu-early-inline-all=true` is critical to CK's
performance, add it.
2022-10-28 09:36:00 -07:00
Vincent Wang
8b0669bf63
QuickGelu Fusion (#12417)
Some models have QuickGelu(x)=x*sigmoid(1.702x), which has 3 Ops for
forward and 5 Ops for backward. The PR is to fuse this to a single Op
named QuickGelu and its gradient QuickGeluGrad.

For CUDA, tested in V100 using input tensor with shape [64,128,2048] and
float16 type:
Before, FW takes 335us, BW takes 614us

![image](https://user-images.githubusercontent.com/11661208/182291335-15188709-ffe7-44d1-9d14-0b544cbe5e55.png)

After, FW takes 115us, BW takes 139us, which is much faster.

![image](https://user-images.githubusercontent.com/11661208/182291502-f0b5161c-b95c-45fc-90f8-ad0c592d2433.png)

For CPU kernel, using same shape and float type:
Before, FW takes 10us, BW takes 49us
Mul: 3480[µs]
Sigmoid: 1996[µs]
Mul: 4789[µs]
Mul: 4642[µs]
Mul: 4195[µs]
SigmoidGrad: 18328[µs]
Mul: 2988[µs]
Sum: 18576[µs]

After, FW takes 4us, BW takes 5us, which is also much faster.
QuickGelu: 3939[µs]
QuickGeluGrad: 5089[µs]

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2022-10-28 18:12:07 +08:00
JiCheng
20c3c35c33
[XNNPACK] support building xnnpack EP for IOS (#13461)
### Description
support building xnnpack for IOS


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-10-28 15:03:04 +08:00
Changming Sun
07271b6c8a
Update docs/OperatorKernels.md (#13485) 2022-10-27 20:11:49 -07:00
Jian Chen
f9378c5cca
Cjian/c4244 round 2 (#13473)
### Description
Round 2 of fixing C4244



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-10-27 18:50:26 -04:00
Changming Sun
4a20c0d98b
Delete zlib.cmake (#13467)
Delete the file because it is not included by any other file.
2022-10-27 15:36:04 -07:00
Yi Zhang
67074851a3
Skip failed models on training ci and openvino ci (#13477) 2022-10-27 15:22:47 -07:00
Changming Sun
35659d9021
Increase the timeout value for linux-gpu-tensorrt-ci-pipeline.yml (#13481)
Now it takes about 55-60 minutes. It is on the edge so it often fails.
2022-10-27 14:26:22 -07:00
Scott McKay
ab71c4bbc0
Document generation CI is broken (#13308)
### Description
<!-- Describe your changes. -->
Fix document generation CI. It's not currently updating the docs as
we're skipping the tests, which is the invocation of build.py that would
have generated the documentation.

Setup specific task to generate documentation for greater clarity. 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Operator kernel documentation is not getting updated and is now out of
date.
2022-10-28 07:20:48 +10:00
Patrice Vignola
0b29f64dba
[DML EP] Enable all datatypes for Abs and Sign (#13470)
### Description
Enables all datatypes supported for DML for `Abs` and `Sign`.



### Motivation and Context
`Abs` and `Sign` haven't been updated since DML started to support all
datatypes for them. These ops are used in some transformer models and
were forcing unnecessary copies between the CPU and the GPU.
2022-10-27 11:36:11 -07:00
Dmitri Smirnov
0e2087acff
Add extension method to compensate for Contains() absence (#13466)
### Description
The targeted framework does not contain `Contains(string, orginal)`. 
Add extension method to compensate in following the suggestion
[here](https://learn.microsoft.com/en-us/dotnet/api/system.string.contains?view=net-7.0).


### Motivation and Context
Packaging pipeline fails.
2022-10-27 10:00:47 -07:00
Baiju Meswani
a46c599a40
Training API to export the eval model to an inference model (#13345) 2022-10-27 09:34:01 -07:00
Jian Chen
8827c4bdbc
First round of fixes. (#13452)
### Description
First round of fixes for C4244 error.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-10-26 23:05:45 -04:00
Edward Chen
601b74b904
Add '$schema' entry to cgmanifest.json files. (#13444) 2022-10-26 16:15:05 -07:00
Changming Sun
7d58332298
Update tsaoptions.json: update the email alias (#13448) 2022-10-26 15:56:16 -07:00
Vincent Wang
805ec459a0
Fix a PoliCheck finding in _hierarchical_ortmodule.py(#13462) 2022-10-26 15:45:18 -07:00
sumitsays
490e4ddea5
[DML EP] Don't fuse a capability outside the compile call (#13468)
### Description
DML EP was a special EP w.r.t. capability fusion. It used to fuse a
capability outside the IExecutionProvider::Compile() call. But after
recent re-architecture #13131, it is no longer a special case.



### Motivation and Context
Why is this change required? What problem does it solve?
To make DML EP consistent with the ORT design.
- If it fixes an open issue, please link to the issue here.  N/A

Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com>
2022-10-26 15:21:33 -07:00
Dmitri Smirnov
1c8a22ec68
Improve logging and default affinity mask generation (#13338)
### Description
Fix logging for affinity failures on Linux.

Make `GetCpuCores()` consistently return the number of physical cores.
Use `CpuInfo` library to correctly set affinities for Linux where
supported.
Make windows generate affinity masks as ordinals and convert them to
masks at the setting site.
Allow setting multiple logical processors affinity masks per thread.
We continue to set all logical processors as thread affinity per
physical core.

### Motivation and Context
Error logging on Linux uses `pthread_self()` which does not return
Thread ID.
Fix default affinity mask generation on Windows. The following are the
issues with Windows:

- `GetThreadAffinityMasks()` returns bitmasks, but on other platforms it
returns ordinals generated for the hardware concurrency
- The maximum number of processors supported for requires a mask of
64-bits, but `size_t` type used is not always 64-bit
- The masks returned per physical core may have multiple bits set,
because the mask applies to several logical cores hosted by the physical
core. In the past, customers complained that their threads jump from one
core to another which adversely affects performance. The decision was
made to stay this way.
- 64-bit masks do not allow for logical processors with IDs that are
outside of 0-63 range.
2022-10-26 13:30:27 -07:00
Rui Ren
136e15bfaf
revert cmake external file (#13459) 2022-10-26 11:38:15 -07:00
Adrian Lizarraga
8770201e96
[EP-Perf-Dashboard] Decouple docker image name from branch name (#13449)
### Description
Updates naming scheme for docker images built by the EP Perf pipeline.
Specifically, the docker image name is no longer based on the branch
name.

### Motivation and Context
The docker image name used by EP Perf pipeline is built from the branch
name. This makes the pipeline fail for branches with uppercase letters
because docker image names can only contain lower-case letters.
2022-10-26 10:27:22 -07:00
Juan Villamizar
48b2ec944c
Fix warnings preventing Onnx build (#13447) 2022-10-26 07:53:55 -07:00
Abhishek Udupa
8fbdc6cc46
Add a script for quick profile analysis (#13423)
### Description
Implements a Python script for quick analysis of a generated JSON
profile from ORT.


### Motivation and Context
This PR implements a script that lists kernels that take up the most
time in a JSON profile, from both the CPU and GPU points-of-view. The
script also supports various options for CSV output, grouping of kernels
wrt shape of input tensors and wrt kernel dimensions.

Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
2022-10-26 07:43:03 -07:00
PeixuanZuo
a0cc289be6
Update SkipLayerNorm fusion rules (#13350)
### Description
<!-- Describe your changes. -->

The subgraph below meet the SkipLayerNorm fusion pattern, but the fusion
rules also required every input dimension has a certain value. So the
subgraph below cannot fused to SkipLayerNorm.

subgraph we want to fuse

![image](https://user-images.githubusercontent.com/94887879/196386821-3e678a4c-83e4-4bca-8900-5ef4ea996868.png)

     
fusion pattern 3
 [Sub1]   [Sub2]
         \       /
          \     /
           \   /
            Add1
             |
     LayerNormalization

This change allow inputs of FirstAdd operator has dimension which only
has dim_param.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
2022-10-26 16:15:27 +08:00
Patrice Vignola
ac48bdec89
DML EP add einsum MatMul NHCW ops (#13440)
### Description
This adds the "NHCW" format support for einsum MatMul. The logic is
basically a merge of the existing Transpose and MatMul Einsum
implementations.



### Motivation and Context
Some transformer models that I'm tracking use Einsum quite often during
a single inference, and about half of those were "NHCW" MatMul Einsums.
Supporting them will reduce the number of copies to the CPU.
2022-10-25 23:09:07 -07:00
Patrice Vignola
d5e8d59243
DML EP register all data types for Where operator (#13443)
### Description
Register all datatypes for DML's `Where` operator since DML now supports
everything.



### Motivation and Context
Some transformer models use the `Where` operator on int64 data, but
since DML wasn't supporting it, it needed to fall back to the CPU.
2022-10-25 22:47:55 -07:00
PeixuanZuo
70b73afd36
[ADD] fuse Matmul + fastgalu -> gemmfastgelu (#11699)
**Description**: Describe your changes.

fuse MatMul + FastGelu -> GemmFastGelu
prepare for AMD optimized fused operator GemmFastGelu

usage:
python benchmark.py -g -m bert-base-cased --sequence_length 384
--batch_sizes 128 --provider=rocm -p fp16 --disable_embed_layer_norm
--enable_gemm_fast_gelu

**Motivation and Context**
- Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here.
2022-10-26 09:33:58 +08:00