Commit graph

1093 commits

Author SHA1 Message Date
Severin Simmler
90754fc077
Fix invalid escape sequence (#17145)
### Description
- Removed one unused import
- Escaped a backslash in a path

### Motivation and Context
I see this `DeprecationWarning` when I import `onnxruntime`:

```
onnxruntime/capi/_pybind_state.py:28: DeprecationWarning: invalid escape sequence '\S'
    "(other than %SystemRoot%\System32), "
```

A future version of Python (maybe 3.13?) will raise a `SyntaxError` for
invalid escape sequences.
2023-08-15 10:29:54 -07:00
Tianlei Wu
3aba736ee2
Refactoring of Stable Diffusion scripts (#17138)
Reduce duplicated code in two stable diffusion pipelines (CUDA and TensorRT). Move the common code to models.py
2023-08-15 09:36:31 -07:00
Tianlei Wu
412b0d0831
Update BERT and GPT-2 optimization notebooks for CPU EP (#17057)
The notebooks are not up to update.
(1) Update BERT and GPT-2 optimization notebooks for CPU EP with latest
PyTorch and ONNX Runtime.
(2) Add links to quantization example

### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/16515
2023-08-15 00:55:03 -07:00
pengwa
cd7b3f54da
Allow defining customized PythonOp shape inferer (#17093)
### Allow defining customized PythonOp shape inferer

For `torch.autograd.Function`, we converted it to PythonOp in MSDomain,
there are two places to do shape inferencing for it:

1. in SymbolicShapeInfer, there is one. 
2. in PythonOp op definition. 

For common PythonOp, since we don't know the relation ship between
inputs and outputs, so we only infer the rank from output ranks, and
generate symbolic dimensions for each dim. While this will introduce
many meaningless symbolic dimensions, sometimes blocking our graph
transformers to do op fusion.

This PR provide a way to define custom shape inferencing for
`torch.autograd.Function` we defined, to propagate the original
dimensions across the PythonOp at the best efforts.

But the 2rd one is not covered yet, we could refine that later. Fixing
1st one is enough for ORTModule training/evaluation.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-14 09:13:32 +08:00
Chen Fu
f2e1b91634
add int4 quantization code in python (#17077)
### Description
Adding int4 quantization code in python


### Motivation and Context
Python quantization tool no-longer needs to invoke shell to call a
native exe
2023-08-11 15:17:58 -07:00
cloudhan
b4e0fc87ea
[ROCm] Make KE reports with better format (#17049) 2023-08-10 17:44:32 +08:00
PeixuanZuo
7c7c991417
[ROCm] Workaround type conversion issue (#17074) 2023-08-10 11:04:11 +08:00
sfatimar
2c5d4dce77
Openvino ep ort 5.1 (#17042)
OpenVINO EP ORT 5.1 Branch
Changes for the new API to take in OpenVINO Provider Options
and compatibility with OV 2023.1


### Motivation and Context
The change is required for the new API to take in OpenVINO Provider
Options
and make it seamless.

---------

Signed-off-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: saurabhintel0 <saurabh1.kale@intel.com>
Co-authored-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
2023-08-09 11:50:10 -07:00
cloudhan
a4902ee65b
[CUDA][ROCm] Allow allocating ScratchBuffer from TuningContext (#17028)
By switching to ort native stream, we can allocate scratch buffer
directly from tuning context.
2023-08-10 00:05:10 +08:00
pengwa
6e6f582e08
Use full qualified name for PythonOp export (#17021)
### Use full qualified name for PythonOp export

Originally, when there are duplicate named torch.autograd.Function in
different module, for example:

`a.b.c.Gelu` v.s. `d.e.func.<locals>.Gelu`

We by default will throw exception to let user be aware we cannot
distinguish the two Gelu because during model export, we did not module
path. The workaround is we introduced
`ORTMODULE_SKIPPED_AUTOGRAD_FUNCTIONS` to ignore those duplicated named
Gelu that is not used by model run. This has limitations obviously for
example if two Gelus are both used in training.



This PR finds a way to construct a full qualified name.

`def _export_pt_1_10(g, n, *args, **kwargs):`

1. in exporter function, kwargs contains `name` and `module`, in the
above example:
   `a.b.c.Gelu`  --> name: `Gelu`, module: `a.b.c`
   `d.e.func.<locals>.Gelu` --> name: `Gelu`, module: `d.e`
   
 
Using name and module is not enough to get a full qualified name, for
the second case, where `d.e` is the module path, then there is a
function called `func`, in this function, there is a local
auto.grad.Function named `Gelu`. (Many of our UT looks like this). We
can only get `d.e.Gelu`, but this is not the correct full qual name.

The reason for this: `kwargs[name]` or `n.name` only return the class's
name, not the class's full qual name. (be noted kwargs[module]` is
correct).

2. `n` is torch.Node, we can access `pyobj` to get the
torch.autograd.Function's apply method instance, then use `._self` to
get the torch.autograd.Function class. Then we can get the `module` and
`class`'s ful qual name, added together, we get the full qual name.

With the above change, we don't need use `kwargs[name]` and
`kwargs[module]` , and don't need check naming conflicting or
`ORTMODULE_SKIPPED_AUTOGRAD_FUNCTIONS` env var any more.
2023-08-09 10:58:33 +08:00
Xavier Dupré
d0316ee768
Updating QDQ to support Float8E4M3FN (#16550)
### Description
Naive update quantization tools to support Float8E4M3FN for Gemm.
2023-08-08 12:18:48 +02:00
Chen Fu
3c10f027de
4b quantization for weights of LLMs (#16833)
### Description
Blockwise 4b quantization for LLMs. 
1. Introduce 4b block-wise quantization for linear layer weights.
2. Implements matrix multiplication kernel for fp32 x int4
3. Implements special operator MatMulFpQ4
4. Implements quantization tool, that convert MatMul operator to
MatMulFpQ4, when the right hand side is 2D const tensor.


### Motivation and Context
Compress and accelerate LLMs

|Benchmark | Time(ns)|
|-------------|----------|
|Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:8| 218054|
|Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:8| 35830155|
|Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:8| 73479790|
|Q4GEMM/Q4Zp8/M:1/N:4096/K:4096/Threads:8| 270152|
|Q4GEMM/Q4Zp8/M:1024/N:4096/K:4096/Threads:8| 35826721|
|Q4GEMM/Q4Zp8/M:2048/N:4096/K:4096/Threads:8| 73021200|
|Q4GEMM/Q4Sym128/M:1/N:4096/K:4096/Threads:8| 213832|
|Q4GEMM/Q4Sym128/M:1024/N:4096/K:4096/Threads:8| 36749874|
|Q4GEMM/Q4Sym128/M:2048/N:4096/K:4096/Threads:8| 72618120|


|Benchmark | Time(ns)|
|-------------|----------|
|SGEMM/LLM/M:1/N:4096/K:4096/Threads:8|   522610|
|SGEMM/LLM/M:1024/N:4096/K:4096/Threads:8| 39237689|
|SGEMM/LLM/M:2048/N:4096/K:4096/Threads:8| 75983467|

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-08-07 12:23:55 -07:00
pengwa
3649376f09
Fix few small bugs (#17019)
### Fix few bugs

1. symbolic shape infer, there is no None check before get length. 
2. Rename PythonOp/PythonOpGrad's attribute `name` to `func_name`,
otherwise, when we use onnx.helper.make_node to create node, `name`
conflicts with node name.
3. Filter shape inference warnings for PythonOp for torch 2.0 or newer. 
4. Close file descriptor for log suppression. Without the fix, two extra
fd is left after the log suppression exit its context.
Before enter log suppression (left), Before exit log suppression (right)

![image](https://github.com/microsoft/onnxruntime/assets/10530022/3cd3057a-59f9-4c89-8359-d9b32c49a17e)
   With the fix, no fd added after context exit.

![image](https://github.com/microsoft/onnxruntime/assets/10530022/03454a8f-ab48-4552-bb9b-293a4f51be67)
2023-08-07 14:01:36 +08:00
Yifan Li
d6ce43db5e
[EP Perf] MemTest: Add Valgrind and fix addressSanitizer (#16930)
### Description
1. Add valgrind to existing ep_perf CI MemTest and parse ORT-TRT memLeak
details
1. General Valgrind logs and logs related to ORT-TRT will be parsed in
[CI
artifacts](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=334122&view=artifacts&pathAsName=false&type=publishedArtifacts)
      1. Logic:
1. Run valgrind with `onnxruntime-perf-test -e tensorrt` and export log
to `valgrind.log`
         2. Identify if any `definitely lost` memleak happened
1. For log paragraphs which show `definitely lost`, parse if they have
keyword `TensorrtExecutionProvider`.
2. If so, extract these details to `ort_trt_memleak_detail.log`, and
return `build failure` to EP Perf CI
3. Fix existing addressSanitizer and sync the squeezenet testcase with
latest update from
[ort-inference-example](https://github.com/microsoft/onnxruntime-inference-examples/blob/main/c_cxx/squeezenet/main.cpp)
1. Updates in short: Upgrade main.cpp to be using
OrtTensorRTProviderOptionsV2
4. Reorder the 7-min-MemTest to be ahead of 9-hr-model-tests, and enable
MemTest by default
2023-08-04 16:58:57 -07:00
Tianlei Wu
a25d0d296b
Add --mask_type option to generate different format of attention mask in bert_perf_test.py (#16976)
### Description
Add an option to generate different formats of attention_mask for
testing transformers models:
1 - 1D mask index, actual sequence length excluding padding
2 - 2D attention mask. Value 0 means padding, 1 otherwise.
3 - 1D, key lengths and cumulated sequence lengths of query and key

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-03 15:24:20 -07:00
Tianlei Wu
bda012a4b2
Scripts to convert model with MulitHeadAttention to packing mode (#16925)
### Description

Update scripts for converting model with MulitHeadAttention to packing
mode.
- [x] Update symbolic shape inference for PackedMultiHeadAttention and
GatedRelativePositionBias
- [x] Update convert_to_packing_mode to handle model with
MulitHeadAttention


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-03 15:23:55 -07:00
Tianlei Wu
76aff63f37
Update bert_perf_test to test inputs with different padding ratio (#16963)
Add --average_sequence_length and --random_sequence_length so that we
can test the performance of model on different padding ratio.
2023-08-02 10:28:39 -07:00
RandySheriffH
c392fdeb1b
RunAsync Python API (#16760)
Implement python binding for RunAsync API.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-08-02 10:15:34 -07:00
Patrice Vignola
49512e558a
[DML EP] Add I/O binding and If operator (#16859)
Being able to leverage I/O binding for DML and registering `If` for the
DML EP allows us to avoid copying the past/present key/values back and
forth between the CPU and the GPU after every token.

This gives us a 25% performance increase for Dolly V2 with 128 tokens on
an RTX 4090.
2023-07-31 19:45:59 -07:00
kunal-vaishnavi
3c72f43f78
Extend saving models optimized by inference session (#16912)
### Description
This PR adds support for saving model optimizations after loading a
model that contains external data into an `InferenceSession`.



### Motivation and Context
This PR is a follow-up to a [previous
PR](https://github.com/microsoft/onnxruntime/pull/16716) for saving a
model optimized by an `InferenceSession`.
2023-07-31 16:39:35 -07:00
Wang, Mengni
fe463d4957
Support SmoothQuant for ORT static quantization (#16288)
### Description

Support SmoothQuant for ORT static quantization via intel neural
compressor

> Note:
Please use neural-compressor==2.2 to try SmoothQuant function.

### Motivation and Context
For large language models (LLMs) with gigantic parameters, the
systematic outliers make quantification of activations difficult. As a
training free post-training quantization (PTQ) solution, SmoothQuant
offline migrates this difficulty from activations to weights with a
mathematically equivalent transformation. Integrating SmoothQuant into
ORT quantization can benefit the accuracy of INT8 LLMs.

---------

Signed-off-by: Mengni Wang <mengni.wang@intel.com>
2023-07-26 18:56:45 -07:00
Patrice Vignola
649930142f
[DML EP] Add NCHW and float16 gamma/beta support for GroupNorm (#16814)
This will remove transposes that are non needed in the DML kernel. To
keep backward compatiblity, the default behavior is to set NHWC when no
attribute is set.
2023-07-25 21:43:29 -07:00
Justin Chu
0c1a5098dc
Disable PERF* rules in ruff to allow better readability (#16834)
### Description

Disable two PERF* rules in ruff to allow better readability. Rational
commented inline. This change also removes the unused noqa directives
because of the rule change.

### Motivation and Context

Readability
2023-07-25 15:38:22 -07:00
Maximilian Müller
d8d8349a1b
fix: add missing nullptr of SessionOptions V2 (#16794)
/builds/devtechproviz/dl/ort-builder/onnxruntime/onnxruntime/python/onnxruntime_pybind_state.cc:388:14:
error: missing initializer for member
'OrtTensorRTProviderOptionsV2::trt_cuda_graph_enable'
[-Werror=missing-field-initializers]
  388 |             0};
      |

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-07-24 15:17:11 -07:00
Justin Chu
d79515041c
[Better Engineering] Bump ruff to 0.0.278 and fix new lint errors (#16789)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* __->__ #16789

Bump ruff to 0.0.278 and fix new lint errors. I added noqa to all
existing RUF012 errors which requires mutable class variables to be
annotated with `ClassVar`, as well as all PERF issues.

Signed-off-by: Justin Chu <justinchu@microsoft.com>
2023-07-21 12:53:41 -07:00
Justin Chu
d3295f4329
[Better Engineering] Fix N802 lint errors in tests (#16788)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* #16789
* __->__ #16788

This change fixes the N802 lint errors by renaming the test case to use
snake case.
2023-07-21 09:17:34 -07:00
kunal-vaishnavi
b7176f9826
Fix bug with saving model optimized by inference session (#16716)
### Description
A [previous PR](https://github.com/microsoft/onnxruntime/pull/16531)
added a temporary directory to save the model optimizations after
loading a model into an `InferenceSession`. Many models that have an
external data file, however, require the data file to be in the same
directory as the ONNX model file. Because the model is saved in a
temporary directory and the data is saved in another directory, this
causes a `FileNotFoundError` error when trying to load the model in the
temporary directory.

This PR fixes this error by saving the external data file in the same
directory that the optimized model is located in.

### Motivation and Context
This PR fixes a bug with using a temporary directory while running the
optimizer for models that have an external data file.
2023-07-20 18:44:28 -07:00
Xavier Dupré
2bc9fbb621
Fix url in the code documentation (graph optimizations) (#16770)
### Description
Fix a wrong url in the documentation as mentioned in issue #16678.



### Motivation and Context
Better documentation.
2023-07-20 07:02:22 -07:00
Tianlei Wu
95f9628776
Fix transformers optimizations for GPT-NeoX (#16743)
### Description
Fix some issues found in GPT-NeoX graph fusion:
(1) GPT-NeoX uses float16 weights. The step of using onnxruntime with
opt_level==1 uses CPU provider. Since most operators does not have fp16
in CPU EP, so extra Cast nodes are added to up cast to fp32.
(2) Add is shared by two LayerNormalization children, and
SkipLayerNormalization might cause invalid graph.
(3) Reshape fusion might miss since some part only check initializer but
not Constant.

This PR adds a check whether model uses FP16, and output a warning when
use_gpu is not True, and use GPU provider for graph optimization when use_gpu=True.
2023-07-18 10:29:08 -07:00
PeixuanZuo
9b549c646c
[ROCm] fix kernel explorer GemmSoftmaxGemm test (#16735)
GemmSoftmaxGemmTunble occasionally broken with large numerical error.
The root cause of this error is CK's Strided Batched Gemm has larger
error under a specific initialization distribution
`(multinormal_distribution)`.

Generic(Gemm1 + Softmax + Gemm2) implementation is one instance of
GemmSoftmaxGemmTunble. Gemm1 and Gemm2 in Generic implementation are
TunableOps when tuning enabled. In some case GemmSoftmaxGemmTunble
select Generic implentation, while Gemm1 or Gemm2 select ck
implementation, the result of GemmSoftmaxGemmTunble affect by CK.

- Make tolerance more loosen.
- Add `GemmSoftmaxGemmPermuteGenericNestedTunable` to test Generic
implementation with tuning enabled.
2023-07-18 16:47:39 +08:00
cloudhan
0cab7e1a37
[ROCm] Generalize FastGeLU (#16623)
Allow the whole pipeline to be parameterized with unary elementwise
functor.
2023-07-18 11:23:12 +08:00
Tianlei Wu
77b45c6503
Add Stable Diffusion Benchmark on A100-PCIE-80GB (#16702)
0(1) Fix a bug in https://github.com/microsoft/onnxruntime/pull/16560
that UNet shall be set fp16 flag.
(2) Remove wget in requirements since it is no longer needed.
(3) Add benchmark numbers in A100-PCIE-80GB. Note that CUDA EP have
issue to run in batch size 4 so the number is not added.
2023-07-14 10:37:00 -07:00
mindest
810512c658
[ROCm] TunableOp: add hipBLASLt tuning logic (#16338)
### Description
- Add hipBLASLt tuning logic in place of default hipBLASLt
implementation;
- add kernel explorer for hipBLASLt.

related operators: Gemm, StridedBatchedGemm, and GemmFastGelu.

Temporarily mark algos that require extra workspace as unsupported.
Will add workspace support in later PR, which will change Gemm Params
def and affect multiple files.
2023-07-14 08:20:58 +08:00
Vincent Wang
c07a3b869c
Triton Codegen for ORTModule (#15831)
Fuse connected elementwise and reduce Ops to TritonOp and codegen triton
code to run the kernel.

This PR is co-edited by @wejoncy and @er3x3
2023-07-13 18:17:58 +08:00
mindest
b7fd5af48b
[ROCm] TunableOp: Update rocBLAS get_solutions API (since ROCm5.6) (#16657)
### Description
- Update existing rocBLAS get_solutions API using
`*_get_solutions_by_type` (supported from ROCm5.6); remove the original
nested TunableOp logic.
- Update kernel_explorer.
2023-07-13 11:20:26 +08:00
PeixuanZuo
ebc311365b
[ROCm] Optimize ROCm CI to reduce time (#16620)
This PR mainly optimize ROCm CI test to reduce time and CPU utilization.

- use smaller batch size on strided_batched_gemm/batched_gemm test
- disable cpu training test
- fix test_e2e_padding_elimination Occasional failures on ROCm.
2023-07-13 10:58:03 +08:00
Ye Wang
dd7d721f3c
support rotary embeddings in decoder masked self-attention (#16556)
### Description
<!-- Describe your changes. -->

This PR adds support for rotary embeddings in decoder masked
self-attention

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-07-12 13:48:48 -07:00
Tianlei Wu
2de5807703
Attention fusion for UNet onnx model export from PyTorch 2.* (#16629)
### Description
Tested with stable diffusion unet models exported by pytorch nightly.

Example to run:
```
cd onnxruntime/python/tools/transformers/
python optimizer.py --input unet.onnx  --output unet_fp16.onnx --model_type unet --float16 --opt_level 0
```
2023-07-11 14:35:48 -07:00
Justin Chu
ad994565ae
Fix type annotation for InferenceSession (#16632)
The Sequence should have been annotated to take a Union type; otherwise
the annotation would be invalid.
2023-07-11 06:34:22 -07:00
mindest
347c963d5c
[ROCm] Add ROCm Triton TunableOp for GroupNorm (#16196)
### Description
- Refactor existing Triton TunableOp-related code (based on work in
#15862)
- Add GroupNorm Triton implementation
2023-07-11 13:55:30 +08:00
Tianlei Wu
b8f6235f11
Update stable diffusion benchmark for TensorRT EP (#16560)
### Description

Add Stable Diffusion Text2Image pipelines of TensorRT EP and CUDA EP.
They can automatically export and optimize ONNX model, and create
ONNXRuntime session to use TensorRT EP or CUDA execution provider.

Add support for benchmarking TensorRT.

Add support of cuda graph. The feature is only supported in nightly
package right now.

Engine/Provider to test | command line
---- | ---
CUDA EP | `python benchmark.py -v 1.5`
CUDA EP with cuda graph | `python benchmark.py -v 1.5
--enable_cuda_graph`
TensorRT EP | `python benchmark.py -v 1.5 -r tensorrt`
TensorRT EP with cuda graph | `python benchmark.py -v 1.5 -r tensorrt
--enable_cuda_graph`
TensorRT | `python benchmark.py -v 1.5 -e tensorrt`

Add benchmark numbers of T4 GPU using CUDA 11.7, cuDNN 8.5, PyTorch
1.13.1+cu11.7, TensorRT 8.6.1, onnxruntime-gpu 1.15.1 (or
ort-nightly-gpu 1.16 for cuda graph).

TODO: add benchmark numbers of A100-80GB

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-07-10 09:51:03 -07:00
PeixuanZuo
3b729e5d2f
[ROCm] use cupy for GPU-accelerated computing (#16611)
kernel explorer has lots of tests and need numpy to verify the results
of GPU kernels, it will make CPU utilization very high. This PR use
`cupy ` to replace `numpy` to do compute on GPU to reduce CPU
utilization.

set `KERNEL_EXPLORER_TEST_USE_CUPY=1` to enable cupy.
2023-07-10 17:17:39 +08:00
PeixuanZuo
2f56815344
[Whisper] Fix provider in export procress (#16545)
type of parameter `provider` is string.
2023-07-10 11:55:36 +08:00
cloudhan
01c5d05712
Avoid repeated GemmSoftmaxGemmPermuteTunableOp<HipT> ctor invocation (#16518)
The `GemmSoftmaxGemmPermuteTunableOp<HipT>` is expensive to construct,
avoid the ctor invocation will substantially improve the launch time and
get better performance during the decoding. This get <7% e2e time
reduction of whisper large.
2023-07-08 00:25:07 +08:00
Edward Chen
6be7b03e53
Enable -Wshorten-64-to-32 warning if available. (#16524)
- Fix some warnings from Xcode build (`-Wshorten-64-to-32`).
- Enable `-Wshorten-64-to-32` warning if available. Currently it's not fully enabled for `onnxruntime_test_all` and `onnxruntime_providers_xnnpack` yet.
- Some clean up in build.py including setting CMake generator more consistently.
2023-07-07 08:11:44 -07:00
petermcaughan
47f136e2d3
Speed Up Whisper Export (#16504)
### Description
Add a greedy option to the initializer deduplication process in the
Whisper export.

Currently to detect shared initializers, ORT compares every initializer
against every other initializer (n^2). In the comparison operator, if
the two initializers have different data types (e.g. raw_data and
int_64), both initializers are converted to a numpy array and the cast
result is compared. This cast happens in every comparison, and
exponentially affects the runtime of finding shared initializers. This
cast operation is the bottleneck for the current Whisper export script.

The conversion to the numpy array is useful for detecting equal
initializer values across nodes of different data types (e.g.
recognizing a bias value of 0.0 is the same as a slice index of 0) but
isn't triggered when comparing initializers of the same data type (e.g.
weight value of 0.6 == weight value of 0.6). The latter case is where
the majority of utility is for Whisper, and so by eliminating our path
for comparing numpy arrays for initializers we save a lot of time for
minimal cost.

In other words, this PR adds an option to remove the ability to detect
shared initializers of different types (e.g. Slice Index and MatMul
Constant) while retaining the ability to deduplicate weights.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- Current time to export Whisper-large is prohibitive.

---------

Co-authored-by: Peter McAughan <petermca@microsoft.com>
2023-06-30 12:22:30 -07:00
Mike Guo
aeaa1d650f
make optimized_model_path be in temp folder instead of source model folder for transformer optimization (#16531)
### The optimize_model will generate a temporary model in current model
folder. Most of time, it is fine.

However, the scenario will break when the function run against input
model mount from AzureML. In that case, the mounted folder is read-only.
We have to copy the model to another temp folder to call optimize_model
to workaround this issue. Otherwise, the optimize_model will fail when
creating the optimized model in the read-only folder. However, the model
copy is painful, especially when model is huge.

This PR just expose the optimized_model_path at optimize_model level so
that the caller could decide where to save the temp model.
2023-06-30 20:19:51 +08:00
cao lei
0c5f492493
remove AllocatorMgr class (#16509)
### Description
Remove AllocatorManager class


### Motivation and Context
After the refactor PR #15833 is in, AllocatorManager class is not
referenced anymore.
2023-06-28 15:43:19 -07:00
Tianlei Wu
9407c3270c
GPT-2 attention fusion for transformers >= 4.27 (#16461)
### Description
Before transformers 4.27, the causal mask uses uint8 data type, so there
is extra Cast node to convert it to bool. This adds a pattern that
without Cast node to support attention fusion for GPT-2 models exported
with transformers >= 4.27.

### Motivation and Context

https://github.com/microsoft/onnxruntime/issues/16453
2023-06-23 15:38:35 -07:00
Hector Li
a8c313dec4
[QNN EP] Python script to modify Onnx model to make it aligned with converted QNN model (#16423)
Python script to modify Onnx model to make it aligned with converted QNN
model

### Description
Onnxruntime QNN EP can support context binary file generated by QNN tool chain. However QNN generated context binary file uses channel last and 8 bits or 16 bits for input and output. This script get the QNN model input & output information from QNN converted model_net.json file, and insert Cast, Transpose nodes to Onnx model if required.
2023-06-23 11:00:51 -07:00