Commit graph

7774 commits

Author SHA1 Message Date
mindest
f34ebbc8ff
fix a wrong assert condition in benchmark_helper (#13821)
### Description
fix a wrong assert condition in benchmark_helper.py (introduced in
#13455)
2022-12-03 18:50:47 +08:00
Pranav Sharma
335b62bde6
Fix invocation of GetInputMemoryType. (#13828)
### Description
GetInputMemoryType was introduced in ver 13 in [this
PR](https://github.com/microsoft/onnxruntime/pull/10879). The ver check
introduced in this PR allows custom ops compiled using older versions to
work with newer versions (> 12) of the ORT binary.

### Motivation and Context
Fixes binary compatibility.
2022-12-02 18:42:14 -08:00
Patrice Vignola
b53bbe7370
[DML EP] Add an implementation for NonZero (#13768)
### Description
Add the NonZero op for DML



### Motivation and Context
NonZero is used in a few transformer models, so having a DML
implementation will stop large tensors from being transferred to the CPU
and back to the GPU
2022-12-02 18:39:21 -08:00
Gaz Iqbal
b9702587df
[oneDNN] Implemented Concat Op (#13646)
### Description
This PR implements the **Concat Operator** for the **OneDNN Execution
Provider**.

### Motivation and Context
- As part of evaluating ORT performance on ARM based targets such as
Graviton3, we discovered that the OneDNN EP had some gaps on operator
coverage.
- The Concat Operator is fairly common and used in models such as
Yolov5, MobileNet, DistillBert and GPT2
- For Yolov5 specifically, this improves average inference time over 100
runs on Graviton3 from 180.2ms to 115.5ms when using OneDNN + ARM
Compute Library.

Co-authored-by: Gaz Iqbal <giqbal@octoml.ai>
2022-12-02 13:30:37 -08:00
Patrice Vignola
c2d08fd73a
[DML EP] Add support for LayerNorm (scale == nullptr) != (bias == nullptr) (#13818)
### Description
Add support for LayerNorm scale == nullptr != bias == nullptr
2022-12-02 13:19:53 -08:00
Patrice Vignola
a0b470bc35
[DML EP] Add mixed datatype support for DML's LayerNorm contrib op (#13734)
### Description
Add mixed datatype support for DML's LayerNorm contrib op.



### Motivation and Context
The fusion logic removes casts around LayerNorm in the graph because the
contrib version of the op supports mixed datatypes. Scale, Bias and
Output's datatypes must match, but input's datatype can be different.
2022-12-01 14:08:18 -08:00
JiCheng
82d123b6c9
[quick fix] Build onnxruntime under DISABLE_ABSEIL (#13799) 2022-12-01 10:00:31 -08:00
Changming Sun
04900f96c1
Improve dependency management (#13523)
## Description
1. Convert some git submodules to cmake external projects
2. Update nsync from
[1.23.0](https://github.com/google/nsync/releases/tag/1.23.0) to
[1.25.0](https://github.com/google/nsync/releases/tag/1.25.0)
3. Update re2 from 2021-06-01 to 2022-06-01
4. Update wil from an old commit to 1.0.220914.1 tag
5. Update gtest to a newer commit so that it can optionally leverage
absl/re2 for parsing command line flags.

The following git submodules are deleted:

1. FP16
2. safeint
3. XNNPACK
4. cxxopts
5. dlpack
7. flatbuffers
8. googlebenchmark
9. json
10. mimalloc
11. mp11
12. pthreadpool

More will come.

## Motivation and Context
There are 3 ways of integrating 3rd party C/C++ libraries into ONNX
Runtime:
1. Install them to a system location, then use cmake's find_package
module to locate them.
2.  Use git submodules 
6.  Use cmake's external projects(externalproject_add). 

At first when this project was just started, we considered both option 2
and option 3. We preferred option 2 because:

1. It's easier to handle authentication. At first this project was not
open source, and it had some other non-public dependencies. If we use
git submodule, ADO will handle authentication smoothly. Otherwise we
need to manually pass tokens around and be very careful on not exposing
them in build logs.
2. At that time, cmake fetched dependencies after "cmake" finished
generating vcprojects/makefiles. So it was very difficult to make cflags
consistent. Since cmake 3.11, it has a new command: FetchContent, which
fetches dependencies when it generates vcprojects/makefiles just before
add_subdirectories, so the parent project's variables/settings can be
easily passed to the child projects.

And when the project went on,  we had some new concerns:
1. As we started to have more and more EPs and build configs, the number
of submodules grew quickly. For more developers, most ORT submodules are
not relevant to them. They shouldn't need to download all of them.
2. It is impossible to let two different build configs use two different
versions of the same dependency. For example, right now we have protobuf
3.18.3 in the submodules. Then every EP must use the same version.
Whenever we have a need to upgrade protobuf, we need to coordinate
across the whole team and many external developers. I can't manage it
anymore.
3. Some projects want to manage the dependencies in a different way,
either because of their preference or because of compliance
requirements. For example, some Microsoft teams want to use vcpkg, but
we don't want to force every user of onnxruntime using vcpkg.
7. Someone wants to dynamically link to protobuf, but our build script
only does static link.
8. Hard to handle security vulnerabilities. For example, whenever
protobuf has a security patch, we have a lot of things to do. But if we
allowed people to build ORT with a different version of protobuf without
changing ORT"s source code, the customer who build ORT from source will
be able to act on such things in a quicker way. They will not need to
wait ORT having a patch release.
9. Every time we do a release, github will also publish a source file
zip file and a source file tarball for us. But they are not usable,
because they miss submodules.
 
### New features

After this change, users will be able to:
1. Build the dependencies in the way they want, then install them to
somewhere(for example, /usr or a temp folder).
2. Or download the dependencies by using cmake commands from these
dependencies official website
3. Similar to the above, but use your private mirrors to migrate supply
chain risks.
4. Use different versions of the dependencies, as long as our source
code is compatible with them. For example, you may use you can't use
protobuf 3.20.x as they need code changes in ONNX Runtime.
6.  Only download the things the current build needs.
10. Avoid building external dependencies again and again in every build.

### Breaking change
The onnxruntime_PREFER_SYSTEM_LIB build option is removed you could think from now 
it is default ON. If you don't like the new behavior, you can set FETCHCONTENT_TRY_FIND_PACKAGE_MODE to NEVER.
Besides, for who relied on the onnxruntime_PREFER_SYSTEM_LIB build
option, please be aware that this PR will change find_package calls from
Module mode to Config mode. For example, in the past if you have
installed protobuf from apt-get from ubuntu 20.04's official repo,
find_package can find it and use it. But after this PR, it won't. This
is because that protobuf version provided by Ubuntu 20.04 is too old to
support the "config mode". It can be resolved by getting a newer version
of protobuf from somewhere.
2022-12-01 09:51:59 -08:00
Patrice Vignola
e9b92fdf33
[DML EP] Add DML implementation for BiasGelu (#13795)
### Description
Add DML implementation for BiasGelu
2022-12-01 09:23:19 -08:00
Numfor Tiapo
e0dcbc3832
Fix C26436 prefast errors (#13774)
Fixes errors 9196, 9214, 9255, and 9314.

Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>
2022-12-01 09:07:44 -08:00
Patrice Vignola
4128e44b4f
[DML EP] Upgrade DML to 1.10.0 (#13796)
### Description
Upgrade DML to 1.10.0
2022-11-30 21:32:14 -08:00
Yi Zhang
777c474f61
skip quantized model C# tests on GPU (#13782)
### Description
Skip quantized model C# tests on GPU too.

### Motivation and Context
It looks the current test result isn't reasonable.
https://github.com/onnx/models/issues/581

Once we update the image, the quantized model [test data will be
generated with
VNNI](ba629906dd),
the CI would be broken.
2022-12-01 12:33:20 +08:00
Wei-Sheng Chin
7df8f84228
Improve DORT document (#13790)
1. Refine words based on PyTorch changes.
2. Make the need of inference mode clearer. A test is added.
2022-11-30 16:55:25 -08:00
Yulong Wang
77c97b6f16
[js/rn] support load model from buffer on Android (#12676)
**Description**: [js/React Native] Add android implementation for
creating session from buffer. #12500

Co-authored-by: Rachel Guo <guorachel@microsoft.com>
2022-11-30 10:55:55 -08:00
Wei-Sheng Chin
639d285670
[DORT] Catch up with yesterday's PyTorch change (#13779)
Fix recent CI failures.
2022-11-30 09:23:44 -08:00
Xavier Dupré
441b30b2d2
Move a function call outside a loop in ORTModule (#13771)
### Description
The proposed change is useful for ORTModule when the output graph has
multiple outputs.



### Motivation and Context
performance

Signed-off-by: xadupre <xadupre@microsoft.com>
2022-11-30 12:49:41 +01:00
Patrice Vignola
08ed09d20b
Add DML support to the transformers benchmark.py script (#13776)
### Description
Add DML support to the transformers benchmark.py script



### Motivation and Context
Before this change, running the `benchmark.py` script when the
`onnxruntime-directml` package is installed resulted in an error because
it expects a CUDA or ROCM framework.
2022-11-29 18:57:52 -08:00
Changming Sun
29ed8811e5
Move C/C++ deps' URLs to deps.txt (#13769)
### Description
1. Move C/C++ deps' URLs to deps.txt, and download the dependencies from
Azure Devops Artifacts instead of github.
2. Add "EXCLUDE_FROM_ALL" keyword to the cmake external projects, so
that we only build the parts we need and avoid installing the 3rd-party
dependencies when people run `make install` in ORT's build directory.
However, at this moment cmake itself doesn't have the feature. So I
copied their code to cmake/external/helper_functions.cmake and modified
it.

This PR is split from #13523, to make that one smaller. 

### Motivation and Context
1. Secure the supply chain
2. Make it be possible to automatically detect if ORT has an old
dependency that hasn't been updated from a long time.
2022-11-29 18:06:35 -08:00
Jeff Bloomfield
571dc5a1f1
Support exteranl weights in DML execution provider (#13740)
### Description
This enables support for external weights in the DML execution provider
when its graph optimization logic is reached.

### Motivation and Context
External weighs are encountered after optimization is applied to
transformer models.
2022-11-29 15:47:16 -08:00
stevenlix
ce0025d3f2
Fallback Pow op in layer norm to FP32 in TRT to avoid overflow (#13639)
Accuracy loss is observed when transformer models such as BERT, DeBERTa,
ViT are running in TRT FP16 mode. The cause is that overflow happens at
Pow op in layer norm.
This PR provides the option to force Pow to run in TRT FP32 precision if
overflow occurs.

Co-authored-by: Ubuntu <azureuser@orteplinuxdev.bxgbzpva45kedp3rhbsbit4phb.jx.internal.cloudapp.net>
2022-11-29 13:37:31 -08:00
Chi Lo
0327606d2d
Revert TRT EP Linux CI to run unit tests in container (#13766)
Revert TRT EP Linux CI to old behavior that code build and unit tests
are both executing in container. So that we don't have to update the VM
image for native Ubuntu to include latest TRT libraries every time newer
version of TRT is introduced.
2022-11-29 13:15:27 -08:00
Tianlei Wu
abe1642a0c
Update fusion for distilbert accuracy test on SQuAD (#13748)
(1) Embed layer fusion to work with --use_mask_index.
(2) Parse num_heads and hidden_size from a pattern of Concat shape node.
(3) Fix a typo (CUDAExcecutionProvider=> CUDAExecutionProvider) in eval_squad.py
(4) Update example comments in eval_squad.py to use optimized fp16 model.
(5) Update tests in test_optimizer.py
2022-11-29 13:06:39 -08:00
FFrog
181628ced1
[CANN] add more operators (#13578)
### Description
Adding new operators and  enhances operators, also.

### Motivation and Context
The operators of CANN EP is modified as follows:

The list of enhanced operators is as follows:

- Add
- Sub
- Mul
- Div
- Gemm
- MatMul
- AveragePool
- GlobalAveragePool
- MaxPool
- GlobalMaxPool
- Dropout

The new operators are as follows:

- Abs
- Neg
- Floor
- Ceil
- Reciprocal
- Sqrt
- Log
- Exp
- Erf
- Round
- Sin
- Cos
- Cast
- Reshape
- Transpose

The remaining operators will be supported in the next PRs.
2022-11-29 12:08:36 -08:00
Baiju Meswani
2c29938846
[QAT] Introduce FakeQuant op (#13649) 2022-11-29 08:43:37 -08:00
sfatimar
49c3768985
Enabled ops for DeBERTa model (#13690)
### Description
Enabled GatherElements Ops to enable DeBERTA Model



### Motivation and Context
- This change is required to enable DeBerta Model which is relevant to
MSFT
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: mayavijx <mayax.vijayan@intel.com>
2022-11-28 22:39:32 -08:00
pengwa
7c53b6eee8
Skip the tests of saving tensor in backward (#13767)
### skip the tests of saving tensor in backward

The test failed randomly; Let's skip it until the issue got fixed to
unblock the CIs.
2022-11-29 13:02:26 +08:00
Vincent Wang
3c258c878c
[CUDA] Optimize Slice Kernel (#13641)
The PR optimizes Slice CUDA kernel by two ways:
- Coalesce dimensions so less divmod during the kernel compute
- Split data load and write for better memory throughput

Below shows some perf results (cycles number from Nsight Compute) in
V100 using real cases from Huggingface's XLNet model:

  | Old | New
-- | -- | --
[8,12,2048,1024], axis=2, start=1, end=2048 | 1838687| 1539846
[8,12,1024,2047], axis=3, start=0, end=1024 | 951383| 722203
2022-11-29 09:18:03 +08:00
JiCheng
47780b7f3b
[XNNPACK] add more computation heavy ops (#13270)
### Description
This is the first PR of adding remaining Ops for XNPACK EP,
I am gonna add:
- [x] ConvTranspose f32 qu8 q s8
- [x] ~~UnMaxpool   f32 qu8 qs8~~
- [x] Resize f32 qu8 q s8
- [ ]  GEMM see https://github.com/microsoft/onnxruntime/pull/13126

The remains operation support would be seperated into another PR.

### Motivation and Context
2022-11-29 09:09:26 +08:00
Dmitri Smirnov
4fbe16e493
Ifdef cpuinfo code on platforms we do not set affinity (#13486)
### Description
Remove code that invokes cpuinfo library on platforms we do not set
affinity.

### Motivation and Context
`cpuinfo` library increases binary size.
2022-11-28 13:44:16 -08:00
Guenther Schmuelling
2d523c507e
for wasm catch exceptions at top level api (#13644)
fix for https://github.com/microsoft/onnxruntime/issues/13383,
https://github.com/microsoft/onnxruntime/issues/13408

Currently ort-web doesn't catch exceptions because turning on exception
catching increases the binary size by 3MB (~30%).
But ort can throw (ie onnx errors or ORT_ENFORCE) and there is no
useable error message.

Turning on exception catching just for top level api released file will
fix the error messages at minimal increase of binary size.
2022-11-28 10:24:34 -08:00
Faith Xu
b7c3862330
Update resource section in readme (#13724)
### Description
- adds link to release plans page
- adds link to youtube channel
2022-11-28 09:42:31 -08:00
Jicheng Tang
b4a4fa5aac
Fix compile error with protobuf RepeatedIterator (#13731)
### Description
<!-- Describe your changes. -->
There are some compile errors with
google::protobuf::internal::RepeatedIterator.
replace reinterpret_cast with &(*iter), which iter is RepeatedIterator
type.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
My protobuf version is:
- libprotoc 3.21.5
- g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

when I use build command:
```
./build.sh --use_cuda --cudnn_home /usr --cuda_home /usr/local/cuda --config Debug --build_shared_lib --parallel 
```

There are some compile errors like this:

- error 1
onnxruntime/test/util/test_utils.cc:186:105: error: no matching function
for call to ‘make_span(google::protobuf::RepeatedField<long
int>::const_iterator, google::protobuf::RepeatedField<long
int>::const_iterator)’
186 | ind_span = gsl::make_span(indices_proto.int64_data().cbegin(),
indices_proto.int64_data().cend());

- error 2
onnxruntime/test/onnx/tensorprotoutils.cc:101:56: error: invalid cast
from type ‘google::protobuf::internal::RepeatedIterator<const long
unsigned int>’ to type ‘const uint32_t*’ {aka ‘const unsigned int*’}
  101 |       *p_data++ = *reinterpret_cast<const T*>(data_iter);
2022-11-28 09:33:53 -08:00
Numfor Tiapo
aa1390e963
Fix Prefast Errors (#13675)
Fixes all C28204, C6031, and C26814 prefast errors.

Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>
2022-11-28 09:16:22 -08:00
Ted Themistokleous
c6bea4f02f
Modify MIGraphX EP for Accuracy tests (#13455)
Allows MIGraphX EP to run the following additional tests. Also adds support to get MIGraphX to run eval_squad.py

Reference to the Rocm EP changes: https://github.com/microsoft/onnxruntime/pull/13306

Co-authored-by: Joseph Groenenboom <joseph.groenenboom@amd.com>
Co-authored-by: Ted Themistokleous <tthemist@amd.com>
2022-11-27 18:26:49 +08:00
Yufeng Li
4ca62b9ee8
fix build break in test/beam_search_topk.cc (#13739) 2022-11-23 21:20:51 -08:00
Vincent Wang
47e7630378
[CUDA] Transpose3DImpl Supporting more Cases (#13611)
CUDA's Transpose3DImpl is to transpose [batch, m, n] to [batch, n, m].
Currently it requires both m and n can be divided by 32 or 16. If it's
not this case, the compute will fallback to general implementation,
which is slow. This PR is to remove the limitation.

Profiling in V100 using below size of tensors, got the cycles number
from Nsight Compute:
  | Old | New
-- | -- | --
[3072,64,512] | 760793 | 727140
[3072,16,2048] | 854303 | 851146
[3072,2048,12] | 986924 | 737884
[3072,1024,24] | 1212427 | 495117

It shows that even we added extra IF statements to the kernel
implementation, it has nearly no impact to the old version (case 1 and
2). And for case 3 and 4 which will fallback to general implementation
before, it's much faster.

Above data was collected using FP16 tensors, similar results was
observed for float tensors.

This PR is to enhance the perf of ORT training of Huggingface's XLNet
model which has[8,1024,1024,12].permute(0,3,1,2).
2022-11-24 09:40:48 +08:00
Yi Zhang
87d5703b14
skip TestCUDAProviderOptions in End2EndTest (#13737)
### Description
<!-- Describe your changes. -->
Skip the test with --filter in runtest.sh

### Motivation and Context
Recently, the Zip-Nuget-Java-Nodejs Packaging Pipeline always failed in
Nuget_Test_Linux_GPU.
To unblock the packaging workflow, skip the test in Nuget_Test_Linux_GPU
temporally.
the exception message is below.
```
[xUnit.net 00:07:26.28]     TestCUDAProviderOptions [FAIL]
  Failed TestCUDAProviderOptions [1 m 19 s]
  Error Message:
   Microsoft.ML.OnnxRuntime.OnnxRuntimeException : [ErrorCode:RuntimeException] Non-zero status code returned while running FusedConv node. Name:'' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:342 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Available memory of 11416064 is smaller than requested bytes of 134217728

  Stack Trace:
     at Microsoft.ML.OnnxRuntime.NativeApiStatus.VerifySuccess(IntPtr nativeStatus)
   at Microsoft.ML.OnnxRuntime.InferenceSession.RunImpl(RunOptions options, IntPtr[] inputNames, IntPtr[] inputValues, IntPtr[] outputNames, DisposableList`1 cleanupList)
   at Microsoft.ML.OnnxRuntime.InferenceSession.Run(IReadOnlyCollection`1 inputs, IReadOnlyCollection`1 outputNames, RunOptions options)
   at Microsoft.ML.OnnxRuntime.InferenceSession.Run(IReadOnlyCollection`1 inputs, IReadOnlyCollection`1 outputNames)
   at Microsoft.ML.OnnxRuntime.InferenceSession.Run(IReadOnlyCollection`1 inputs)
   at Microsoft.ML.OnnxRuntime.Tests.CUDATest.TestCUDAProviderOptions() in /mnt/vss/_work/1/s/csharp/test/Microsoft.ML.OnnxRuntime.Tests.NetCoreApp/InferenceTest.netcore.cs:line 93

Failed!  - Failed:     1, Passed:     0, Skipped:     0, Total:     1, Duration: < 1 ms - /mnt/vss/_work/1/s/csharp/test/Microsoft.ML.OnnxRuntime.EndToEndTests/bin/Debug/netcoreapp3.1/Microsoft.ML.OnnxRuntime.EndToEndTests.dll (netcoreapp3.1)
       Done executing task "Microsoft.TestPlatform.Build.Tasks.VSTestTask" -- FAILED.
     1>Done building target "VSTest" in project "Microsoft.ML.OnnxRuntime.EndToEndTests.csproj" -- FAILED.
     1>Done Building Project "/mnt/vss/_work/1/s/csharp/test/Microsoft.ML.OnnxRuntime.EndToEndTests/Microsoft.ML.OnnxRuntime.EndToEndTests.csproj" (VSTest target(s)) -- FAILED.
```
2022-11-23 14:56:04 -08:00
Ye Wang
c1bda4c1cc
fix buffer overuse in addtofeed() (#13733)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-11-23 10:53:53 -08:00
Tianlei Wu
e306b44e98
Improve coverage of fused MHA in Attention (#13732)
Previously, fused attention was applied to limited sequence lengths (64,
96, 128, 256, 384, 512). This will expand support all sequence lengths
<= 384 for V100 and T4, or 512 for A100.

Previously, fused attention only works for batch_size=1. After this
change, fused MHA has no limit on batch_size.

## Accuracy Tests on SQuAD

Using optimized fp16 onnx model of
distilbert-base-cased-distilled-squad, we test the CUDA EP with IO
Binding using eval_squad.py:

disable_fused_attention | batch_size | sequence_length | exact | f1 |
samples_per_second | latency_in_ms
-- | -- | -- | -- | -- | -- | --
TRUE | 1 | 384 | 79.6 | 86.8 | 283.5 | 3.5
TRUE | 2 | 384 | 79.6 | 86.8 | 308.3 | 3.2
FALSE | 1 | 384 | 79.6 | 86.8 | 313.2 | 3.2
FALSE | 2 | 384 | 79.6 | 86.8 | 340.9 | 2.9
TRUE | 1 | 300 | 79.3 | 86.6 | 278.5 | 3.6
TRUE | 2 | 300 | 79.4 | 86.6 | 301.8 | 3.3
FALSE | 1 | 300 | 79.4 | 86.6 | 305.8 | 3.3
FALSE | 2 | 300 | 79.4 | 86.6 | 335.9 | 3.0

It shows that with/without fused attention could achieve same accuracy. 

Note that latency number here is just for reference (eval_squad.py has
not been optimized for speed). We can see that it is about 10% faster
with fused attention than without fused attention.

version of package used: onnx 1.12.0, torch 1.13.0, transformers 4.24.0,
optimum 1.5.0, datasets 2.7.0, evaluate 0.3.0

## Performance Test of base-based-cased on T4 GPU
```
sudo nvidia-smi -rgc
export ORT_DISABLE_FUSED_ATTENTION=0
python benchmark.py -m bert-base-cased -e onnxruntime -g -p fp16 -o by_script -i 3 -t 1000 -b 1 8  -s 8 16 32 64 80 96 120 128 --use_mask_index --overwrite
```

Disable_Fused_Attention | b1_s8 | b1_s16 | b1_s32 | b1_s64 | b1_s80 |
b1_s96 | b1_s120 | b1_s128 | b8_s8 | b8_s16 | b8_s32 | b8_s64 | b8_s80 |
b8_s96 | b8_s120 | b8_s128
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
| -- | --
FALSE | 1.32 | 1.28 | 1.33 | 1.51 | 1.71 | 1.79 | 1.99 | 2.04 | 1.56 |
1.99 | 2.85 | 4.88 | 6.03 | 7.03 | 9.2 | 9.34
TRUE | 1.37 | 1.34 | 1.44 | 1.68 | 1.89 | 1.99 | 2.15 | 2.21 | 1.63 |
2.31 | 3.19 | 5.48 | 6.98 | 8.14 | 10.54 | 10.66
Latency Reduction  | 3.6% | 4.5% | 7.6% | 10.1% | 9.5% | 10.1% | 7.4% |
7.7% | 4.3% | 13.9% | 10.7% | 10.9% | 13.6% | 13.6% | 12.7% | 12.4%

Perf gain is observed in all sequence lengths tested.
2022-11-23 10:19:04 -08:00
Changming Sun
87e6a26c5d
Enforce Prefast check in Windows CPU CI pipeline (#13735)
Right now we fix the warnings in an ad-hoc way. We run static analysis
in nightly builds, then create work items for the finding it found. Our
CI build pipelines run the same scan but do not break the build. So,
this PR will fix the remaining findings in the CPU EP(including the
training part) and enforce the check. Later on we can continue to expand
the scope.

We still have some warnings left in the JNI part. I will try to address
them later in the next month.
2022-11-23 09:25:02 -08:00
Ted Themistokleous
9168e25738
Patch eval_squad.py script for Python < 3.8 and multiple Execution Providers (#13524)
Need this for benchmarks to function correctly with older containers

This fixes import errors when attempting to run eval_squad.py to
evaluate bert distilled models

Adds a change to the previously merged #12947 which fails when using
Python version < 3.8 to run this script.

Co-authored-by: Ted Themistokleous <tthemist@amd.com>
2022-11-23 15:37:39 +08:00
PeixuanZuo
977da6635b
[ROCm] Remove tuning options on transformerOptions (#13689)
### Description
<!-- Describe your changes. -->

Remove tuning options on transformerOptions, use IsTunableOpEnabled from
provider in the future.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
2022-11-23 15:36:09 +08:00
Yufeng Li
c43ce64795
Beam search TopK improvement (#13594)
### Description
<!-- Describe your changes. -->

TopK in BeamSearch retrieves top 2*beam next tokens based on logit
score, specifically computing top [batch, 2*beam] tokens based on score
[batch, beam, vocab_size].

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Current implementation use batch as the grid and each thread block
compute top 2*beam from [beam, vocab_size]. It is inefficient because:
1. batch size is usually small( <32) and can not fully leverage GPU's
SMs; 2. vocab_size is usually more than 50k. It is inefficient to
compute 50k * beam in one thread block.

This PR split the topk computation into multiple stages: 
- for small beam size, split [batch, beam, vocab_size] to [batch, beam,
parts_of_vocab, vocab_size_per_part]
- 1st stage, each thread block compute top 2*beam from
vocab_sizer_per_part and gets [batch, beam, parts_of_vocab, 2*beam]
- 2nd stage, each thread block compute top 2*beam from parts_of_vocab
*(2*beam} and gets [batch, beam, 2*beam]
  - last stage, compute [batch, 2*beam] from [batch, beam, 2*beam]
- for large beam size, 1st stage computes [batch, beam, 2*beam] from
[batch, beam, vocab_size] and 2nd stage computes [batch, 2*beam] from
[batch, beam, 2*beam].

With the change, performance improves a lot, it reduces ~100us from 2ms
for batch:4, beam:4, vocab_size:~50k.
2022-11-22 21:24:27 -08:00
apsonawane
7857f59d2b
Use sequences to create initial feeds for decoder subgraph (#13719)
Use sequences to create initial feeds for decoder subgraph instead of
beam_next_tokens

### Description
For TuLG models exporting of decoder is different from bart model.
Passing beam_next_tokens to the decoder while ort inferencing generated
incorrect result from pytorch inference.
This change will use sequences as inputs for the first iteration as well


### Motivation and Context
Pytorch and ORT inference for TuLG models was incorrect, keeping pytorch
as correct result we modified ort to match the result.
2022-11-22 18:00:58 -08:00
Baiju Meswani
fb85b31fac
Remove protobuf pin from training requirements (#13695) 2022-11-22 12:27:18 -08:00
Yulong Wang
2bebe6189a
set node schema when apply NHWC transformer (#13660)
### Description
set node schema when apply NHWC transformer

### Motivation and Context
The implementation in `IExecutionProvider::GetCapability()` checks node
schema to determine the capability of the current EP. If NHWC graph
transformer created a new channel last `Conv` node to replace the
channel first `Conv` node, we need to assign the schema to the replaced
node.
2022-11-22 12:26:52 -08:00
Patrice Vignola
ce460f9cdb
[DML EP] Return device removal reason when D3D12 device gets removed (#13727)
### Description
Before this change, when the D3D12 device was getting removed, we were
returning a generic device removed error, which can be harder to
investigate.



### Motivation and Context
It makes it easier to debug and investigate device removal failures.
2022-11-22 10:38:56 -08:00
Patrice Vignola
6c5333e1a7
[DML EP] Enable more DML tests (#13726)
### Description
Enables more DML tests.



### Motivation and Context
It increases test coverage that was missing for the DML EP
2022-11-22 10:35:16 -08:00
Adam Pocock
dd2c031d95
[java] Sparse tensor support (#10653)
**Description**:

Adds support for creating and receiving sparse tensors in the ORT Java
API.

CSRC and COO tensors as inputs are tested, but there is no op which
accepts a block sparse tensor to test. COO tensors are tested as
outputs, but there is no op which emits a CSRC or block sparse tensor to
test.

**Motivation and Context**
- Why is this change required? What problem does it solve? Request to
expose ORT sparse tensor support in Java.

cc @yuslepukhin
2022-11-22 10:29:24 -08:00
Tianlei Wu
8b0e0f4927
Add RemovePadding and RestorePadding for BERT model (#13701)
Add two operators RemovePadding and RestorePadding based on ideal of
effective transformer (https://github.com/bytedance/effective_transformer) to improve large
batch size inference for BERT model.
2022-11-22 10:00:23 -08:00