### Description
<!-- Describe your changes. -->
Accept the command line option --symmetric and its optional value
correctly. If the optional value matches uncased to 'True' then set
symmetric to True else set symmetric to False. Asymmetric quantization
will generate zero_point input.
```
usage: matmul_4bits_quantizer.py [-h] --input_model INPUT_MODEL --output_model OUTPUT_MODEL [--block_size BLOCK_SIZE] [--symmetric [{True,False}]] [--accuracy_level ACCURACY_LEVEL] [-v]
[--nodes_to_exclude NODES_TO_EXCLUDE [NODES_TO_EXCLUDE ...]]
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Added dummy return values to functions which specify a return value, but
do not return an value value.
### Motivation and Context
Fix compiler errors with 'warnings as errors' enabled.
### Description
<!-- Describe your changes. -->
build.py sets a few parallelization parameters when building. Using
msbuild directly lacks those.
7a5860e490/tools/ci_build/build.py (L1665-L1669)
Changed to use build.py. If there's a concern with that we _could_ set
the parameters in the yaml, but that will be uglier due to duplicating
logic in multiple places.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Fixes cmake function definition in winml.cmake to copy link flags.
### Motivation and Context
XFGCheck errors in WindowsAI because this function does not transfer
linker flags
### Description
<!-- Describe your changes. -->
Provide specific xcodebuild flags instead of depending on cmake to do
the right thing.
This built in just over an hour with a ccache miss. Previous CIs with a
ccache miss were timing out after 150 minutes.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Disable __cpuid check on arm64 builds as intrinsic is not available
Motivation
Breaking the arm64 build.
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
### Description
This is required to make shape uniforms really work.
### Motivation and Context
The bug was unveiled in a model with multiple Split nodes. The later
nodes would try to reuse a previous pipeline cache, while the old shapes
were hardcoded as constants in cache.
### Description
The current quantization tool relies on shape inference to provide the
type of every intermediate tensor, then the tool knows which type it
must dequantize into (float32, float16). However, this information is
not available if shape inference fails. That happens every time the
model include an operator from a custom domain such as com.microsoft.
This PR introduces an extra option `DefaultTensorType` as a fall back
when the quantizer cannot find the type it needs.
### Motivation and Context
This fixes issue #19409.
### Description
I've replaces all ocurances of C++ designated initializers in the CUDA
NHWC Tests by member initialization.
### Motivation and Context
C++ designated initializers have been introduced in C++ 20. Yet GCC
accepts designated initializers in C++17 which is the standard used to
compile onnxruntime. Yet MSVC is standard conform and accepts this
feature starting C++20 which leads to compile failures on Windows
without this change.
### Minor fix for cmake
When build on Linux, get a warning saying "
CMake Warning at CMakeLists.txt:1603 (message):
MPI and NCCL disabled on Win build.
"
This message is not correct. So have such a fix to avoid any
misunderstanding from users.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add MatMulNBits to support MatMul using 4-bit quantized weights
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Updates the default QNN SDK version to 2.19.2.240210.
### Motivation and Context
Build and test the latest version of QNN SDK in our pipelines.
### Description
This PR updates exporting and running the Whisper model with beam search
by adding the following.
- Adds temperature as a graph input to the exported model
- Fixes the token ids by adding them as attributes to
`WhisperBeamSearch`
- Fixes the timestamps test cases so they pass now
- Fixes a bug with invoking `torch.onnx.export`
- Cleans up the Whisper scripts and groups the arguments in
`convert_to_onnx.py`
- Adds a `requirements.txt` file to specify package dependencies
- Adds `whisper-large-v3` to list of pretrained models
- Fixes a bug with missing cross-attention KV cache inputs in the
decoder subgraph
### Motivation and Context
- This is a follow-up to [this
PR](https://github.com/microsoft/onnxruntime/pull/19188).
- The incorrect token ids in the timestamps processor were first noticed
during [this PR
review](https://github.com/microsoft/onnxruntime/pull/17500#discussion_r1333520007).
When they were originally added in [this
PR](https://github.com/microsoft/onnxruntime/pull/15853), the offsets
were previously constant across the Whisper model sizes. When comparing
the new `whisper-large-v3` variant, the English-only variants (e.g.
`whisper-tiny.en`), and the original variants (e.g. `whisper-tiny`),
both the values and the offsets differ. Therefore, it is easier to set
the token ids as attributes to `WhisperBeamSearch` when exporting to
ensure the right values are used in the timestamps processor.
- The Hugging Face API for returning timestamps and the expected outputs
from the PyTorch model have both changed.
- The fix for `torch.onnx.export` is a follow-up to [this PR
review](https://github.com/microsoft/onnxruntime/pull/17179#issuecomment-1683001470).
- The argument grouping is a follow-up to [this PR
review](https://github.com/microsoft/onnxruntime/pull/17500#discussion_r1333521721).
- Specific package versions are needed to run the Whisper scripts and
the `requirements.txt` file ensures that these versions are installed.
- The `whisper-large-v3` variant is released and should be in the list
of official pretrained models.
- After the changes from [this
PR](https://github.com/microsoft/onnxruntime/pull/17316), the exported
model is not loading in an ORT inference session because the
cross-attention KV cache inputs are missing in the decoder subgraph.
### Description
Some test thresholds that previously worked in T4 GPU does not work
anymore. The reason is current pipeline uses A10, and TF32 is enabled by
default.
Disable TF32 in Linux GPU CI Pipeline in testing to avoid such random
test failure.
### Motivation and Context
Linux Test has random failure at tests:
ProviderOptionsTest > testCUDAOptions() FAILED
org.opentest4j.AssertionFailedError: array contents differ at index
[446], expected: <0.0419757> but was: <0.041948937>
at
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
at
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
at
app//org.junit.jupiter.api.AssertArrayEquals.failArraysNotEqual(AssertArrayEquals.java:440)
at
app//org.junit.jupiter.api.AssertArrayEquals.assertArrayEquals(AssertArrayEquals.java:290)
at
app//org.junit.jupiter.api.AssertArrayEquals.assertArrayEquals(AssertArrayEquals.java:123)
at
app//org.junit.jupiter.api.AssertArrayEquals.assertArrayEquals(AssertArrayEquals.java:119)
at
app//org.junit.jupiter.api.Assertions.assertArrayEquals(Assertions.java:1360)
at
app//ai.onnxruntime.providers.ProviderOptionsTest.runProvider(ProviderOptionsTest.java:99)
at
app//ai.onnxruntime.providers.ProviderOptionsTest.testCUDAOptions(ProviderOptionsTest.java:43)
org.opentest4j.AssertionFailedError: array contents differ at index [6],
expected: <0.0225981> but was: <0.022587791>
at
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
at
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
at
app//org.junit.jupiter.api.AssertArrayEquals.failArraysNotEqual(AssertArrayEquals.java:440)
at
app//org.junit.jupiter.api.AssertArrayEquals.assertArrayEquals(AssertArrayEquals.java:290)
at
app//org.junit.jupiter.api.AssertArrayEquals.assertArrayEquals(AssertArrayEquals.java:123)
at
app//org.junit.jupiter.api.AssertArrayEquals.assertArrayEquals(AssertArrayEquals.java:119)
at
app//org.junit.jupiter.api.Assertions.assertArrayEquals(Assertions.java:1360)
at app//ai.onnxruntime.InferenceTest.runProvider(InferenceTest.java:676)
at app//ai.onnxruntime.InferenceTest.testCUDA(InferenceTest.java:615)
### Description
Fuses DQ -> Q sequences into a QNN Convert operator if:
- Converting from one qtype to another. Ex: Dequantize(uint8 to float)
-> Quantize(float to uint16)
- The DQ and Q operators are not part of another node unit (i.e.,
standalone)
- The Q operator is the only consumer for the DQ operator.
### Motivation and Context
Allows faster execution of QDQ models with mixed activation types by
leveraging the QNN Convert operator, which converts between quantization
types. For certain models, this results in inference latency speed-ups
of up to 2x (depends on the number of DQ -> Q sequences).
#### Example for Add node unit with 16-bit I/O:
Original:
```
u8 ----> DQ ---> Q ---u16--> Add ---u16-->
^
|
u16 --------------------------+
```
After fusing DQ -> Q:
```
u8 ----> Convert ---u16--> Add ---u16-->
^
|
u16 ------------------------+
```
**Description**
1) During SessionInitialization, KahnsTopologicalSort is a major cause
of perf degradation.
The main cause of slow down is that the TopologicalSort needs to keep
track of nodes to visit in order, and reorder them based on priority (as
informed by a comparator). The existing implementation uses a
priority_queue that is backed by a std::vector container. However,
vectors are not good for insertion and reordering. The appropriate data
type for this operation is a linked list. However, linked lists like
std::list are not usable as a container for std::priority_queue. This is
because std::priority_queue requires random access, which linked lists
do not have. However, for this simple implementation, we can leverage a
std::list under the hood and perform insertions manually using
std::upper_bound. This drastically reduces the time taken by the method,
which currently instead causes numerous recopies and a lot of movement
inside the graph nodes to visit list.
2) In the comparator, I hide forward and backward attribute checking
behind the #ifdef ENABLE_TRAINING macro, as I believe it should only be
valid in the training scenario.
3) In noopelimination transformer, I prevent the creation of Initializer
(which unpacks tensorproto data) in every node and only create
initializers when Add/Sub/Mul/Div op nodes are detected.
**Motivation and Context**
Session creation time of many models is quite slow.
---------
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
### Description
The unit tests take 19 minutes to run (in debug build) because of too
many combinations. I reduce the combinations and remain good test
coverage. After the change, the test can finish in 51 seconds.
Before:
[----------] 2 tests from DecoderMaskedSelfAttentionTest
[ RUN ] DecoderMaskedSelfAttentionTest.Test_fp32
[ OK ] DecoderMaskedSelfAttentionTest.Test_fp32 (394086 ms)
[ RUN ] DecoderMaskedSelfAttentionTest.Test_fp16
[ OK ] DecoderMaskedSelfAttentionTest.Test_fp16 (747035 ms)
[----------] 2 tests from DecoderMaskedSelfAttentionTest (1141122 ms
total)
After:
[----------] 2 tests from DecoderMaskedSelfAttentionTest
[ RUN ] DecoderMaskedSelfAttentionTest.Test_fp32
[ OK ] DecoderMaskedSelfAttentionTest.Test_fp32 (21057 ms)
[ RUN ] DecoderMaskedSelfAttentionTest.Test_fp16
[ OK ] DecoderMaskedSelfAttentionTest.Test_fp16 (30653 ms)
[----------] 2 tests from DecoderMaskedSelfAttentionTest (51710 ms
total)
### Motivation and Context
Reduce test time, and improve build pipeline efficiency.
### Description
Changed the actions/stale version back to v8 from v9.
### Motivation and Context
There is a well-documented issue w/ the new actions/stale version
(v9.0.0) that causes the following error: "Error delete _state: [403]
Resource not accessible by integration". See
https://github.com/actions/stale/issues/1133 for more context.
This issue is preventing the stale bot from labeling stale issues since
the version was updated b/c the action can no longer access the cache
and cannot apply labels to all issues due to GH API rate limiting.
There are two potential fixes if we continue to use the new version: (1)
run the action on all PRs/issues to avoid using the cache or (2) give
write access to the endpoints listed in
https://docs.github.com/en/rest/authentication/permissions-required-for-fine-grained-personal-access-tokens?apiVersion=2022-11-28#repository-permissions-for-actions.
Neither of these options is preferable, so I am going to wait until the
bug is fixed.
Note: The old version (v8.0.0) uses Node 16, which will be deprecated in
Spring 2024, instead of Node 20, so we should keep an eye on [this
issue](https://github.com/actions/stale/issues/1133) to see when they
make the fix and we can switch back to the new version.
### Description
<!-- Describe your changes. -->
ROCm CI pipeline issue.
```
Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.91 MiB, post-processed: Unknown size, total: 17.41 MiB) to /home/onnxruntimedev/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20...
main()
File "/stage/huggingface-transformers/examples/pytorch/language-modeling/run_mlm.py", line 242, in main
datasets = load_dataset(data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir)
File "/opt/miniconda/envs/rocm-ci/lib/python3.9/site-packages/datasets/load.py", line 856, in load_dataset
builder_instance.download_and_prepare(
File "/opt/miniconda/envs/rocm-ci/lib/python3.9/site-packages/datasets/builder.py", line 583, in download_and_prepare
self._download_and_prepare(
File "/opt/miniconda/envs/rocm-ci/lib/python3.9/site-packages/datasets/builder.py", line 639, in _download_and_prepare
split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
File "/home/onnxruntimedev/.cache/huggingface/modules/datasets_modules/datasets/wikitext/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20/wikitext.py", line 138, in _split_generators
data_file = dl_manager.download_and_extract(self.config.data_url)
File "/opt/miniconda/envs/rocm-ci/lib/python3.9/site-packages/datasets/utils/download_manager.py", line 289, in download_and_extract
return self.extract(self.download(url_or_urls))
File "/opt/miniconda/envs/rocm-ci/lib/python3.9/site-packages/datasets/utils/download_manager.py", line 197, in download
downloaded_path_or_paths = map_nested(
File "/opt/miniconda/envs/rocm-ci/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 195, in map_nested
return function(data_struct)
File "/opt/miniconda/envs/rocm-ci/lib/python3.9/site-packages/datasets/utils/download_manager.py", line 220, in _download
return cached_path(url_or_filename, download_config=download_config)
File "/opt/miniconda/envs/rocm-ci/lib/python3.9/site-packages/datasets/utils/file_utils.py", line 281, in cached_path
output_path = get_from_cache(
File "/opt/miniconda/envs/rocm-ci/lib/python3.9/site-packages/datasets/utils/file_utils.py", line 634, in get_from_cache
raise ConnectionError("Couldn't reach {}".format(url))
ConnectionError: Couldn't reach https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Update the `datasets` pipeline to latest version `2.17.0`.
### Description
See the comments inside of the changed files for more detailed
information.
The file onnxruntime/core/platform/windows/hardware_core_enumerator.cc
and onnxruntime/core/platform/windows/hardware_core_enumerator.h were
copied from WinML source folder in this repo, with minor coding style
changes.
I had an offline discussion with Sheil. We agree that given the lack of
a future proof solution, we may check-in this temp fix first, and rework
it later. I will have a meeting with @ivberg for discussing the issue
deeply, and seeking for a long term solution. Thanks for offering help,
@ivberg !
### Motivation and Context
With this change, we will see about 2x perf improvement on some Intel
CPUs.
### Description
Sqrt does not have BF16 support yet. Adding that with this PR
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Multi Query Attention Optimization
in multi-query attention
```
batch_size, seq_length, three_times_hidden_size = fused_qkv.shape
fused_qkv = fused_qkv.view(batch_size, seq_length, self.num_heads + 2, self.head_dim)
return fused_qkv[..., :-2, :], fused_qkv[..., [-2], :], fused_qkv[..., [-1], :]
```
which can be optimized to
```
batch_size, seq_length, three_times_hidden_size = fused_qkv.shape
fused_qkv = fused_qkv.view(batch_size, seq_length, self.num_heads + 2, self.head_dim)
(query, key, value) = fused_qkv.split([self.num_heads, 1, 1], dim=2)
return query, key, value
```
this optimization can be validated from nsight profiling and perf
benchmarking.
<img width="545" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/15321482/cefcd061-4a01-4aaf-a008-8e265f7f63e9">
As such, This PR is to Optimize the `Gather/Gather/Slice` Ops to `Split`
Kernel.
### Optimization Target
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
As 2 `Gather` and 1 `Slice` Kernels are time consuming for backward
prop, it would be efficient to use 1 `Split` Kernel
### Example
- Before Fusion
<img width="419" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/15321482/17410319-57ea-4176-afd4-1efdcd3fdbae">
- After Fusion
<img width="424" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/15321482/f1ee1582-96d4-45f4-8778-49d1f3fd370a">
### Perf Gain
After the optimization, there will have **~7%** perf gain.
> The `Transpose` Kernel can be fused too, will update it in next PR.
However, after testing Transponse Ops fusion on Falcon model, there is
no perf gain. Will not create a new PR.
---------
Co-authored-by: ruiren <ruiren@microsoft.com>
### Description
<!-- Describe your changes. -->
Adds infrastructure to create an ML Package containing the Model using
ML Program. Updated coremltools files to v7.1 to bring in new protobuf
definitions along with the tools to write the weight.bin file and create
an ML Package correctly.
Enables building a CoreML Model on all platforms which means all the
operator builder code can be debugged anywhere. Execution of the
generated CoreML model is obviously limited to Apple platforms.
The Conv operator builder has been updated to be able to generate an ML
Program Operation.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
NeuralNetwork is no longer being developed and ML Program is the
replacement going forward.
### Description
<!-- Describe your changes. -->
This PR upgrades ORTModule's default opset from 15 to 17. Opset 17 is
the final opset supported by torchscript exporter
(https://github.com/pytorch/pytorch/pull/107829)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Engineering excellence contribution for ORT Training DRI.
---------
Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Description
Limit SoC core detection via 2 level cache core logic to Intel and
Hybrid processors.
### Motivation and Context
The following code was added to add support for a new class of CPU cores
present in Intel’s next generation Intel Core Ultra mobile processors.
This code is essential to avoid placing threads on low performing SoC
cores that don’t have L3 cache. SoC cores are meant to specialize in
system bringup and help improve responsiveness and power usage, in other
words they are not meant to run compute heavy AI workloads. In order to
avoid broad exposure of this logic, it is currently designed to be
restricted to Intel platforms that have hybrid enabled.
---------
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
### Description
Increase the threshold to 1e-5 to avoid test failed in CUDA when
difference is slightly larger than 1e-6.
May because TF32 is used in those CUDA tests.
### Motivation and Context
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1291322&view=logs&j=f2f63060-d9d6-52d0-adee-b97db5a9ab91&t=28e21ca6-87a4-5e1e-0441-72b5e8326f2d
ProviderOptionsTest > testCUDAOptions() FAILED
org.opentest4j.AssertionFailedError: array contents differ at index
[103], expected: <0.0102678> but was: <0.010266338>
at
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
at
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
at
app//org.junit.jupiter.api.AssertArrayEquals.failArraysNotEqual(AssertArrayEquals.java:440)
at
app//org.junit.jupiter.api.AssertArrayEquals.assertArrayEquals(AssertArrayEquals.java:290)
at
app//org.junit.jupiter.api.AssertArrayEquals.assertArrayEquals(AssertArrayEquals.java:123)
at
app//org.junit.jupiter.api.AssertArrayEquals.assertArrayEquals(AssertArrayEquals.java:119)
at
app//org.junit.jupiter.api.Assertions.assertArrayEquals(Assertions.java:1360)
at
app//ai.onnxruntime.providers.ProviderOptionsTest.runProvider(ProviderOptionsTest.java:99)
at
app//ai.onnxruntime.providers.ProviderOptionsTest.testCUDAOptions(ProviderOptionsTest.java:43)
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1293200&view=logs&jobId=f2f63060-d9d6-52d0-adee-b97db5a9ab91&j=f2f63060-d9d6-52d0-adee-b97db5a9ab91&t=28e21ca6-87a4-5e1e-0441-72b5e8326f2d
InferenceTest > testCUDA() FAILED
org.opentest4j.AssertionFailedError: array contents differ at index
[103], expected: <0.0102678> but was: <0.010266337>
at
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
at
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
at
app//org.junit.jupiter.api.AssertArrayEquals.failArraysNotEqual(AssertArrayEquals.java:440)
at
app//org.junit.jupiter.api.AssertArrayEquals.assertArrayEquals(AssertArrayEquals.java:290)
at
app//org.junit.jupiter.api.AssertArrayEquals.assertArrayEquals(AssertArrayEquals.java:123)
at
app//org.junit.jupiter.api.AssertArrayEquals.assertArrayEquals(AssertArrayEquals.java:119)
at
app//org.junit.jupiter.api.Assertions.assertArrayEquals(Assertions.java:1360)
at app//ai.onnxruntime.InferenceTest.runProvider(InferenceTest.java:676)
at app//ai.onnxruntime.InferenceTest.testCUDA(InferenceTest.java:615)
### Description
<!-- Describe your changes. -->
This PR is intended to support Phi2 passes in Olive.
Merge it before https://github.com/microsoft/Olive/pull/938
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Adds bfloat16 as a supported dtype for SimplifiedLayerNormFusion which
will provide speedup for Llama-v2 on A100 using bfloat16 numerical
format.
_layernorm_optimized_training.onnx exported in bfloat16 vs. float16:_

### Repro Instructions
```python
from torch import nn
from onnxruntime.training.ortmodule import ORTModule, DebugOptions, LogLevel
import torch
dtype = torch.bfloat16
# dtype = torch.float16
class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc = nn.Linear(784, 10, dtype=dtype)
self.layernorm = nn.LayerNorm([784], dtype=dtype)
def forward(self, x):
x = x.view(x.shape[0], -1)
x = self.layernorm(x)
x = self.fc(x)
return x
model = Net()
model = ORTModule(model, DebugOptions(save_onnx=True, onnx_prefix='layernorm', log_level=LogLevel.INFO))
model.to("cuda")
images = torch.randn((8, 28, 28), dtype=dtype).to("cuda")
output = model(images)
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ONNX Runtime integration with Llama-v2 family of LLMs.
---------
Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
As per title, fixes
https://github.com/microsoft/onnxruntime/issues/19418
ONNX Runtime 1.17 broke the quantization of ONNX models with subgraphs
where initializers are placed on the top-level graph, while different
subgraphs use the same initializer.
allow protobuf-lite builds with TensorRT EP as long as it's built with
the trt built-in parser and not the oss-parser.
This is because trt built-in parser statically links protobuf so there
aren't any conflicts for protobuf-lite.
### Description
Adds a job to the python packaging pipeline that builds x64 python
wheels for QNN EP.
### Motivation and Context
Necessary to create a cached QNN model on Windows x64, which is done by
creating a properly configured onnxruntime session with QNN EP.
Enable a option to exit after session creation so that user can measure session creation time to measure impact of enabling any initialization optimizations.
Add arm64 bfloat16 fastmath mode option for transformers benchmarking script.
### Motivation and Context
onnxruntime now supports bfloat16 fastmath gemm kernels for arm64 platforms with bfloat16 instruction support. This PR updates benchmark scripts to test that mode.
### Description
Handle bugs for API backward compatability.
Update to consume the onnx model path rather the onnx serialised model
to OV compile_model API
### Description
Disable CPU EP's allocator's arena when address sanitizer is enabled,
because it masks problems. For example, the code in
onnxruntime/test/quantization/quantization_test.cc has a memory leak
problem: it allocated a buffer but didn't free it, but most memory leak
check tool cannot detect that because the buffer was from an arena and
the arena was finally freed.
### Motivation and Context
Provider better memory leak check coverage.