### Description
1. Publish debug symbols for Windows python packages. This PR will
publish them to ADO. Later on I will also replicate them to Microsoft
Symbol Server.
2. Build the packages in Release mode instead of RelWithDebInfo, to be
consistent with the other platforms(Linux/macOS/...)
### Motivation and Context
To help debug things. Sometimes we found an issue, but we couldn't debug
it because we didn't have symbols, and once we rebuilt the package
locally the issue was gone. This change would be helpful for such
scenarios.
Build log:
https://aiinfra.visualstudio.com/Lotus/_build?definitionId=841
### Description
Enable Hardsigmoid for QNN EP using SDK support direct support instead
of decomposing to its constituent ops so it can support the quantized
model
### Description
Length checking is even more strict for packed batching input.
There are two cases for a batch of input_ids.
- padded seq with equal length of inputs.
```
|----********|
|------------|
|--------****|
|-***********|
```
- packed seqs with different length of input_ids
`|----|---------|----|-|`
The max_seq_length is either from graph_inputs or the position_ids.
While in most of cases, we will cache the max_seq_length of rotary_cache
in the model ans shared among all layers.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: kailums <kalu@microsoft.com>
# Description
This PR removes the building of the ORT "mobile" packages and much of the associated infrastructure which is no longer needed.
Not removed yet - tools/ci_build/github/android/mobile_package.required_operators.config and the helper scripts that depend on it.
# Motivation and Context
The mobile packages were deprecated in 1.18. Users should use the full packages (Android - onnxruntime-android, iOS - onnxruntime-c/onnxruntime-objc) instead or do a custom build.
### Description
Update c-api-noopenmp-packaging-pipelines.yml: remove CUDA version
parameter
To reduce confusion. This pipeline is for generating CUDA 11 packages.
Just it. Not CUDA 12.
### Motivation and Context
In the last release we accidentally published CUDA 12(instead of CUDA
11) packages to nuget.org.
We also tried to publish CUDA 12 packages to
https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/ORT-Nightly.
Luckily it didn't go through because a package with the same version
number already existed there. Every time when someone runs this pipeline
with CUDA version set to 12, the built packages will be published to
https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/ORT-Nightly.
And GenAI team's build pipelines are based on the nightly packages. So
sometimes GenAI team builds their packages with CUDA 12 and sometimes
with CUDA 11, which is very random.
Therefore, please limit the use of pipeline parameters. Most Azure
DevOps yml files are template files. They should use parameters. But the
top level yml files should be more careful on that.
To replaced deprecated API.
Should verify with the `Gradle cmakeCheck` step from
`Windows_Packaging_CPU_x64_default` stage from the Zip-Nuge-...
pipeline.
### Description
Windows - Fully dynamic ETW controlled logging for ORT and QNN logs
The logging support is documented here
-
https://onnxruntime.ai/docs/performance/tune-performance/logging_tracing.html#tracing---windows
-
https://onnxruntime.ai/docs/performance/tune-performance/profiling-tools.html#tracelogging-etw-windows-profiling
Also add support for logging ORT SessionCreation on ETW CaptureState
### Motivation and Context
The previous ETW support only worked if you enabled ETW before the
session started. There can commonly be long-lived AI inference processes
that need to be traced & debugged. This enables logging fully on the
fly.
Without this support a dev would have to end up killing a process or
stopping a service in order to get tracing. We had to do this for a
recent issue with QNN, and it was a bit painful to get the logs and it
ruined the repro.
### Testing
I tested with the following cases
- Leaving default ORT run
- Enabling ETW prior to start and leaving running for entire session +
inferences, then stopping
- Starting ORT session + inf, then enabling and stopping ETW
- Start ORT session /w long running Inferences
- wpr -start
[ort.wprp](e6228575e4/ort.wprp (L4))
-start
[etw_provider.wprp](e6228575e4/onnxruntime/test/platform/windows/logging/etw_provider.wprp)
- Wait a few seconds
- wpr -stop ort.etl
- Inferences are still running
- Verify ONNXRuntimeLogEvent provider events are present and new
SessionCreation_CaptureState event under Microsoft.ML.ONNXRuntime
provider
Related:
#18882#19428
Some dev environments come with a preinstalled abseil. For example,
conda users often do that. If the preinstalled abseil version is
incompatible with what we have in cmake/deps.txt, it could result in a
hard-to-understand build error. This PR adds a version check to improve
that.
### Description
Uses C-style casting for Power vector instructions in
`MlasQuantizeLinearInt4Kernel`.
### Motivation and Context
Vector commands (e.g., vec_xst) need C-style casting to support various
compiler versions.
ONNX Runtime CI pipelines do not build with all compiler versions. The
recent INT4 PR broke the powerpc build for certain compiler versions
because it uses C++-style `static_cast<>`.
See:
https://github.com/microsoft/onnxruntime/pull/20362#discussion_r1630106164
Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
### Description
Support loading from model with multiple QNN context binary
### Motivation and Context
QNN EP generated context binary model only has one single QNN context.
Because of QNN PD memory limitation, large model (>3.5GB) has to be split into 2 smaller models. Then generate the model with context binary. User can load from the smaller models with context binary. The problem is it requires 2 Ort session. User want to glue the split models into 1 (with multiple EPContext nodes) so that they can use 1 Ort session to do the work.
QNN EP has limitation which only support loading from 1 single QNN context binary. This PR removes that limitation to unblock this user scenario.
---------
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Following constraints have been supported by WebNN TFLite backend:
- Concat: supports up to 4 inputs
- Matmul: supports broadcasting
- Resize: supports nearest mode
- Split: supports up to 4 outputs
### Description
Add conditional check in Get/Set current GPU device id
### Motivation and Context
Currently with ROCm build, calling `GetCurrentGpuDeviceId` will still
try to find CUDA libraries and log the following error message:
```text
[E:onnxruntime:, provider_bridge_ort.cc:1836 TryGetProviderInfo_CUDA] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1511 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libonnxruntime_providers_cuda.so: cannot open shared object file: No such file or directory
```
This is unnecessary and confusing.
### Description
<!-- Describe your changes. -->
Trilu<bool> is used by phi-3 when exported with torch.onnx.export.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Initialize `device_id` with `-1` in `cuda_call` and `rocm_call`.
### Motivation and Context
From PyTorch code:
bb2de3b101/c10/cuda/CUDAFunctions.cpp (L217-L324)
If `cudaGetDevice` or `hipGetDevice` failed, an uninitialized `int`
would produce a random number that changes during each run:
```text
[with ERRTYPE = hipError_t; bool THRW = true; std::conditional_t<THRW, void, common::Status> = void] HIP failure 101: invalid device ordinal ; GPU=32741 ; hostname=e6724be2a31a ; file=/onnxruntime_src/onnxruntime/core/providers/rocm/rocm_common.h ; line=66 ; expr=hipGetDeviceProperties(&deviceProp, 0);
```
Notice the `GPU` value above. Using `-1` would clearly indicate such
failure and avoid confusion.
### Description
- Updates pipelines to use QNN SDK 2.22 by default.
- Linux QNN pipeline now uses an Ubuntu 22.04 image (required by QNN
SDK)
- Android QNN pipeline still uses the current Ubuntu 20.04 image. Will
update in a separate PR.
- Disables QDQ LayerNorm test that triggers QNN's graph finalization
error on QNN 2.22
- Increases accuracy tolerance for various HTP tests so that they pass
on Windows arm64.
### Motivation and Context
Test QNN EP with latest QNN SDK version by default.
---------
Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
### Description
- Uses our own quantization functions instead of the ONNX reference
implementation of QuantizeLinear when quantizing weights to int4.
- Uses a custom function that packs bytes into 4-bit elements.
### Motivation and Context
Running the quantization tool to create QDQ models with int4 weights
could take up to 7x longer. This PR uses our own quantization and byte
packing utilities to improve performance.
#### Measurements
Model with ~5M parameters to quantize to int4.
- Current implementation: **84.5s**
- Only replace ONNX QuantizeLinear implementation: **50.3s** (1.68x
speedup)
- This PR (replace onnx Q impl, custom packing func): **13.5s** (6.26x
speedup)
---------
Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
These are changes to improve GEMM portion of the code for Power.
There are 2 main code changes :
1) Changing a function to a template parameter so that operations that
add/sub zero are eliminated at compile time. Plus reuse a vector that
has the mask instead of rebuilding each time.
2) Add processing 16 columns at a time in MlasGemmQuantCopyPackB8x8 -
this should reduce potential page faults by a factor of 4 and also be
faster.
3) Unroll MlasQgemmStoreVectorMMA and vectorize other variables.
### Description
<!-- Describe your changes. -->
offset used in attention is with data type int. It can overflow for
large sequence length.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Add new APIs.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This change is required for satisfying requirement of Microsoft.
---------
Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
To align with Office and other MS products.
Office's support policy is:
"Office for iPad and iPhone is supported on the two most recent versions
of iOS and iPadOS. When a new version of iOS or iPadOS is released, the
Office Operating System requirement becomes the two most recent
versions: the new version of iOS or iPadOS and the previous version."
(from https://products.office.com/office-system-requirements)
The latest iOS version is 17. So they support both 17 and 16. Here I set
our min iOS version to 13 so that it will be a superset of what Office
supports.
This change would allow us using C++17's std::filesystem feature in the
core framework. The modifications were generated by running
```bash
find . -type f -exec sed -i "s/apple_deploy_target[ =]12.0/apple_deploy_target=13.0/g" {} \;
```
Cannot use 15.0 because otherwise iOS packaging would fail with:
```
/Users/runner/work/1/b/apple_framework/intermediates/iphoneos_arm64/Release/_deps/coremltools-src/mlmodel/src/MILBlob/Util/Span.hpp:288:9: error: cannot use 'throw' with exceptions disabled
MILVerifyIsTrue(index < Size(), std::range_error, "index out of bounds");
```
The Google OSS libraries we use only officially support iOS 15+.
### Description
1. Add one image into whitelist, but if the image is hit, the pipeline
status is warning.
2. adjust the image parity test tolerance
### Motivation and Context
improve pipeline stability
### Description
Update the initializer that's added in GatherSliceToSplitFusion to use
the GenerateNodeArgName function, rather than the GenerateNodeName
function.
GenerateNodeName goes through all the nodes in the graph to see if the
given name is already used and generates a unique one if it has been
used. GenerateNodeArgName iterates through all the node args in the
graph to see if the given name is already used.
### Motivation and Context
* on-device training goes through a generate artifacts step, where
optimizations are applied, then, when the training artifact is loaded,
additional optimizations are applied. In the first round of
optimizations, a "splits" initializer is added for phi-3. With the
second round of optimizations, another "splits" initializer with
different dimensions and data is added. Since we call GenerateNodeName
func, the first splits initializer isn't found, causing a type error
where it claims the shape of splits does not match the TensorProto
shape.
### Description
This PR to allow `./gradlew cmakeCheck` failed on
Windows_Packaging_(CUDA|TensorRT) Job. This way, it will still generate
all nessary jar and pom file need for later stage to consume while
`./gradlew cmakeCheck`will be also run again in the
Windows_Packaging_(CUDA|TensorRT)_Testing stage.
### Motivation and Context
Reduce the time of All java packaging stages by 30+ min.
### Description
<!-- Describe your changes. -->
This PR allows to build ORT web to `ort{.all|.webgpu}.bundle.min.mjs`,
which does not have any dynamic import. This makes it possible to use
ort web via static import in service worker.
Fixes#20876
### Description
This PR upgrades CUDA 11 build pipelines' GCC version from 8 to 11.
### Motivation and Context
GCC8 has an experimental std::filesystem implementation which is not ABI
compatible with the formal one in later GCC releases. It didn't cause
trouble for us, however, ONNX community has encountered this issue much.
For example, https://github.com/onnx/onnx/issues/6047 . So this PR
increases the minimum supported GCC version from 8 to 9, and removes the
references to GCC's "stdc++fs" library. Please note we compile our code
on RHEL8 and RHEL8's libstdc++ doesn't have the fs library, which means
the binaries in ONNX Runtime's official packages always static link to
the fs library. It is just a matter of which version of the library, an
experimental one or a more mature one. And it is an implementation
detail that is not visible from outside. Anyway, a newer GCC is better.
It will give us the chance to use many C++20 features.
#### Why we were using GCC 8?
It is because all our Linux packages were built on RHEL8 or its
equivalents. The default GCC version in RHEL8 is 8. RHEL also provides
additional GCC versions from RH devtoolset. UBI8 is the abbreviation of
Red Hat Universal Base Image 8, which is the containerized RHEL8. UBI8
is free, which means it doesn't require a subscription(while RHEL does).
The only devtoolset that UBI8 provides is GCC 12, which is too new for
being used with CUDA 11.8. And our CUDA 11.8's build env is a docker
image from Nvidia that is based on UBI8.
#### How the problem is solved
Almalinux is an alternative to RHEL. Almalinux 8 provides GCC 11. And
the CUDA 11.8 docker image from Nvidia is open source, which means we
can rebuild the image based on Almalinux 8 to get GCC 11. I've done
this, but I cannot republish the new image due to various complicated
license restrictions. Therefore I put them at an internal location in
onnxruntimebuildcache.azurecr.io.
### Description
The recent [PR for int4
support](https://github.com/microsoft/onnxruntime/pull/20362) breaks
builds with the onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS option enabled.
This PR adds utility functions for debug printing of int4 tensor
statistics and data.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Phi-3 vision loads 3 models in memory, which means that we have 3
different sessions, 3 different execution providers and 3 different
allocators all loaded at the same time. Since the DML EP uses a
bucketized allocator, this results in a lot of memory fragmentation
across all 3 models that can only be used by the model itself.
To fix that, we can disable the memory arena (term for any kind of
allocator that reuses memory in ORT) as an opt-in option. In the case of
LLMs, we essentially never need to reallocate memory after the initial
graphs have been capture, which means that we gain nothing by using the
bucketized allocator, and it causes unnecessary fragmentation.
---------
Co-authored-by: Patrice Vignola <pavignol@microsoft.com>
### Description
Similar to #20786 . The last PR was able to update all pipelines and all
docker files. This is a follow-up to that PR.
### Motivation and Context
1. To extract the common part as a reusable build infra among different
ONNX Runtime projects.
2. Avoid hitting docker hub's limit: 429 Too Many Requests - Server
message: toomanyrequests: You have reached your pull rate limit. You may
increase the limit by authenticating and upgrading:
https://www.docker.com/increase-rate-limit
### Description
- 4-bit QuantizeLinear(21). **Blocked quantization still missing (i.e.,
do not support the new `block_size` attribute)**
- 4-bit DequantizeLinear(21). **Blocked dequantization still missing
(i.e., do not support the new `block_size` attribute)**
- 4-bit Transpose(21).
- Update quantization tool with int4 types.
- Disable QDQ fusions for 4-bit types. See:
https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc
- MLAS 4-bit quantization kernels for intel, neon, powerpc.
##### Notes
To calculate a tensor's storage size, we normally get the number of
elements from the shape (i.e., `tensor_shape.Size()`) and multiply by
the size of a single element. This does not directly work for sub-byte
elements like int4 as each element in a `Tensor<Int4x2>` stores **two**
packed int4 elements in a byte. The `Tensor::
CalculateTensorStorageSize` should be called to perform the correct
calculation for any tensor element type.
### Motivation and Context
ONNX 1.16 added the int4 and uint4 types. This initial PR adds the int4
type to ORT and adds int4 implementations for the Quant, Dequant, and
Transpose ops on CPU EP. We still need to add int4 support for many ops
and execution providers. See the ONNX 1.16 release notes:
https://github.com/onnx/onnx/releases.
mac-react-native-ci-pipeline.yml:
- We don't need to run component detection for PR builds so just disable it there.
npm-packaging-pipeline.yml:
- Manually added component detection task was being added twice - removed one.
- Increased timeout of stage where component detection is run since the existing timeout was close for some builds.