### Description
<!-- Describe your changes. -->
Setting the log level after environment creation is too late in some
cases.
If the DML EP is enabled, it will create a composite sink with the
original logger using the creation time log severity, as well as
additional ETW sink. As it saves the current severity levels for each
sink inside the composite sink that prevents being able to get verbose
log output to stdout even if you set that at the session level.
I don't know enough about the setup that combines ETW with the original
sink to say whether we should also be updating the severity of
individual sinks in the combined sink, so this change is limited to
making the unit tests behave in the expected manner when the default log
severity is set in the background and not directly controlled.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Make it possible to get verbose output to stdout when the DML EP is
enabled.
### Description
Skip softmax BF16 test for ROCm, because BFloat16 is unsupported by
MIOpen, and `torch.cuda.is_available()` also returns `True` for ROCm.
### Problem
Newer models using more novel equations (e.g. `bhwc,hkc->bhwk` in
Segment Anything's encoder or `bqc,bchw->bqhw`) cause fallback from DML
to CPU, yielding performance issues. The EP had some pattern matching to
map more common equations to existing DML operators, but the number of
permutations was prohibitive and could not catch them all.
### Solution
So, ditch the static mapping, and instead handle any 1-input or 2-input
cases via remapped strides and a mini-graph of elementwise
multiplication & sum reduction (as if DML had a
`DML_OPERATOR_DOT_PRODUCT` that took `axes`). A subset of mappings still
exist for performance (GEMM, pure reduction, transpose...), but they are
identified generally rather than via a pattern table. Also...
- Diagonals are supported now (e.g. iji->i).
- Removes any remaining DML-specific EinSum `GTEST_SKIP` statements.
- Handles any cases up to 8 unique labels (DML dimension limit is 8D).
- \>= 3 inputs and arbitrary size inputs via ellipsis are not handled,
but we have yet to come across a model.
### Description
Alternative design from #20942
Allow users to pass in a model path for the generate_artifacts API.
### Motivation and Context
- ONNX API calls such as the onnx checker + shape inference fail when
given a model > 2GB, but work if a path to a model >2GB is passed in.
Context and motivation:
When quantizing large transformer models, we faced OOM issue when the
number of calibration samples goes up. To resolve this, in the PR we
want to add support for reading quantization data in chunck, calculating
ranges for intermediate tensors, then accumulating results for the final
ranges.
### Description
This reverts commit 1d7bf56947 because it
broken the AMD GPU CI pipeline. Sorry when I reviewed the PR I forgot to
run the AMD GPU CI pipeline.
Will revert the PR first then ask the author to fix the issue.
### Description
Update protobuf_cmake.patch to allow extra disablements. ORT repo
already patches protobuf to not disable the warning 4996.
### Motivation and Context
To meet SDL requirements, Microsoft repos have to fail build if there is
warning 4996
Binskim also gives errors if warning 4996 is disabled.
We can suppress the Binskim issues, but we need a way to disable the
warnings for the minimal set of code that has them.
Right now, WindowsAI disables 4996 for entirety of ORT, but it should
only be disabled for protobuf.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Under certain conditions with enabling & disabling ETW continuously, we
got a crash report.
Allows ETW callbacks to be de-registered upon class destructor.
Related to #20537
### Motivation and Context
Fixes crash
### Callstack
We see it crash in
[0x0]
onnxruntime!<lambda_967a738fca8512372f170fcaf2d094d4>::operator()+0x34
0x12941ff570 0x7ffa994f0a04
[0x1] onnxruntime!std::_Func_class<void,_GUID const *,unsigned
long,unsigned char,unsigned __int64,unsigned
__int64,_EVENT_FILTER_DESCRIPTOR *,void *>::operator()+0x54 0x12941ff7b0
0x7ffa994f0d64
[0x2]
onnxruntime!onnxruntime::logging::EtwRegistrationManager::InvokeCallbacks+0xcc
0x12941ff7b0 0x7ffa994f0d64
[0x3]
onnxruntime!onnxruntime::logging::EtwRegistrationManager::ORT_TL_EtwEnableCallback+0x94
0x12941ff860 0x7ffa98d19628
and seems to us that the this pointer captured in
etwRegistrationManager.RegisterInternalCallback(
[&etwRegistrationManager, this](
...
is no longer valid when the callback is called.
### Description
The machine has multiple python installations and none of them is in
PATH. Therefore we should explicitly set python version via this task to
avoid having surprises.
### Motivation and Context
Similar to #21095
### Description
Delete RoslynAnalyzers. Use CodeQL instead.
### Motivation and Context
Now we already have CodeQL which is modern and also covers C# code. The
RoslynAnalyzers one is not in our pull request pipelines. The
"RoslynAnalyzers@2" task is outdated and needs be upgraded. I will
delete it for now since we already have CodeQL.
### Description
1. added kernel to quantize matmul B tensor to q4, and store in the same
shape as original tensor. scales and zero points are calculated as well.
scales and zero points have the same shape.
2. added kernel to transpose q4 B tensor to B tensor in MatMulNBits.
Scales and zero points are transposed as well.
####
Benchmark
<1024 x 4096 input, 64 quant block, 8 threads>:
- quantize: 23035923 ns
- transpose: 718635 ns
<1024 x 4095 input, 64 quant block, 8 threads>:
- quantize: 26759319 ns
- transpose: 1279064 ns
### Motivation and Context
The MatMulNbits tool chain current only supports converting a MatMul op
direct to MatMulNBits op. MatMulNbits op is not an ONNX standard op.
Therefore, we need the tool chain to support converting MatMul to Q/DQ
format, and later in the transform step converts DQ + MatMul to
MatMulNBits. The tensors stored in DQ are the quantized constants and
will be stored in the MatMulNBits.
### Description
change is_pod tp is_trivial
### Motivation and Context
This is commonnly needed for both linux and win c++20 upgrade.
is_trivial was introduced backed in C++11
### Description
Remove the "--enable_language_interop_ops" build flag, because the code
is incompatible with the latest numpy, and the build flag is not used
anywhere except a macOS CI pipeline. It does not seem to have a ship
plan.
### Motivation and Context
The build error was:
```
onnxruntime/core/language_interop_ops/pyop/pyop.cc:122:85: error: no member named 'elsize' in '_PyArray_Descr'
static_cast<int64_t>(PyArray_DescrFromType(type)->elsize),
~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^
```
### Description
Due to a security setting change, now we need to explicitly set the
permissions. I forgot doing that when bringing the old change back.
### Motivation and Context
Now the pipeline cannot publish scanning result to Github
### Description
- Updates CI pipelines to use QNN SDK 2.23.0 by default.
- QNN SDK adds support for int64 Cast. This allows QNN EP to support
ONNX ArgMax/ArgMin/TopK operators that generate an int64 graph output.
Example translation of ArgMax:
- **ONNX**: input --> ArgMax --> output (int64)
- **QNN**: input --> ArgMax --> Cast (int32 to int64) --> output (int64)
### Motivation and Context
Update onnxruntime to use the latest QNN SDK.
Refactoring of MultiHeadAttention op
- [x] Add some checking for cross attention of pass_past_in_kv to make
sure there is no kv cache and bias.
- [x] Update interface of PackVIntoRotaryQKV so that it can be used by
SparseAttention later.
- [x] Add test cases
### Motivation and Context
To prepare the pull request for SparseAttention cpu op.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Remove xamarin related entries.
Update MAUI entries to net8
Remove macos entries (not required by MAUI)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Updates missed from #21062
WebNN CPU implementation has been migrated from XNNPack to TFLite which
supports more ops. Turn on partial `cpu` supported ops which just need
the change from `false` to `true` firstly.
### Description
<!-- Describe your changes. -->
Xamarin is EOL so remove support.
The MAUI targets are EOL and need updating.
https://dotnet.microsoft.com/en-us/platform/support/policy/maui
Other cleanups:
- netcoreapp3.1 is EOL
- the net6 macos target was added in the mistaken belief that was for
MAUI mac support, but that is actually via the mac-catalyst target which
we recently added support for.
- some CIs that were using the old build setup of splitting pre-net6
targets. The ORT C# bindings csproj was updated last year and the
`PreNet6` and `SelectedTargets` properties no longer exist as they were
replaced by the simpler `IncludeMobileTargets` property.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Remove EOL components.
#21058
### Description
Fix regression caused by
https://github.com/microsoft/onnxruntime/pull/20995
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
skip default `locateFile()` when dynamic import is disabled. This allows
the file to work with bundlers to load WebAssembly file correctly if
`env.wasm.wasmPaths` is not set.