Distribute writing-to-output work over all threads in MatMulNBits.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Originally, Prelu in QNN will fail when the input is fp16 and alpha is fp32.
QNN requires alpha is fp16 when input is fp16.
This can be resolved by casting alpha to fp16 and pass it to QNN.
### Motivation and Context
Makes QNN Prelu support fp16 case.
---------
Co-authored-by: Hector Li <hecli@microsoft.com>
In CMakeLists.txt:set_msvc_c_cpp_compiler_warning_level(), the regex should match the value that gets added by the function. The latter got updated, so this change updates the former to match.
### Description
<!-- Describe your changes. -->
Update order of steps
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix CI
### Description
<!-- Describe your changes. -->
[VitisAI] Solve the problem that gsl cannot be found when compiling
under linux
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
```
tvm_execution_provider.cc
denormal.cc
D:\a\onnxruntime\onnxruntime\onnxruntime\core\providers\tvm\tvm_execution_provider.cc(122,5): error C2660: 'onnxruntime::GraphViewerToProto': function does not take 4 arguments [D:\a\onnxruntime\onnxruntime\build\Release\onnxruntime_providers_tvm.vcxproj]
D:\a\onnxruntime\onnxruntime\onnxruntime\core\graph\graph_proto_serializer.h(10,6):
see declaration of 'onnxruntime::GraphViewerToProto'
D:\a\onnxruntime\onnxruntime\onnxruntime\core\providers\tvm\tvm_execution_provider.cc(122,5):
while trying to match the argument list '(const onnxruntime::GraphViewer, onnx::GraphProto, bool, bool)'
cpuid_uarch.cc
get_execution_providers.cc
abi_session_options.cc
bias_dropout_fusion.cc
if.cc
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Perform computation in fp32 and convert finally to fp16.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Error:
**Artifact name input: e2e_test_logs_1364625_$(Date:yyyyMMddHHmmss)
##[error]Artifact name is not valid:
e2e_test_logs_1364625_$(Date:yyyyMMddHHmmss). It cannot contain '\', /',
"', ':', '<', '>', '|', '*', and '?'**
Date not correctly showing up in the artifact name. Use predefined
pipeline variable BuildNumber instead which also serves similarly as a
timestamp.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
RN CI failure
---------
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
### Description
<!-- Describe your changes. -->
flatbuffers::String::c_str returns a pointer that may not be null
terminated.
This causes a warning when building on an A100 with gcc 11. Not clear
why other builds with gcc 11 (e.g. Ubuntu 22.04 WSL) don't generate a
warning. Either way it's safer to use str() as that constructs a
std::string with data() and size().
Unclear if this is an issue in reality as it's reading from the
flatbuffer and most likely didn't write out an empty string in order to
save space. There's no perf need to use c_str instead of str, and in
LOAD_STR_FROM_ORT_FORMAT we need to convert the return value to a
std::string anyway.
```c++
struct String : public Vector<char> {
const char *c_str() const { return reinterpret_cast<const char *>(Data()); }
std::string str() const { return std::string(c_str(), size()); }
```
```
inlined from ‘onnxruntime::common::Status onnxruntime::fbs::utils::LoadAttributeOrtFormat(const onnxruntime::fbs::Attribute&, onnx::AttributeProto&, std::unique_ptr<onnxruntime::Graph>&, onnxruntime::Graph&, onnxruntime::Node&, const onnxruntime::OrtFormatLoadOptions&, const onnxruntime::logging::Logger&)’ at /frdong_data/onnxruntime/onnxruntime/core/graph/graph_flatbuffers_utils.cc:385:3:
/usr/include/c++/11/bits/char_traits.h:399:32: error: ‘long unsigned int __builtin_strlen(const char*)’ reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread]
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix build error on A100
The order of defines for these test have to be in the same order. If we
check for TRT -> CUDA ->DML wen cannot reverse that order in later
defines as we might want to build for multiple EPs.
+@PatriceVignola
### Description
mlas matmul nbits implementation requires packed b. have a condition for
this.
need to update this logic if it changes.
### Motivation and Context
---------
Signed-off-by: Liqun Fu <liqfu@microsoft.com>
### Description
Following the issue #19223, introduce `per_channel` attribute in
`MinMaxCalibrater` to develop per-channel calibration.
If required, this new functionality should be implemented in the other
_Calibraters_ (`HistogramCalibrater`, `EntropyCalibrater`, ...).
### Motivation and Context
- This is the first part to solve #19223's proposal.
- If per channel calibration was allowed, the quantization algorithm
could be updated to improve quantization performance, i.e. weights
quantization per channel and not per tensor. That is why it would be
interesting to have a 'per_channel' option in any 'Calibrater' class to
produce a set of calibration vectors instead of a single scalar.
### Description
<!-- Describe your changes. -->
As title.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Fix some misc build warnings from x86 Windows build
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Update to more generic url
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Fix the build error for Win ARM64 Release build.
graph_transform_test.cc(1,1): error C1128: number of sections exceeded
object file format limit: compile with /bigobj
[D:\build\Windows\Release\onnxruntime_test_all.vcxproj]
### Motivation and Context
Fix issue: https://github.com/microsoft/onnxruntime/issues/20406
### Description
The Key and Value inputs could be 4-dims
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add bf16 support for below ops:
ConstantOfShape
Exp
Erf
convolution
PythonOp
### Motivation and Context
phimm model works on bf16, ORT need support bf16 on previous ops to work
with phimm on bf16
The prefetching instructions (_mm_prefetch) is used to anticipate memory
accesses by prefetching the next row of the input buffer. This
optimization is designed to reduce the impact of memory latency, thereby
enhancing the performance of the MlasComputeSoftmax function. As a
result, the worst-case performance of the OCR model has improved by
approximately 50ms, which equates to a 3% improvement.
For TensorRT 10 GA onwards, the TensorRT libraries will have major
version appended to the end on Windows, for example, nvinfer_10.dll,
nvinfer_plugin_10.dll, nvonnxparser_10.dll ...
Change cmake file accordingly.
### Description
1. Update the image name to avoid docker image wouldn't be overwrite.
there was an mistake that variables.CUDA_VERSION_MAJOR is always empty
14fcf0a52d/tools/ci_build/github/azure-pipelines/stages/nuget-linux-cuda-packaging-stage.yml (L120)
3. set one artifact name as variable to make the job rerunnable
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
The current code is calling one method with a missing argument.
### Motivation and Context
It breaks Olive's unittests.
---------
Co-authored-by: Xavier Dupré <xavier.dupre@gmail.com>
### Description
We originally only use compute queues for compute-only devices; this
change sets the default for DX12 devices to use compute queues as well.
### Motivation and Context
There have been issues with TDRs occurring when using the current
default queues, which doesn't happen on compute queues.
### Description
Fixed pastkey, key and pastvalue, value concatenation condition and
fixed index error. Added new test cases.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
This PR supports a build of onnxruntime.xcframework for xros/xrsimulator
for visionos via the build command of
`python3 tools/ci_build/github/apple/build_apple_framework.py --config
Release/Debug
tools/ci_build/github/apple/default_vision_os_framework_build_settings.json`.
For officially include visionos in ios cocoapods package and testing in
CI, would require separate work for upgrading the Xcode version &
upgrade macOS CI agent to macos-13-arm64 or higher.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
visionos support:
https://github.com/microsoft/onnxruntime/discussions/19313
---------
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
### Description
In Deepspeed's Pipeline Parallel Implementation, there is a class used
to instantiate the object after it's moved to the device and assigned in
a stage.
This approach helps reduce peak memory usage.
In this PR, we're adding support to ORT for wrapping this LayerSpec.
### Description
As described in latest discussion in #19915, parcel v2 without using the
[new resolver](https://parceljs.org/blog/v2-9-0/#new-resolver) will not
work correctly with onnxruntime-web. There are still users who uses
parcel with default resolver, so add this deprecated field "browser"
back for backward compatibility. This PR also corrects the "main" field,
which is for old resolver for Node.js.
### Description
Enabled more usecases
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Fix comparison that was not updated when the threshold was converted to
bytes.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix CI failure