### Description
<!-- Describe your changes. -->
We have use cases where multiple sessions are created concurrently.
Minimizing the usage of the default logger is important for these
scenarios.
Wire through the session logger to as many places as possible. The EP
logger can also be used once the session is created (can't be used
during EP construction/kernel registration but can be used in
GetCapability and Compile).
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve logging when there are concurrent sessions.
### Description
When the graph is quantized to qdq format, the DQ + MatMul is
transformed to MatMulNBits in the level 2 optimizer when the model is
initialized in an inference session.
In the transformation step, tensors are transposed and new tensor protos
are created. Instead of using protobuf arena allocated memory, the PR
sets the tensor proto to use external buffer, and point the external
location to memory location which contains the tensor buffer allocated
by CPU.
Then, in the step that creates OrtValue using the tensor proto, the
memory buffers in the tensor proto are directly assigned to the tensors
which were originally allocated by Ort Arena.
With these two steps, the peak memory usage of QDQ format model is the
same as usage of QOperator model. Besides, the model initialization time
is significantly reduced. Take
[Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)
for example:
|| QOperator Model (MatMulNBits) | QDQ Model (DQ + MatMul, original
code) | QDQ Model (this PR) |
|---|---|---|---|
| peak memory consumption | 2.8 GB | ~4.8 GB | 2.8 GB |
| initialization time | 3 sec | 9 sec | 5 sec |
### Motivation and Context
When the graph is quantized to qdq format, the DQ + MatMul is converted
to MatMulNBits in the level 2 optimizer.
Originally, the newly created tensor proto use memory allocated by
protobuf arena. These memory usage cannot be fully released when the
tensor protos are deleted.
Then, in the tensor proto to OrtValue step, tensors are created using
ORT arena. Later, in the pre-pack step for MatMulNBits, new OrtValues
are created. The tensors in the ORT arena are not fully released as
well.
The two arena memory allocation steps in the DQ + MatMul -> MatMulNBits
transformation will result in almost 2x memory consumption in the model
initialization.
### Description
This is a partial change from
[fajin/qdqmatmulnbitstoolchain](https://github.com/microsoft/onnxruntime/pull/21180).
The original PR is blocked by Web CI failures.
MatMulNBits is a heavily optimized matmul operation. Currently a MatMul
can be converted to MatMulNBits to speed up the model inference.
However, MatMulNBits is an ORT only op. To make the graph compatible
with ONNX ops and utilize MatMulNBits at the same time, we introduce
Q/DQ support for MatMulNBits.
To convert MatMul ops in a model to MatMulNBits:
1. use matmul_4bits_quantizer.py to convert MatMul to DQ + MatMul using
QDQ mode.
2. In ORT session, DQ + MatMul is fused to MatMulNBits
#### Note
MatMulNBits assume B weight is uint4. When no zp is provided, zp
defaults to 8, which is different from DQ. DQ defaults zp to 0 when no
zp provided. And DQ supports int4. Therefore some conversions are
introduced during DQ + MatMul --> MatMulNBits step.
#### Perf
Using QDQ format will increase the model initialization time and memory
consumption. With current implement, model init time increased from ~4s
to ~9s, and memory consumption increased from ~2.8GB to ~4.8GB.
The memory increase is due to
1. in optimizer, after transpose the B weight, a in-memory tensor proto
is created using protobuf's arena.
2. in finalize step, when saving initializer and prepacking, ORT arena
is used to create buffers for initializers.
The memory allocated by arenas cannot be fully deallocated.
If disable ORT arena memory allocation, the memory consumptions of both
QDQ format and original format are ~2.2GB.
The time increase is mainly due to multiple memory copy, but can be
further optimized.
### Motivation and Context
Please see description for details.
### Description
There are so many typos reported by the review dog, [Optional Lint]
actions (example:
https://github.com/microsoft/onnxruntime/actions/runs/9864564489/job/27239732367),
this PR is to fix some of them.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
Augment nhwc graph optimizer to accommodate fp16 operators.
### Motivation and Context
With new fp16 conv operator added. This operator prefers NHWC data
layout. We need to augment existing graph optimizers to better utilize
the new operator.
### Description
Run clang-format in CI. Formatted all c/c++, objective-c/c++ files.
Excluded
```
'onnxruntime/core/mlas/**',
'onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/**',
```
because they contain assembly or is data heavy
### Motivation and Context
Coding style consistency
### Description
<!-- Describe your changes. -->
Add required graph transformer to duplicate DQ nodes to ensure that QDQ
node units have unique DQ nodes. This condition is necessary for QDQ
node unit processing.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
There is an existing Python utility that does this:
c7ced7a5e9/tools/python/util/qdq_helpers/qdq_model_utils.py (L77)
This PR implements it as a graph transformer so it is integrated into
ORT and does not require a separate step to update the model. There are
also tests to ensure that its effects are not undone by basic level
graph optimizations.
### Description
To pass session_options to Xnnpack EP via
`XnnpackProviderFactoryCreator` for Initializing xnnpack's threadpool.
If you want to use different threadpool size or even disable xnnpack's
threadpool, just setting intra_threadpool to 1 by xnnpack EP's
provider_options.
### Motivation and Context
Co-authored-by: Guangyun Han <guangyunhan@microsoft.com>
Co-authored-by: Jicheng Wen <jicwen@microsoft.com>
Work on minimizing memory management calls by
reducing number of allocations and copies.
Replace std::unordered_set to InlinedHashSet
and add usage of InlinedVector.
Employ std::move() to minimize copying and memory allocations.
Remove copying of the const shared data into each of the
PropagateCast transformer instances.
Move inlined_containers.h header to include/common
Adjust AsSpan imlementation for C++ < 17
Add support for saving graph runtime optimizations in an ORT format model. The idea is to allow some optimizations to be "replayed" at runtime in a minimal build. The replaying part will be in a future change.
* Moved GraphTransformerConfiguration to a separate file and added strategy option to PropagateCastOps transformation.
* Added testing both FloodFill and InsertAndReduce stratigies for cast propagation.
* Added AddConsumer and RemoveConsumer functions to in graph.h for efficient graph editing.
* Added PropagateCastOps code documentation
* Added GraphTransformationConfiguration class hierarchy information
* Added RemoveInputOutputUpDownCasts
* Allow specific optimizers to be disabled.
- replace unused ability to specify just the optimizers to run
- never used so not needed
Allow the disabled list to be specified via the python bindings
- expected usage is internal, so using kwargs for that so as not to pollute the documentation with stuff no user is likely to need
Update the ORT format model conversion script to disable NCHWc transformer when level is 'all'
- currently there aren't any known use cases where we'd want the NCHWc transformations to run as they create a device specific model and aren't used on ARM
- the ORT format model is not expected to be generated on the target device (e.g. generate on Windows/Linux/macOS to deploy to Android/iOS so there's a good chance we'd generate a useless/invalid model
- default to 'all' as ARM and MLAS prefer NHWC and the NHWC transformer runs at that level
* Add matching changes to optimizer generation in training code
* Expose recompute configs to the frontend
* Add frontend test
* Ensure recompute graph transformer is only applied once
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* Add support for sharing allocators
* Incremental update
* Address some PR comments, add unit tests, add documentation.
* Address PR comments, add tests and some documentation.
* Fix build and test issues
* Remove RegisterAllocator API restoring the OrtAllocator interface changes. Changed docs to reflect this.
Also fixed the orttraining segfault. The segfault was because in the case of training session,
the CPU exec prov is not available at the time the transformers are applied. Changed it to create
a new one.
Update TransposeMatMul to support scaling of the matrix product by a constant scalar value (analogous to the GEMM alpha parameter). Rename TransposeMatMul to TransposeScaleMatMul.
Fuse MatMul with surrounding Mul/Div with constant scalar into TransposeScaleMatMul.
Remove gsl subodule and replace with a local copy of gsl-lite
Refactor for onnxruntime::make_unique
gsl::span size and index are now size_t
Remove lambda auto argument type detection.
Remove constexpr from fail_fast in gsl due to Linux not being happy.
Comment out std::stream support due to MacOS std lib broken.
Move make_unique into include/core/common so it is accessible for server builds.
Relax requirements for onnxruntime/test/providers/cpu/ml/write_scores_test.cc
due to x86 build.
Add ONNXRUNTIME_ROOT to Server Lib includes so gsl is recognized
This change integrates the NCHWc support recently added to MLAS into ONNX Runtime. When using "-o 3" optimizations, then the runtime will do a NCHWc layout optimization pass to convert standard ONNX operators such as Conv/MaxPool to the com.microsoft.nchwc domain with weights and biases reordered for speed.
Introduce a quick pre-filtering of rules based on the node op types they are targeting.
The goal is to avoid evaluating all rules for all nodes. Instead, for each node, we will only be evaluating the rules associated with its op type.
* fix graph transformers and refactor tests
* fix merge master
* Set default optimization level to Level1
* fix build warnings for Linux
* try root cause tensorrt test failures
* try root cause tensorrt test failure
* Test level2 transformers with all CI builds
* remove ConvActivation fusion transformer
* change default level back to level1
* remove providers from apply api
* more changes
* Convert unsqueeze elimination to rewrite rule
* Simplify the way we register predefined transformers and rules in the inference session (all details are now moved to the graph transformer utils)
* Some reorganization and renaming of methods in graph_utils
* Updates in graph transformers test
* Update in edge removal to not perform unnecessary check of node args that led to race conditions when updating the graph
* Improve documentation for rewrite rules
* Remove top-down rule-based transformer (given we currently have only one type of rule-based transformer)
* graph transformers update
* some updates
* plus changes
* more updates
* fixes per review comments
* enable tests
* adding more tests
* more changes
* update api in inference sesion
* changes per review
* Linux CI fix
* fix linux CI failure
* fix MAC CI failure
* more updates
* add more documentation and add level param to register transformer
* Create a project for graph optimizer.
Move optimizer related code to the folder optimizer.
* Fix build failures.
* rebase and fix build failures.
* fix build failure.
* fix build failure with cuda path.
* fix python build failure.
* Move two transformers(memcpy and insert_cast) from framework to optimizer.
* rebase.
* SessionState should not depend on optimizer.