### Description
Augment nhwc graph optimizer to accommodate fp16 operators.
### Motivation and Context
With new fp16 conv operator added. This operator prefers NHWC data
layout. We need to augment existing graph optimizers to better utilize
the new operator.
### Description
Run clang-format in CI. Formatted all c/c++, objective-c/c++ files.
Excluded
```
'onnxruntime/core/mlas/**',
'onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/**',
```
because they contain assembly or is data heavy
### Motivation and Context
Coding style consistency
### Description
<!-- Describe your changes. -->
Add required graph transformer to duplicate DQ nodes to ensure that QDQ
node units have unique DQ nodes. This condition is necessary for QDQ
node unit processing.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
There is an existing Python utility that does this:
c7ced7a5e9/tools/python/util/qdq_helpers/qdq_model_utils.py (L77)
This PR implements it as a graph transformer so it is integrated into
ORT and does not require a separate step to update the model. There are
also tests to ensure that its effects are not undone by basic level
graph optimizations.
### Description
To pass session_options to Xnnpack EP via
`XnnpackProviderFactoryCreator` for Initializing xnnpack's threadpool.
If you want to use different threadpool size or even disable xnnpack's
threadpool, just setting intra_threadpool to 1 by xnnpack EP's
provider_options.
### Motivation and Context
Co-authored-by: Guangyun Han <guangyunhan@microsoft.com>
Co-authored-by: Jicheng Wen <jicwen@microsoft.com>
Work on minimizing memory management calls by
reducing number of allocations and copies.
Replace std::unordered_set to InlinedHashSet
and add usage of InlinedVector.
Employ std::move() to minimize copying and memory allocations.
Remove copying of the const shared data into each of the
PropagateCast transformer instances.
Move inlined_containers.h header to include/common
Adjust AsSpan imlementation for C++ < 17
Add support for saving graph runtime optimizations in an ORT format model. The idea is to allow some optimizations to be "replayed" at runtime in a minimal build. The replaying part will be in a future change.
* Moved GraphTransformerConfiguration to a separate file and added strategy option to PropagateCastOps transformation.
* Added testing both FloodFill and InsertAndReduce stratigies for cast propagation.
* Added AddConsumer and RemoveConsumer functions to in graph.h for efficient graph editing.
* Added PropagateCastOps code documentation
* Added GraphTransformationConfiguration class hierarchy information
* Added RemoveInputOutputUpDownCasts
* Allow specific optimizers to be disabled.
- replace unused ability to specify just the optimizers to run
- never used so not needed
Allow the disabled list to be specified via the python bindings
- expected usage is internal, so using kwargs for that so as not to pollute the documentation with stuff no user is likely to need
Update the ORT format model conversion script to disable NCHWc transformer when level is 'all'
- currently there aren't any known use cases where we'd want the NCHWc transformations to run as they create a device specific model and aren't used on ARM
- the ORT format model is not expected to be generated on the target device (e.g. generate on Windows/Linux/macOS to deploy to Android/iOS so there's a good chance we'd generate a useless/invalid model
- default to 'all' as ARM and MLAS prefer NHWC and the NHWC transformer runs at that level
* Add matching changes to optimizer generation in training code
* Expose recompute configs to the frontend
* Add frontend test
* Ensure recompute graph transformer is only applied once
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* Add support for sharing allocators
* Incremental update
* Address some PR comments, add unit tests, add documentation.
* Address PR comments, add tests and some documentation.
* Fix build and test issues
* Remove RegisterAllocator API restoring the OrtAllocator interface changes. Changed docs to reflect this.
Also fixed the orttraining segfault. The segfault was because in the case of training session,
the CPU exec prov is not available at the time the transformers are applied. Changed it to create
a new one.
Update TransposeMatMul to support scaling of the matrix product by a constant scalar value (analogous to the GEMM alpha parameter). Rename TransposeMatMul to TransposeScaleMatMul.
Fuse MatMul with surrounding Mul/Div with constant scalar into TransposeScaleMatMul.
Remove gsl subodule and replace with a local copy of gsl-lite
Refactor for onnxruntime::make_unique
gsl::span size and index are now size_t
Remove lambda auto argument type detection.
Remove constexpr from fail_fast in gsl due to Linux not being happy.
Comment out std::stream support due to MacOS std lib broken.
Move make_unique into include/core/common so it is accessible for server builds.
Relax requirements for onnxruntime/test/providers/cpu/ml/write_scores_test.cc
due to x86 build.
Add ONNXRUNTIME_ROOT to Server Lib includes so gsl is recognized
This change integrates the NCHWc support recently added to MLAS into ONNX Runtime. When using "-o 3" optimizations, then the runtime will do a NCHWc layout optimization pass to convert standard ONNX operators such as Conv/MaxPool to the com.microsoft.nchwc domain with weights and biases reordered for speed.
Introduce a quick pre-filtering of rules based on the node op types they are targeting.
The goal is to avoid evaluating all rules for all nodes. Instead, for each node, we will only be evaluating the rules associated with its op type.
* fix graph transformers and refactor tests
* fix merge master
* Set default optimization level to Level1
* fix build warnings for Linux
* try root cause tensorrt test failures
* try root cause tensorrt test failure
* Test level2 transformers with all CI builds
* remove ConvActivation fusion transformer
* change default level back to level1
* remove providers from apply api
* more changes
* Convert unsqueeze elimination to rewrite rule
* Simplify the way we register predefined transformers and rules in the inference session (all details are now moved to the graph transformer utils)
* Some reorganization and renaming of methods in graph_utils
* Updates in graph transformers test
* Update in edge removal to not perform unnecessary check of node args that led to race conditions when updating the graph
* Improve documentation for rewrite rules
* Remove top-down rule-based transformer (given we currently have only one type of rule-based transformer)
* graph transformers update
* some updates
* plus changes
* more updates
* fixes per review comments
* enable tests
* adding more tests
* more changes
* update api in inference sesion
* changes per review
* Linux CI fix
* fix linux CI failure
* fix MAC CI failure
* more updates
* add more documentation and add level param to register transformer
* Create a project for graph optimizer.
Move optimizer related code to the folder optimizer.
* Fix build failures.
* rebase and fix build failures.
* fix build failure.
* fix build failure with cuda path.
* fix python build failure.
* Move two transformers(memcpy and insert_cast) from framework to optimizer.
* rebase.
* SessionState should not depend on optimizer.