* Add helper to check if node provides a graph output. The current approach unnecessarily creates a vector when most of the optimizers only care about a true/false response.
* Undo accidental change
* Fix a couple of issues due to copying from larger set of changes.
ORTModule requires two PyTorch CPP extensions that are currently JIT compiled. The runtime compilation can cause issues in some environments without all build requirements or in environments with multiple instances of ORTModule running in parallel
This PR creates a custom command to compile such extensions that must be manually executed before ORTModule is executed for the first time. When users try to use ORTModule before the extensions are compiled, an error with instructions are raised
PyTorch CPP Extensions for ORTModule can be compiled by running:
python -m onnxruntime.training.ortmodule.torch_cpp_extensions.install
Full build environment is needed for this
* POWER10: Optimized SGEMM in MLAS
This patch introduces new optimized version of SGEMM in MLAS
using power10 Matrix-Multiply Assist (MMA) feature introduced in
POWER ISA v3.1. This patch makes use of new POWER10 compute instructions
for matrix multiplication operation.
* Adjust tabs in cmake
Changing tabs to spaces as per review comment.
* Adjust tabs in new sgemm file
Changing tabs to spaces in SgemmKernelPOWER10.cpp.
* Reusing functions using common header
Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
The warning is:
"Potential comparison of a constant with another constant. at D:\a_work\1\s\onnxruntime\core\providers\cuda\nn\pool.cc@167,21".
It was found by VS static code analyzer in our CUDA EP.
* Drop std::count_if() in *EmbedLayerNorm Ops.
Profiling has shown that summing up the vector using the std function
can be 2x slower than just a simple plain vector sum loop.
* try and revert sumodule commits
* ensure mask is 1.
Switched the code to C++17. To build ONNX Runtime on old distros like CentOS 7, you need to install a newer GCC from additionary repos. If you build onnxruntime with the newer GCC, typically the result binary can't be distributed to other places because it depends on the new GCC's runtime libraries, something that the stock OS doesn't have. But on RHEL/CentOS, it can be better. We use Red Hat devtoolset 8/9/10 with CentOS7 building our code. The new library features(like std::filesystem) that not exists in the old C++ runtime will be statically linked into the applications with some restrictions:
1. GCC has dual ABI, but we can only use the old one. It means std::string is still copy-on-write and std::list::size() is still O(n). Also, if you build onnxruntime on CentOS 7 and link it with some binaries that were built on CentOS 8 or Ubuntu with the new ABI and export C++ symbols directly(instead of using a C API), the it won't work.
2. We still can't use std::optional. It is a limitation coming from macOS. We will solve it when we got macOS 11 build machines. It won't be too long.
3. Please avoid to use C++17 in CUDA files(*.cu). Also, the *.h files that they include(like core/framework/float16.h). This is Because CUDA 10.2 doesn't support C++17. You are welcome to use the new features in any *.cc files.
* prepare for C# to configure provider options
* add c# code
* revert modification
* Add update provider info configuration in trt ep side
* fix bugs
* fix bug for compiler error C2259
* Add c# test
* fix bug
* fix bug
* Properly deal with string
* Add c# api for accepting trt provider options
* fix bug
* Modify C# test
* add shared lib test
* Add get provider options functionality
* clean up
* clean up
* fix bug
* fix bugs for CI
* Fix bugs for CI and documentation
* Move TRT EP provider options related functions out of C API
* revert
* fix bug
* refactor
* add check for provider options string
* code refactor
* fix CI bug
* Fix CI bugs
* clean up
* fix bug
* Fix bug for Post Analysis
* fix accidental bug
* Add API_IMPL_BEGIN/API_IMPL_END
* clean up
* code refactor
* code refactor
* fix CI fail
* fix bug
* use string append
* Change the code to better handle strncpy and string append
* changes to fuse attention node and create varied dimensions
* added an option to optimizer to only do offline fusion
* fixing a typo
* merge with master
* removing extra changes
* added new unit test - test_attention_fusion_for_varied_qkv_dimensions()
* Unit test succesfull for q,k,v paths with varied dimensions
* adding test model for unit test case
* optimizing attention tests
* removing debugs
* minor change
* addressing comments
* addressing comments
* changed the new option to disable_onnxruntime
* replacing asserts with debugs
* make attn fusion backward compatible for head_size, hidden_size
* preserving behavior for shape_modified_tensor
* adding new option as the last parameter
* cleaning up
* line breaks and spaces
* formatting according to python
* making the changes to fuse attention node without user input
* changes to fusion_attention.py updated
* bringing the code up to python standard
Numpy has binary compatibility, which means "binaries compiled against a given version of NumPy will still run correctly with newer NumPy versions, but not with older versions." So, if an onnx runtime package was built with numpy version A, then at run time it requires numpy version >=A. In this change, we read numpy version from the installed packages at build time, to avoid manually keeping the build time/runtime consistency.
This is an update to https://github.com/microsoft/onnxruntime/pull/8079
The sample application motivating the original update changed to use an updated version of the model. Now, fewer ops are required. This change removes the previously added ops which are no longer needed.
1. No padding branch performance is improved 8 times
2. Symmetric padding branch is generalized for asymmetric padding case (padding symmetry was not actually used) and further optimized by eliminating integer multiplications.