* Enable selecting custom ops in onnxruntime-extensions.
* Move cmake_helper.py.
* Remove over-indented spaces.
* Add doc.
* Remove onnxruntime-extensions from git submodules, and user should pass path of onnxruntime-extensions for build.
* Modify doc.
* Remove argument --enable_onnxruntime_extensions and use --onnxruntime_extensions_path.
* Fix build error.
* Fix build error.
* Use onnxruntime_extensions_path.
* support both submodule and external source folders
* refinement
* Update cgmanifest.json
* Support building onnxruntime-extensions from either git submodule or pre-pulled path.
* Update doc.
* more standard name
* update docs
* add the copyright header
Co-authored-by: Zuwei Zhao <zuzhao@microsoft.com>
Co-authored-by: Wenbing Li <wenbingl@outlook.com>
Co-authored-by: Wenbing Li <10278425+wenbingl@users.noreply.github.com>
* dnnl ep rework
rework DnnlTensor,DnnlNode,DnnlSubgraph to support arbitrary graph topology and tensor data types
rework GetCapability to claim nodes in graph greedily from node topological ordering and delay creation of DnnlSubgraph until Compile
rework compile to have DnnlSubgraphPrimitive as the object to handle primitive creation and execution
instead of thread local primitive pool which duplicates intermediate memory allocated by the EP across threads
DnnlSubgraphPrimitive provides helpers to handle many common functions for each dnnl primitive builder and become the centralized place to store input, output, intermediate memories, initializer memories and etc
it provides functions to obtain input memories with automatic reordering/reshaping and moving between engines
it provides interfaces to add primitive, set output memory for single node and etc
add CONCURRENT_EXEC compile flag for dnnl library as without it, convolution primitive cannot be created and executed on different threads
enable unit tests to run on dnnl ep as well if built with dnnl ep
add dnnl ep support for Matmulinteger
* Add Relu to the DNNL refactor
Signed-off-by: George Nash <george.nash@intel.com>
* Add Convolution op to the DNNL rework
Signed-off-by: George Nash <george.nash@intel.com>
* Add Pooling ops to the DNNL rework
This adds the following ops:
- AveragePool
- GlobalAveragePool
- GlobalMaxPool
- MaxPool
Note: Pooling with dilation is not yet supported.
Note: GlobalLpPool, LpPool, MaxRoiPool, and MaxUnpool are not supported yet.
Signed-off-by: George Nash <george.nash@intel.com>
* Add Sum op to the DNNL rework
Signed-off-by: George Nash <george.nash@intel.com>
* Add ConvGrad op to the DNNL rework
Signed-off-by: George Nash <george.nash@intel.com>
* Add MaxPoolGrad and AveragePoolGrad ops to DNNL rework
Signed-off-by: George Nash <george.nash@intel.com>
* Added lrn operator to the refactored code
Signed-off by chethan.palangoutu.keshava@intel.com
* Added ReduceMean DNNL op to the refactor code
Signed-off-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>
* Added Softmax DNNL op for the refactored code
Signed-off-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>
* Added BatchNorm DNNL op inference-only for refactored code
Signed-off-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>
* Added Binary Ops to DNNL rework
Signed-off-by: Wang <zhaoyang.wang@intel.com>
* Added ReluGrad to DNNL Rework
Signed-off-by: Wang <zhaoyang.wang@intel.com>
* Update OneDNN tag to v2.3
Signed-off-by: Wang <zhaoyang.wang@intel.com>
* Added support for memory upto dim size 12
this is to fix the CI test cases that contain binary ops of input dim
size > 5
Signed-off-by: Wang <zhaoyang.wang@intel.com>
* Prevent claiming support for float16 and bfloat16 when only float is suppoted
By using The string.find used was causing the code to claiming support
for float16 and bfloat16 when we only supported float. We now explicitly
check the code for the data type or the data type with a 7 letter prefix
basically prefixed with "tensor("
Signed-off-by: George Nash <george.nash@intel.com>
* Disable uint8 mul and div, improve type conversion
Disable mul_uint8 and div_uint8 test cases as they use modulo for
overflow handling while onednn uses saturation
improve ype conversion using enum instead of string comparsion as well
as adding more types
Signed-off-by: Wang <zhaoyang.wang@intel.com>
Co-authored-by: Wang <zhaoyang.wang@intel.com>
Co-authored-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>
* updates for picking pnnx commit
* add tests filter to c# tests
* plus test fixes
* fix versioning for contrib ops
* fix tests
* test filter for optional ops
* more versioning related updates
* fix test
* fix layernorm spec
* more updates
* update docs
* add more test filters
* more filters
* update binary size threshold
* update docs
* plus more fixes
* updates per review
* update to release commit
* add filters for optional type tests
* plus updates
* update onnx-tensorrt parser to master
* disable unsupported tests
* add cuda sm 75 for T4
* update tensorrt pipeline
* update trt pipelines
* update trt pipelines
* Update linux-gpu-tensorrt-ci-pipeline.yml
* update trt cid pipeline
* Update linux-gpu-tensorrt-ci-pipeline.yml
* Update Tensorrt Windows build pool and TensorRT/CUDA/CuDNN version
* update to cuda11.4 in trt ci pipeline
* update base image to cuda11.4
* update packaging pipeline to cuda11.4
* clean up
* remove cuda11.1 and cuda11.3 docker file
* disable unsupported tensorrt tests at runtime
* Update linux-multi-gpu-tensorrt-ci-pipeline.yml
* Update submodule onnxruntime-extensions to latest.
* Add document for onnxruntime-extensions.
* Update cgmanifest.json for onnxruntime-extensions.
* Add example in JavaScript.
Co-authored-by: Zuwei Zhao <zuzhao@microsoft.com>
Pytorch cpuinfo library allows us to query current cpu features, micro-architecture and cache size, etc. These information is needed for targeted performance optimizations.
Unfortunately it does not work under Windows/ARM. We need to develop our own later
Switched the code to C++17. To build ONNX Runtime on old distros like CentOS 7, you need to install a newer GCC from additionary repos. If you build onnxruntime with the newer GCC, typically the result binary can't be distributed to other places because it depends on the new GCC's runtime libraries, something that the stock OS doesn't have. But on RHEL/CentOS, it can be better. We use Red Hat devtoolset 8/9/10 with CentOS7 building our code. The new library features(like std::filesystem) that not exists in the old C++ runtime will be statically linked into the applications with some restrictions:
1. GCC has dual ABI, but we can only use the old one. It means std::string is still copy-on-write and std::list::size() is still O(n). Also, if you build onnxruntime on CentOS 7 and link it with some binaries that were built on CentOS 8 or Ubuntu with the new ABI and export C++ symbols directly(instead of using a C API), the it won't work.
2. We still can't use std::optional. It is a limitation coming from macOS. We will solve it when we got macOS 11 build machines. It won't be too long.
3. Please avoid to use C++17 in CUDA files(*.cu). Also, the *.h files that they include(like core/framework/float16.h). This is Because CUDA 10.2 doesn't support C++17. You are welcome to use the new features in any *.cc files.
Co-authored-by: Chen Fu <fuchen@microsoft.com>
Description:
This change add google benchmark git repo as a submodule in onnxruntime repo.
Motivation and Context
Currently we have benchmarking code that depends on google benchmark. The version we are using has cross compilation issues for ARM CPUs. Recent changes in Google benchmark fixed these issues.
Another problem is that we now rely on ONNX to pull in Google benchmark, an indirect dependency. Updating ONNX involves complex steps and rightly so. However, updating Google benchmark dependency should not be hindered by these processes.
* Simplified version of WebAssembly support to keep most of existing data structures and add cmake using Ninja and emcmake
* Clean up CMakeLists.txt and add an example to create and compute a kernel
* Load a model from bytes and remove graph building steps
* Add all cpu and contrib ops with mlas library
* WebAssembly build with Onnxruntime C/CXX API
* Use protobuf cmakefile directory instead of adding every necessary source file
* Fix invalid output at example
* add missing files
* Change an example to use Teams model and support ort mobile format
* add API for javascript
* fix input releasing in _ort_run()
* update API
* Let onnxruntime cmake build WebAssembly with option '--wasm'
* allow one-step building for wasm
* Make build script working on Linux and MacOS
* Fix broken build from Windows command
* Enable unit test on building WebAssembly
* Resolve comments
* update build flags
* wasm conv improvement from: 1) GemmV; 2) Depthwise direct convolution 3x3; 3) Direct convolution 3x3
* Cleaned mlas unittest.
* use glob
* update comments
* Update baseline due to loss scale fix (#6948)
* fix stream sync issue (#6954)
* Enable type reduction in EyeLike, Mod, random.cc CPU kernels. (#6960)
* Update EyeLike CPU kernel.
* Update Mod CPU kernel.
* Update Multinomial CPU kernel.
* Slight improvement to Pad CPU kernel binary size.
* Update RandomNormal[Like], RandomUniform[Like] CPU kernels.
* Fix warning from setting multiple MSVC warning level options. (#6917)
Fix warning from setting multiple MSVC warning level options. Replace an existing /Wn flag instead of always appending a new one.
* MLAS: quantized GEMM update (#6916)
Various updates to the int8_t GEMMs:
1) Add ARM64 udot kernel to take advantage of dot product instructions available in newer cores. Some models run 4x faster than the stock implementation we used before.
2) Refactor the x64 kernels to share common code for AVX2(u8u8/u8s8/avxvnni) vs AVX512(u8u8/u8s8/avx512vnni) to reduce binary size.
3) Extend kernels to support per-column zero points for matrix B. This is not currently wired to an operator.
* Implement QLinearAveragePool with unit tests. (#6896)
Implement QLinearAveragePool with unit tests.
* Attention fusion detect num_heads and hidden_size automatically (#6920)
* fixed type to experimental session constructor (#6950)
* fixed type to experimental session constructor
Co-authored-by: David Medine <david.medine@brainproducts.com>
* Update onnxruntime_perf_test.exe to accept free dimension overrides (#6962)
Co-authored-by: Ori Levari <orlevari@microsoft.com>
* Fix possible fd leak in NNAPI (#6966)
* Release buffers for prepacked tensors (#6820)
Unsolved problems:
1. One test failure was caused by a bug in Cudnn rnn kernels, when they can allocate a buffer and partially initialize it, the garbage data near tail of the buffer caused problem in some of the hardware. To attack this problem in a broader sense, should we add code in our allocators, and during a memory fuzzing test, fill an allocated buffer with garbage before returning to the caller?
2. Prepacking is used more widely than we know. For instance, Cudnn rnn kernels also cache their weights. They mix several weight tensors together into a single buffer, and never touch the original weight tensor anymore. This is the same idea with pre-pack, but they didn't override the virtual function, and they never tried to release those weight tensors, leading to memory waste. It also seems to me that there are some other kernels have similar behavior. Wonder how much memory we can save if we try to cleanup those too.
3. Turning off memory pattern planning does increase memory fragmentation, leading to out of memory error in some training test cases. Perhaps we can revisit the idea of pushing kernels-creation stage earlier, and then during initializer deserialization, we only avoid tracing those that will be prepacked.
* Enable type reduction for Range, ReverseSequence, ScatterND, Split, and Unique CPU kernels. (#6963)
* add CI
* fix test in ci
* fix flags for nsync in wasm build
* add copyright banner
* fix wasm source glob
* add missing exports
* resolve comments
* Perf gain by make packb wide to 4 from 16 on GEMM for WASM.
Remove no need direct conv in previous perf tuning.
* fix buildbreak introduced from latest master merge
* fix buildbreak in mlasi.h
* resolve all comments except MLAS
* rewrite packb related 3 functions for WASM_SCALAR seperately rather than using #ifdef in each.
and other changes according to PR feedback in mlas.
* More complete scalar path in sgemm from Tracy.
* Fix edge case handling in depthwise conv2d kernel 3x3. where:
*) support input W==1 and H==1
*) recalc in accurate pad_right and pad_bottom
*) support hidden pad_right == 2 or pad_bottom == 2 when W == 1 or H==1 and no pad left/top
* Add more test coverage for conv depthwise from Tracy.
Fix one typo according to PR.
* resolve comments
* replace typedef by using
* do not use throw in OrtRun()
* output error message
Co-authored-by: Sunghoon <35605090+hanbitmyths@users.noreply.github.com>
Co-authored-by: Lei Zhang <zhang.huanning@hotmail.com>
Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Tracy Sharpe <42477615+tracysh@users.noreply.github.com>
Co-authored-by: David Medine <david.eric.medine@gmail.com>
Co-authored-by: David Medine <david.medine@brainproducts.com>
Co-authored-by: Ori Levari <ori.levari@microsoft.com>
Co-authored-by: Ori Levari <orlevari@microsoft.com>
Co-authored-by: Guoyu Wang <62914304+gwang-msft@users.noreply.github.com>
Co-authored-by: Chen Fu <chenfucs@gmail.com>
Changes include:
* Revert Event Pool changes
* Add copyright and revert unrelated changes
* Add DLPack as submodule and remove to_dlpack and from_dlpack from public API
* Update golden numbers for DHP Parallel tests
* Update ORTTrainer unit test numbers
* Rollback to DLPack v0.3
* Disable flaky test
* Update third party notices and CG manifest file
* Minor refactoring of ORTValue API
* Added code for Relugrad with GPU support.
Signed-off-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>
* Add GPU support for DNNL ConvGrad
Signed-off-by: George Nash <george.nash@intel.com>
* Add GPU support for DNNL MaxPoolGrad
Updates to MaxPool for training with GPU
Update oneDNN to version 1.8.1
Signed-off-by: George Nash <george.nash@intel.com>
* Fixed issues found durring code review
- error in code comment
- using auto when the direct type would have been better
- removed ternary operators that were returning bool values
Signed-off-by: George Nash <george.nash@intel.com>
Co-authored-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>
* Add ReluGrad and ConvGrad ops for the dnnl provider
* the mnist sample is updated to add the --use_dnnl option that
will cause the sample to use the dnnl execution provider for
nodes that exist in dnnl provider.
* Added the ability to find forward ops. Dnnl backward gradient
ops require the forward primitive description and workspace
from the forward operation.
* Enable specifying the execution provider for Gradient Checker Tests
* Prevent memory leak when running dnnl_provider in training mode
Prevent creating a SubgraphPrimitivePool when the code is built with the
ENABLE_TRAINING build flag. Instead create a SubgraphPrimitive directly.
The SubgraphPrimitivePool was causing a pool of SubgraphPrimitives to be
stashed in a map for reuse. Due to the way the Training Loop uses threads
the pool of SubgraphPrimitives were not being reuse instead a new pool of
SubgraphPrimitives being created each run. The old pool was not instantly
freed. This behavior could be a language error when using thread_local
memory.
Signed-off-by: George Nash <george.nash@intel.com>
* Added fixes to maxpoolgrad and memory leak.
Maxpoolgrad will now pass all unit tests.
With the conv and convgrad disabled for dnnl, mnist is able to train till 95%
Signed-off-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>
* Fixed misc issues when testing training code with dnnl provider
* fix conv_grad dnnl tests with dilation to run dnnl execution provider
* update mnist training sample to accept convolution type models
convolution models require the input shape to be {1, 28, 28}
instead of the flat {728} image that is used for the gemm models
this will enable models that require the different shape by adding
`--model_type conv` to the command line when running the mnist sample.
(while testing a workaround was used see #4762)
* Disable weight caching in dnnl conv operator when using training
When training we can not use cached weights because the weight
will be updated each run. This re-enables dnnl Conv and ConvGrad Ops.
The weight caching was the source of the error from Conv when training.
* Fix issues found when building grad ops on Linux
* The dnnl_convgrad code was over using the scope operator
causing a compilation problem.
* The dnnl_maxpoolgrad code had a logic error that is was
comparing with the source description when it should have
been comparing with the destination despription.
* Update BUILD.md so it shows DNNL for training
* Updated the table of contents. Since the same providers
are listed twice. Once for Infrance and again for Training
an HTML anchor was added to distinguish the second header
from the first for the TOC.
* Fix build failure when not using --enable-training build option
* reorganize the gradient operators so they are grouped together
* Fix issues found when running onnx_backend_test_series.py
* Pooling code only supports 2 outputs when built with --enable-training
* Address code review feedback
* class member variables end in underscore_
* use dst instead of dist to match pattern use elsewhere in DNNL code.
* Remove workaround that was introduced to handle problems running
convolution based training models. See issue #4762
Signed-off-by: George Nash <george.nash@intel.com>
* Isolate training code and code cleanup
* Do not build if dnnl_gpu_runtime if enable_training is set training code
does not support dnnl_gpu_runtime yet.
* Isolated Training code inside ifdefs so that they wont affect
project if built without training enabled
* Inadvertant changes in whitespace were removed to make code review simpler
* Undid some code reordering that was not needed
* comments added to closing #endif statments to simplify reading complex ifdefs
* Modified the GetPrimitiveDesc functions to return shared_ptr instead of raw
pointer. This matches what was done in Pool code and is safer memory code.
Signed-off-by: George Nash <george.nash@intel.com>
* Address code review issues
- whitespace changes caused by running clang-format on the code
- Several spelling errors fixed
- Removed/changed some ifdefs to improve readability
- other misc. changes in responce to code review.
Signed-off-by: George Nash <george.nash@intel.com>
* Code changes to address code review
- Simplify iteration code using `auto` keyword
- remove C style cast that was not needed
- remove instance variable that was not needed [relugrad.h]
- added the execution providers to `ComputeGradientErrorInternal()`
and `ComputeTheoreticalJacobianTranspose()` instead of using
a pointer to an instance varaible [gradient_checker.h/.cc]
Signed-off-by: George Nash <george.nash@intel.com>
* Combined the default gradient ops test and dnnl gradient ops test for ConvGrad and MaxPoolGrad into one function with the help of a helper function.
This will reduce repeated code.
Signed-off-by: Palangotu Keshava, Chethan's avatarChethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>
* Replaced the stack used by convgrad to vector so that the vector(used as stack) can be easily cleared everytime the graph is created.
This will prevent memory leak from convolution kernels being pushed constantly onto the stack.
Signed-off-by: chethan.palangotu.keshava@intel.com
* Code clean up and formating updates
- Removed empty else statment
- updated indentation of code that was causing double curly brackets to look unususal
- Changed check for NumDimensions to Size in Relu and ReluGrad error checking code.
- isolated training code
Signed-off-by: George Nash <george.nash@intel.com>
* Restore inadvertantly removed ConvGrad tests
When combining the DNNL and CPU version of the ConvGrad
tests two test were inadvertantly excluded. This adds
back the Conv3d and Conv3d with strides test cases.
Signed-off-by: George Nash <george.nash@intel.com>
* Add validation to ConvGrad
This validates the dimensions of the ConvGrad match the
passed in Convolution forward primitive description.
The current code for DNNL ConvGrad makes the assumption that the ConvGrad
nodes will be visited in the reverse order from the corresponding Conv nodes
The added validation will return an error if this assumption is not true.
Signed-off-by: George Nash <george.nash@intel.com>
* Do not create new execution providers in provider_test_utils
This removes the code that generated new execution providers in the
OpTester::Run function. This was added because the std::move was
leaving the `entry` value empty so subsequent calls would cause a
segfault.
Problem is this potentially changed the execution_provider because it
would create the default provider dropping any custom arguments.
When the now removed code was originally added the std::move was causing
crashes when the GradientChecker unit tests were run. However, it is no
longer causing problems even with the code removed.
Signed-off-by: George Nash <george.nash@intel.com>
* Change the forward conv stack to a forward conv map
This changes how the forward conv kernel is mapped to the bwd ConvGrad
kernel the problematic stack is no longer used.
The convolution stack made the assumption that the corresponding
ConvGrad operator would be visited in reverse order of the forward
Conv operators. This was always problematic and was unlikely to
work for inception models.
Important changes:
- The weight_name is added to the ConvGrad dnnl_node making it
possible to use the weight_name as a lookup key to find the
Conv forward Kernel
- the `std::vector fwd_conv_stack_` has been replaced with a
`std::map fwd_conv_kernel_map_`
- Although it is not needed lock_guards were added when writing
to and reading from the fwd_conv_kernel_map_ as well as the
fwd_kernel_map_. These should always be accessed by a single
thread when preparing the dnnl subgraphs so the guard should not
be needed but its added just in case.
- Updated the comments ConvGrad.h code to no longer mention the
stack. The error check is not removed. It will be good to verify
there are no errors as we continue to test against more models.
Signed-off-by: George Nash <george.nash@intel.com>
Co-authored-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>
Co-authored-by: unknown <63478620+jeyblu@users.noreply.github.com>
* assert sequence tensor and remove skips
* update testdata json
* use ONNX 1.8 in cgmanifest.json
* use previous commit to workaround
* update ONNX commit ID in docker
* skip test_maxpool_2d_dilations test for now
* update function name