* Update to flatbuffers v2.0.0 (#10866)
* Fix Reduced ops pipeline (#10861)
* Fix a couple of issues with the python package tools (#10858)
* Tweaks to the model utils
* Add handling for a dim_value of -1 when replacing the entire input shape. This occurs in models exported from PaddlePaddle
* make pytorch helpers accessible in package
* make QDQ helpers accessible in package
* Fix wrong percentile values returned during calibration (#10847)
* Use numpy.percentile to get the lookup value.
* Use 1.0 as float value rather than integer.
* Add missing cdf parameter for `np.percentile`.
* Use 100. instead of 1.0
* Remove print.
* Update from @yufenglee
* Add support for opset 16 to transpose optimizer. (#10841)
* Add support for opset 16 to transpose optimizer.
Only change required is for GridSample to be added to the layout sensitive ops. The existing handling for layout transpose works with that as the first input and first output are layout sensitive.
Update the optimize to be able to return an error message if it fails.
* Use separate build directories for full and mobile iOS packages. (#10835)
* Address performance issue with abseil flat_hash_table. (#10819)
When returning by value in a cross DLL call, the hash table
even though containing all the entries that are originally there
can not find at least some of them. Reverting to std::unordered_set
pending further investigation.
* Mark end of version 11 C API. (#10803)
* Mark end of version 11 C API
* Add static_assert
* avoid using LocalFree on FormatMessageW buffer (#10796)
* remove local free
* Remove local free from onnxruntime
* don't allocate
* Change to use constexpr to satisfy CPU build warning
* Integrate C-API tests into Pipelines for release packages (#10794)
* add c-api test for package
* fix bug for running c-api test for package
* refine run application script
* remove redundant code
* include CUDA test
* Remove testing CUDA EP temporarily
* fix bug
* Code refactor
* try to fix YAML bug
* try to fix YAML bug
* try to fix YAML bug
* fix bug for multiple directories in Pipelines
* fix bug
* add comments and fix bug
* Update c-api-noopenmp-packaging-pipelines.yml
* Remove failOnStandardError flag in Pipelines
* Detect runtime CUDA JIT and warn the user (#10781)
* Use cudaMalloc vs cudaDeviceSynchronize and show the total time
* Update convert_onnx_models_to_ort.py to support runtime optimizations. (#10765)
Add runtime optimization support to ONNX -> ORT format conversion script.
Replace `--optimization_level`, `--use_nnapi`, and `--use_coreml` with a new `--optimization_style` option.
* Add multithreading test and put a lock on nvinfer1::createInferRuntime() for TRT EP (#10714)
* Add multithread unit test and put lock on library call
* update code
* remove debug code
* add comment
* add one session multi-threads inference
* Put lock for build engine all the time
* Update naming and comment
* remove unnecessary lock
* Revert "remove unnecessary lock"
This reverts commit 9c2317b1d2273dec0ebdeb52160bc757839e5edc.
* Fix handling of nodes inserted by NHWC transformer. (#10904) (#10925)
* Revert "Upsample support NHWC (#10554)" (#10917)
This reverts commit bd08f11a58.
Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
* [python API] Change raise import error when `C:\Windows\System32\vcruntime140_1.dll` is not found to warning (#10927)
* remove throw if C:\\Windows\\System32\\vcruntime140_1.dll cannot be found
* Add comments and update warning message
* adding back accidentally removed line
Co-authored-by: gwang0000 <62914304+gwang0000@users.noreply.github.com>
* [js] Create npm packaging pipeline (#10886)
* create npm packaging pipeline
* fix indentations
* Update npm-packaging-pipeline.yml for Azure Pipelines
* Update npm-packaging-pipeline.yml for Azure Pipelines
* Update npm-packaging-pipeline.yml for Azure Pipelines
* react-native-ci as a template
* fix typos
* fix template paths
* add a depencendy
* change a stage name
* set different artifact name for each package
* fix typo
* Update npm-packaging-pipeline.yml for Azure Pipelines
Set a build Id for node npm package as a parameter
* Update npm-packaging-pipeline.yml for Azure Pipelines
Set a build Id for node npm package as a parameter
* Update npm-packaging-pipeline.yml for Azure Pipelines
* Follow up update for python API checking if `vcruntime140_1.dll` is available (#10927) (#10933)
Co-authored-by: Hariharan Seshadri <hasesh@microsoft.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Funtowicz Morgan <mfuntowicz@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: Pranav Sharma <prs@microsoft.com>
Co-authored-by: Ryan Lai <rylai@microsoft.com>
Co-authored-by: Ryan Hill <38674843+RyanUnderhill@users.noreply.github.com>
Co-authored-by: Yi-Hong Lyu <yilyu@microsoft.com>
Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
Co-authored-by: Guoyu Wang <62914304+gwang-msft@users.noreply.github.com>
Co-authored-by: gwang0000 <62914304+gwang0000@users.noreply.github.com>
Co-authored-by: Sunghoon <35605090+hanbitmyths@users.noreply.github.com>
ARM a55 micro-architecture (with dot product instructions), similar to a53, is widely used as little cores in big.Little configurations. A55 has a narrower memory load/store hardware, where a 128b load instruction would block the pipeline for 2 whole cycles, during which no other instructions can be executed. On the other hand, a 64b load instruction can be duo issued with many other instructions.
This change adds a Symmetric QGEMM kernel for a55 micro-architecture, where we replace
ldr q4,[x1],#16
with
ldr d4,[x1],#8
ldr x11,[x1],#8
ins v4.d[1],x11
so that we can try to hide the memory load cycles behind computing cycles in the kernel.
Co-authored-by: Chen Fu <fuchen@microsoft.com>
This code is valid only when -mcpu is set to utilize POWER9 technology
or above. A compatible code for POWER8 was created as well, but it
was not tuned for performance.
* POWER10: QGEMM optimization
This patch makes use of POWER10 MMA feature for QGEMM function.
This optimization includes signed and unsigned cases.Tested and
there are no new failures with gcc11 and clang-14.
* Changes as per review comments
Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
* add executor option (vm or graph) and support virtual machine methods
* nullptr check for compile and run methods (see also PR#10211 from microsoft:onnxruntime)
* get output shapes for VM
* remove run_with_benchmark. remove run methods from python api, get it from native side
* get outputs method for VM was implemented
* support multiple input for VM
* update python logging and exception
* small fix
* update tvm with patch for VM API
* update nhwc transformations for TVM EP
* add data alignment check and support set_input_zero_copy for GE in TVM EP
* fix logger name
* return back to apache/tvm with VM fixes instead of local dev branch
* hide customized tvm logger while issue is not resolved. fix tvm warning related to target_host
* flake8 fix
Co-authored-by: Valery Chernov <valery.chernov@deelvin.com>
Work on minimizing memory management calls by
reducing number of allocations and copies.
Replace std::unordered_set to InlinedHashSet
and add usage of InlinedVector.
Employ std::move() to minimize copying and memory allocations.
Remove copying of the const shared data into each of the
PropagateCast transformer instances.
Move inlined_containers.h header to include/common
Adjust AsSpan imlementation for C++ < 17
* add support for bool type
* add TVM EP support for tests
* include TVM EP in python test pool
* fix pylint
* moved technical imports to a separate file
* clean up post build actions & move _ld_preload.py extension to CMake level
* add files for include TVM EP into CI
* implement custom logger for TVM
* replace TVM logging with ONNX RT logging
* update link for TVM EP tutorial
* clean up TVM EP cmake
* add pybind auto enabling for TVM EP
* fix blank spaces
* code review fixes
* replace print with comment
* add list of EP without TVM EP
* enable onnx tests
* disable contrib ops and ml ops
* reuse Dockerfile.ubuntu
* Move install_tvm_test_dependencies.sh out of Docker context dir, update build definition.
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Disable warning about padding for abseil-cpp flat_hash_map.
Disable some warnings from compiling the test proto. This also required removing a line in CMakeList.txt where we move a level 4 warning to level 3. That ends up later on the command line and overrides the `/wd4800`. Couldn't find a way to handle that nicely. As we compile with `/W4` the value of moving 4800 to level 3 in dev mode is unclear so simplest was to remove that. Open to suggestions if there's a better way.
* Fix incorrect type constraint registration for RoiAlign. This led to the input type not actually being checked when matching a kernel as the invalid constraint name is treated as a missing optional input.
* fix missing dependency for the unit test exe. Whilst it doesn't link against the CUDA providers lib, without the dependency VS doesn't know it needs to rebuild the library if there are changes.
* Add check for invalid type constraints.
* Fix invalid registrations for other kernels.
* Add hash replacement logic to provide backwards compatibility in ORT format models when the registration is fixed.
* Add tests
* Add layout transformer for NNAPI
* plus merge fixes
* plus some more merge fixes
* test fixes
* comments + cleanup
* plus updates
* post merge changes
* enable layout transformer in extended minimal build
* plus more comments
* more tests + fix CI
* plus updates per review
* more updates per review
* fix file name
* fix qdq tests
* plus more updates
* plus updates
* typo fix
* fix qdq selection in 2nd optimization pass
* fix typo
* fix a test
* update dependency structure for layout transformer
* plus updates
* more updates
* plus change
* more updates to fix linker error in minimal build
* remove unnecessary headers
Update QDQ propagation transformer to insert new QDQ nodes instead of moving the existing one. This creates a more consistent `DQ -> op -> Q` pattern for other components to recognize.
Upgrade this transformer to a basic level optimization as it yields a valid ONNX graph.
* expand model tests name
* skip cpu/cuda for trt when running onnxruntime_test_all
* only run trt ep for c++ unit test
* Update CMAKE_CUDA_ARCHITECTURES for T4
* Use new t4 agent pool
* Update YAML for run T4 on Windows
* revert code
* Update CMAKE_CUDA_ARCHITECTURES
* fix wrong value
* Remove cpu/cuda directly in model tests
* add only CMAKE_CUDA_ARCHITECTURES=75
* remove expanding model test name to see difference
* revert code
* Add fallback execution provider for unit test
* Add fallback execution provider for unit test (cont)
* add conditional to add fackback cuda ep
* Reduction op takes much longer time for TRT 8.2, so we test smaller range of inputs
* use M60
* revert code
* revert code
* add comments
* Modify code and add comment
* modify comment
* update comment
* add comment
Adding S8S8 kernels for symmetric quantized indirect conv and depthwise conv.
Perf number with single thread:
Nokia G10 (baseline / new) in ms Pixel 4 (baseline/new) in ms
mobilenet_edgetpu 220 / 213 18.5 / 17.6
cartoongan 8537 / 8521 967 / 928
Co-authored-by: Chen Fu <fuchen@microsoft.com>
* add qdqgroup as input for NodeUnit
* minor update
* hookup nnapi_ep
* minor update
* update compiler setting
* Add a simple UT
* Pipeline change to add build minimal extended with NNAPI for Android
* move GetAllNodeUnits to node_unit.h, add UT for NodeUnits, minor updates
* minor updates
* address CR comments
Co-authored-by: gwang0000 <62914304+gwang0000@users.noreply.github.com>
Adding code for symmetric quantized matrix multiplication. Used in quantized convolution, achieving significant perf gain.
TODO, use Symmetric Quantized GEMM in other operators!
TODO address activation buffer overread in custom allocators and tensors supplied by users.
DOT kernel perf test:
Pixel 5a:
Cartoongan 513.539 ms 471.786 ms
Efficient 57.5169 ms 56.4174 ms
Edgetpu 14.6673 ms 13.5959 ms
NEON kernel perf test
Pixel 3a
Cartoongan 1423.53 ms 1069.92 ms
Efficient 114.086 ms 107.968 ms
Edgetpu 39.2632 ms 36.9839 ms
Co-authored-by: Chen Fu <fuchen@microsoft.com>
Add abseil and inlined containers typedefs
Introduce TensorShapeVector for shape building.
Use gsl::span<const T> to make interfaces accept different types of vector like args.
Introduce InineShapeVectorT for shape capacity typed instantiations
Refactor cuda slice along with provider shared interfaces
Refactor Concat, Conv, Pad
Build with Conv Einsum and ConvTranspose refactored.
Remove TesnorShape::GetDimsAsVector()
Refactor SliceIterator and SliceIteratorBase
Refactor broadcast
Refactor Pads for twice as long
Remove memory planner intermediate shapes vector
Refactor orttraining
Fix passing TenshroShapeVector to tests
Remove abseil copy and submodule, use FetchContent_Declare/Fetch
Path with separate command
Make RocmAsyncBuffer accept anything convertible to span. Adjust Linux GPU pipeline.
* clearing map for eager mode backends
* clearing map for eager mode backends manager
* making OrtBackendsManager an extern variable and trying to delete it
* cleaning backends manager when the python interpret exits
* adding ifdef for eager mode code
* disabling warning for pybind state file
* disabling warning for python module file
* running clang auto format and reducing redundancy
* remove new line
* moving declaration to a new header file
* adding the header file for eager mode for python module
* removing source files for eager mode
* add source file for python module in eager mode
* Update orttraining/orttraining/python/orttraining_python_module_eager.h
Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>