Updates the thread pool implementation to make work distribution over the Eigen thread pool more closely resemble techniques used in OpenMP. In particular:
(1) A thread entering a parallel loop works on the iterations itself, rather than requiring a thread switch to/from a thread in the pool, if called from outside the thread pool.
(2) To support this, work items pushed to the thread pool run a loop to claim iterations from a shared counter via atomic-fetch-and-add, as opposed to having work items themselves represent individual batches of iterations. This means that any thread working on the loop can execute any batch of iterations, including having the main thread run through all of the batches itself if the loop turns out to be short-running.
(3) As with OpenMP active scheduling, the worker loop spins waiting for work prior to blocking. This avoids OS blocking / wake-up paths in workloads with series of short-running parallel sections.
* Added GetAvailableProviders to C API
* Fix API version and Windows build error
* Changed function name
* Changed ORT_API_VERSION to 4
* Moved all_providers array to constants.h
* Move check for providers to constants.h
* Changed name of array to avoid warning
* Address review comment
* Added unit test
- Update IAllocator setup to move the OrtMemoryInfo to the base class instead of requiring derived classes to have that as a member and override a virtual method to return it.
- Cleanup CreateAllocator setup to take an argument as to whether to wrap the device allocator in an arena allocator. The choice to do that isn't a property of the underlying device allocator.
- Minor cleanups in the various EPs to adjust to the change to IAllocator and CreateAllocator, and to use the create_arena flag consistently when available.
* Add build option to disable traditional ML ops from the binary.
* Fix python tests by splitting tests for ML ops to a separate file. Exclude ML tests from onnx_test_runner and C# tests. Exclude ML op sources.
* Update Edge pkg pipelines with new MLops env variable and fix C# packaging pipeline tests to skip ML ops.
According to profiling in #4267, getting the allocator can account for a large fraction of overhead when accessing a kernel output, due to STL container operations. The allocator isn't used when (i) we're not creating a fence, and (ii) we have a memory pattern and a pre-allocated buffer, so we can avoid this overhead.
* support position_ids input
* support fp16 conversion for gpt2 past state
* output results to csv file
* Remove the useless check that output of matmul is in cuda
* expose ACL/ARMNN providers to python
* add -acl / -armnn to package name when use_acl / use_armnn is specified
* build python wheel for ARMNN EP
* link ACL/ARMNN EPs into onnxruntime_pybind11_state
* wrong argument order in build_python_wheel for wheel_name_suffix
* Fix a bug and add code to profile memory
1. Compile Send/Recv again (currently broken because of
HOROVOD refactor).
2. Add code to print out initializer allocation size and
activation memory size.
* Address comments
* Split memory counts per locations
* Fix a metric
update PyTorch Bert SquAD notebooks to use onnxruntim-tools and update usage of intra_op_num_threads.
rename python files according to coding style
Fix change_input_to_int32.
update keras notebook to copy script from rel-1.3.0 branch (Will update them later)
* Add ONNX postpasses
* add flag + add bert test from onnx file
* address PR comments
* fix typo
* fix rebase
* address comments
* Fix test failures
* add new pass for expand for new pt version, add comments
* fix rebase
Co-authored-by: lahaidar <lahaidar@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* ORT on CUDA 11
1. Seperate HOROVOD and MPI
2. Seperate NCCL from HOROVOD in CMakeLists.txt
2. Remove dependency on external cub
3. cudnnSetRNNDescriptor is changed in cuDNN 8.0
* polish the code about MPI/NCCL in CMakeLists.txt and build.py
* check CUDA version
* ${MPI_INCLUDE_DIRS} should be PUBLIC
* sm30, sm50 are deprecated in CUDA 11 Toolkit
* update change based on code review feedback.
* add sm_52
* improve MPI/NCCL build path
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
* fix a links to Engineering Design and API in CONTRIBUTING.md
* fix additional links in CONTRIBUTING.md
* correct the link to the public API in CONTRIBUTING.md
Co-authored-by: Emad El-Haraty <emad.elharaty@limebike.com>
Search/replace of the pattern "const auto foo = tensor.Shape()" to "const auto& foo = tensor.Shape()" to avoid unneeded copies at runtime and reduce code size (8KB drop for onnxruntime.dll). Remove some unnecessary header includes.
* Enable static memory planning for pipeline.
1. We fix a bug when resolving symbolic shape for scalars.
2. We pass the original inputs to all pipeline stages so that
the symbolic shapes can be resolved.
* Further Improvements
1. Address comments.
2. Further reduce activation size by ~50% when pipeline is on.
This is done by removing all but one gradient tensor from the last
RecordEvent in the backward pass.
* Address a comment
* Fix Windows build
Add more variants of MlasGemm that do a u8x8 GEMM with the output type as float. This fuses the common sequence of MatMulInteger + Cast + Mul(OutputScale) + optional Add(BiasVector).