Description: This PR makes two changes identified while looking at a PGAN model.
First, it uses ThreadPool::TryParallelFor for the main parallel loops in the Expand operator. This lets the thread pool decide on the granularity at which to distribute work (unlike TrySimpleParallelFor). Profiling showed high costs when running "simple" loops with 4M iterations each of which copied only 4 bytes.
Second, it updates the sharded loop counter in the thread pool so that the number of shards is capped by the number of threads. This helps make the performance of any other high-contention "simple" loops more robust at low thread counts by letting each thread work on its own "home" shard for longer.
Motivation and Context
Profiling showed a PGAN model taking 2x+ longer with the non-OpenMP build. The root cause was that the OpenMP build uses simple static scheduling of loop iterations, while the non-OpenMP build uses dynamic scheduling. The combination of large numbers of tiny iterations is less significant with static scheduling --- although still desirable to avoid, given that each iteration incurs a std::function invocation.
Support pad operator in quantization tool.
Support pad operator in quantized nhwc transformer.
Fix pad() operator bug when pad input's inner(right) most axis value is zero for Edge and Reflect mode, it copied wrong value to the cells to be padded. Note the Constant mode will not trigger this bug, as Edge/Reflect need copy value from the already copied array while Constant mode only fill specified value.
Add more test cases to cover pad() operator bug fixed here.
Fix quantization tools uint8/int8 value overflow issue when quantize weights in python.
* Add ability to generate configuration that includes required types for individual operators, to allow build size reduction based on that.
- Add python bindings for ORT format models
- Add script to update bindings and help info
- Add parsing of ORT format models
- Add ability to enable type reduction to config generation
- Update build.py to only allow operator/type reduction via config
- simpler to require config to be generated first
- can't mix a type aware (ORT format model only) and non-type aware config as that may result in insufficient types being enabled
- Add script to create reduced build config
- Update CIs
* Add macos coreml CI and coreml_flags
* Move save debuggubg model to use environment var
* Move pipeline off from macos CI template
* Fix an issue building using unix make, add parallel to build script
* Fixed build break for shared_lib and cmpile warning
* Fix a compile warning
* test
* Revert the accidental push from another branch
This reverts commit 472029ba25d50f9508474c9eeceb3454cead7877.
* Updates for Gradle 7.
* Adding support for OrtThreadingOptions into the Java API.
* Fixing a typo in the JNI code.
* Adding a test for the environment's thread pool.
* Fix cuda test, add comment to failure.
* Updating build.gradle
* update load library code to have the fullly qualified path
* make it work for syswow32
* git Revert "make it work for syswow32"
This reverts commit b9f594341b7cf07241b18d0c376af905edcabae3.
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
* Share allocator between CUDA EP & TRT EP.
limitation:
1. Does not cover the per-thread allocator created by CUDA EP, still need to figure out the way to remove it
2. Need to have more identifiers to make it able to share CPU allocator across all EPs
* expose optimized_model_filepath in SessionOptions as `debug.graph_save_paths.model_with_training_graph_after_optimization_path` in `ORTTrainerOptions`
Description: Builds and installs libusb without UDEV support, which is used for communicating with the VPU device.
Motivation and Context
This enables the resulting docker container to be run without '--privileged' and '--network host' options which may not be suitable in deployment environments.