* add profile caching to improve engine caching feature
* Add comments
* fix typo
* add decryption for engine caching
* Update tensorrt_execution_provider.cc
* Update tensorrt_execution_provider.cc
* Update tensorrt_execution_provider.cc
* Update tensorrt_execution_provider.cc
* Update tensorrt_execution_provider.cc
* update onnx-tensorrt submodule
* set opt profile to max value of the range
* add hash to engine/profile name
* Add calibration based INT8 quantization
* add an option to enable both FP16 and INT8
* Update tensorrt_execution_provider.cc
* add env variable to specify calibration file name
* clean up code
* Add comments and update TRT document
* enable tensorrt basic test and add EngineCachingTest
* clean up
* update envrionment variable in the test
* clean up
* Introduce PassThrough op to wait for all gradient ready before weight update
* Compute gradient norm for fp32 runs
* Update FE UT expected value
* Respect enable_grad_norm_clip
* Large model export and run ORT Python support
* Megatron change
refine a bit
workaround self attention issue
use partitioned name for weights when megatron model parallel is enabled
Fix Megatron Transformer Issue (cuased by the renaming)
Add UTs for T5 model parallel
Fix megatron seed issue
fix log a bit
checkkpointing changes + rebase
Unintended reshape transform change
t5 layer norm changes
add t5 layer norm kernel
use template for t5 layer norm
template definition changes
no build error
add CPU cuda kernel
first unit test
other forward unit tests
add T5LayerNormGrad
Add c++ transform and test for T5 LN
minor fix
BART MLP Megatron tranform
Add concat slice transform + test
Cosmetic improvements in concat slice transform
Constant folding bug fix + megatron attention transform for BART
Undo unnecessary changes
* Cleanup
* Remove unnecessary changes
* Cleanup megatron
* Windows build
* Add self attention test graph
* Correcting transforms + cleanup
* review comments
* review comments
* fix build and test failures
* Fix CI
* fix windows CI
Co-authored-by: Peng Wang <pengwa@microsoft.com>
Co-authored-by: Aishwarya <aibhanda@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* Enabling Multi Device support for UEP
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Minor fix added
*Added a simple fix to determine OpenVINO
version for Arm build as well
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Move GetCapability independent of ModelBuilder
* minor code style fix
* Move ort_enforce for same number of op_builders and op_support_checkers
* minor code fix
* cpu send/recv
* clean up send/recv
* remove unused code
* assert and nccl option for mnist
* add build option to enable build with only cpu. Without this, nccl is always enabled which will break build on machine that only contains cpu
* Add USE_MPI distinct from USE_NCCL/USE_HOROVOD
* fix
* fix
* exclude cpu send/recv for machines without mpi
Co-authored-by: Tim Harris <tiharr@microsoft.com>
* Create an Azure Pipeline to merge cpp and python e2e pipelines into one. Still keep cpp 2e2 pipeline until this new pipeline is stable.
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* Implement a Scale function for quantization
Quantized GEMM is always followed by Scaling (PerTensor Or PerColumn), and often need to be accumulated to an existing matrix. This PR implements a post-processor for quantized GEMM result and accumulate it to another matrix.
Conditionally enable NCCL depending on CUDA and ROCM
Before this change NCCL support was enabled unconditionally, even
when building without CUDA or ROCM support.
This caused the command:
$ ./build.sh --enable_training
To trigger the following cmake warning
-- Could NOT find NCCL (missing: NCCL_INCLUDE_DIR NCCL_LIBRARY)
CMake Warning at CMakeLists.txt:1282 (message):
NCCL is not found. Please use --nccl_home to specify the path of NCCL.
Otherwise, NCCL is disabled.
This is a spurious warning because the user did not ask to search for NCCL.
This is a small perf / clean-up change. It removes the Env::Task abstraction which wraps a single std::function field, and adds at least one virtual method call overhead when creating a Task and when executing it. The POSIX and Windows implementations are now identical.
This PR updates the ThreadPool API to support multi-loop parallel sections. As with the OpenMP "parallel" construct, this allows per-loop work to be amortized over a series of loops. For ORT, it also promotes locality between successive loops in the sense that iteration X of one loop will tend to run on the same worker thread as iteration X of preceding loops.
The change was developed while optimizing the implementation of a model that performed better with OpenMP. Profiling indicated that OpenMP was providing lower loop entry/exit costs and that, via OpenMP's static scheduling, it was leading to a lower L2 miss rate in the series of parallel loops used in GRU.
The main changes are:
- Addition of ThreadPool::ParallelSection and underlying support in the modified Eigen thread pool.
- In EigenNonBlockingThreadPool.h, refactoring the RunInParallel method to support two variants: one that takes an existing parallel section object created by the caller, and another (used by default) that creates its own parallel section.
- Simplify ThreadPool::LoopCounter (used by worker threads to claim loop iterations), basing it an ID supplied by the underlying Eigen thread pool for affinity in a series of loops.
- Fix a possible perf issue where a loop with iterations scheduled in batches would have more threads than batches available.
- Use of parallel sections in the GRU operator.
- Additional test cases in threadpool_test.h.
- Additional comments at the top of threadpool.h and EigenNonBlockingThreadPool.h.
Some part of code for reduction kernels has been changed in 858040fa,
which cause failures in rocm build since ROCm EP shares some code with
CUDA EP. This PR is to quick fix this failure by not sharing two files
for now to unblock CI enabling on ROCm EP. Another PR for leveraging
858040fa for ROCm EP will be done later.
Add tag types for Ort::Float16_t and Ort:Bfloat16_t structs
that contain uint16_t values for float16 and bfloat16.
These will serve as type dispatching types for C++ API.
They are of uint16_t size and arrays of these types can be used
to create Tensors of the corresponding types.
Make documentation Doxygen compliant.
* Add kernels for AMD GPU.
This PR is mostly about GPU kernels for ROCm EP. Due to similar GPU programming language (CUDA and HIP and similar math library calls, one principle in ROCM EP design is to share CUDA kernels as much as possible for ROCm. Thus, the script amd_hipify.py has been created for converting CUDA kernels to ROCm HIP kernels automatically during compilation phase. But, for some reasons such as perf issue, syntax difference..., some converted kernels need some manual intervention. These kernels will be checked in the repo physically for now. In order to avoid manual intervention, the plan is to refactor CUDA kernels to make them portable between CUDA EP and ROCm EP as much as possible.
Please refer to "HIP Porting Guide" for details.
* like lamb, multi-tensor-apply needs to be disabled for IsAllFiniteOp and ReduceAllL2, current AMD GPU compiler has perf issue for kernel parameter which is a structure with "pass by value".
* Use hipMemsetAsync and add checks on HIP calls.
* move the generated files to build folder.
Co-authored-by: Jesse Benson <jesseb@microsoft.com>
* Add YAML file for pipeline
* Modify typo
* Add working directory
* Modify and test
* Modfiy and test
* Modify and test
* Modify and test
* Modify
* Modify
* Modify
* Modify
* Make sure to copy all the result files
* Add clearn up
* Modify
* Modify agent pool name
* Upload only specific artifacts
* Modify
* Integrated CI Pipeline for running TRT perf as well as added the “large amount of models” into perf model target
* Fix bug
* Fix bug
* Add reading the information regarding previously known failing models
and then skip testing them during benchmark/validation
* Modify the script file for CI
* Replace print with logger.info
* Fix bug
* Fix bug
* Refine the code
* Modify the script so that it can capture script segmentation fault while
running ORT
* Fix bug
* fix bug
* fix bug
* Add debug info
* fix bug
* Refine perf code
* Refine the code
* fix bug
* Code refactoring
* change many-models path
* remove metadata after validation/benchmark are done
* Update README.md
* Fix bug so that metadata doesn't hold stale value
* Remove hardcode and update README
* Add arguments to the script to make it run correctly
* Update linux-gpu-tensorrt-ci-perf-pipeline.yml for Azure Pipelines
* Update linux-gpu-tensorrt-ci-perf-pipeline.yml for Azure Pipelines
* Fix bug so that metadata doesn't hold stale value
* Fix small bug of finding test dataset directory for FP16 test data, as
well as modification of some output information
* use -i random for perf test of TRT changes
Co-authored-by: Olivia Jain <oljain@microsoft.com>
* create new nuget packaging pipeline without openmp
* rename package
* update image name
* rename package name
* rename managed package
* reset project attribute
* merge master
* set package name
* set NoOpenMP as cpu build
* shorten line length
Co-authored-by: Randy Shuai <rashuai@microsoft.com>