Provide a tool to convert Loop to Scan for Nuphar performance
Fix Nuphar CI pipeline failures.
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* merge master, keep postprocess status commit
* download float16.py everytime
* using variables to reference eps
* adding ACL EP to ep perf tool
* accuracy with absolute tolerance configurable
* add acl to dict + remove commented line
* Initial implementation of generating calibration dynamic range table
* Initialize validation support for Quantization
* Initialize validation support for Quantization (cont.)
* Improve validation support for Quantization
* Improve validation support for Quantization
* Rewrite/Refine for calibration and validation
* Rewrite/Refine for calibration and validation (cont.)
* Refine code
* Refine code
* Add data reader for BERT
* Add flatbuffers to serialize calibration table
* Refine code and add BERT evaluation
* Refine the code
* minor modification
* Add preprocess/postprocess of vision team yolov3 and refine the code
* Update annotation
* Make bbox cooridates more accurate
* Fix bug
* Add support of batch processing
* Batch processing for model zoo yolov3
* Add batch inference for evaluation
* Refine the code
* Add README
* Add comments
* Refine the code for PR
* Remove batch support checking in data_reader and refine the code
* Refine the code for PR
* Refine the code for PR review
Co-authored-by: Olivia Jain <oljain@microsoft.com>
* Support to pass initial optimizer states to optimizer graph builder
* Changes for passing init optim state to training session config
* Pass optimizer state through cpp and python frontend
* Cleanup
* Review comments
* Fix windows and mac CI
* Review comments
* review comments
* Review comments
* Frontend review changes
* Fix CI
* support gpt2 and longformer in profiler tool
* rename bert_profiler to profiler
* Add --basic_optimization to allow user to use basic level of graph optimization
* Add --kernel_time_only to filter kernel time and exclude fence time
* Add --threshold to filter nodes that with low run time percentage.
* initial implementation of longformer tools for onnx conversion and benchmark
* Support ONNX conversion for transformers 4.0
Add an option to optimize onnx model, and export fp16 model
* optimize a bert model converted using tf2onnx
* add test data
* update
* remove comments
* format
* Revert "format"
This reverts commit f8ae88cb564bce5caf4780e56561403f3ba3d524.
* Revert "remove comments"
This reverts commit 59d8a693581a731fd0291b70fe2c9cec6c4950fe.
* add a squeeze node to convert a 3-d mask to 2-d
* update
* update
* verify and add comments
* build off a specific commit and archive wheel file
* rename to fp32, prefix results w/ commit, add CPU col
* rename 99th to 90 percentile
* get symbolic_shape from master each time
* add install archive wheel, parallel build
* shortening hash
Update ORT model conversion script
- add args for specifying optimization level and whether to use NNAPI
- add logic to create a list of required ops and ORT format model that can be used with NNAPI
* Remove nGraph Execution Provider
Pursuant to nGraph deprecation notice: https://github.com/microsoft/onnxruntime/blob/master/docs/execution_providers/nGraph-ExecutionProvider.md#deprecation-notice
**Deprecation Notice**
| | |
| --- | --- |
| Deprecation Begins | June 1, 2020 |
| Removal Date | December 1, 2020 |
Starting with the OpenVINO™ toolkit 2020.2 release, all of the features
previously available through nGraph have been merged into the OpenVINO™
toolkit. As a result, all the features previously available through
ONNX RT Execution Provider for nGraph have been merged with ONNX RT
Execution Provider for OpenVINO™ toolkit.
Therefore, ONNX RT Execution Provider for **nGraph** will be deprecated
starting June 1, 2020 and will be completely removed on December 1,
2020. Users are recommended to migrate to the ONNX RT Execution Provider
for OpenVINO™ toolkit as the unified solution for all AI inferencing on
Intel® hardware.
* Remove nGraph Licence info from ThirdPartyNotices.txt
* Use simple Test.Run() for tests without EP exclusions
To be consistent with rest of test code.
* Remove nGraph EP functions from Java code
Quantize LSTM:
1. dynamically quantizes MatMul inside the LSTM. It doesn't quantize activation function.
2. support per-channel on the input weight and recurrent weight.
* Enabling Multi Device support for UEP
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Minor fix added
*Added a simple fix to determine OpenVINO
version for Arm build as well
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Add YAML file for pipeline
* Modify typo
* Add working directory
* Modify and test
* Modfiy and test
* Modify and test
* Modify and test
* Modify
* Modify
* Modify
* Modify
* Make sure to copy all the result files
* Add clearn up
* Modify
* Modify agent pool name
* Upload only specific artifacts
* Modify
* Integrated CI Pipeline for running TRT perf as well as added the “large amount of models” into perf model target
* Fix bug
* Fix bug
* Add reading the information regarding previously known failing models
and then skip testing them during benchmark/validation
* Modify the script file for CI
* Replace print with logger.info
* Fix bug
* Fix bug
* Refine the code
* Modify the script so that it can capture script segmentation fault while
running ORT
* Fix bug
* fix bug
* fix bug
* Add debug info
* fix bug
* Refine perf code
* Refine the code
* fix bug
* Code refactoring
* change many-models path
* remove metadata after validation/benchmark are done
* Update README.md
* Fix bug so that metadata doesn't hold stale value
* Remove hardcode and update README
* Add arguments to the script to make it run correctly
* Update linux-gpu-tensorrt-ci-perf-pipeline.yml for Azure Pipelines
* Update linux-gpu-tensorrt-ci-perf-pipeline.yml for Azure Pipelines
* Fix bug so that metadata doesn't hold stale value
* Fix small bug of finding test dataset directory for FP16 test data, as
well as modification of some output information
* use -i random for perf test of TRT changes
Co-authored-by: Olivia Jain <oljain@microsoft.com>
* Implement Hetero in UEP
* Added security checks to take valid Hetero combinations
as device type
* Integrating Hetero features
* Get the statistics Report in Debug Mode
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Passing right device type for vadm_baackend
Added simple fix to pick the right device type
when using vadm_backend with Hetero as well.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Fixed batching logic for 2020.4 and above
* Fixed flake8 PEP8 errors
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Minor Fixes Added
*Added security checks for device_type passed
in for Hetero build during run time
*code cleanup
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Minor changes Added
*Fixed batch_size bug in vadm_backend
*code cleanup
*Documentation updated for Hetero
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
Co-authored-by: suryasidd <surya.siddharth.pemmaraju@intel.com>
* Some fixes to symbolic shape inference
1. Topological sort before iteration in graph
2. Fix a case in slice: start=100000, end=-100000, step=-1, dim=2
3. Fix Nuphar Gemm test's random seed
4. Slice opset 1 axes is optional
The ROCm EP is designed and implemented based on AMD GPU software stack named ROCm. Here is the link for the details about ROCm: https://rocmdocs.amd.com/en/latest/
ROCm EP was created based on the following things:
1. AMD GPU programming language: HIP
2. AMD GPU HIP language runtime: amdhip64
3. BLAS: rocBLAS, hipBLAS
4. DNN: miOpen
5. Collective Communication library: RCCL
6. cub: hipCub
7. …
Current status:
BERT-L and GPT2 training can be ran on AMD GPU with data parallel.
Next:
1. Make more GPU code be sharable between ROCm EP and CUDA EP since HIP language and HIP runtime API are very close to CUDA.
2. Continue improving the implementation.
3. Continue GPU kernel optimization.
4. Support model parallelism on ROCm EP.
……
The rocm kernels have been removed from this commit and will be in a separate PR. Since the original PR was too big(~180 files), it was suggested to split the PR into two parts, one is rocm-kernels, the other is non rocm kernels.
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: sabreshao <sabre.shao@amd.com>
Co-authored-by: anghostcici <11013544+anghostcici@users.noreply.github.com>
Co-authored-by: Suffian Khan <sukha@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
* Enabled multi-threading for OpenVino EP
->Enabled support for concurrent_session_runs
*Run UEP using concurrent_session_runs > 1
*Enabled support for ORT_PARALLEL ExecutionMode
->Documentation Added for Enabling MultiThreading
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Minor Fixes added
*Configure the value of nireq during Runtime
*Documentation typos rectified and details
added for Multi_Threaded Inference
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Some checks added for this fix
*Added checks to invalidate wrong nireq value
and assigned it to default value of 8
*Added new config options for enable_vpu_fast_compile
which were changed w.r.t OpenVINO_2021.1 Release
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>