* build off a specific commit and archive wheel file
* rename to fp32, prefix results w/ commit, add CPU col
* rename 99th to 90 percentile
* get symbolic_shape from master each time
* add install archive wheel, parallel build
* shortening hash
Update ORT model conversion script
- add args for specifying optimization level and whether to use NNAPI
- add logic to create a list of required ops and ORT format model that can be used with NNAPI
* Remove nGraph Execution Provider
Pursuant to nGraph deprecation notice: https://github.com/microsoft/onnxruntime/blob/master/docs/execution_providers/nGraph-ExecutionProvider.md#deprecation-notice
**Deprecation Notice**
| | |
| --- | --- |
| Deprecation Begins | June 1, 2020 |
| Removal Date | December 1, 2020 |
Starting with the OpenVINO™ toolkit 2020.2 release, all of the features
previously available through nGraph have been merged into the OpenVINO™
toolkit. As a result, all the features previously available through
ONNX RT Execution Provider for nGraph have been merged with ONNX RT
Execution Provider for OpenVINO™ toolkit.
Therefore, ONNX RT Execution Provider for **nGraph** will be deprecated
starting June 1, 2020 and will be completely removed on December 1,
2020. Users are recommended to migrate to the ONNX RT Execution Provider
for OpenVINO™ toolkit as the unified solution for all AI inferencing on
Intel® hardware.
* Remove nGraph Licence info from ThirdPartyNotices.txt
* Use simple Test.Run() for tests without EP exclusions
To be consistent with rest of test code.
* Remove nGraph EP functions from Java code
Quantize LSTM:
1. dynamically quantizes MatMul inside the LSTM. It doesn't quantize activation function.
2. support per-channel on the input weight and recurrent weight.
* Enabling Multi Device support for UEP
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Minor fix added
*Added a simple fix to determine OpenVINO
version for Arm build as well
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Add YAML file for pipeline
* Modify typo
* Add working directory
* Modify and test
* Modfiy and test
* Modify and test
* Modify and test
* Modify
* Modify
* Modify
* Modify
* Make sure to copy all the result files
* Add clearn up
* Modify
* Modify agent pool name
* Upload only specific artifacts
* Modify
* Integrated CI Pipeline for running TRT perf as well as added the “large amount of models” into perf model target
* Fix bug
* Fix bug
* Add reading the information regarding previously known failing models
and then skip testing them during benchmark/validation
* Modify the script file for CI
* Replace print with logger.info
* Fix bug
* Fix bug
* Refine the code
* Modify the script so that it can capture script segmentation fault while
running ORT
* Fix bug
* fix bug
* fix bug
* Add debug info
* fix bug
* Refine perf code
* Refine the code
* fix bug
* Code refactoring
* change many-models path
* remove metadata after validation/benchmark are done
* Update README.md
* Fix bug so that metadata doesn't hold stale value
* Remove hardcode and update README
* Add arguments to the script to make it run correctly
* Update linux-gpu-tensorrt-ci-perf-pipeline.yml for Azure Pipelines
* Update linux-gpu-tensorrt-ci-perf-pipeline.yml for Azure Pipelines
* Fix bug so that metadata doesn't hold stale value
* Fix small bug of finding test dataset directory for FP16 test data, as
well as modification of some output information
* use -i random for perf test of TRT changes
Co-authored-by: Olivia Jain <oljain@microsoft.com>
* Implement Hetero in UEP
* Added security checks to take valid Hetero combinations
as device type
* Integrating Hetero features
* Get the statistics Report in Debug Mode
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Passing right device type for vadm_baackend
Added simple fix to pick the right device type
when using vadm_backend with Hetero as well.
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Fixed batching logic for 2020.4 and above
* Fixed flake8 PEP8 errors
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Minor Fixes Added
*Added security checks for device_type passed
in for Hetero build during run time
*code cleanup
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Minor changes Added
*Fixed batch_size bug in vadm_backend
*code cleanup
*Documentation updated for Hetero
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
Co-authored-by: suryasidd <surya.siddharth.pemmaraju@intel.com>
* Some fixes to symbolic shape inference
1. Topological sort before iteration in graph
2. Fix a case in slice: start=100000, end=-100000, step=-1, dim=2
3. Fix Nuphar Gemm test's random seed
4. Slice opset 1 axes is optional
The ROCm EP is designed and implemented based on AMD GPU software stack named ROCm. Here is the link for the details about ROCm: https://rocmdocs.amd.com/en/latest/
ROCm EP was created based on the following things:
1. AMD GPU programming language: HIP
2. AMD GPU HIP language runtime: amdhip64
3. BLAS: rocBLAS, hipBLAS
4. DNN: miOpen
5. Collective Communication library: RCCL
6. cub: hipCub
7. …
Current status:
BERT-L and GPT2 training can be ran on AMD GPU with data parallel.
Next:
1. Make more GPU code be sharable between ROCm EP and CUDA EP since HIP language and HIP runtime API are very close to CUDA.
2. Continue improving the implementation.
3. Continue GPU kernel optimization.
4. Support model parallelism on ROCm EP.
……
The rocm kernels have been removed from this commit and will be in a separate PR. Since the original PR was too big(~180 files), it was suggested to split the PR into two parts, one is rocm-kernels, the other is non rocm kernels.
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
Co-authored-by: sabreshao <sabre.shao@amd.com>
Co-authored-by: anghostcici <11013544+anghostcici@users.noreply.github.com>
Co-authored-by: Suffian Khan <sukha@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
* Enabled multi-threading for OpenVino EP
->Enabled support for concurrent_session_runs
*Run UEP using concurrent_session_runs > 1
*Enabled support for ORT_PARALLEL ExecutionMode
->Documentation Added for Enabling MultiThreading
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Minor Fixes added
*Configure the value of nireq during Runtime
*Documentation typos rectified and details
added for Multi_Threaded Inference
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Some checks added for this fix
*Added checks to invalidate wrong nireq value
and assigned it to default value of 8
*Added new config options for enable_vpu_fast_compile
which were changed w.r.t OpenVINO_2021.1 Release
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* add Python API for getProfilingStartTime
* debug for using Python API
* add in C# api
* use uint intead of uint64_t to prevent warning
* typo for GetProfilingStartTimeNs
* remove const
* Update onnxruntime/python/session.py
Co-authored-by: Pranav Sharma <emailpranav@gmail.com>
* remove unnecessary return
* Add Python unit test
* Add C# unit test and refactor Python test
* use ulong in C# for uint64_t in C++
* remove time.monotonic_ns
* syntax: remove public for inner function
* correct the API's order
* getprofilingstarttime after run
* Correct the right order in NativeMethod.cs
* update order
* nit: remove spaces
* Update csharp/src/Microsoft.ML.OnnxRuntime/InferenceSession.cs
Co-authored-by: Guoyu Wang <62914304+gwang-msft@users.noreply.github.com>
* use the updated function
* add comment about the precision
* add more comments
* add session.py back
* fix flake8
* remove session.py
* Add comments in C, C#, Python APIs about precision
Co-authored-by: Pranav Sharma <emailpranav@gmail.com>
Co-authored-by: Guoyu Wang <62914304+gwang-msft@users.noreply.github.com>
* Add CUDA option to run copy in default stream
This change fixes#4829. Thanks @maherzog for providing the repro!
The bug is caused by memory reuse in BFC arena, where copy and
compute stream in CUDA has a racing condition.
BFC arena is an arena allocator on top of cudaMalloc/Free to
reduce the cost in syncing CPU and GPU when alloc/free. It means
when CPU alloc/free the memory, GPU might not finished previous
work on the memory, so that CPU and GPU could run asynchronously.
This is OK if there's only one stream, where the execution order
in CPU and GPU are consistent. For example, if we have two kernels
A and B, CPU runs allocA->computeA->freeA->allocB->computeB->freeB,
A and B could shares the same memory since computeA and computeB
will not have racing as long as they run in the same GPU compute
stream.
However, if CPU runs allocA->CopyA->freeA->allocB->computeB->freeB,
the order of execution in GPU could have copyA happen after computeB,
if copy and compute happens in different GPU streams.
This change makes copy to run in default compute stream, while adding
an option to fall back to previous behavior if there's perf hit. This
is a short term fix before BFC arena could support multiple streams.
User may use following options to revert to previous behavior:
C API:
struct OrtCUDAProviderOptions cudaProviderOpt;
cudaProviderOpt.do_copy_in_default_stream = false;
C++ API:
CUDAExecutionProviderInfo cudaEPInfo;
cudaEPInfo.do_copy_in_default_stream = false;
C# API:
pending...
Python:
import onnxruntime
onnxruntime.capi._pybind_state.set_do_copy_in_default_stream(False)
* Confirmed the test failes in CI when doing copy in separate stream
Revert the test to get CI pass now
* Fix Windows test
* Address CR
* ONNX GPU runtime fails to fallback to CPU when GPU is not available OR busy
https://github.com/microsoft/onnxruntime/issues/5299
* comments
* Init _fallback_providers before C.InferenceSession
* As per review: Fallback providers order supersedes user's providers order, IF they are included into providers list.
* Code convention fix
* pep8
* Expose recompute configs to the frontend
* Add frontend test
* Ensure recompute graph transformer is only applied once
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
The bug happens when merging following shapes:
input0: [1, 1, 'Min(1024, input1_dynamic_axes_3)', 'Min(1024, input1_dynamic_axes_3)']
input1: ['input1_dynamic_axes_1*input1_dynamic_axes_2', 12, 'input1_dynamic_axes_3', 'input1_dynamic_axes_3']
input2: []
The fix is to avoid broadcasting merge on input2