* add Python API for getProfilingStartTime
* debug for using Python API
* add in C# api
* use uint intead of uint64_t to prevent warning
* typo for GetProfilingStartTimeNs
* remove const
* Update onnxruntime/python/session.py
Co-authored-by: Pranav Sharma <emailpranav@gmail.com>
* remove unnecessary return
* Add Python unit test
* Add C# unit test and refactor Python test
* use ulong in C# for uint64_t in C++
* remove time.monotonic_ns
* syntax: remove public for inner function
* correct the API's order
* getprofilingstarttime after run
* Correct the right order in NativeMethod.cs
* update order
* nit: remove spaces
* Update csharp/src/Microsoft.ML.OnnxRuntime/InferenceSession.cs
Co-authored-by: Guoyu Wang <62914304+gwang-msft@users.noreply.github.com>
* use the updated function
* add comment about the precision
* add more comments
* add session.py back
* fix flake8
* remove session.py
* Add comments in C, C#, Python APIs about precision
Co-authored-by: Pranav Sharma <emailpranav@gmail.com>
Co-authored-by: Guoyu Wang <62914304+gwang-msft@users.noreply.github.com>
* Add CUDA option to run copy in default stream
This change fixes#4829. Thanks @maherzog for providing the repro!
The bug is caused by memory reuse in BFC arena, where copy and
compute stream in CUDA has a racing condition.
BFC arena is an arena allocator on top of cudaMalloc/Free to
reduce the cost in syncing CPU and GPU when alloc/free. It means
when CPU alloc/free the memory, GPU might not finished previous
work on the memory, so that CPU and GPU could run asynchronously.
This is OK if there's only one stream, where the execution order
in CPU and GPU are consistent. For example, if we have two kernels
A and B, CPU runs allocA->computeA->freeA->allocB->computeB->freeB,
A and B could shares the same memory since computeA and computeB
will not have racing as long as they run in the same GPU compute
stream.
However, if CPU runs allocA->CopyA->freeA->allocB->computeB->freeB,
the order of execution in GPU could have copyA happen after computeB,
if copy and compute happens in different GPU streams.
This change makes copy to run in default compute stream, while adding
an option to fall back to previous behavior if there's perf hit. This
is a short term fix before BFC arena could support multiple streams.
User may use following options to revert to previous behavior:
C API:
struct OrtCUDAProviderOptions cudaProviderOpt;
cudaProviderOpt.do_copy_in_default_stream = false;
C++ API:
CUDAExecutionProviderInfo cudaEPInfo;
cudaEPInfo.do_copy_in_default_stream = false;
C# API:
pending...
Python:
import onnxruntime
onnxruntime.capi._pybind_state.set_do_copy_in_default_stream(False)
* Confirmed the test failes in CI when doing copy in separate stream
Revert the test to get CI pass now
* Fix Windows test
* Address CR
* ONNX GPU runtime fails to fallback to CPU when GPU is not available OR busy
https://github.com/microsoft/onnxruntime/issues/5299
* comments
* Init _fallback_providers before C.InferenceSession
* As per review: Fallback providers order supersedes user's providers order, IF they are included into providers list.
* Code convention fix
* pep8
* Expose recompute configs to the frontend
* Add frontend test
* Ensure recompute graph transformer is only applied once
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
The bug happens when merging following shapes:
input0: [1, 1, 'Min(1024, input1_dynamic_axes_3)', 'Min(1024, input1_dynamic_axes_3)']
input1: ['input1_dynamic_axes_1*input1_dynamic_axes_2', 12, 'input1_dynamic_axes_3', 'input1_dynamic_axes_3']
input2: []
The fix is to avoid broadcasting merge on input2
* Build Recomputation Graph
* Make topological sort to run FW nodes first
* Pattern match start and end of transformer layer
* Topological sort with Priority
* Add logger to Gradient Graph Builder
* Use Logger
* Introduce Execution Order
* Symbolic shape inference: fix a case when concat requires merge multiple dims
* Fix a bug triggered in newer version of sympy
Fix a bug in output data type guessing
* remove shape inference and fix save large model problem
* remove unnecessary import
* refine code and add external format for quantize_qat
* remove initializers in tensors_to_calibrate
* small refine
Co-authored-by: t-yguo <t-yguo@microsoft.com>
* Initialize tensorrt perf script
* Add bert-squad dependencies
* Modified code to make ort inference with CUDA/Tensorrt
* Add get CUDA/TRT version
* uncomment bert-squad
* Add BERT-SQUAD inputs.json
* Add FastRCNN
* Make preprocess/validation in to common functions
* Add MaskRCNN and SSD and consolidate the code
* Add dependencies for MaskRCNN
* following modifications are made:
- create common fetch function to get inputs/outputs of model from ONNX model zoo.
- create common validation function to compare inference outputs with reference outputs from ONNX model zoo.
- move run/repeat time to argument list. (still working on other arguments, like fp16 or fp32, latency percentile).
- generate table in csv file to show the latency comparison (TRT vs CUDA) side by side.
* Add approache to analyze profling file and also update model related
settings
* Add models
* Add most of models from ONNX model zoo
* Add model input name and print all the model names at the end of run
* Add system info
* Add TRT fp16 support
* Refine the code
* Handle TRT fall back and modify the way to get input data
* Refine code
* Modify code
* Add more precise approach to measure inference
* Add io-binding
* Add YoLoV4
* Refine the code
* Refine the code
* Add models
* Add yolov4 notebook for jetson device
* Update notebook
* Update notebook
* Add CVS models
* Add missing model
* Add support of float16
* Add new way to get trt version
* Add "validate" and "benchmark" mode
* Add randomly generated input
* Refine perf script
* Refine the code.
* Add README
* Refine the code
* Update README.md
* Refine code
* Update README.md
* Remove all the model related python and instead using model_list.json as
models configuration.
Refine the benchmark.py
* Refine the code
Co-authored-by: Chi Lo <lochi@microsoft.com>
* Added config flags for VPU Fast Recompile
* clean-up ifdefs
* Add VPU Fast compile config option
Adds an option that enables Fast compilation of models to VPU
hardware specific format.
* Add config option to choose specific device id for inference
Inference of all subgraphs will be scheduled only on this device
even if other devices of the same type are available.
* Add Python API to list available device IDs
* code cleanup
* Add second C/C++ API with settings string parameter
Adds an additional C/C++ API that allows passing multiple
key-value pairs for settings as a single string. Multiple
settings are delimited by '\n' while the key and value
within a setting are delimited by '|'.
* Append 'Ex' to the extended C/C++ API
* Use set_providers Py API to set config options.
Uses Session.set_providers Python API to set EP runtime config
options as key/val pairs
Deprecated older module function definitions for config settings.
Updates documentation.
* avoid globals for py config options where possible
Co-authored-by: intel <you@example.com>
(1) Save gpt2 test data during test generation.
(2) Use torch fp32 model as baseline when onnx model is fp16.
(3) Refine logic to compose onnx model path
* Add SetLanguageProjection C Api and use it in four projections
* static cast enum languageprojection to uint32_t
* resolve comments
* fix typo and line added unintentionally
* revert unecessary change
* reorder c# api
* add TensorAt and CreateAndRegisterAllocator in Csharp to keep the same order as C apis