GPT2_LM_HEAD is a new ONNX model zoo model that OpenVino doesn't support.
Error message:1: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running OpenVINO-EP-subgraph_1162 node. Name:'OpenVINOExecutionProvider_OpenVINO-EP-subgraph_1162_1' Status Message: _Map_base::at
* Merged PR 5195856: Fix broken cases of zero size tensors in Cast/Reduce
MaskRCNN failed when `Cast` tried to execute `Xor` with emptiness (zero in dimensions). This is perfectly legal and should be treated as a nop.
Ultimately DML itself should treat this case as a nop, just like how C's `memcpy` treats 0 count as a nop, but I'm just addressing it in ORT now, as enabling it in DML would impact more operators to be consistent (probably should incrementally add a flag to tensor validation so operators can be opted in gradually).
Corresponding WindowsAI PR: https://microsoft.visualstudio.com/WindowsAI/_git/WindowsAI/pullrequest/5195850
Related work items: #27469839, #28761382
* Merged PR 5201369: Remove copy of initializers added in DMLXP refactor
When used in ORT, a common method shouldn't copy and return initializer data
Related work items: #29514403
Co-authored-by: Justin Stoecker <justoeck@microsoft.com>
Co-authored-by: Jeff Bloomfield <jeffbloo@microsoft.com>
* [java] Fixing the buffer semantics.
* Renaming bufferCapacity to bufferRemaining.
* Adding a cast to char* so the pointer arithmetic works on Windows.
* remove shape inference and fix save large model problem
* remove unnecessary import
* refine code and add external format for quantize_qat
* remove initializers in tensors_to_calibrate
* small refine
Co-authored-by: t-yguo <t-yguo@microsoft.com>
This updates the NCHWc transformer to not interfere with quantized convolution models, based on observations from internal models. The tensor type for MaxPool must be float. The input to GlobalAveragePool/GlobalMaxPool must be in NCHWc format.
* Refactor TensorAt
locations* must be const and int64_t since our dims are int64_t
Remove unnecessary copy of locations.
Remove unnecesary casting and C-casting. Simplify implementation.
Add a check for string type.
Make CXX api return T& to fully expose C API in C++, const std::vector& by value as it
covers more ground and eliminate redundant copy.
Eliminate inner loop, compute strides first.
* add GetStartTime() for profiler
* add function in inference_session
* remove qualified name
* add the api in cxx_api.h
* rename starttime to StartTimeNs, expost profiling object
* rename GetProfilingStartTime
* move Ortapis to the right place
* move to the end
* add const for session
* const the right place
* use const auto instead of const auto* for session
* remove const for auto getstarttime
* remove const for auto getstarttime
add unit tests
* nit: update test name and add comments
* Initialize tensorrt perf script
* Add bert-squad dependencies
* Modified code to make ort inference with CUDA/Tensorrt
* Add get CUDA/TRT version
* uncomment bert-squad
* Add BERT-SQUAD inputs.json
* Add FastRCNN
* Make preprocess/validation in to common functions
* Add MaskRCNN and SSD and consolidate the code
* Add dependencies for MaskRCNN
* following modifications are made:
- create common fetch function to get inputs/outputs of model from ONNX model zoo.
- create common validation function to compare inference outputs with reference outputs from ONNX model zoo.
- move run/repeat time to argument list. (still working on other arguments, like fp16 or fp32, latency percentile).
- generate table in csv file to show the latency comparison (TRT vs CUDA) side by side.
* Add approache to analyze profling file and also update model related
settings
* Add models
* Add most of models from ONNX model zoo
* Add model input name and print all the model names at the end of run
* Add system info
* Add TRT fp16 support
* Refine the code
* Handle TRT fall back and modify the way to get input data
* Refine code
* Modify code
* Add more precise approach to measure inference
* Add io-binding
* Add YoLoV4
* Refine the code
* Refine the code
* Add models
* Add yolov4 notebook for jetson device
* Update notebook
* Update notebook
* Add CVS models
* Add missing model
* Add support of float16
* Add new way to get trt version
* Add "validate" and "benchmark" mode
* Add randomly generated input
* Refine perf script
* Refine the code.
* Add README
* Refine the code
* Update README.md
* Refine code
* Update README.md
* Remove all the model related python and instead using model_list.json as
models configuration.
Refine the benchmark.py
* Refine the code
Co-authored-by: Chi Lo <lochi@microsoft.com>