* Introduce DynamicQuantizeMatMul
It fuses DynamicQuantizeLinear, MatMul and following cast, multiplier. It gets float in and float out for quantized matmul. We have a MLAS kernel in implementation for this op.
Modify gradle build so artifactID has _gpu for GPU builds.
Pass USE_CUDA flag on CUDA build
Adjust publishing pipelines to extract POM from a correct path.
Co-Authored-By: @Craigacp
Disable nuphar large model test, because it takes too long(40+ minutes), while the default cpu provider takes about 5 minutes. After this change, we still keep a lot of other nuphar model tests, I think that should be enough.
1. Enlarge the read buffer size further, so that our code can run even faster. TODO: need apply the similar changes to python some other language bindings.
2. Add coreml_VGG16_ImageNet to the test exclusion set of x86_32. It is not a new model but previously we didn't run the test against x86_32.
* try mac pipeline
* fix path separator
* copy prebuilds folder
* split esrp yaml for win/mac
* disable mac signing temporarily
* add linux
* fix indent
* add nodetool in linux
* add nodetool in win-ci-2019
* replace linux build by custom docker scripts
* use manylinux as node 12.16 not working on centos6
* try ubuntu
* loosen timeout for test case - multiple runs calls
Fix memory leak when a Python list passed as a feed.
Create a custom allocator that can take ownership of python
arrays that are created inside pybind.
Allow direct memory use if continuous array is a copy because
we now can take ownership of it by the allocator.
* bug fix for models not using wrapper
* add test case for no wrapper case
* update test case to use internal learning rate
* fix bug with frozen weight update
* Fixes from investigating issue running BERT-Squad model with larger batch sizes. When the batch size gets large enough the initial run will be successful (no memory pattern in use) but the second will fail to allocate the memory pattern block.
The cause of this failure is that we still have the smaller blocks from the first run allocated, as BFCArena has no logic to free those. This essentially results in 2x the memory being required to run the model.
There was inconsistency in BFCArena::Extend which on one path threw an exception if it couldn't do the allocation, and on another just returned false (resulting in Alloc returning a nullptr). Make the behavior consistent by always throwing if BFCArena fails to find a buffer to return. There are a huge number of places in the code where we assume Alloc returns a valid pointer so throwing will result in more correct behavior as a whole. It's also consistent with what happens when CUDA or the standard library fails to allocate memory.
Next, update ExecutionFrame to check for this failure and not insert a memory block entry if it happens. With the existing code if BFCArena Alloc returned a nullptr we happily inserted that in the blocks, delaying detection of the failure to when we attempted to use the block in AllocateMLValueTensorSelfOwnBufferHelper.
Finally update AllocateMLValueTensorSelfOwnBufferHelper to expect a location may not have a block. A log message will be provided when the block allocation fails so it's not necessary to have more on each individual allocation that would have used the block. Falls through to default behavior of doing a normal allocation.
Dropout op was recently changed to accept a new input named
'training_mode', which is passed in to DropoutGrad automatically.
This PR updates the DropoutGrad schema to accommodate the new input.
Tests were also update to reflect the API change
Co-authored-by: Thiago Crepaldi <thiag.crepaldi@microsoft.com>
* General enhancements/cleanups to test exes
- Support running onnxruntime_perf_test with no output file
- if you're profiling the output file is often unused and can be very large
- Allow failure to override early success if doing multiple runs of a test using running onnx_test_runner
- e.g. if the second run fails that's more important as a final status
- Clarify ownership semantics
- Cleanup naming, line lengths, usage of references for required parameters etc.
* Add ArmNN Execution Provider
Add a new execution provider targeting Arm architecture based on ArmNN.
Validated on NXP i.MX8QM CPU with ResNet50, MobileNetv2 and VGG models.
reviewed-by: mike.caraman@nxp.com
* Minor fixes
- renamed onnxruntime_ARMNN_RELU_USECPU to onnxruntime_ARMNN_RELU_USE_CPU
- fixed acl typo
* remove extra includes. added exception for ArmNN in test
* fix indentation
* Separated the activation implementation from the cpu and fixed the blockage from the endif
Co-authored-by: Andrei-Alexandru <andrei-alexandru.avram@nxp.com>
Needed to change the MissingTrack enum naming due to ort_mutex.h including Windows.h which #defines TRUE and FALSE (via inclusion of fdi_fci_types.h), breaking usage of MissingTrack::TRUE and MissingTrack::FALSE.
* Symbolic shape inference exit on models without onnx opset used
* Temporary fix for ConvTranspose with symbolic input dims
Co-authored-by: Changming Sun <me@sunchangming.com>