QGemm takes in quantized A, B, C, and quantization parameters of output Y, in which C and quantization parameters of Y are optional. Its output can be quantized or full precision, which depends on whether quantization parameters of Y exists or not. If quant params of Y are provided, the output will be requantized or is full precision.
Comparing with QLinearMatMul and MatMulInteger, QGemm supports transpose, apha and beta attribute.
The formula for quantized GEMM is:
Y = alpha * scale_a * scale_b * ((A_int8 - zp_a) * (B_int8 - zp_b) + C_int32), in which,
C_int32 is quantized with formula: C_int32 = (beta * C) / (alpha * scale_a * scale_b)
SparseTensor support
Implement Builder pattern
Fix support for 1-D and 2-D COO indices
Implement and test CSR support.
Handle shape inference for SparseTensors
Implement conversion for COO, CSR and tests.
Address the case where constant sparse initializer is the output.
Implement test infra for SparseTensors
Implement SparseDenseMatMul for Csr and COO and tested it.
Add hash for SparseToDenseMatMul
Finish shared provider refactor
Refactor GetOrCreate to Create
Working on py interface
Expose OrtDevice and use it in allocate_numpy
Adjust Sparse interfaces, add support for string SparseTensor. Add tests.
Add and test to_cuda()
Add accessors to format specific indices
Test values and indices views, read-only flag, after GC access
Add sparse related methods to OrtValue
Re-work SparseTensor wrapper, add OrtValue methods
Rework numpy_array_to_cuda/to_cpu
Add run_with_ort_values
Add models and test sparse_mat_mul with run_with_ort_values
Refactor sparse tensor to use a single buffer
Ifdef x86 Eigen CSR sparse matmul implementation
Exclude broken test, check for string type when copying cross device
Split pybind schema, regenerate docs, add exclusion
Conditionally exclude schema module
Update docs fix cuda build
Add test to a filter and renerate JS docs
Add conversion and test string support for sparse tensors
Exclude conversion utils from minimal build
Add CUDA Memcpy and adjust provider interfaces
* add Gridsampler contrib op
* fix gridsampler_paddingmode_border test
* disable the tests until the kernel added
* fix CI failure
* change GridSampler to GridSample
* changes working to convert akv nodes
* changes to replace nodes
* changes to accomodate qkv hidden sizes as attributes
* kernel to accept qkv_hidden_size attributes
* Working till compute for varied dimension, todo applyattention()
* changes to make all regression tests work
* inference running successfully without prepack
* success inference with pre-pack weights
* add test for diff sizes
* bias shape need not be a mul of 3
* get the output_hidden_size from input
* infer output shape from input
* merge with master
* cleaning up files that got merged wrong
* accurancy at accepted level
* added unit test case for different dimensions
* all unit tests passing
* packed weights working for attention
* prepacked weights working
* added test case for newly added extra qk input
* updated unit test to test only extra add qk
* fixing build error
* removing few debugs
* reverting test changes
* all python test passing
* cleaning up
* new unit test added, major clean up of code
* removed extra code
* minor
* minor fix to tests
* prepack weights code cleaned up
* compacted compute() in attention.cc
* reformat compute()
* making a parameter T
* adding 3 q,k,v buffers in all cases
* fixing build
* running tests only on cpu
* Updating docs
* trigger ci builds
* Addressing comments in PR
* addressing some more comments
* get add_qk_str from add_qk node directly
* updating docs, added extra check to verify attn inputs
* Optimized the extra add by parallelizing
* added attention_shape to symbolic_shape_infer.py
* minor refactoring to address comments
* Update submodule onnxruntime-extensions to latest.
* Add document for onnxruntime-extensions.
* Update cgmanifest.json for onnxruntime-extensions.
* Add example in JavaScript.
Co-authored-by: Zuwei Zhao <zuzhao@microsoft.com>
**Description**:
Enforce no repetition of n-grams. Scores are set to `-inf` for tokens that form a repeated n-gram if added to the back of the input_ids.
**Motivation and Context**
Needed by transformer models in sequence generation algorithms (greedy search and beam search). This module has heavy impact on performance, and can be highly parallelized.
* Update the operator documentation generation
- Make layout a little nicer
- Update to latest supported operators including training
- Fix some links that are broken when the docs content is copied to github-pages
- Fix incorrect usage of 'onnx.ai.ml' as the default domain
- ML ops are now separated from the real default domain of 'onnx.ai'
- Include CPU, CUDA and training kernels
- exclude DNNL as it's not an EP we own
* There are separate paths for CUDA and CUDNN as they are not guaranteed to be in the same location on a Windows machine. Use the CUDNN path when looking for the CUDNN library.
* Enable validation of both contrib ops and operator kernels in build
Filter generation so it's deterministic
Add ability for CI to publish the md files as build artifacts if they differ so a developer can download and add to their PR to resolve any diffs.
Remove workarounds for github-pages as that will now link to the github docs which display correctly
* checkin
* add 4dmask support in attention cuda op
* trim
* add comments
* fix build/test error
* review comments and add tests
* sync doc
* review comments
* minor change
* Include ORT format model conversion scripts and infrastructure in ORT python package.
- tweak existing script setup so it can be easily run directly and from the ORT python package
Add config file and readme for Android minimal build package
Update ORT Mobile doco
Disable warning if 'all' optimizations are enabled but NCHWc transformer is excluded (device specific optimizations don't apply in this scenario so the warning is moot).
* Address PR comments
Implement various improvements related to reordering a tensor for use by NCHWc operations:
Relax the requirement that the input channel count must be a multiple of the NCHWc block size (either 8 or 16 depending on ISA). The requirement now is that the channel count must be a multiple of 4. The implementation of MlasReorderInputNchw would need further work to support relaxing this further, but I don't have any models where I've observed this to be necessary yet.
Support fusing a Transpose(NHWC->NCHW) into a following ReorderInput. ReorderInput now has a channels_last attribute as was done in the past for ReorderOutput. This helps with models converted from TF where the converter is unable to remove all Transpose operations.
Add threading support to ReorderInput to accelerate performance (ReorderOutput will come later).
* Implement qlinear concat and unit test.
Add quantization tools for QLinearConcat and it quantization tests.
* Add kernel def hash for QLinearConcat.
* Change according to PR. Add qdq transformer support for QLinearConcat.
* Add QDQ Transformer unittest. Fix typo on domain.
* remove dup logic of no use.
* fix x86 build error.
* Update operator docs.
* Add support for custom ops library to the ORT model conversion script
Simplify model conversion now that we read ops from the ORT format model.
Enable custom ops in the python bindings if custom ops are turned on in a minimal build.
* Add test of model conversion involving custom ops.
* Add ability to generate configuration that includes required types for individual operators, to allow build size reduction based on that.
- Add python bindings for ORT format models
- Add script to update bindings and help info
- Add parsing of ORT format models
- Add ability to enable type reduction to config generation
- Update build.py to only allow operator/type reduction via config
- simpler to require config to be generated first
- can't mix a type aware (ORT format model only) and non-type aware config as that may result in insufficient types being enabled
- Add script to create reduced build config
- Update CIs
Update Python API to allow more flexibility for setting providers and provider options.
The providers argument (InferenceSession/TrainingSession constructors, InferenceSession.set_providers()) now also accepts a tuple of (name, options dict).
Fix get_available_providers() API (and the corresponding function in the C API) to return the providers in default priority order. Now it can be used as a starting point for the providers argument and maintain the default priority order.
Convert some usages of the deprecated global configuration functions to use EP-specific options instead.
Update some EP-specific option parsing to fail on unknown options.
Other clean up.
* Enabling fasterrcnn variant and vehicle detector
* changes for 2021_2 branch
* yolov3_pytorch commit
* fixed braces in basic_backend.cc
* ci information added
* faster rcnn variant and vehicle detector changes were made in 2021.1 and not in 2021.2
* some changes to support unit tests
* disable some tests which are failing
* fix myriad tests for vehicle detector
* Did some cleanup
*cleaned up comments
*Disabled Add_Broadcast_0x1 and Add_Broadcast_1x0
tests on MYRIAD_FP16 backend due to a bug
*cleaned up capability_2021_2.cc file
*Removed extra conditions which were added
for some validation in backend_utils
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* yolov3 pytorch workaround to ensure that the output names are matched
* gemmoptest fixed on myriad
* Fixed MYRIADX CPP Test Failures
*Expand,GatherND,Range,Round op's
are only supported in model
*where op with float input data
types are not supported and fixed
*Scatter and ScatterElements op's with
negative axis are fixed
*Reshape op with 0 dim value are not
supported and fixed
*Disabled InstanceNorm_2 test on MYRIADX
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* make changes to yolov3 pytorch
* Fixed python unit tests
*Fixed failing python tests on vpu,
GPU and CPU
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Fixes POW op failures on GPU_FP16
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Clean up capability_2021_2.cc
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Updated docx for MultiThreading option
*Added extra info on setting the num_of_threads
option using the API and it's actual usage
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* fixed slice and removed extra prints
* Disabled failing python tests
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Minor changes added in capabilty_2021_2
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* made changes to slice to avoid failures
* Disabling FP16 support for GPU_FP32
->Inferencing an FP16 model on GPU_FP32
leads to accuracy mismatches. so, we would
rather use GPU_FP16 to infer an FP16 model
on GPU Device
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* Updated docx for Inferencing a FP16 Model
Signed-off-by: MaajidKhan <n.maajidkhan@gmail.com>
* fix for mask rcnn
* Script for installing openvino from source
* Updated with openvino 2021.2 online installation
* code comment fixes
fixed accuracy mismatch for div
* Update OpenvinoEP-ExecutionProvider.md
updated for 2021.2 branch
* Update README.md
updated dockerfile documentation
* Update BUILD.md
build.md update documentation
* permissiong change of install_openvino.sh
* made changes to align with microsoft onnxruntime changes
* Updated with ov 2021.2.200
Co-authored-by: suryasidd <surya.siddharth.pemmaraju@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel/com>
Co-authored-by: MaajidKhan <n.maajidkhan@gmail.com>
Co-authored-by: mohdansx <mohdx.ansari@intel.com>