### Description
<!-- Describe your changes. -->
Sampling op for cpu and cuda
support huggingface case and custom case
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
Implement reuse kv_cache past and present tensor in Attention Ops.
Unit test for abover feature.
Utilize the reuse kv_cache for past and present tensor in Greedy Search.
Correctness test for it.
Co-authored-by: Zhang Lei <phill.zhang@gmail.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
fix https://github.com/microsoft/onnxruntime/issues/13508
Some models have QuickGelu(x)=x*sigmoid(1.702x), which has 3 Ops for
forward and 5 Ops for backward. The PR is to fuse this to a single Op
named QuickGelu and its gradient QuickGeluGrad.
For CUDA, tested in V100 using input tensor with shape [64,128,2048] and
float16 type:
Before, FW takes 335us, BW takes 614us

After, FW takes 115us, BW takes 139us, which is much faster.

For CPU kernel, using same shape and float type:
Before, FW takes 10us, BW takes 49us
Mul: 3480[µs]
Sigmoid: 1996[µs]
Mul: 4789[µs]
Mul: 4642[µs]
Mul: 4195[µs]
SigmoidGrad: 18328[µs]
Mul: 2988[µs]
Sum: 18576[µs]
After, FW takes 4us, BW takes 5us, which is also much faster.
QuickGelu: 3939[µs]
QuickGeluGrad: 5089[µs]
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
### Description
Allow separated Q, K and V inputs to support cross attention:
* Q: [batch_size, sequence_length, hidden_size]
* K: [batch_size, kv_sequence_length, hidden_size]
* V: [batch_size, kv_sequence_length, v_hidden_size]
* Output: [batch_size, sequence_length, v_hidden_size]
To use separated Q/K/V inputs, the input tensor is for query, and two
optional inputs are added for key and value. Weights for input
projection is not included for now, so the MatMul of input projection
shall be done out of Attention operator, but Add bias is included for
performance consideration.
### Description
<!-- Describe your changes. -->
Fix a bug in GreedySearch Op when batch > 1
Support custom attention mask in GreedySearch and BeamSearch with GPT2
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
fix some typo in docs
### Motivation and Context
singed vs signed
succeding vs succeeding
fileter vs filter
kernal vs kernel
libary vs library
* Change block dimension type to Int from Ints.
* In response to feedback that the block dimension corresponds to the
reduction dimension of the consuming matrix multiplication. There is
always only 1 reduction dimension.
* `QuantizeBFP` and `DequantizeBFP` schemas - similar to
`QuantizeLinear` and `DeQuantizeLinear`.
* BFP datatype is represented as a `uint8` tensor with shape and stride
metadata. This is preferrable to adding a new datatype for BFP, which is
more disruptive and [discouraged by
PyTorch](https://discuss.pytorch.org/t/training-with-custom-quantized-datatype/152132/2).
Context:
The Microsoft Floating Point (BFP) datatype shares an exponent for every
n numbers called a “bounding box.” Each number still has its own
mantissa and sign bits. BFP has been shown to incur 3-4 less cost
(energy and area) than BFloat16 and INT8 counterparts without reductions
in accuracy for the ImageNet benchmark as described in [Rouhani
2020](https://proceedings.neurips.cc/paper/2020/file/747e32ab0fea7fbd2ad9ec03daa3f840-Paper.pdf).
Requirements:
* There are many variants of BFP (number of mantissa bits, number of
shared exponent bits, size of bounding box, custom bit fields, etc.)
* The size and layout of an BFP variant varies across hardware
* bounding box can be over arbitrary dimensions; for example, for the
channel "C" dimension in a N x C x H x W tensor for convolution
Goals of this PR:
* Add initial versions of QuantizeBFP and DequantizeBFP operators to
enable QDQ-style quantization with BFP. Once the schemas stabilize, we
can consider upstreaming to ONNX.
* Add some basic type and shape inferencing tests; tests that run on an
EP will be a follow-up.
* Make ORT as Pytorch JIT backend
LORT likely doesn't work with aten fallback so we only test LORT in its own CI.
* Revert changes to enable external CUDA allocator. Will add it later.
Revert "Revert changes to enable external CUDA allocator. Will add it later."
This reverts commit d5487f2e193014c805505afae8fb577c53667658.
Fix external allocator
* Relax tolerance and remove commented code
* Print more information in CI
* Fix pointer
* Address comments.
1. Reuse ORT-eager mode's environment.
2. Remove unused ctor.
* Use Pytorch master branch as all PRs are merged
Fix
* Refine based on cpplint feedbacks
* Revert changes to allow custom CUDA allocator in public APIs
* Use torch.testing.assert_close
* Use unittest framework
* Switch docker repo
* Rename *.cpp to *.cc
* Address comments
* Add comment
* Use same pipeline file for eager and lort pipelines
* Address comments
* Add yaml comment
* Fix cmake files
* Address comments
* Rename flags, remove printing code, remove dead comment
(1) Support T5 in BeamSearch operator, and add both CPU and CUDA implementation.
(2) Change BeamSearch op: rename encoder_decoder_init attribute to encoder, and add decoder_start_token_id attribute
(3) Update convert_to_onnx for T5 to use int32 instead of int64 inputs as default.
(4) Add more tests in best_beam_search.py
(5) fix ORT_ENFORCE of hypothesis_buffer_offset_
(6) Improve ONNX conversion:
(a) Change encoder some dynamic axes to fixed dim value
(b) add --separate_encoder_and_decoder_init
(c) correct name t5-3B => t5-3b, t5-11B => t5-11b
(d) Add --use_int32_inputs in convert t5 to onnx
(e) Allow t5 beam search conversion in one step
* Initiate Ort SNPE EP
* fix snpe ep windows build which is caused by the utility method (ToUTF8String) name change on master
* correct the source path for libonnxruntime.so while building for andorid package
* add AdditionalDependencies for amr64
* On MS-Windows, the patchfile must be a text file, i.e. CR-LF must be used as line endings. A file with LF may give the error: "Assertion failed, hunk, file patch.c, line 343," unless the option '--binary' is given.
* fix build failure if snpe is not enabled
* update doc for contrib op
* separate out snpe ep settings to onnxruntime_snpe_provider.cmake
* renaming according review comments
* update according review comments
* Implement BitmaskDropout and associated unit tests.
* Implement BitmaskDropoutGrad and associated unit tests.
* Implement Dropout -> BitmaskDropout rewrite rule and associated unit tests.
* Implement (Dropout,DropoutGrad) -> (BitmaskDropout,BitmaskDropoutGrad) rewrite rule.
This commit does not yet include unit tests for this rewrite rule.
This commit also introduces improved documentation for all changes which will be grouped
into this PR.
* bitmask dropout
* fix win build
* bugfix for rocm
* bugfix
* fix code format
* fix ut
* fix build break
* fix ut in win
* resolve comments
* fix ut in trt
* resolve comments
* fix rocm build error
* fix typo
Co-authored-by: Aidan Beggs <aidanbeggs@microsoft.com>
* change BeamSearch op to support encoder decoder model
* check model_type and decoder attribute
* fix
* update comments
* warn shape inference issue with onnx v1.11 or T5
* skip parity test when tempature != 1.0
* fix build
* Changes to fuse embed layer for gpt2, kernal changes pending
* verified add output and regular add match
* Test added for additional output embedlayernorm, working on CUDA
* Test passing on CPU
* updated convert_to_onnx toll to check parity correctly
* removed some debugs
* couple of TODO left as in optimizer.py
* removed changes to optimizer.py
* fixing build
* fixing build
* updated order of initilization
* added a test case for float16
* updating the docs
* updating tests failing due to embed layer fusion
* update unit tests
* updating CUDA documentation in operatorkernels.md
* addressing comments
* OperatorKernels.md updated with CUDA
* adding TODO to qembed_layer
* minor edit
* updated docs
* addressing comments
* adding position ids to embed layer gpt2
* updating fused gpt2 model
* added extra test
* remove comments
* addressing comments
* contrib_defs.cc updated
* all tests passing
* fixing a typo
* minor edit
* trigger build
* qembedlayernorm checkinputs updated
* fixing build error
* fixing build error
* fixing build error
```
Component for aggressive decoding. Find the bifurcation index of predicted tokens, between source tokens,
starting from previous suffix match index, and predicted tokens.
Concat predicted tokens, starting from bifurcation index, to the back
of current tokens. This forms the output tokens.
Detect suffix match index in source tokens, between source tokens and output tokens.
Detection is based on finding the appearances of last n-gram in output tokens
in source tokens.
A match is considered found if source tokens contain a single matching n-gram.
Return the index of the start of the n-gram in source tokens.
No matching if found if src tokens contain multiple or zero matching n-grams. Return -1.
```
* bias dropout improvement
* add transform case for same shape case
* combine kernel
* merge with vectorized kernel
* use "has_same_shape_bias"
* minor: a "N % 4 != 0" case
* add op UT for has_same_shape_bias
* address comments; add param case for 1d bias;
add param case tests for 1d and same-shape bias
* rewrite logic condition
Co-authored-by: Peng Wang <pengwa@microsoft.com>
QGemm takes in quantized A, B, C, and quantization parameters of output Y, in which C and quantization parameters of Y are optional. Its output can be quantized or full precision, which depends on whether quantization parameters of Y exists or not. If quant params of Y are provided, the output will be requantized or is full precision.
Comparing with QLinearMatMul and MatMulInteger, QGemm supports transpose, apha and beta attribute.
The formula for quantized GEMM is:
Y = alpha * scale_a * scale_b * ((A_int8 - zp_a) * (B_int8 - zp_b) + C_int32), in which,
C_int32 is quantized with formula: C_int32 = (beta * C) / (alpha * scale_a * scale_b)
SparseTensor support
Implement Builder pattern
Fix support for 1-D and 2-D COO indices
Implement and test CSR support.
Handle shape inference for SparseTensors
Implement conversion for COO, CSR and tests.
Address the case where constant sparse initializer is the output.
Implement test infra for SparseTensors
Implement SparseDenseMatMul for Csr and COO and tested it.
Add hash for SparseToDenseMatMul
Finish shared provider refactor
Refactor GetOrCreate to Create
Working on py interface
Expose OrtDevice and use it in allocate_numpy
Adjust Sparse interfaces, add support for string SparseTensor. Add tests.
Add and test to_cuda()
Add accessors to format specific indices
Test values and indices views, read-only flag, after GC access
Add sparse related methods to OrtValue
Re-work SparseTensor wrapper, add OrtValue methods
Rework numpy_array_to_cuda/to_cpu
Add run_with_ort_values
Add models and test sparse_mat_mul with run_with_ort_values
Refactor sparse tensor to use a single buffer
Ifdef x86 Eigen CSR sparse matmul implementation
Exclude broken test, check for string type when copying cross device
Split pybind schema, regenerate docs, add exclusion
Conditionally exclude schema module
Update docs fix cuda build
Add test to a filter and renerate JS docs
Add conversion and test string support for sparse tensors
Exclude conversion utils from minimal build
Add CUDA Memcpy and adjust provider interfaces
* add Gridsampler contrib op
* fix gridsampler_paddingmode_border test
* disable the tests until the kernel added
* fix CI failure
* change GridSampler to GridSample
* changes working to convert akv nodes
* changes to replace nodes
* changes to accomodate qkv hidden sizes as attributes
* kernel to accept qkv_hidden_size attributes
* Working till compute for varied dimension, todo applyattention()
* changes to make all regression tests work
* inference running successfully without prepack
* success inference with pre-pack weights
* add test for diff sizes
* bias shape need not be a mul of 3
* get the output_hidden_size from input
* infer output shape from input
* merge with master
* cleaning up files that got merged wrong
* accurancy at accepted level
* added unit test case for different dimensions
* all unit tests passing
* packed weights working for attention
* prepacked weights working
* added test case for newly added extra qk input
* updated unit test to test only extra add qk
* fixing build error
* removing few debugs
* reverting test changes
* all python test passing
* cleaning up
* new unit test added, major clean up of code
* removed extra code
* minor
* minor fix to tests
* prepack weights code cleaned up
* compacted compute() in attention.cc
* reformat compute()
* making a parameter T
* adding 3 q,k,v buffers in all cases
* fixing build
* running tests only on cpu
* Updating docs
* trigger ci builds
* Addressing comments in PR
* addressing some more comments
* get add_qk_str from add_qk node directly
* updating docs, added extra check to verify attn inputs
* Optimized the extra add by parallelizing
* added attention_shape to symbolic_shape_infer.py
* minor refactoring to address comments
**Description**:
Enforce no repetition of n-grams. Scores are set to `-inf` for tokens that form a repeated n-gram if added to the back of the input_ids.
**Motivation and Context**
Needed by transformer models in sequence generation algorithms (greedy search and beam search). This module has heavy impact on performance, and can be highly parallelized.
* Update the operator documentation generation
- Make layout a little nicer
- Update to latest supported operators including training
- Fix some links that are broken when the docs content is copied to github-pages
- Fix incorrect usage of 'onnx.ai.ml' as the default domain
- ML ops are now separated from the real default domain of 'onnx.ai'
- Include CPU, CUDA and training kernels
- exclude DNNL as it's not an EP we own
* There are separate paths for CUDA and CUDNN as they are not guaranteed to be in the same location on a Windows machine. Use the CUDNN path when looking for the CUDNN library.
* Enable validation of both contrib ops and operator kernels in build
Filter generation so it's deterministic
Add ability for CI to publish the md files as build artifacts if they differ so a developer can download and add to their PR to resolve any diffs.
Remove workarounds for github-pages as that will now link to the github docs which display correctly