Commit graph

171 commits

Author SHA1 Message Date
Patrice Vignola
96d8d2c278
[DML EP] Add SkipLayerNormalization (#13849)
### Description

Add SkipLayerNormalization for the DML EP
2022-12-07 01:49:14 -08:00
Patrice Vignola
b53bbe7370
[DML EP] Add an implementation for NonZero (#13768)
### Description
Add the NonZero op for DML



### Motivation and Context
NonZero is used in a few transformer models, so having a DML
implementation will stop large tensors from being transferred to the CPU
and back to the GPU
2022-12-02 18:39:21 -08:00
Patrice Vignola
a0b470bc35
[DML EP] Add mixed datatype support for DML's LayerNorm contrib op (#13734)
### Description
Add mixed datatype support for DML's LayerNorm contrib op.



### Motivation and Context
The fusion logic removes casts around LayerNorm in the graph because the
contrib version of the op supports mixed datatypes. Scale, Bias and
Output's datatypes must match, but input's datatype can be different.
2022-12-01 14:08:18 -08:00
Patrice Vignola
e9b92fdf33
[DML EP] Add DML implementation for BiasGelu (#13795)
### Description
Add DML implementation for BiasGelu
2022-12-01 09:23:19 -08:00
Tianlei Wu
8b0e0f4927
Add RemovePadding and RestorePadding for BERT model (#13701)
Add two operators RemovePadding and RestorePadding based on ideal of
effective transformer (https://github.com/bytedance/effective_transformer) to improve large
batch size inference for BERT model.
2022-11-22 10:00:23 -08:00
Patrice Vignola
3482180ec2
DML EP add a registration for Shape and Size (#13442)
### Description
Add a DML registration for Shape to avoid copying back to the CPU just
to get the shape of a GPU tensor.



### Motivation and Context
When using free dimensions, many Transformers models extensively use the
`Shape` operator. This causes hundreds of GPU->CPU copy that should be
completely avoidable. Note that this change also uses the same
heuristics as other providers (e.g. CUDA) to force some tensors on the
CPU in certain situations.

Co-authored-by: Patrice Vignola <pavignol@microsoft.com>
2022-11-08 19:29:37 -08:00
Vincent Wang
8b0669bf63
QuickGelu Fusion (#12417)
Some models have QuickGelu(x)=x*sigmoid(1.702x), which has 3 Ops for
forward and 5 Ops for backward. The PR is to fuse this to a single Op
named QuickGelu and its gradient QuickGeluGrad.

For CUDA, tested in V100 using input tensor with shape [64,128,2048] and
float16 type:
Before, FW takes 335us, BW takes 614us

![image](https://user-images.githubusercontent.com/11661208/182291335-15188709-ffe7-44d1-9d14-0b544cbe5e55.png)

After, FW takes 115us, BW takes 139us, which is much faster.

![image](https://user-images.githubusercontent.com/11661208/182291502-f0b5161c-b95c-45fc-90f8-ad0c592d2433.png)

For CPU kernel, using same shape and float type:
Before, FW takes 10us, BW takes 49us
Mul: 3480[µs]
Sigmoid: 1996[µs]
Mul: 4789[µs]
Mul: 4642[µs]
Mul: 4195[µs]
SigmoidGrad: 18328[µs]
Mul: 2988[µs]
Sum: 18576[µs]

After, FW takes 4us, BW takes 5us, which is also much faster.
QuickGelu: 3939[µs]
QuickGeluGrad: 5089[µs]

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2022-10-28 18:12:07 +08:00
Changming Sun
07271b6c8a
Update docs/OperatorKernels.md (#13485) 2022-10-27 20:11:49 -07:00
Scott McKay
ab71c4bbc0
Document generation CI is broken (#13308)
### Description
<!-- Describe your changes. -->
Fix document generation CI. It's not currently updating the docs as
we're skipping the tests, which is the invocation of build.py that would
have generated the documentation.

Setup specific task to generate documentation for greater clarity. 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Operator kernel documentation is not getting updated and is now out of
date.
2022-10-28 07:20:48 +10:00
Tianlei Wu
7aafd86229
Update Attention operator to support separated Q/K/V inputs (#13410)
### Description
Allow separated Q, K and V inputs to support cross attention:
* Q: [batch_size, sequence_length, hidden_size]
* K: [batch_size, kv_sequence_length, hidden_size]
* V: [batch_size, kv_sequence_length, v_hidden_size]
* Output: [batch_size, sequence_length, v_hidden_size]

To use separated Q/K/V inputs, the input tensor is for query, and two
optional inputs are added for key and value. Weights for input
projection is not included for now, so the MatMul of input projection
shall be done out of Attention operator, but Add bias is included for
performance consideration.
2022-10-25 11:51:06 -07:00
Ye Wang
928c9889a3
A few fixes for generative model ops (#13363)
### Description
<!-- Describe your changes. -->

Fix a bug in GreedySearch Op when batch > 1
Support custom attention mask in GreedySearch and BeamSearch with GPT2 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-10-21 15:00:18 -07:00
Edward Chen
454f77cd94
Update kernel matching logic: decouple from op schemas and remove kernel def hashes (#12791)
# Motivation
Currently, ORT minimal builds use kernel def hashes to map from nodes to
kernels to execute when loading the model. As the kernel def hashes must
be known ahead of time, this works for statically registered kernels.
This works well for the CPU EP.
For this approach to work, the kernel def hashes must also be known at
ORT format model conversion time, which means the EP with statically
registered kernels must also be enabled then. This is not an issue for
the always-available CPU EP. However, we do not want to require that any
EP which statically registers kernels is always available too.
Consequently, we explore another approach to match nodes to kernels that
does not rely on kernel def hashes. An added benefit of this is the
possibility of moving away from kernel def hashes completely, which
would eliminate the maintenance burden of keeping the hashes stable.

# Approach
In a full build, ORT uses some information from the ONNX op schema to
match a node to a kernel. We want to avoid including the ONNX op schema
in a minimal build to reduce binary size. Essentially, we take the
necessary information from the ONNX op schema and make it available in a
minimal build.
We decouple the ONNX op schema from the kernel matching logic. The
kernel matching logic instead relies on per-op information which can
either be obtained from the ONNX op schema or another source.
This per-op information must be available in a minimal build when there
are no ONNX op schemas. We put it in the ORT format model.
Existing uses of kernel def hashes to look up kernels are replaced
with the updated kernel matching logic. We no longer store
kernel def hashes in the ORT format model’s session state and runtime
optimization representations. We no longer keep the logic to
generate and ensure stability of kernel def hashes.
2022-09-20 14:24:59 -07:00
Dwayne Robinson
8e4eb24648
Update operator kernel table to include DML operators (#12887)
* Fix bug in pybind get_all_operator_schema due to premature reference dropping
* Add updated operator kernels markdown table
* Update build.py to include documentation generation for DML operators too
* Update GPU pipeline to include DML in the build to so operators can be generated.
* Use a separate pipeline stage, feedback from Changming and Scott
* Appease annoying Python linter
* Add onnxruntime_BUILD_UNIT_TESTS=OFF and remove stale --use_dml in cuda stage
2022-09-09 10:21:25 -07:00
Hariharan Seshadri
ad69aac491
Introduce ordered quantization ops for the CUDA EP [1/n] (#12582)
Initial core small set for the ordered quantization ops for cuda EP.
2022-09-07 11:58:15 -07:00
Yulong Wang
c144acc534
Replace 'master' branch ref to 'main' in the code (#12547) 2022-08-22 10:48:12 -07:00
Cheng
64e991a9fc
[Qlinearsoftmax] contrib cpu (#12177)
* [Qlinearsoftmax] contrib cpu

* int8 implementation

* contrib operator md

* qdq transformer test

* new attribute: opset

* doc

* quantized tool

* remove template to reduce Binary size

* doc of contribe operators

* enforce x_shape is valid

* fix reduce_size if input-shape is dynamic

* add UT

* register one op for reducing binarysize

* kernel hash update

* docs/ContribOperators.md
2022-08-10 10:52:02 +08:00
Vincent Wang
cfa09d16d9
[CUDA] Mod Op Kernel (#12499)
* mod for cuda and rocm

* fix bfloat16 ut

* change bf16 ut number

* fix opset version

* fix op kernel doc
2022-08-09 13:05:40 +08:00
Ye Wang
b622e5fa9b
Support vocab_mask/prefix_vocab_mask/no_repeat_number in greedysearch op (#12327)
* support more inputs for greedy search

* fix docs

* refactor test

* lint

* review comments
2022-08-03 10:10:08 -07:00
Ye Wang
89ac61f4d4
support gpt2 model with greedy search (#12068)
* greedy search gpt2 cpu checkin

* add cuda support

* add test

* provider

* update

* fix some bugs

* refactor impl class

* refactor test

* remove unused func

* refactor parameters class

* simplify padding

* fix lint warnings

* python format

* Revert "python format"

This reverts commit f25fe1017fa33d960b2418ebbb5dba6a4bd043cf.

* python format

* fix pipelines

* fix pipeline

* move bufferallocater to generate_impl_base

* review comments(alignment, filename/namespace change)

* rebase2

* python reformat

* reformat

* fix rocm build

* review comment

* review comments

* review comments

* fix a bug

* rebase test files

* python format

* format import order

* review comments

* fix build
2022-07-22 15:45:16 -07:00
Gary Miguel
dc5d6b9515
register signal ops for opset 17 (#11778)
* Register signal ops for op set 17

Note code is mostly being moved, not added. These ops were previously
only registered as Microsoft contrib ops and only built if
`BUILD_MS_EXPERIMENTAL_OPS=1`. They've been added to the ai.onnx
standard op set in version 17.

Main components of this change:

* Move the kernels from the conrib_ops directory to the
  core directory.
* Add function bodies for ms experimental ops. This will allow
  old models that use the contrib ops to continue to function.
  All the function bodies consist of a single op (the
  new standard op), so performance overhead should be minimal.

Minor clean-up also in this change:

* De-duplicate get_scalar_value_from_tensor: put it in a new utils.h.
* Fix some bugs that caused compilation errors with the experimental
  ops. Tested with `build.sh --ms_experimental`
* Fix some spelling errors and lint violations.
* Replace a couple of switch statements with `MLTypeCallDispatcher`.
* Use `InlineVector` instead of `std::vector`.

Unblocks https://github.com/microsoft/onnxruntime/issues/11640
2022-06-27 10:26:55 +10:00
Gary Miguel
4bf22e2a40
Update ONNX to 1.12 (#11924)
Follow-ups that need to happen after this and before the next ORT release:
* Support SequenceMap with https://github.com/microsoft/onnxruntime/pull/11731
* Support signal ops with https://github.com/microsoft/onnxruntime/pull/11778

Follow-ups that need to happen after this but don't necessarily need to happen before the release:
* Implement LayerNormalization kernel for opset version 17: https://github.com/microsoft/onnxruntime/issues/11916

Fixes #11640
2022-06-21 17:19:52 -07:00
Ye Wang
859ef277a0
apply zcode changes to the beam search op (#11880)
* apply zcode  changes to the beam search op

* fix pipeline failure

* add doc

* workaround for C#

* update

* update

* use name zcode

* review comment

* review comments

* fix cpplint

* review coments
2022-06-20 18:39:07 -07:00
Tianlei Wu
6ee2c1b5fc
Remove temperature input from BeamSearch operator (#11896)
* remove temperature input
* update index of remaining inputs
2022-06-20 09:50:45 -07:00
Vincent Wang
02724c54ff
[CUDA] Implement BitmaskDropout, BitmaskBiasDropout and BitmaskDropoutGrad (#11534)
* Implement BitmaskDropout and associated unit tests.

* Implement BitmaskDropoutGrad and associated unit tests.

* Implement Dropout -> BitmaskDropout rewrite rule and associated unit tests.

* Implement (Dropout,DropoutGrad) -> (BitmaskDropout,BitmaskDropoutGrad) rewrite rule.

This commit does not yet include unit tests for this rewrite rule.

This commit also introduces improved documentation for all changes which will be grouped
into this PR.

* bitmask dropout

* fix win build

* bugfix for rocm

* bugfix

* fix code format

* fix ut

* fix build break

* fix ut in win

* resolve comments

* fix ut in trt

* resolve comments

* fix rocm build error

* fix typo

Co-authored-by: Aidan Beggs <aidanbeggs@microsoft.com>
2022-05-27 17:24:47 +08:00
Xavier Dupré
c37d2728bf
Implement TreeEnsemble for opset(ai.onnx.ml)==3 (#10821)
* Implement TreeEnsemble for opset(ai.onnx.ml)==3
* use of InlineVector
* refactoring
* improve attributes retrieval
* avoid creating a temporary buffer
* modifies onnx.ml.cpu.json
* use unordered_map
* update docs/OperatorKernels.md
* address PR comments (TH -> ThresholdType, ORT_RETURN...)
* add a python unit test to load a TreeEnsembleRegressor following ai.onnx.ml==3 specifications
2022-03-30 12:53:12 +02:00
Vincent Wang
6a6840d5c6
Fuse LayerNormalization for Apex O2 (#10233) 2022-03-29 21:22:04 +08:00
pengwa
89ef987ab1
Improve NonZero on CUDA/ROCM (#10307)
* improve NonZero

* fix megatron_fp16 optimzier, fix the doc

* multi_tensor_applier

* resolve comment

* fix building warning

* fix build error when enabling training and use tensorrt
2022-03-25 07:35:45 +08:00
Hariharan Seshadri
a9d9c6b486
Register CPU, CUDA and ROCM opset-16 kernels for some operators (#10643) 2022-03-08 09:18:39 -08:00
liqun Fu
da885a72e8
update with onnx 1.11 release (#10441) 2022-03-07 21:10:55 -08:00
Tianlei Wu
36c3271546
BeamSearch op cuda (#10556)
Add BeamSearch cuda implementation with support of fp16 GPT-2 subgraph
2022-02-25 13:08:55 -08:00
Scott McKay
df841ee87d
Fix incorrect type constraint registration for operator kernels. (#10489)
* Fix incorrect type constraint registration for RoiAlign. This led to the input type not actually being checked when matching a kernel as the invalid constraint name is treated as a missing optional input.
  * fix missing dependency for the unit test exe. Whilst it doesn't link against the CUDA providers lib, without the dependency VS doesn't know it needs to rebuild the library if there are changes.
* Add check for invalid type constraints.
* Fix invalid registrations for other kernels.
* Add hash replacement logic to provide backwards compatibility in ORT format models when the registration is fixed.
* Add tests
2022-02-18 16:55:32 +10:00
Viswanath Boga
ad9d2e2e89
Prefix match in first iteration of beam search OP (#10231)
* Add BeamSearch op schema

* Add ONNX conversion for beams search

* remove attention_mask and change input order

* add option to run baseline

* add check data type NULL

* applies VerifyNodeAndOpMatch to subgraph

* update input_ids shape

* Add node name for Cast node

* expose API for topk

* parse parameters

* Add beam search scorer

* output results

* fix typo

* use c++ template and format python

* fix build pipeline errors

* symbolic shape infer of input onnx

* output scores

* add kernel def hash

* Handle vocab_mask; move CheckSubgraph

* undo insert_cast_transformer.cc and fusion_utils.py

* fix typo

* fix merge

* update doc

* add repetition penalty

* refactoring: add GptSubgraph class

* move BeamSearchState from .h to .cc file

* adjust logits processor order

* add batch generation example

* fix repetition penalty for dup words in sequence

* Add test

* Add no repeat ngram processor

* refactoring: move logits processor to classes

* fix build warning

* show latency

* use allocator in beam state

* use allocator in sequences

* fix build error

* move next_positions to beam state

* Changes for prefix matching

* removing debugs

* removing more debugs

* clean up

* clean up

* cpu doc updated

* Updated docs

* updated prefix_vocab_mask dimension in convert script

* changes to support bxs prefix_vocab_mask in beamsearchop kernel

* doc update

* OperatorKernels.md updated

* matching docs from artifacts

* minor change in logits processor

* Addressing comments

* Updated the prefix vocab mask usage properly

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
2022-02-03 00:14:39 +05:30
Yufeng Li
1aa0789691
add qdq support for QGemm (#10414)
* add qgemm in quantization tool

* add qdq support for QGemm

* fix build break

* fix OperatorKernels.md
2022-02-02 10:35:29 -08:00
Yi-Hong Lyu
e27f2dc932
int8/uint8 support for Argmax for opset 1, 11, 12 (#10296) 2022-01-18 14:37:34 -08:00
Vincent Wang
44e2db9397
CUDA BFloat16 Refactor (#10085) 2022-01-14 19:38:56 +08:00
Yi-Hong Lyu
499f1d5fd7
Quantization of Argmax (#10213)
This patch includes:
* int8/uint8 support for Argmax
* Quantization tool support for Argmax
2022-01-12 14:12:56 -08:00
Yufeng Li
12ee2e942f
add int8_t for Resize (#10067)
As we support quantization for format s8s8, we need Resize to support int8_t.
2021-12-17 15:36:09 -08:00
Tianlei Wu
ef36488df0
Add BeamSearch operator for GPT-2 decoding (#9680)
* Add BeamSearch operator and CPU implementation
* Add ONNX conversion script
2021-12-16 16:08:05 -08:00
Yufeng Li
ffdafb2012
add fallback of s8s8 support on x64 (#9995)
* add fallback of s8s8 support on x64
2021-12-10 11:33:19 -08:00
Yufeng Li
a0afd7303d
add int8_t support for pool operators (#9852)
* add int8_t support for pool operators
2021-11-29 18:43:43 -08:00
Ye Wang
6856619b18
Decoder Attention CUDA Op (#9792)
* add kernel interface

* register kernel

* add self/cross qkv projection without cache

* add LaunchTransQkv2 for (S,B,X,N,H) -> (X,B,N,S,H)

* refactor ConcatPastToPresent

* DecoderQkvToContext interface

* q,k,v buffer and cache as output

* qk, pv and transctx

* fix compiler error on linux machine

* key_padding_mask

* add test_parity file. However not runnable

* add partial unittest

* made partial attributes to inputs

* --gen_doc

* change kernel interface, add more tests

* morre parity tests

* fix test

* fix typo

* transpose optimizer has bug. remove it temporarily

* add input shape checks

* add type/shape inference

* fix cache shape check

* fix rocm build failure

* fix rocm build error

* review comments

* review comments
2021-11-19 19:25:36 -08:00
Vincent Wang
f390347c11
Add CUDA Kernels of RandomNormal[Like], RandomUniform[Like] (#9761) 2021-11-19 08:18:34 +08:00
satyajandhyala
229c9a4e1c
Added Trilu CUDA kernel. (#9633)
* Added Trilu CUDA kernel.

* Added TriluGrad.

* Added a training testcase for Trilu.

* Added Trilu gradient checker test.
2021-11-09 11:20:17 -08:00
Hariharan Seshadri
bbeceb7541
Support optional type in ORT (#8339) 2021-11-04 15:01:42 -07:00
Viswanath Boga
85874bb315
embed layer fusion gpt2 (#9336)
* Changes to fuse embed layer for gpt2, kernal changes pending

* verified add output and regular add match

* Test added for additional output embedlayernorm, working on CUDA

* Test passing on CPU

* updated convert_to_onnx toll to check parity correctly

* removed some debugs

* couple of TODO left as in optimizer.py

* removed changes to optimizer.py

* fixing build

* fixing build

* updated order of initilization

* added a test case for float16

* updating the docs

* updating tests failing due to embed layer fusion

* update unit tests

* updating CUDA documentation in operatorkernels.md

* addressing comments

* OperatorKernels.md updated with CUDA

* adding TODO to qembed_layer

* minor edit

* updated docs

* addressing comments

* adding position ids to embed layer gpt2

* updating fused gpt2 model

* added extra test

* remove comments

* addressing comments

* contrib_defs.cc updated

* all tests passing

* fixing a typo

* minor edit

* trigger build

* qembedlayernorm checkinputs updated

* fixing build error

* fixing build error

* fixing build error
2021-10-28 11:06:26 -07:00
Bowen Bao
e983f37121
Bifurcation detector for aggressive decoding (#9432)
```
Component for aggressive decoding. Find the bifurcation index of predicted tokens, between source tokens,
starting from previous suffix match index, and predicted tokens.
Concat predicted tokens, starting from bifurcation index, to the back
of current tokens. This forms the output tokens.
Detect suffix match index in source tokens, between source tokens and output tokens.
Detection is based on finding the appearances of last n-gram in output tokens
in source tokens.
A match is considered found if source tokens contain a single matching n-gram.
Return the index of the start of the n-gram in source tokens.
No matching if found if src tokens contain multiple or zero matching n-grams. Return -1.
```
2021-10-19 19:53:56 -07:00
ashbhandare
35c2102cfa
Fixes for GatherND, Multinomial (#9143)
* register gathernd kernel, aten multinomial

* fix CI, add test

* review comments
2021-10-05 14:51:58 -07:00
ytaous
0193490cbf
ReduceMin - add int64 cuda kernel support for opset12/13 (#8966)
* ReduceMin - int64 support

* fix doc

Co-authored-by: Ethan Tao <ettao@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-09-07 17:01:26 -07:00
Hariharan Seshadri
cee79526fd
Add opset 15 kernels for Pow, BatchNorm, and Shape (#8442) 2021-08-25 12:04:20 -07:00
Hariharan Seshadri
17b0664e34
Optimize sequence type usage on CUDA [2/n] (#8720) 2021-08-24 10:40:28 -07:00
XiyinOSS
19b82b438b
GridSample OP implementation for CPU and CUDA (#8551)
* GridSample OP implementation for CPU and CUDA

**Description**: This change contains implementation for torch grid_sample OP.
Cuda implementation contains contribution from Muscle Wu.

* Use interpolation for out-of-bound points in zero padding mode

Out-of-bound points in zeros padding mode changed from constant 0 to
interpolation of surrounding pixels. This aligns with Pytorch implementation.

A bug in CUDA batch offset calculation is fixed.

Custom op exporter type is added.

* Fix nearest bug in CPU

* Update per CI build finding and review comments

* Force float to avoid potential integer T issue

* Style update

* PR update

* Remove c++17 feature from cuda code
2021-08-20 12:37:38 -07:00
harshithapv
c24335246b
Support bool type for Pad Op and fix Unsqueeze in Tile grad for Opset 13 (#8602)
* changes

* tile grad unsqueeze fix for opset 13

* clean up

* remove bool support for opset 2 to 12 for Pad as it is not supported.

* Copy OperatorKernels.md from artifacts of Windows CI build.
2021-08-11 11:21:02 -07:00
Xavier Dupré
064a385b59
Support int8 for operator Split (#8615)
* Support int8 for operator Split
2021-08-10 23:04:16 +02:00
Ashwini Khade
96eb9810ba
Update onnx (#8458)
* updates for picking pnnx commit

* add tests filter to c# tests

* plus test fixes

* fix versioning for contrib ops

* fix tests

* test filter for optional ops

* more versioning related updates

* fix test

* fix layernorm spec

* more updates

* update docs

* add more test filters

* more filters

* update binary size threshold

* update docs

* plus more fixes

* updates per review

* update to release commit

* add filters for optional type tests

* plus updates
2021-08-05 09:21:44 -07:00
Yufeng Li
ceeb1a65d6
Add quantization support of GEMM directly with QGemm (#8447)
QGemm takes in quantized A, B, C, and quantization parameters of output Y, in which C and quantization parameters of Y are optional. Its output can be quantized or full precision, which depends on whether quantization parameters of Y exists or not. If quant params of Y are provided, the output will be requantized or is full precision.

Comparing with QLinearMatMul and MatMulInteger, QGemm supports transpose, apha and beta attribute.

The formula for quantized GEMM is:
Y = alpha * scale_a * scale_b * ((A_int8 - zp_a) * (B_int8 - zp_b) + C_int32), in which,
C_int32 is quantized with formula: C_int32 = (beta * C) / (alpha * scale_a * scale_b)
2021-07-27 21:21:49 -07:00
Dmitri Smirnov
950fe5e28b
Implement SparseTensor and infrastructure suppport and advance ONNX commit (#8038)
SparseTensor support
  Implement Builder pattern
  Fix support for 1-D and 2-D COO indices
  Implement and test CSR support.
  Handle shape inference for SparseTensors
  Implement conversion for COO, CSR and tests.
  Address the case where constant sparse initializer is the output.
  Implement test infra for SparseTensors
  Implement SparseDenseMatMul for Csr and COO and tested it.
  Add hash for SparseToDenseMatMul
  Finish shared provider refactor
  Refactor GetOrCreate to Create
  Working on py interface
  Expose OrtDevice and use it in allocate_numpy
	Adjust Sparse interfaces, add support for string SparseTensor. Add tests.
	Add and test to_cuda()
	Add accessors to format specific indices
	Test values and indices views, read-only flag, after GC access
	Add sparse related methods to OrtValue
	Re-work SparseTensor wrapper, add OrtValue methods
	Rework numpy_array_to_cuda/to_cpu
	Add run_with_ort_values
	Add models and test sparse_mat_mul with run_with_ort_values
	Refactor sparse tensor to use a single buffer
        Ifdef x86 Eigen CSR sparse matmul implementation
        Exclude broken test, check for string type when copying cross device
       Split pybind schema, regenerate docs, add exclusion
       Conditionally exclude schema module
       Update docs fix cuda build
       Add test to a filter and renerate JS docs
      Add conversion and test string support for sparse tensors
      Exclude conversion utils from minimal build
      Add CUDA Memcpy and adjust provider interfaces
2021-07-22 15:24:36 -07:00
Viswanath Boga
afce0e2543
Attention kernel update to handle different Q,K,V hidden sizes (#8039)
* changes working to convert akv nodes

* changes to replace nodes

* changes to accomodate qkv hidden sizes as attributes

* kernel to accept qkv_hidden_size attributes

* Working till compute for varied dimension, todo applyattention()

* changes to make all regression tests work

* inference running successfully without prepack

* success inference with pre-pack weights

* add test for diff sizes

* bias shape need not be a mul of 3

* get the output_hidden_size from input

* infer output shape from input

* merge with master

* cleaning up files that got merged wrong

* accurancy at accepted level

* added unit test case for different dimensions

* all unit tests passing

* packed weights working for attention

* prepacked weights working

* added test case for newly added extra qk input

* updated unit test to test only extra add qk

* fixing build error

* removing few debugs

* reverting test changes

* all python test passing

* cleaning up

* new unit test added, major clean up of code

* removed extra code

* minor

* minor fix to tests

* prepack weights code cleaned up

* compacted compute() in attention.cc

* reformat compute()

* making a parameter T

* adding 3 q,k,v buffers in all cases

* fixing build

* running tests only on cpu

* Updating docs

* trigger ci builds

* Addressing comments in PR

* addressing some more comments

* get add_qk_str from add_qk node directly

* updating docs, added extra check to verify attn inputs

* Optimized the extra add by parallelizing

* added attention_shape to symbolic_shape_infer.py

* minor refactoring to address comments
2021-07-19 12:21:33 -07:00
Ye Wang
04297110c3
Support int64 in ReduceMin cuda op for Opset 14 (#8307)
* reducemin int64_t support

* fix xxcuda.so load error

* testtest

* refactor

* update doc

* propagate types to opset14

* re-generate doc

* rename macro
2021-07-13 16:18:06 -07:00
Hariharan Seshadri
5369821ad6
Support SpaceDepth ops in the CUDA and ROCM EPs (#7960) 2021-07-09 01:00:22 -07:00
Nick Kreeger
800b62a139
Create a quantized EmbedLayerNorm for ORT. (#8124)
Create a quantized EmbedLayerNorm Op for ORT
2021-06-25 17:51:43 -05:00
Bowen Bao
51c12a715b
Add NGramRepeatBlock contrib op (#8078)
**Description**: 
Enforce no repetition of n-grams. Scores are set to `-inf` for tokens that form a repeated n-gram if added to the back of the input_ids.

**Motivation and Context**
Needed by transformer models in sequence generation algorithms (greedy search and beam search). This module has heavy impact on performance, and can be highly parallelized.
2021-06-21 10:21:48 -07:00
RandySheriffH
1a5ee11dbd
Implement Sequence Ops GPU (#7863) 2021-06-07 15:30:26 -07:00
Scott McKay
0fbec1b9c1
Update the operator documentation generation (#7787)
* Update the operator documentation generation
  - Make layout a little nicer
  - Update to latest supported operators including training
  - Fix some links that are broken when the docs content is copied to github-pages
  - Fix incorrect usage of 'onnx.ai.ml' as the default domain
    - ML ops are now separated from the real default domain of 'onnx.ai'
  - Include CPU, CUDA and training kernels
    - exclude DNNL as it's not an EP we own

* There are separate paths for CUDA and CUDNN as they are not guaranteed to be in the same location on a Windows machine. Use the CUDNN path when looking for the CUDNN library.

* Enable validation of both contrib ops and operator kernels in build
Filter generation so it's deterministic
Add ability for CI to publish the md files as build artifacts if they differ so a developer can download and add to their PR to resolve any diffs.
Remove workarounds for github-pages as that will now link to the github docs which display correctly
2021-06-02 17:47:40 +10:00
Zhang Lei
50c5edcf13
Add nhwc support for QLinearAveragePool operator (#7656)
* Add nhwc support for QLinearAveragePool operator

* Update ContribOperators.md

* Update OperatorKernels.md with cpu,dnnl and cuda enabled.
2021-05-13 22:05:30 -07:00
Ye Wang
803837df63
Add 4dmask support for attention cuda kernel (#7591)
* checkin

* add 4dmask support in attention cuda op

* trim

* add comments

* fix build/test error

* review comments and add tests

* sync doc

* review comments

* minor change
2021-05-07 20:17:29 -07:00
Zhang Lei
ada0fbbd2d
Implement qlinear concat and unit test. (#7341)
* Implement qlinear concat and unit test.
Add quantization tools for QLinearConcat and it quantization tests.

* Add kernel def hash for QLinearConcat.

* Change according to PR. Add qdq transformer support for QLinearConcat.

* Add QDQ Transformer unittest. Fix typo on domain.

* remove dup logic of no use.

* fix x86 build error.

* Update operator docs.
2021-04-26 13:38:40 -07:00
Nat Kershaw (MSFT)
8a03b6e5c7
Render Operator documentation as compliant markdown (#3658) 2020-09-02 15:07:50 -07:00
Hariharan Seshadri
1599562016 Fix BatchNorm CUDA kernel definition 2020-04-18 17:21:29 -07:00
Hariharan Seshadri
b4457ecb7a
Fix gen_doc build option and refresh documentation (#3545)
* Support listing keys in custom metadata map via C/C++ API

* nit

* PR feedback

* Nit

* Initial commit

* More changes

* Support listing keys in custom metadata map via C/C++ API

* nit

* PR feedback

* Nit

* Initial commit

* More changes

* Add md files

* Doc changes

* Update

* revert cmake changes

* Update

* Doc change

* Update

* Update
2020-04-17 14:41:04 -07:00
Sreekanth Yalachigere
31ea11a696 Renaming MKL-DNN as DNNL (#2515)
* DNNL: Moving Files to rename file names

* DNNL name change

* azure pipeline updated

* disable ceil/dialation and enable Opset10

* disable ceil/dialation tests in Python

* mlperf_ssd_resnet34_1200 disabled
2019-12-03 07:34:23 -08:00
shahasad
0c5d2c998b
Generate documentation from the registered operator kernels (#1395)
- Added python script for generating markdown doc from the registered opkernels. 
- Made some conditional changes in the pybind to expose necessary python API
- Added some missing type-constraints in the op kernel registrations
2019-08-14 18:12:24 -07:00