Commit graph

635 commits

Author SHA1 Message Date
Nat Kershaw (MSFT)
638f21b969
Upgrade doxygen to fix C API docs build issue (#13950) 2023-02-03 09:43:29 -08:00
Tianlei Wu
a6c5ba0185
Stable Diffusion CUDA Optimizations (#14428)
### Description

Add stable diffusion CUDA kernel optimizations.

The following are included:
(1) GroupNorm operator. This kernel is from TensorRT 8.5.
(2) BiasSplitGelu operator. This kernel is modified from SplitGelu of
TensorRT 8.5. We added bias to the SplitGelu.
(3) NhwcConv operator. This adds support of NHWC format (ONNX Conv
operator uses NCHW format).
(3) Update MultiHeadAttention (packed kv and no bias) for cross
attention. This could avoid transpose of kv for TRT fused cross
attention kernel.
(4) Optimization and benchmark script

Not included:
(1) Script to convert Conv to NhwcConv in onnx graph.
(2) Update symbolic shape inference for NhwcConv.
(3) Add SeqLen2Spatial operator
(4) Documents

Limitations: GroupNorm, BiasSplitGelu and NhwcConv kernels are
implemented based on stable diffusion usage. They might not be
applicable to any input size or dimensions. For example, BiasSplitGelu
requires hidden size to be 2560 | 5120 | 10240, and NhwcConv assumes 4D
input/weight.

There is minor increasement of binary size. For SM=75 only, python
package wheel size adds (33757K - 33640K) = 117 KB. It is possible to
move NHWC from template parameter to constructor to reduce binary size
(with slight cost of performance).

Note: for RTX 4090/4080/4070 Ti, need build with CUDA 11.8 and latest
cuDNN to get best performance.
2023-02-02 23:43:51 -08:00
Numfor Tiapo
3cc81460e0
Register ScatterElements-16 (#14425)
This PR registers ScatterElements-16 to the DML EP
- CPU fallback is added if the reduction attribute is in use, as this is
not yet supported by DML.

---------

Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>
2023-02-01 09:46:37 -08:00
Rui Ren
eacd829d23
Bump ORT version number (#14226)
### Description
Bump ort version after the creation of release candidate of 1.14

Co-authored-by: ruiren <ruiren@microsoft.com>
2023-01-26 12:33:47 -08:00
liqun Fu
2b1a59f01a
cpu support of LpPool(18) (#14205)
Signed-off-by: Liqun Fu <liqfu@microsoft.com>

### Description
To support LpPool (18)



### Motivation and Context
for Ort 1.14 release

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
2023-01-25 23:14:56 -08:00
Thiago Crepaldi
32c05fcdd1
Add Col2Im CPU op (#12311)
**Description**
This PR implements N-dimensional Col2Im as a contrib CPU Op as specified
by ONNX's https://github.com/onnx/onnx/pull/3948

**Motivation and Context**
- Col2Im enables models such as:
  - [SS-DCNet](https://github.com/xhp-hust-2018-2011/SS-DCNet)
  - [DSTT](https://github.com/ruiliu-ai/DSTT)
- It also serves to document the ORT's obscure `math::Col2ImNd` utility

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
Co-authored-by: Liqun Fu <liqfu@microsoft.com>
2023-01-25 12:23:00 -08:00
Edward Chen
3bc092b1ea
Update ORT format v5 change docs to cover limited backwards compatibility in 1.14. (#14413) 2023-01-25 08:23:12 -08:00
liqun Fu
7b6d880b28
cpu to support bitwise ops (#14197) 2023-01-23 16:42:18 -08:00
Scott McKay
c252a7f992
Remove exclusions for ONNX model tests that now pass. (#14337)
### Description
<!-- Describe your changes. -->
Remove exclusions for ONNX model tests that now pass due to kernels
being implemented.
Update ONNX update doc to point to correct location for tests.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Run as many tests as possible.

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-01-24 08:04:27 +10:00
liqun Fu
05915d8393
support Pad(18) (#14219) 2023-01-23 12:14:35 -08:00
Nat Kershaw (MSFT)
abaed6f474
Add link to Python API examples (#14345) 2023-01-21 16:23:16 -08:00
Nat Kershaw (MSFT)
e57c312f9d
Pin sphinx to avoid broken link (#14383) 2023-01-21 09:50:56 -08:00
Ye Wang
de7a868d5f
Update quantization_defs.cc (#14380)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-01-20 15:03:50 -08:00
Ye Wang
668586e8f8
Support muP in Attention (#14348)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-01-19 20:36:55 -08:00
liqun Fu
5d6a049141
support ScatterND(18) and ScatterElement(18) (#14224) 2023-01-19 13:54:20 -08:00
Tianlei Wu
477cad3051
[CUDA] Add trt cross attention kernels (#14328)
Add TRT cross attention kernels for stable diffusion optimization.
2023-01-17 17:55:45 -08:00
Ye Wang
2db57a53a3
Add mask_filter in Attention related ops' attribute (#14274)
### Description
<!-- Describe your changes. -->


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

https://github.com/microsoft/onnxruntime/issues/12843

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-01-17 12:28:11 -08:00
Zhang Lei
15141a40b4
Add present_past_share_buff to QAttention Defs to enable QAttention related tests. (#14297) 2023-01-14 09:19:06 -08:00
Ye Wang
c9a53c9255
Some changes to Sampling Op (#14218)
### Description
<!-- Describe your changes. -->
1. add an optional input to pass in seed
2. two UTs. one for top_p=0.5, another for top_p=0.01(create greedy
search result, in convert_generation.py)
3. fix a bug in cpu kernel

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-01-12 14:15:26 -08:00
Numfor Tiapo
dee36f8ade
DML EP Register ScatterND-16 (#14240)
This PR registers ScatterND-16 to the DML EP

- CPU fallback is added if the reduction attribute is in use, as this is
not yet supported by DML.

Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>
2023-01-12 10:39:25 -08:00
sfatimar
7654cd50e8
Openvino ep 2022.3 v4.3 (#14210)
### Description
Changes to incorporate OpenVINO EP 2022.3


### Motivation and Context
This change is required to incorportate OpenVINO EP 2022.3
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: mohsinmx <mohsinx.mohammad@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: Aravind <aravindx.gunda@intel.com>
Co-authored-by: mayavijx <mayax.vijayan@intel.com>
Co-authored-by: flexci <mohsinmx>
2023-01-11 16:31:26 -08:00
Scott McKay
dd2df460b3
Split(18) (#14015)
### Description
<!-- Describe your changes. -->
Opset 18 Split changes. Adds ability to specify num_outputs which also
allows uneven splitting.

https://github.com/onnx/onnx/releases/tag/v1.13.0

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Support ONNX opset 18.
2023-01-12 08:14:10 +10:00
Ye Wang
a01bf8dbb1
rename CrossAttention to MultiHeadAttention (#14201)
### Description
<!-- Describe your changes. -->

rename the CrossAttention to MultiheadAttention since this op can also
be used as self attention

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-01-10 10:18:39 -08:00
Numfor Tiapo
f4ea781b81
DML EP Register Identity-16 (#14053)
This PR Registers Identity-16 to the DML EP.

ONNX Backend tests and optional type tests were skipped pending future
additions.

Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>
2023-01-10 09:16:09 -08:00
liqun Fu
1be36913cc
to work with onnx 1.13 rc, implement ver 18 reduce and optioanl ops, … (#13765) 2023-01-09 10:26:16 -08:00
Ye Wang
5eac2c1f41
relational attention bias cuda op (#14149)
### Description

This cuda op implements the compute_bias() method in T5 Attention
including the permutation.

note:
1. bias_table needs to be saved in col-major. be careful when
implementing fusion script
2. second input(sequence length) is placed on cpu. (using Shape node's
output should be good)
3. the first dimension of output is 1, so extra_add_qk in attention
should support broadcasting
4. compute_bias() only used in self-attn in t5

TODO: docs change will be applied later

### Motivation and Context
It's part of the process of optimizing t5 attention as well as t5 based
generation model

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-01-06 17:32:58 -08:00
Tianlei Wu
2cacb24cb0
Add CrossAttention operator (#14146)
Move separated Q, K and V (without input projection) from Attention to a
new operator CrossAttention.

The Attention operator is hard to maintain when we need support with and
without input projection in one class. Add a new operator according to
feedback.

Some change might need in the future, but not in this PR:
(1) bias could be optional (We will not proceed that route unless
experiments show that fusing Add bias with MatMul instead of this op
could improve performance).
(2) support packed KV. There are two ways to support it: when key and
value are same Tensor, they are packed; or we can make value as
optional, and use packed mode when value is empty and the key has packed
K/V.
(3) support cached key and value, and other (like relative position
bias), or more attention mask format. They can be added easily without
breaking backward compatible.
(4) ROCm/CPU implementation of this op.
2023-01-06 14:27:40 -08:00
Hariharan Seshadri
d0c5ffd5f7
Misc transformer fixes - 2 (#14156)
### Description
1. The graph pattern search introduced in
https://github.com/microsoft/onnxruntime/pull/13914/ needs to be
enhanced so that SkipLayerNormalization is supported

2. Fix fp32 parity for GPT-2 while using `SkipLayerNormalization`
fusion. The optional output of SLN needs to also include the bias (if
present) and the added output should be a sum of `input + skip + (bias)`

### Motivation and Context
Fix some breaking tests
2023-01-06 07:27:10 -08:00
Ye Wang
ae148ebc05
T5 skip_layer_norm cuda op (#14093)
### Description

T5 uses a layer_norm which only scales and doesn't shift, which is also
known as Root Mean Square Layer Normalization.
ORT already have the simplified_layer_norm which is the RMS layer_norm.
This PR extends this T5 layer_norm with support of skip/bias and the
residual output.
This new op is named SkipSimplifiedLayerNorm and has similar interface
as SkipLayerNorm but removes the beta as input


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-01-04 13:31:53 -08:00
Ashwini Khade
68b5b2d7d3
Refactor training build options (#13964)
### Description
1. Renames all references of on device training to training apis. This
is to keep the naming general. Nothing really prevents us from using the
same apis on servers\non-edge devices.
2. Update ENABLE_TRAINING option: With this PR when this option is
enabled, training apis and torch interop is also enabled.
3. Refactoring for onnxruntime_ENABLE_TRAINING_TORCH_INTEROP option: 
   -  Removed user facing option
- Setting onnxruntime_ENABLE_TRAINING_TORCH_INTEROP to ON when
onnxruntime_ENABLE_TRAINING is ON as we always build with torch interop.

Once this PR is merged when --enable_training is selected we will do a
"FULL Build" for training (with all the training entry points and
features).
Training entry points include:
1. ORTModule
2. Training APIs

Features include:
1. ATen Fallback
2. All Training OPs includes communication and collectives
3. Strided Tensor Support
4. Python Op (torch interop)
5. ONNXBlock (Front end tools for training artifacts prep when using
trianing apis)

### Motivation and Context
Intention is to simply the options for building training enabled builds.
This is part of the larger work item to create dedicated build for
learning on the edge scenarios with just training apis enabled.
2023-01-03 13:28:16 -08:00
Ye Wang
68518a1b72
Sampling op (#13426)
### Description
<!-- Describe your changes. -->

Sampling op for cpu and cuda
support huggingface case and custom case
            


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2022-12-22 17:34:12 -08:00
pengwa
2f5bf75e51
Optimize computation orders (#13672)
### Optimize computation orders

In `Roberta/Electra`, when `ClassificationHead` is used, there is
slicing operation on features on sequence_length dimensions, then loss
calculations only depend on this sliced data. This is a slicing at axis
1. Before slicing the shape is [batch, sequence_length, hidden], after
slicing, it becomes [batch , hidden_stage]

We had opportunities to bring this slicing earlier as much as possible,
by passing through simple elementwise ops (like Add/Div), or
Layernorm/Softmax(if their reduce axis is after the slicing axis), or
even MatMul's the left operand (if only it did not affect the last
dims).

For operators like Reshape/Transpose, it is special since they have
either data specified (after slicing we need update), or they have perm
specified, which requires the input rank remain unchanged. So for those
kinds of operators, we can remain the original rank, but just leave the
sliced dim to be 1, after the compute completed, we do a Squeeze.

```
class RobertaClassificationHead(nn.Module):
    """Head for sentence-level classification tasks."""

    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        classifier_dropout = (
            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
        )
        self.dropout = nn.Dropout(classifier_dropout)
        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, features, **kwargs):
        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])
        x = self.dropout(x)
        x = self.dense(x)
        x = torch.tanh(x)
        x = self.dropout(x)
        x = self.out_proj(x)
        return x
```

src\transformers\models\roberta\modeling_roberta.py
src\transformers\models\electra\modeling_electra.py

#### Benchmark

A simple benchmark shows Robeta training latency dropped from 208ms ~
199ms. 4.5+% reduction.
More comprehensive tests are on the way.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-12-22 15:12:52 +08:00
Hariharan Seshadri
7ed8bd4f95
Support (Bias)SkipLayerNormalization fusion in GPT2 (#13988) 2022-12-21 23:04:44 -08:00
Edward Chen
df8ff34f25
Update CUDA ArgMin/ArgMax op kernels to have end version 11 since opset 12+ is not supported yet. (#13983)
### Description

Update CUDA ArgMin/ArgMax op kernels to have end version 11 since opset
12+ is not supported yet.
With the way these kernels are currently registered, the documentation
shows support for opset 11+. This is not accurate.

### Motivation and Context

Fix #13781
2022-12-21 19:01:00 -05:00
Numfor Tiapo
8943d623a4
DML EP Register operators for Opset 16 (#14034)
This PR Registers the following operators for opset 16 to the DML EP:

- LeakyRelu-16
- PRelu-16
- Where-16
- GreaterOrEqual-16
- LessOrEqual-16

Identity-16 was not added in this PR due to pipeline failures

Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>
2022-12-21 09:05:12 -08:00
Zhang Lei
fba09faf5b
Implement reuse past and present tensor in Attention Ops. (#13791)
Implement reuse kv_cache past and present tensor in Attention Ops. 
Unit test for abover feature.
Utilize the reuse kv_cache for past and present tensor in Greedy Search.
Correctness test for it.

Co-authored-by: Zhang Lei <phill.zhang@gmail.com>
2022-12-18 10:03:53 -08:00
Jakub Bachurski
3b17ab7c65
Add float64 kernels for Floor, Ceil, IsNaN (#13906)
### Description
This PR adds support for `float64` kernels in the latest versions of
operators: Floor, Ceil and IsNaN.

### Motivation and Context
The lack of these kernels is non-trivial to work around and easily lead
to performance losses when it is attempted. When equivalence with an
existing implementation is required, precision is easily lost when
casting to `float32` instead.

IsNaN is common when cleaning up data in an ML pipeline. Floor and Ceil
have uses for discretising values and single-precision floats are
insufficient to round well when values get larger than a few million.

According to my measurement this only increases the binary size by a few
kilobytes (on the Python wheel of RelWithDebInfo).

Closes #13673 (Round already has float64 support)
Partially solves #8791 (Looks like there's parallel issues/PR open for
Split, but it is also hard to work around and hence useful)

Signed-off-by: jbachurski <kbachurski@gmail.com>
2022-12-14 14:57:14 -08:00
Hariharan Seshadri
abc5c25a85
Updates to GreedySearch/BeamSearch (#13943) 2022-12-13 20:25:26 -08:00
Patrice Vignola
8246ff015a
[DML EP] Add EmbedLayerNorm (#13868)
### Description
Add EmbedLayerNorm to the DML EP
2022-12-13 13:23:53 -08:00
Jian Chen
d7d932c1c2
Cjian/where python operator (#12795)
**Description**: 
This PR will enable the python tool to run QWhere and QDQWhere operation

**Limitation**:
s8s8 Where is still not supported.
2022-12-12 13:27:47 -08:00
Edward Chen
8cfbc4fe91
Add support for other data types to Split CPU kernel. (#13900)
Split copies data - we can add support for all data types without too much binary size impact by using data type size-based implementations. The DispatchStridedCopy() function used here does this.
2022-12-12 09:29:15 -08:00
Nat Kershaw (MSFT)
21dd341e52
Add Google Analytics to python apidocs (#13901) 2022-12-09 15:44:12 -08:00
Patrice Vignola
96d8d2c278
[DML EP] Add SkipLayerNormalization (#13849)
### Description

Add SkipLayerNormalization for the DML EP
2022-12-07 01:49:14 -08:00
Hariharan Seshadri
004a1538d3
Extend vocab padding for logits MatMul for fp16 GPT2 GreedySearch (#13842) 2022-12-06 19:39:20 -08:00
Patrice Vignola
b53bbe7370
[DML EP] Add an implementation for NonZero (#13768)
### Description
Add the NonZero op for DML



### Motivation and Context
NonZero is used in a few transformer models, so having a DML
implementation will stop large tensors from being transferred to the CPU
and back to the GPU
2022-12-02 18:39:21 -08:00
Patrice Vignola
a0b470bc35
[DML EP] Add mixed datatype support for DML's LayerNorm contrib op (#13734)
### Description
Add mixed datatype support for DML's LayerNorm contrib op.



### Motivation and Context
The fusion logic removes casts around LayerNorm in the graph because the
contrib version of the op supports mixed datatypes. Scale, Bias and
Output's datatypes must match, but input's datatype can be different.
2022-12-01 14:08:18 -08:00
Patrice Vignola
e9b92fdf33
[DML EP] Add DML implementation for BiasGelu (#13795)
### Description
Add DML implementation for BiasGelu
2022-12-01 09:23:19 -08:00
Tianlei Wu
8b0e0f4927
Add RemovePadding and RestorePadding for BERT model (#13701)
Add two operators RemovePadding and RestorePadding based on ideal of
effective transformer (https://github.com/bytedance/effective_transformer) to improve large
batch size inference for BERT model.
2022-11-22 10:00:23 -08:00
Hariharan Seshadri
c7329e004d
Improve fp16 performance of GPT-2's logits MatMul while using BeamSearch (#13686) 2022-11-18 18:50:19 -08:00
Ye Wang
38a74af45d
Support position_ids broadcasting in EmbedLayerNorm (#13677)
### Description
<!-- Describe your changes. -->


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

fix https://github.com/microsoft/onnxruntime/issues/13508
2022-11-17 17:56:27 -08:00
pengwa
d5721b3464
Fix wrong import path in docs (#13680)
### Fix wrong import path in docs
2022-11-17 18:15:02 +08:00
Patrice Vignola
3482180ec2
DML EP add a registration for Shape and Size (#13442)
### Description
Add a DML registration for Shape to avoid copying back to the CPU just
to get the shape of a GPU tensor.



### Motivation and Context
When using free dimensions, many Transformers models extensively use the
`Shape` operator. This causes hundreds of GPU->CPU copy that should be
completely avoidable. Note that this change also uses the same
heuristics as other providers (e.g. CUDA) to force some tensors on the
CPU in certain situations.

Co-authored-by: Patrice Vignola <pavignol@microsoft.com>
2022-11-08 19:29:37 -08:00
pengwa
ab9ac2acc4
Add guidelines for ORTModule (#13553)
### Add guidelines for ORTModule

As title.

Feel free to let me know if I missed something. 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-11-04 19:42:10 +08:00
pengwa
a3e7da60e7
Trade subgraph recompute for memory (#12852)
**Description**: Subgraph-level recompute

This PR adds an optional capability trading additional re-computation
for better memory efficiency. Specifically, a pre-defined operator list
used to iterate the Graph to find some subgraphs for recompute, to
reduce some stashed activations whose lifetime across forward and
backward pass.

When training with ORTModule, by default, the graph transformer will
scan the execution graph to find all eligible subgraph to recompute,
along with sizes that can save. An example looks like below.
If we want to enable some of them to recompute, we can define env
variable this way:
`export
ORTMODULE_ENABLE_MEMORY_ALLEVIATION="Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1:-1,BiasGelu+:1:-1,BitmaskDropout+Cast+:1:-1,FusedMatMul+:1:-1,Cast+:1:-1,Mul+Add+:1:-1,Mul+Sub+:1:-1"`
```

[1,0]<stderr>:2,022-10-12 14:47:39.302,954,530 [W:onnxruntime:, memory_alleviation.cc:595 PrintSummary]
[1,0]<stderr>:MemoryAlleviation Summary:
[1,0]<stderr>:  User config:
[1,0]<stderr>:  Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1,BiasGelu+:1,BitmaskDropout+Cast+:1,FusedMatMul+:1,Cast+:1,Mul+Add+:1,Mul+Sub+:1
[1,0]<stderr>:  =================================
[1,0]<stderr>:  Subgraph: BitmaskDropout+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 1,024 x   Frequency:1
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: BiasGelu+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x  Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Reshape[1,0]<stderr>:+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:labels_dim0 x      Frequency:1
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Unsqueeze+Unsqueeze+Cast+Sub+Mul+Mul+FusedMatMul+Cast+Add+BiasSoftmaxDropout+Cast+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x    Frequency:23
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x    Frequency:1
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Mul+Add+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x         Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: FusedMatMul+Cast+Add+Reshape+Cast+
[1,0]<stderr>:          AlleviationType: Disabled
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 2 x 4 x     Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Mul+Sub+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x         Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: Cast+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:1,024 x 1,024 x    Frequency:97
[1,0]<stderr>:                  PatternShape:3 x 1,024 x        Frequency:1
[1,0]<stderr>:                  PatternShape:8 x 64 x   Frequency:24
[1,0]<stderr>:                  PatternShape:1,024 x 4,096 x    Frequency:24
[1,0]<stderr>:                  PatternShape:4,096 x    Frequency:24
[1,0]<stderr>:                  PatternShape:4,096 x 1,024 x    Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  Subgraph: FusedMatMul+
[1,0]<stderr>:          AlleviationType: Recompute
[1,0]<stderr>:          Patterns:
[1,0]<stderr>:                  PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x  Frequency:24
[1,0]<stderr>:  --------------------------------
[1,0]<stderr>:  =================================
```


"Type config:" whether recompute is enabled by users. 0 - disable, 1-
enable.
"Subgraph" means what kind of subgraph will be recomputed, in this case,
it is a single node "Gelu", and it will be "Recompute".
"Shape && Frequency" means, for this recompute, one tensor of size
(batch size, 500) will be saved because it will be recomputed.

**Baseline**

On a 1P model (DEBERTA V2), sequence length 256, training with 16 A100
GPUs. With latest main branch, we can run batch size 16, and the maximum
batch size < 32. So 16 is usually chosen by data scientists. 65% of 40GB
memory is used during training. The SamplesPerSec=479.2543353561354.


![image](https://user-images.githubusercontent.com/10530022/188320941-13dde5e7-c32b-4399-a64b-6803fbb9dcda.png)

**With this PR**

Gelu is recomputed for saving memory peak, batch size 32 can be run. The
97% of 40GB A100 is used, the SamplesPerSec=562.041593991271 (**1.17X**
of baseline).


![image](https://user-images.githubusercontent.com/10530022/188321081-f64811bf-9637-4873-8095-349de8d498cc.png)


**Motivation and Context**
- Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here.
2022-11-03 13:49:41 +08:00
Vincent Wang
8b0669bf63
QuickGelu Fusion (#12417)
Some models have QuickGelu(x)=x*sigmoid(1.702x), which has 3 Ops for
forward and 5 Ops for backward. The PR is to fuse this to a single Op
named QuickGelu and its gradient QuickGeluGrad.

For CUDA, tested in V100 using input tensor with shape [64,128,2048] and
float16 type:
Before, FW takes 335us, BW takes 614us

![image](https://user-images.githubusercontent.com/11661208/182291335-15188709-ffe7-44d1-9d14-0b544cbe5e55.png)

After, FW takes 115us, BW takes 139us, which is much faster.

![image](https://user-images.githubusercontent.com/11661208/182291502-f0b5161c-b95c-45fc-90f8-ad0c592d2433.png)

For CPU kernel, using same shape and float type:
Before, FW takes 10us, BW takes 49us
Mul: 3480[µs]
Sigmoid: 1996[µs]
Mul: 4789[µs]
Mul: 4642[µs]
Mul: 4195[µs]
SigmoidGrad: 18328[µs]
Mul: 2988[µs]
Sum: 18576[µs]

After, FW takes 4us, BW takes 5us, which is also much faster.
QuickGelu: 3939[µs]
QuickGeluGrad: 5089[µs]

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2022-10-28 18:12:07 +08:00
Changming Sun
07271b6c8a
Update docs/OperatorKernels.md (#13485) 2022-10-27 20:11:49 -07:00
Scott McKay
ab71c4bbc0
Document generation CI is broken (#13308)
### Description
<!-- Describe your changes. -->
Fix document generation CI. It's not currently updating the docs as
we're skipping the tests, which is the invocation of build.py that would
have generated the documentation.

Setup specific task to generate documentation for greater clarity. 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Operator kernel documentation is not getting updated and is now out of
date.
2022-10-28 07:20:48 +10:00
Tianlei Wu
7aafd86229
Update Attention operator to support separated Q/K/V inputs (#13410)
### Description
Allow separated Q, K and V inputs to support cross attention:
* Q: [batch_size, sequence_length, hidden_size]
* K: [batch_size, kv_sequence_length, hidden_size]
* V: [batch_size, kv_sequence_length, v_hidden_size]
* Output: [batch_size, sequence_length, v_hidden_size]

To use separated Q/K/V inputs, the input tensor is for query, and two
optional inputs are added for key and value. Weights for input
projection is not included for now, so the MatMul of input projection
shall be done out of Attention operator, but Add bias is included for
performance consideration.
2022-10-25 11:51:06 -07:00
Jian Chen
397edf9918
Bumping up version number to 1.14.0 on main branch (#13401)
### Description
Bumping up version number to 1.14.0



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-10-21 19:16:44 -04:00
Ye Wang
928c9889a3
A few fixes for generative model ops (#13363)
### Description
<!-- Describe your changes. -->

Fix a bug in GreedySearch Op when batch > 1
Support custom attention mask in GreedySearch and BeamSearch with GPT2 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-10-21 15:00:18 -07:00
Yi Zhang
ea128cdb18
skip windows GPU check if changes only in doc (#13248)
### Description
Use Path filter and fake workflow to skip windows GPU check if there's
only changes in doc.
Refs:

https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/defining-the-mergeability-of-pull-requests/troubleshooting-required-status-checks#handling-skipped-but-required-checks

The fake github yaml is generated by code.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

###verifications:###
In this PR:
since the win-gpu-ci-pipeline.yml and .github are updated, so the real
Windows GPU workflows are always triggered.

in #13256
To avoid update win-gpu-ci-pipleline.yml, I added the path filter in
devops page. the fake win GPU workflows triggered, and the real
workflows are skipped.
2022-10-11 13:51:44 +08:00
garanews
38906625a3
fix some typo in docs (#13212)
### Description
<!-- Describe your changes. -->
fix some typo in docs


### Motivation and Context
singed vs signed
succeding vs succeeding 
fileter vs filter
kernal vs kernel
libary vs library
2022-10-07 15:58:18 -07:00
ashari4
b09dd11ece
BFP schemas: Change block dimension type to Int (#13169)
* Change block dimension type to Int from Ints.
* In response to feedback that the block dimension corresponds to the
reduction dimension of the consuming matrix multiplication. There is
always only 1 reduction dimension.
2022-10-06 11:11:43 -07:00
Tony Xia
c7522e547a
Fixed a minor typo (#13194)
### Description
binraries ==> binaries



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-10-05 12:10:14 -07:00
Tony Xia
962fee5fe5
Fix typo enviroment => environment (#13195) 2022-10-03 17:02:26 -07:00
Changming Sun
dd2aec170d
Update Coding_Conventions_and_Standards.md (#7705) 2022-09-29 23:32:37 -07:00
ashari4
c4a7e88fc8
QuantizeBFP and DequantizeBFP (#12833)
* `QuantizeBFP` and `DequantizeBFP` schemas - similar to
`QuantizeLinear` and `DeQuantizeLinear`.
* BFP datatype is represented as a `uint8` tensor with shape and stride
metadata. This is preferrable to adding a new datatype for BFP, which is
more disruptive and [discouraged by
PyTorch](https://discuss.pytorch.org/t/training-with-custom-quantized-datatype/152132/2).

Context: 

The Microsoft Floating Point (BFP) datatype shares an exponent for every
n numbers called a “bounding box.” Each number still has its own
mantissa and sign bits. BFP has been shown to incur 3-4 less cost
(energy and area) than BFloat16 and INT8 counterparts without reductions
in accuracy for the ImageNet benchmark as described in [Rouhani
2020](https://proceedings.neurips.cc/paper/2020/file/747e32ab0fea7fbd2ad9ec03daa3f840-Paper.pdf).

Requirements:

* There are many variants of BFP (number of mantissa bits, number of
shared exponent bits, size of bounding box, custom bit fields, etc.)
* The size and layout of an BFP variant varies across hardware
* bounding box can be over arbitrary dimensions; for example, for the
channel "C" dimension in a N x C x H x W tensor for convolution

Goals of this PR:

* Add initial versions of QuantizeBFP and DequantizeBFP operators to
enable QDQ-style quantization with BFP. Once the schemas stabilize, we
can consider upstreaming to ONNX.
* Add some basic type and shape inferencing tests; tests that run on an
EP will be a follow-up.
2022-09-22 14:02:55 -07:00
Edward Chen
454f77cd94
Update kernel matching logic: decouple from op schemas and remove kernel def hashes (#12791)
# Motivation
Currently, ORT minimal builds use kernel def hashes to map from nodes to
kernels to execute when loading the model. As the kernel def hashes must
be known ahead of time, this works for statically registered kernels.
This works well for the CPU EP.
For this approach to work, the kernel def hashes must also be known at
ORT format model conversion time, which means the EP with statically
registered kernels must also be enabled then. This is not an issue for
the always-available CPU EP. However, we do not want to require that any
EP which statically registers kernels is always available too.
Consequently, we explore another approach to match nodes to kernels that
does not rely on kernel def hashes. An added benefit of this is the
possibility of moving away from kernel def hashes completely, which
would eliminate the maintenance burden of keeping the hashes stable.

# Approach
In a full build, ORT uses some information from the ONNX op schema to
match a node to a kernel. We want to avoid including the ONNX op schema
in a minimal build to reduce binary size. Essentially, we take the
necessary information from the ONNX op schema and make it available in a
minimal build.
We decouple the ONNX op schema from the kernel matching logic. The
kernel matching logic instead relies on per-op information which can
either be obtained from the ONNX op schema or another source.
This per-op information must be available in a minimal build when there
are no ONNX op schemas. We put it in the ORT format model.
Existing uses of kernel def hashes to look up kernels are replaced
with the updated kernel matching logic. We no longer store
kernel def hashes in the ORT format model’s session state and runtime
optimization representations. We no longer keep the logic to
generate and ensure stability of kernel def hashes.
2022-09-20 14:24:59 -07:00
Alexey Gladyshev
2b5b11d373
[C#][TVM EP] Fix issues related to using TVM EP in C# front-end (#12958)
Changes in this PR:
* Update building of Nuget package for TVM EP
* Update of documentation  for using TVM EP in C#
2022-09-16 16:04:59 +02:00
RandySheriffH
64466c2d62
Remove nuphar provider folder (#12939) 2022-09-13 09:10:52 -07:00
Dwayne Robinson
8e4eb24648
Update operator kernel table to include DML operators (#12887)
* Fix bug in pybind get_all_operator_schema due to premature reference dropping
* Add updated operator kernels markdown table
* Update build.py to include documentation generation for DML operators too
* Update GPU pipeline to include DML in the build to so operators can be generated.
* Use a separate pipeline stage, feedback from Changming and Scott
* Appease annoying Python linter
* Add onnxruntime_BUILD_UNIT_TESTS=OFF and remove stale --use_dml in cuda stage
2022-09-09 10:21:25 -07:00
Hariharan Seshadri
ad69aac491
Introduce ordered quantization ops for the CUDA EP [1/n] (#12582)
Initial core small set for the ordered quantization ops for cuda EP.
2022-09-07 11:58:15 -07:00
Yulong Wang
1a402a3f25
replace 'master' branch ref to 'main' for onnx repo (#12678) 2022-08-30 13:41:42 -07:00
Yulong Wang
c144acc534
Replace 'master' branch ref to 'main' in the code (#12547) 2022-08-22 10:48:12 -07:00
Wei-Sheng Chin
dc486d146b
Make ORT callable from various Pytorch compilers (LazyTensor, TorchDynamo, etc) (#10460)
* Make ORT as Pytorch JIT backend

LORT likely doesn't work with aten fallback so we only test LORT in its own CI.

* Revert changes to enable external CUDA allocator. Will add it later.

Revert "Revert changes to enable external CUDA allocator. Will add it later."

This reverts commit d5487f2e193014c805505afae8fb577c53667658.

Fix external allocator

* Relax tolerance and remove commented code

* Print more information in CI

* Fix pointer

* Address comments.
1. Reuse ORT-eager mode's environment.
2. Remove unused ctor.

* Use Pytorch master branch as all PRs are merged

Fix

* Refine based on cpplint feedbacks

* Revert changes to allow custom CUDA allocator in public APIs

* Use torch.testing.assert_close

* Use unittest framework

* Switch docker repo

* Rename *.cpp to *.cc

* Address comments

* Add comment

* Use same pipeline file for eager and lort pipelines

* Address comments

* Add yaml comment

* Fix cmake files

* Address comments

* Rename flags, remove printing code, remove dead comment
2022-08-22 09:40:40 -07:00
Cheng
64e991a9fc
[Qlinearsoftmax] contrib cpu (#12177)
* [Qlinearsoftmax] contrib cpu

* int8 implementation

* contrib operator md

* qdq transformer test

* new attribute: opset

* doc

* quantized tool

* remove template to reduce Binary size

* doc of contribe operators

* enforce x_shape is valid

* fix reduce_size if input-shape is dynamic

* add UT

* register one op for reducing binarysize

* kernel hash update

* docs/ContribOperators.md
2022-08-10 10:52:02 +08:00
Vincent Wang
cfa09d16d9
[CUDA] Mod Op Kernel (#12499)
* mod for cuda and rocm

* fix bfloat16 ut

* change bf16 ut number

* fix opset version

* fix op kernel doc
2022-08-09 13:05:40 +08:00
Vincent Wang
37995a7245
[CUDA] BiasSoftmax Supporting New Pattern (#12361) 2022-08-05 06:59:24 +08:00
Scott McKay
a3de1bbf7d
Update script to find optimizers that potentially need supported opset updates (#12330)
* Update to handle multiline declarations for the kernels which are typical these days.
* Update to new path for the cpu contrib_op kernel registrations.
* Update tools/python/find_optimizer_opset_version_updates_required.py

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2022-08-04 07:37:27 +10:00
Dmitri Smirnov
dc984a03d5
Container and memory allocation guidelines (#12387)
Container and memory allocation guidelines
  Re-org and add code samples
  Clarify the wording on returning gsl::span
2022-08-03 10:31:59 -07:00
Changming Sun
44ec2cf088
Update publish-python-apidocs.yml (#12433) 2022-08-03 10:17:00 -07:00
Ye Wang
b622e5fa9b
Support vocab_mask/prefix_vocab_mask/no_repeat_number in greedysearch op (#12327)
* support more inputs for greedy search

* fix docs

* refactor test

* lint

* review comments
2022-08-03 10:10:08 -07:00
Valery Chernov
e2423bb55c
[TVM EP] Build on Windows with ipp-crypto support (#12336)
* update TVM EP docs for ipp-crypto build conditions

* add ipp-crypto by ExternalProject_Add

Co-authored-by: Valery Chernov <valery.chernov@deelvin.com>
2022-07-28 15:40:19 +02:00
Ye Wang
89ac61f4d4
support gpt2 model with greedy search (#12068)
* greedy search gpt2 cpu checkin

* add cuda support

* add test

* provider

* update

* fix some bugs

* refactor impl class

* refactor test

* remove unused func

* refactor parameters class

* simplify padding

* fix lint warnings

* python format

* Revert "python format"

This reverts commit f25fe1017fa33d960b2418ebbb5dba6a4bd043cf.

* python format

* fix pipelines

* fix pipeline

* move bufferallocater to generate_impl_base

* review comments(alignment, filename/namespace change)

* rebase2

* python reformat

* reformat

* fix rocm build

* review comment

* review comments

* review comments

* fix a bug

* rebase test files

* python format

* format import order

* review comments

* fix build
2022-07-22 15:45:16 -07:00
RandySheriffH
0264a9c29b
Bump ort version number (#11948)
* bump ort version number

* update link and note url

* update version to silence assert

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2022-07-22 12:55:53 -07:00
Valery Chernov
3b0aaa9e0e
[TVM EP] support build on Windows (#11851)
* add description of build ORT+TVM EP on Windows

* fix cmake error related to symlink creation on Windows

* add llvm config path to build flags for correct build on Windows

* update TVM_EP.md for llvm_config build arg

* fix warnings skipping during build on Windows

* fix using string or wstring for model path to correct build on Windows (MSVC error)

* fix error in custom logger for correct build on Windows

* implement glob algorithm for Windows

* additional build fixes

* update TVM with export of VM symbols for dll

* description of nasm issue and workaround

* update TVM with export of Executable from VM symbols for dll

* description of installation of ipp-crypto dependencies on Windows

* cmake key for ipp-crypto build

* fix wstring for TVMso EP

* fix ipp-crypto build

* cmake key onnxruntime_TVM_USE_HASH switch off not specific methods, but full hash functionality

* fix absolute path to compiled lib

* update TVM_EP.md, fix lint warnings

* update TVM_EP.md

* small fixes after review

* switch on handshake functionality for Linux workflow

Co-authored-by: Valery Chernov <valery.chernov@deelvin.com>
Co-authored-by: KJlaccHoeUM9l <wotpricol@mail.ru>
2022-07-13 10:48:42 +02:00
PeixuanZuo
5579d81fc8
[add] Add operator gemmfastgelu for ROCM (#12101)
* [ADD] add gemm fast gelu

* [UPDATE] refunction matmul_impl

* [Update] delete tuning_ in this pr

* [FIX] code format

* [FIX] compiler warning

* [Update] update doc
2022-07-13 15:40:16 +08:00
Preetha Veeramalai
99a370dd02
Update readme for OVEP (#12122)
* Add changes for training module in Readme

* Update ReadMeOV.rst
2022-07-11 10:54:12 -07:00
Valery Chernov
8ba8146650
[TVM] handshake mechanism for support of TVMso EP (#11437)
* infrastructure for handshake mechanism was implemented. sha256 was selected as first hash algorithm

* check hash during compile in TVMso EP

* add IPP-CRYPTO to external dependencies for TVM EP

* made checkHash method constant

* removed the public implementation of the SHA-256 algorithm so as not to cause a license conflict

* implemented SHA-256 calculation using ipp-crypto library

* fix dependency for ipp-crypto

* add provider options for hash check

* update documentation for added provider options

* add hash check condition

* fix docs

* fix lint

* fix ORT_THROW

Co-authored-by: Valery Chernov <valery.chernov@deelvin.com>
Co-authored-by: KJlaccHoeUM9l <wotpricol@mail.ru>
2022-06-29 14:57:18 +02:00
Gary Miguel
dc5d6b9515
register signal ops for opset 17 (#11778)
* Register signal ops for op set 17

Note code is mostly being moved, not added. These ops were previously
only registered as Microsoft contrib ops and only built if
`BUILD_MS_EXPERIMENTAL_OPS=1`. They've been added to the ai.onnx
standard op set in version 17.

Main components of this change:

* Move the kernels from the conrib_ops directory to the
  core directory.
* Add function bodies for ms experimental ops. This will allow
  old models that use the contrib ops to continue to function.
  All the function bodies consist of a single op (the
  new standard op), so performance overhead should be minimal.

Minor clean-up also in this change:

* De-duplicate get_scalar_value_from_tensor: put it in a new utils.h.
* Fix some bugs that caused compilation errors with the experimental
  ops. Tested with `build.sh --ms_experimental`
* Fix some spelling errors and lint violations.
* Replace a couple of switch statements with `MLTypeCallDispatcher`.
* Use `InlineVector` instead of `std::vector`.

Unblocks https://github.com/microsoft/onnxruntime/issues/11640
2022-06-27 10:26:55 +10:00
Gary Miguel
4bf22e2a40
Update ONNX to 1.12 (#11924)
Follow-ups that need to happen after this and before the next ORT release:
* Support SequenceMap with https://github.com/microsoft/onnxruntime/pull/11731
* Support signal ops with https://github.com/microsoft/onnxruntime/pull/11778

Follow-ups that need to happen after this but don't necessarily need to happen before the release:
* Implement LayerNormalization kernel for opset version 17: https://github.com/microsoft/onnxruntime/issues/11916

Fixes #11640
2022-06-21 17:19:52 -07:00
Ye Wang
859ef277a0
apply zcode changes to the beam search op (#11880)
* apply zcode  changes to the beam search op

* fix pipeline failure

* add doc

* workaround for C#

* update

* update

* use name zcode

* review comment

* review comments

* fix cpplint

* review coments
2022-06-20 18:39:07 -07:00
Tianlei Wu
6ee2c1b5fc
Remove temperature input from BeamSearch operator (#11896)
* remove temperature input
* update index of remaining inputs
2022-06-20 09:50:45 -07:00
sfatimar
f97bd38c4f
UEP 4.1 release (#11834)
* Add pypi build changes to latest Master

* Add ORT training part of OV build

* Disabling SqueezeOpTest.BadAxes

* Add ONNXruntime branch ARG to Docker build

* Changes to include file details versions

* Commit File Version Updates

* Change naming for linux build

* Add fix for pylint format errors

* Fix pylint warnings.

* Fix pylint errors - stage 2

Signed-off-by: Preetha Veeramalai <preetha.veeramalai@intel.com>

* Fix pylint errors - stage 3

* Fix pylint format - stage4

Signed-off-by: Preetha Veeramalai <preetha.veeramalai@intel.com>

* Commit for Wheel Release >0.35.1

Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: mayavijx <mayax.vijayan@intel.com>
Co-authored-by: Sahar Fatima <sfatima.3001@gmail.com>
Co-authored-by: nmaajidk <n.maajid.khan@intel.com>
2022-06-17 14:49:04 -07:00
Gary Miguel
e8b0d24071
Support per-test tolerances for ONNX tests (#11775)
Prior to this every test shared the same tolerances. This meant
that if an ONNX test failed due to a small but acceptable difference in
output, the only alternative was to disable the test entirely.

In op set 17, the DFT operator is being added. Without this change, the
tests for that operator fail because the output is off by about 5e-5.
It's better to keep test coverage for this new op rather than disable
the test entirely.

Also prior to this change, the global tolerances were not shared between
C++, JavaScript, and Python tests. Now they are.

Also fix various minor issues raised by linters.

Unblocks https://github.com/microsoft/onnxruntime/issues/11640.
2022-06-14 15:12:23 -07:00
Chun-Wei Chen
63c483a998
1.12.0 is the right TBD instead of released 1.11.0 (#11817) 2022-06-13 14:27:59 -07:00
Tianlei Wu
def78a1b81
Support T5 in BeamSearch operator (#11450)
(1) Support T5 in BeamSearch operator, and add both CPU and CUDA implementation.
(2) Change BeamSearch op: rename encoder_decoder_init attribute to encoder, and add decoder_start_token_id attribute
(3) Update convert_to_onnx for T5 to use int32 instead of int64 inputs as default.
(4) Add more tests in best_beam_search.py
(5) fix ORT_ENFORCE of hypothesis_buffer_offset_
(6) Improve ONNX conversion:
   (a) Change encoder some dynamic axes to fixed dim value
   (b) add --separate_encoder_and_decoder_init
   (c) correct name t5-3B => t5-3b, t5-11B => t5-11b
   (d) Add --use_int32_inputs in convert t5 to onnx
   (e) Allow t5 beam search conversion in one step
2022-06-10 15:06:57 -07:00
Alexey Gladyshev
331c387f4a
[TVM EP][DOC] Documentation update for TVM EP due to the addition of precompiled model support. (#11743)
* update description of TVM EP options in docs

* update sample notebook

* update TVM EP documentation

* add link to description of options

Co-authored-by: Valery Chernov <valery.chernov@deelvin.com>
2022-06-08 14:56:01 +02:00
Hector Li
95a16c1ffe
Snpe ep (#11665)
* Initiate Ort SNPE EP
* fix snpe ep windows build which is caused by the utility method (ToUTF8String) name change on master
* correct the source path for libonnxruntime.so while building for andorid package
* add AdditionalDependencies for amr64
* On MS-Windows, the patchfile must be a text file, i.e. CR-LF must be used as line endings. A file with LF may give the error: "Assertion failed, hunk, file patch.c, line 343," unless the option '--binary' is given.
* fix build failure if snpe is not enabled
* update doc for contrib op
* separate out snpe ep settings to onnxruntime_snpe_provider.cmake
* renaming according review comments
* update according review comments
2022-06-03 14:10:02 -07:00
Gary Miguel
74bc4c07f6
Fix C# and numbering (#11643)
* C# protocol buffer code can be updated on Linux. Link to the relevant instructions.
* Fix numbering.
2022-05-31 11:33:36 -07:00
Vincent Wang
02724c54ff
[CUDA] Implement BitmaskDropout, BitmaskBiasDropout and BitmaskDropoutGrad (#11534)
* Implement BitmaskDropout and associated unit tests.

* Implement BitmaskDropoutGrad and associated unit tests.

* Implement Dropout -> BitmaskDropout rewrite rule and associated unit tests.

* Implement (Dropout,DropoutGrad) -> (BitmaskDropout,BitmaskDropoutGrad) rewrite rule.

This commit does not yet include unit tests for this rewrite rule.

This commit also introduces improved documentation for all changes which will be grouped
into this PR.

* bitmask dropout

* fix win build

* bugfix for rocm

* bugfix

* fix code format

* fix ut

* fix build break

* fix ut in win

* resolve comments

* fix ut in trt

* resolve comments

* fix rocm build error

* fix typo

Co-authored-by: Aidan Beggs <aidanbeggs@microsoft.com>
2022-05-27 17:24:47 +08:00
Justin Chu
c541063245
Format coding conventions documentation (#11405)
Add proper formatting to code blocks to make the doc more readable.

- Wrap code blocks with `
- Fix typos
2022-05-09 10:19:15 -07:00
Justin Chu
fdce4fa6af
Format all python files under onnxruntime with black and isort (#11324)
Description: Format all python files under onnxruntime with black and isort.

After checking in, we can use .git-blame-ignore-revs to ignore the formatting PR in git blame.

#11315, #11316
2022-04-26 09:35:16 -07:00
Justin Chu
6fb29f5b9a
Add python docstring linting in vscode settings (#11316)
Add python docstring linting in vscode settings
Use black and isort for python code formatting in VScode. Import sorting enabled on save. Code formatting available in VSCode with manual trigger.
Adopted from pytorch https://github.com/pytorch/pytorch/blob/master/.vscode/settings_recommended.json
2022-04-23 06:23:04 -07:00
Chun-Wei Chen
b9279f637d
update How_To_Update_ONNX_Dev_Notes with right paths (#11074) 2022-04-01 08:05:31 -07:00
Xavier Dupré
c37d2728bf
Implement TreeEnsemble for opset(ai.onnx.ml)==3 (#10821)
* Implement TreeEnsemble for opset(ai.onnx.ml)==3
* use of InlineVector
* refactoring
* improve attributes retrieval
* avoid creating a temporary buffer
* modifies onnx.ml.cpu.json
* use unordered_map
* update docs/OperatorKernels.md
* address PR comments (TH -> ThresholdType, ORT_RETURN...)
* add a python unit test to load a TreeEnsembleRegressor following ai.onnx.ml==3 specifications
2022-03-30 12:53:12 +02:00
Vincent Wang
6a6840d5c6
Fuse LayerNormalization for Apex O2 (#10233) 2022-03-29 21:22:04 +08:00
Chi Lo
8ba52b0a05
Bump master version to 1.12 (#10797)
* bump master version to 1.11

* bump master version to 1.12
2022-03-28 12:30:11 -07:00
pengwa
89ef987ab1
Improve NonZero on CUDA/ROCM (#10307)
* improve NonZero

* fix megatron_fp16 optimzier, fix the doc

* multi_tensor_applier

* resolve comment

* fix building warning

* fix build error when enabling training and use tensorrt
2022-03-25 07:35:45 +08:00
Nat Kershaw (MSFT)
2d961604b1
Refactor Python API docs to better explain IO binding scenarios (#10651) 2022-03-15 09:40:59 -07:00
Hariharan Seshadri
a9d9c6b486
Register CPU, CUDA and ROCM opset-16 kernels for some operators (#10643) 2022-03-08 09:18:39 -08:00
liqun Fu
da885a72e8
update with onnx 1.11 release (#10441) 2022-03-07 21:10:55 -08:00
Tianlei Wu
0e335aba37
Update BeamSearch operator spec to support t5 (#10777)
* change BeamSearch op to support encoder decoder model

* check model_type and decoder attribute

* fix

* update comments

* warn shape inference issue with onnx v1.11 or T5

* skip parity test when tempature != 1.0

* fix build
2022-03-04 21:52:45 -08:00
Tianlei Wu
36c3271546
BeamSearch op cuda (#10556)
Add BeamSearch cuda implementation with support of fp16 GPT-2 subgraph
2022-02-25 13:08:55 -08:00
Dmitri Smirnov
2679711bee
Refactor transformers and other code to reduce memory allocation calls (#10523)
Work on minimizing memory management calls by
  reducing number of allocations and copies.
  Replace std::unordered_set to InlinedHashSet
  and add usage of InlinedVector.
  Employ std::move() to minimize copying and memory allocations.
  Remove copying of the const shared data into each of the
  PropagateCast transformer instances.
  Move inlined_containers.h header to include/common
  Adjust AsSpan imlementation for C++ < 17
2022-02-24 16:17:14 -08:00
Alexey Gladyshev
7dc7529ec8
[TVM EP] Integrate tests for TVM EP into public onnxruntime CI (#10505)
* add support for bool type

* add TVM EP support for tests

* include TVM EP in python test pool

* fix pylint

* moved technical imports to a separate file

* clean up post build actions & move _ld_preload.py extension to CMake level

* add files for include TVM EP into CI

* implement custom logger for TVM

* replace TVM logging with ONNX RT logging

* update link for TVM EP tutorial

* clean up TVM EP cmake

* add pybind auto enabling for TVM EP

* fix blank spaces

* code review fixes

* replace print with comment

* add list of EP without TVM EP

* enable onnx tests

* disable contrib ops and ml ops

* reuse Dockerfile.ubuntu

* Move install_tvm_test_dependencies.sh out of Docker context dir, update build definition.

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2022-02-24 16:24:23 +01:00
Scott McKay
df841ee87d
Fix incorrect type constraint registration for operator kernels. (#10489)
* Fix incorrect type constraint registration for RoiAlign. This led to the input type not actually being checked when matching a kernel as the invalid constraint name is treated as a missing optional input.
  * fix missing dependency for the unit test exe. Whilst it doesn't link against the CUDA providers lib, without the dependency VS doesn't know it needs to rebuild the library if there are changes.
* Add check for invalid type constraints.
* Fix invalid registrations for other kernels.
* Add hash replacement logic to provide backwards compatibility in ORT format models when the registration is fixed.
* Add tests
2022-02-18 16:55:32 +10:00
Valery Chernov
1cdc23aba4
[TVM EP] Rename Standalone TVM (STVM) Execution Provider to TVM EP (#10260)
* update java API for STVM EP. Issue is from PR#10019

* use_stvm -> use_tvm

* rename stvm worktree

* STVMAllocator -> TVMAllocator

* StvmExecutionProviderInfo -> TvmExecutionProviderInfo

* stvm -> tvm for cpu_targets. resolve onnxruntime::tvm and origin tvm namespaces conflict

* STVMRunner -> TVMRunner

* StvmExecutionProvider -> TvmExecutionProvider

* tvm::env_vars

* StvmProviderFactory -> TvmProviderFactory

* rename factory funcs

* StvmCPUDataTransfer -> TvmCPUDataTransfer

* small clean

* STVMFuncState -> TVMFuncState

* USE_TVM -> NUPHAR_USE_TVM

* USE_STVM -> USE_TVM

* python API: providers.stvm -> providers.tvm. clean TVM_EP.md

* clean build scripts #1

* clean build scripts, java frontend and others #2

* once more clean #3

* fix build of nuphar tvm test

* final transfer stvm namespace to onnxruntime::tvm

* rename stvm->tvm

* NUPHAR_USE_TVM -> USE_NUPHAR_TVM

* small fixes for correct CI tests

* clean after rebase. Last renaming stvm to tvm, separate TVM and Nuphar in cmake and build files

* update CUDA support for TVM EP

* roll back CudaNN home check

* ERROR for not positive input shape dimension instead of WARNING

* update documentation for CUDA

* small corrections after review

* update GPU description

* update GPU description

* misprints were fixed

* cleaned up error msgs

Co-authored-by: Valery Chernov <valery.chernov@deelvin.com>
Co-authored-by: KJlaccHoeUM9l <wotpricol@mail.ru>
Co-authored-by: Thierry Moreau <tmoreau@octoml.ai>
2022-02-15 10:21:02 +01:00
Changming Sun
3185680b6c
Add NHWC CONV contrib op (#10506) 2022-02-10 15:47:49 -08:00
Viswanath Boga
ad9d2e2e89
Prefix match in first iteration of beam search OP (#10231)
* Add BeamSearch op schema

* Add ONNX conversion for beams search

* remove attention_mask and change input order

* add option to run baseline

* add check data type NULL

* applies VerifyNodeAndOpMatch to subgraph

* update input_ids shape

* Add node name for Cast node

* expose API for topk

* parse parameters

* Add beam search scorer

* output results

* fix typo

* use c++ template and format python

* fix build pipeline errors

* symbolic shape infer of input onnx

* output scores

* add kernel def hash

* Handle vocab_mask; move CheckSubgraph

* undo insert_cast_transformer.cc and fusion_utils.py

* fix typo

* fix merge

* update doc

* add repetition penalty

* refactoring: add GptSubgraph class

* move BeamSearchState from .h to .cc file

* adjust logits processor order

* add batch generation example

* fix repetition penalty for dup words in sequence

* Add test

* Add no repeat ngram processor

* refactoring: move logits processor to classes

* fix build warning

* show latency

* use allocator in beam state

* use allocator in sequences

* fix build error

* move next_positions to beam state

* Changes for prefix matching

* removing debugs

* removing more debugs

* clean up

* clean up

* cpu doc updated

* Updated docs

* updated prefix_vocab_mask dimension in convert script

* changes to support bxs prefix_vocab_mask in beamsearchop kernel

* doc update

* OperatorKernels.md updated

* matching docs from artifacts

* minor change in logits processor

* Addressing comments

* Updated the prefix vocab mask usage properly

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
2022-02-03 00:14:39 +05:30
Yufeng Li
1aa0789691
add qdq support for QGemm (#10414)
* add qgemm in quantization tool

* add qdq support for QGemm

* fix build break

* fix OperatorKernels.md
2022-02-02 10:35:29 -08:00
Xavier Dupré
481b96d32a
STVM, NUPHAR, remove tvm from submodules list, checks pointers are not null. (#10211)
* STVM, checks pointers are not null.
* removes submodules tvm
* add missing include(FetchContent)
* add target tvm
* fix stvm test
* extend cgmanifest with dependencies of tvm
2022-01-27 20:31:13 +01:00
Edward Chen
66acf50488
Document C/C++ API documentation version info conventions. (#10396) 2022-01-27 10:20:13 -08:00
Dmitri Smirnov
3367ddc5ba
Add abseil cgmanifest declaration. Update coding standards. (#10374)
Add abseil cgmanifest declaration. Update coding standards for InlinedContainers
  Adjust coding guidelines. Add default N calculation for InlinedVector<T, N> for general use.
  Rename T from InlinedShapeVectorT. Fix Eager build
  Add LLVM Copyright with modified derived code notice.
2022-01-27 08:32:05 -08:00
Yi-Hong Lyu
e27f2dc932
int8/uint8 support for Argmax for opset 1, 11, 12 (#10296) 2022-01-18 14:37:34 -08:00
Vincent Wang
44e2db9397
CUDA BFloat16 Refactor (#10085) 2022-01-14 19:38:56 +08:00
Yi-Hong Lyu
499f1d5fd7
Quantization of Argmax (#10213)
This patch includes:
* int8/uint8 support for Argmax
* Quantization tool support for Argmax
2022-01-12 14:12:56 -08:00
Nat Kershaw (MSFT)
d52d3c0052
Update C/C++ API docs automation to create a PR (instead of push to publish branch) (#10093) 2022-01-07 16:16:47 -08:00
Edward Chen
3bc91c2151
Move reduced ops files into build directory (#10030)
In a reduced ops build, some source files get updated. This change moves the updated files into the build directory. This way, it is easier to simultaneously manage different build directories (with possibly different reduced ops configurations) based on a single source directory.
2021-12-28 19:04:20 -08:00
Vincent Wang
ceb17f82ff
Use FusedMatMul When Transpose is Between First Dim and Contiguous Batch Dims (#9734)
* fusedmatmul support transpose batches

* fix win build

* fix contrib op md

* more comments
2021-12-27 10:49:46 +08:00
Yufeng Li
12ee2e942f
add int8_t for Resize (#10067)
As we support quantization for format s8s8, we need Resize to support int8_t.
2021-12-17 15:36:09 -08:00
Tianlei Wu
ef36488df0
Add BeamSearch operator for GPT-2 decoding (#9680)
* Add BeamSearch operator and CPU implementation
* Add ONNX conversion script
2021-12-16 16:08:05 -08:00
Valery Chernov
b327e89efa
Standalone TVM Executor Provider (#10019)
* squashed commit for standalone tvm execution provider

* critical fix for correct python build with stvm ep

* get tuning log file from ep options. It has priority over AUTOTVM_TUNING_LOG

* updates and fixes

* update parsing of stvm provider options

* add support of external data for onnx model

* add conditional dump of subgraphs

* remove unused code

* get input tensor shapes through provider options. get output shapes for fixed input ones by TVM API

* support AUTO_TVM tuning log file inside ORT. Selector for Ansor and Auto_TVM is provider option (tuning_type)

* add fp16

* add functionality of conversion of model layout to NHWC if need. Necessary parameter was added to STVM provider options

* fix license text in header. fix log format

* small fixes

* fix issues from flake8

* remove model proto construction from GetCapability

* reserve memory for vector of DLTensors

* add simple tutorial for STVM EP

* STVM docs

* jroesch/tvm -> apache/tvm

* remove dead code, unneccessary logs and comments

* fix in readme

* improve tutorial notebook

* tvm update

* update STVM_EP.md

* fix default value

* update STVM_EP.md

* some TODOs for the future development

* shorten long lines

* add hyperlink to STVM_EP.md

* fix Linux CI error

* fix error in csharp test

Co-authored-by: Jared Roesch <jroesch@octoml.ai>
Co-authored-by: Valery Chernov <valery.chernov@deelvin.com>
Co-authored-by: KJlaccHoeUM9l <wotpricol@mail.ru>
2021-12-15 16:59:20 -08:00
jingyanwangms
8043a9facc
Bump master version to 1.11 (#9957)
* Bump master version to 1.11

* Update Windows AI version

* update version in onnxruntime_c_api.cc
2021-12-14 23:32:06 -08:00
Nat Kershaw (MSFT)
b4434c7694
Automate generation of C/C++ API docs (#9997) 2021-12-10 17:45:50 -08:00
Yufeng Li
ffdafb2012
add fallback of s8s8 support on x64 (#9995)
* add fallback of s8s8 support on x64
2021-12-10 11:33:19 -08:00
Scott McKay
00c979db4d
Update doc for operators/opsets supported by mobile package (#9899) 2021-12-02 13:51:22 +10:00
Sherlock
6de79d82c8
Fix Training Packaging pipeline (#9885)
* Fix Training Packaging pipeline
2021-11-30 15:26:10 -08:00
Yufeng Li
a0afd7303d
add int8_t support for pool operators (#9852)
* add int8_t support for pool operators
2021-11-29 18:43:43 -08:00
Ye Wang
6856619b18
Decoder Attention CUDA Op (#9792)
* add kernel interface

* register kernel

* add self/cross qkv projection without cache

* add LaunchTransQkv2 for (S,B,X,N,H) -> (X,B,N,S,H)

* refactor ConcatPastToPresent

* DecoderQkvToContext interface

* q,k,v buffer and cache as output

* qk, pv and transctx

* fix compiler error on linux machine

* key_padding_mask

* add test_parity file. However not runnable

* add partial unittest

* made partial attributes to inputs

* --gen_doc

* change kernel interface, add more tests

* morre parity tests

* fix test

* fix typo

* transpose optimizer has bug. remove it temporarily

* add input shape checks

* add type/shape inference

* fix cache shape check

* fix rocm build failure

* fix rocm build error

* review comments

* review comments
2021-11-19 19:25:36 -08:00
Vincent Wang
f390347c11
Add CUDA Kernels of RandomNormal[Like], RandomUniform[Like] (#9761) 2021-11-19 08:18:34 +08:00
Viswanath Boga
9d84811fb6
fixing pypi pipeline for release (#9716)
* fixing pypi pipeline for release

* updated the script and correct python version

* updated the version correctly with script changes

* Remove 1.9.1
2021-11-10 17:33:51 -08:00
satyajandhyala
229c9a4e1c
Added Trilu CUDA kernel. (#9633)
* Added Trilu CUDA kernel.

* Added TriluGrad.

* Added a training testcase for Trilu.

* Added Trilu gradient checker test.
2021-11-09 11:20:17 -08:00
Hariharan Seshadri
bbeceb7541
Support optional type in ORT (#8339) 2021-11-04 15:01:42 -07:00
Viswanath Boga
85874bb315
embed layer fusion gpt2 (#9336)
* Changes to fuse embed layer for gpt2, kernal changes pending

* verified add output and regular add match

* Test added for additional output embedlayernorm, working on CUDA

* Test passing on CPU

* updated convert_to_onnx toll to check parity correctly

* removed some debugs

* couple of TODO left as in optimizer.py

* removed changes to optimizer.py

* fixing build

* fixing build

* updated order of initilization

* added a test case for float16

* updating the docs

* updating tests failing due to embed layer fusion

* update unit tests

* updating CUDA documentation in operatorkernels.md

* addressing comments

* OperatorKernels.md updated with CUDA

* adding TODO to qembed_layer

* minor edit

* updated docs

* addressing comments

* adding position ids to embed layer gpt2

* updating fused gpt2 model

* added extra test

* remove comments

* addressing comments

* contrib_defs.cc updated

* all tests passing

* fixing a typo

* minor edit

* trigger build

* qembedlayernorm checkinputs updated

* fixing build error

* fixing build error

* fixing build error
2021-10-28 11:06:26 -07:00
Bowen Bao
e983f37121
Bifurcation detector for aggressive decoding (#9432)
```
Component for aggressive decoding. Find the bifurcation index of predicted tokens, between source tokens,
starting from previous suffix match index, and predicted tokens.
Concat predicted tokens, starting from bifurcation index, to the back
of current tokens. This forms the output tokens.
Detect suffix match index in source tokens, between source tokens and output tokens.
Detection is based on finding the appearances of last n-gram in output tokens
in source tokens.
A match is considered found if source tokens contain a single matching n-gram.
Return the index of the start of the n-gram in source tokens.
No matching if found if src tokens contain multiple or zero matching n-grams. Return -1.
```
2021-10-19 19:53:56 -07:00
Hariharan Seshadri
4698b73725
Fix output shape description of Attention op's schema (#9406) 2021-10-19 15:56:35 -07:00
Xavier Dupré
11f0081c1e
Remove tensorflow, tf2onnx from the list of dependencies for the documentation (#9221)
* Remove tensorflow, tf2onnx from the list of dependencies for the documentation
* improve documentation
* update API
2021-10-14 18:07:35 +02:00
mindest
f9cf62912a
Add same_shape case for BiasDropout (#9188)
* bias dropout improvement

* add transform case for same shape case

* combine kernel

* merge with vectorized kernel

* use "has_same_shape_bias"

* minor: a "N % 4 != 0" case

* add op UT for has_same_shape_bias

* address comments; add param case for 1d bias;
add param case tests for 1d and same-shape bias

* rewrite logic condition

Co-authored-by: Peng Wang <pengwa@microsoft.com>
2021-10-12 19:57:38 +08:00
ashbhandare
35c2102cfa
Fixes for GatherND, Multinomial (#9143)
* register gathernd kernel, aten multinomial

* fix CI, add test

* review comments
2021-10-05 14:51:58 -07:00
Ye Wang
4934455ab6
Bumping up to 1.10 (#9006)
* bump to 1.10

* Update Versioning.md

* Update README.rst

* Change opset version to 15
2021-09-22 16:34:28 -07:00
Jason
4e5bc8365b
Add Paddle2ONNX to Versioning.md (#9067)
* Add Paddle2ONNX to Versioning.md
2021-09-22 13:38:14 -07:00
Pranav Sharma
dae37dc946
Fix S360 issue by using "use strict" for javascript code. (#9128) 2021-09-20 20:32:44 -07:00
Ryan Hill
6ae5f7a244
C API Docs - Add build instructions (#9106)
* Update Doxyfile, add build instructions to header
* Update paths in README.md
2021-09-17 18:40:27 -07:00
Ryan Hill
280e79463a
FIll in more documentation (#9088)
Fix plural values with %s
Fix more symbol links
Add custom header for web metrics
2021-09-16 17:08:27 -07:00
Zuwei Zhao
ff66cfdfa6
Enable linking in exception throwing support library when build onnxruntime wasm. (#8973)
* Enable linking in exception throwing support library when build onnxruntime webassembly containing onnxruntime-extensions.

* Add flag in build.py to enable linking exceptions throwing library.

* Update onnxruntime-extensions document and bind custom_ops build flag with use_extensions.

* Update doc.

* Update cgmanifest.json.

Co-authored-by: Zuwei Zhao <zuzhao@microsoft.com>
2021-09-10 22:09:16 +08:00
Ryan Hill
2439ced3ec
API Documentation (#8948)
* Make help information compile properly
2021-09-09 22:04:51 -07:00
ytaous
0193490cbf
ReduceMin - add int64 cuda kernel support for opset12/13 (#8966)
* ReduceMin - int64 support

* fix doc

Co-authored-by: Ethan Tao <ettao@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-09-07 17:01:26 -07:00
Ye Wang
e2194797a7
bumping up to version 1.9 (#8982)
* bump up version

* makes the windowAI column align with ORT version

* update the hardcoded version string

* fix a typo
2021-09-07 14:30:55 -07:00
Zuwei Zhao
89e8bff121
Enable selecting custom ops in onnxruntime-extensions. (#8826)
* Enable selecting custom ops in onnxruntime-extensions.

* Move cmake_helper.py.

* Remove over-indented spaces.

* Add doc.

* Remove onnxruntime-extensions from git submodules, and user should pass path of onnxruntime-extensions for build.

* Modify doc.

* Remove argument --enable_onnxruntime_extensions and use --onnxruntime_extensions_path.

* Fix build error.

* Fix build error.

* Use onnxruntime_extensions_path.

* support both submodule and external source folders

* refinement

* Update cgmanifest.json

* Support building onnxruntime-extensions from either git submodule or pre-pulled path.

* Update doc.

* more standard name

* update docs

* add the copyright header

Co-authored-by: Zuwei Zhao <zuzhao@microsoft.com>
Co-authored-by: Wenbing Li <wenbingl@outlook.com>
Co-authored-by: Wenbing Li <10278425+wenbingl@users.noreply.github.com>
2021-08-27 21:45:52 -07:00
Hariharan Seshadri
cee79526fd
Add opset 15 kernels for Pow, BatchNorm, and Shape (#8442) 2021-08-25 12:04:20 -07:00
Hariharan Seshadri
17b0664e34
Optimize sequence type usage on CUDA [2/n] (#8720) 2021-08-24 10:40:28 -07:00
XiyinOSS
19b82b438b
GridSample OP implementation for CPU and CUDA (#8551)
* GridSample OP implementation for CPU and CUDA

**Description**: This change contains implementation for torch grid_sample OP.
Cuda implementation contains contribution from Muscle Wu.

* Use interpolation for out-of-bound points in zero padding mode

Out-of-bound points in zeros padding mode changed from constant 0 to
interpolation of surrounding pixels. This aligns with Pytorch implementation.

A bug in CUDA batch offset calculation is fixed.

Custom op exporter type is added.

* Fix nearest bug in CPU

* Update per CI build finding and review comments

* Force float to avoid potential integer T issue

* Style update

* PR update

* Remove c++17 feature from cuda code
2021-08-20 12:37:38 -07:00
harshithapv
c24335246b
Support bool type for Pad Op and fix Unsqueeze in Tile grad for Opset 13 (#8602)
* changes

* tile grad unsqueeze fix for opset 13

* clean up

* remove bool support for opset 2 to 12 for Pad as it is not supported.

* Copy OperatorKernels.md from artifacts of Windows CI build.
2021-08-11 11:21:02 -07:00
Xavier Dupré
064a385b59
Support int8 for operator Split (#8615)
* Support int8 for operator Split
2021-08-10 23:04:16 +02:00
Changming Sun
ed17ca3595
Remove onnxruntime/core/protobuf (#8617)
* remove onnxruntime/core/protobuf

* Update How_To_Update_ONNX_Dev_Notes.md
2021-08-10 09:36:27 -07:00
Guoyu Wang
52a212e4f1
Bump ORT master version to 1.8.2 (#8646) 2021-08-09 11:10:29 -07:00
Yulong Wang
1b902d0227
doc: add ort-web related instructions to update onnx doc (#8500)
* doc: update instructions for ort web docs

* revise readme
2021-08-06 15:09:11 -07:00
Ashwini Khade
96eb9810ba
Update onnx (#8458)
* updates for picking pnnx commit

* add tests filter to c# tests

* plus test fixes

* fix versioning for contrib ops

* fix tests

* test filter for optional ops

* more versioning related updates

* fix test

* fix layernorm spec

* more updates

* update docs

* add more test filters

* more filters

* update binary size threshold

* update docs

* plus more fixes

* updates per review

* update to release commit

* add filters for optional type tests

* plus updates
2021-08-05 09:21:44 -07:00
Chun-Wei Chen
9d88b1de78
correct supported ONNX version (#8590) 2021-08-05 06:49:50 -07:00
Yufeng Li
ceeb1a65d6
Add quantization support of GEMM directly with QGemm (#8447)
QGemm takes in quantized A, B, C, and quantization parameters of output Y, in which C and quantization parameters of Y are optional. Its output can be quantized or full precision, which depends on whether quantization parameters of Y exists or not. If quant params of Y are provided, the output will be requantized or is full precision.

Comparing with QLinearMatMul and MatMulInteger, QGemm supports transpose, apha and beta attribute.

The formula for quantized GEMM is:
Y = alpha * scale_a * scale_b * ((A_int8 - zp_a) * (B_int8 - zp_b) + C_int32), in which,
C_int32 is quantized with formula: C_int32 = (beta * C) / (alpha * scale_a * scale_b)
2021-07-27 21:21:49 -07:00
Xavier Dupré
a9fc3c448c
Improves documentation, show InferenceSession contructor attributes (#8494)
* include constructor parameters in the python documentation
* expose more classes into the documentation
2021-07-26 15:58:47 +02:00
Dmitri Smirnov
950fe5e28b
Implement SparseTensor and infrastructure suppport and advance ONNX commit (#8038)
SparseTensor support
  Implement Builder pattern
  Fix support for 1-D and 2-D COO indices
  Implement and test CSR support.
  Handle shape inference for SparseTensors
  Implement conversion for COO, CSR and tests.
  Address the case where constant sparse initializer is the output.
  Implement test infra for SparseTensors
  Implement SparseDenseMatMul for Csr and COO and tested it.
  Add hash for SparseToDenseMatMul
  Finish shared provider refactor
  Refactor GetOrCreate to Create
  Working on py interface
  Expose OrtDevice and use it in allocate_numpy
	Adjust Sparse interfaces, add support for string SparseTensor. Add tests.
	Add and test to_cuda()
	Add accessors to format specific indices
	Test values and indices views, read-only flag, after GC access
	Add sparse related methods to OrtValue
	Re-work SparseTensor wrapper, add OrtValue methods
	Rework numpy_array_to_cuda/to_cpu
	Add run_with_ort_values
	Add models and test sparse_mat_mul with run_with_ort_values
	Refactor sparse tensor to use a single buffer
        Ifdef x86 Eigen CSR sparse matmul implementation
        Exclude broken test, check for string type when copying cross device
       Split pybind schema, regenerate docs, add exclusion
       Conditionally exclude schema module
       Update docs fix cuda build
       Add test to a filter and renerate JS docs
      Add conversion and test string support for sparse tensors
      Exclude conversion utils from minimal build
      Add CUDA Memcpy and adjust provider interfaces
2021-07-22 15:24:36 -07:00
DeyuHuang
4275055868
Add Gridsampler contrib op (#8372)
* add Gridsampler contrib op

* fix gridsampler_paddingmode_border test

* disable the tests until the kernel added

* fix CI failure

* change GridSampler to GridSample
2021-07-22 15:39:28 +08:00
harshithapv
0f989c6162
bumping onnxruntime version to 1.8.1 (#8429) 2021-07-19 16:48:56 -07:00
Viswanath Boga
afce0e2543
Attention kernel update to handle different Q,K,V hidden sizes (#8039)
* changes working to convert akv nodes

* changes to replace nodes

* changes to accomodate qkv hidden sizes as attributes

* kernel to accept qkv_hidden_size attributes

* Working till compute for varied dimension, todo applyattention()

* changes to make all regression tests work

* inference running successfully without prepack

* success inference with pre-pack weights

* add test for diff sizes

* bias shape need not be a mul of 3

* get the output_hidden_size from input

* infer output shape from input

* merge with master

* cleaning up files that got merged wrong

* accurancy at accepted level

* added unit test case for different dimensions

* all unit tests passing

* packed weights working for attention

* prepacked weights working

* added test case for newly added extra qk input

* updated unit test to test only extra add qk

* fixing build error

* removing few debugs

* reverting test changes

* all python test passing

* cleaning up

* new unit test added, major clean up of code

* removed extra code

* minor

* minor fix to tests

* prepack weights code cleaned up

* compacted compute() in attention.cc

* reformat compute()

* making a parameter T

* adding 3 q,k,v buffers in all cases

* fixing build

* running tests only on cpu

* Updating docs

* trigger ci builds

* Addressing comments in PR

* addressing some more comments

* get add_qk_str from add_qk node directly

* updating docs, added extra check to verify attn inputs

* Optimized the extra add by parallelizing

* added attention_shape to symbolic_shape_infer.py

* minor refactoring to address comments
2021-07-19 12:21:33 -07:00
Ye Wang
04297110c3
Support int64 in ReduceMin cuda op for Opset 14 (#8307)
* reducemin int64_t support

* fix xxcuda.so load error

* testtest

* refactor

* update doc

* propagate types to opset14

* re-generate doc

* rename macro
2021-07-13 16:18:06 -07:00
Zuwei Zhao
0a5b75f5cd
Update submodule onnxruntime-extensions. (#8282)
* Update submodule onnxruntime-extensions to latest.

* Add document for onnxruntime-extensions.

* Update cgmanifest.json for onnxruntime-extensions.

* Add example in JavaScript.

Co-authored-by: Zuwei Zhao <zuzhao@microsoft.com>
2021-07-13 10:21:11 +08:00
Hariharan Seshadri
5369821ad6
Support SpaceDepth ops in the CUDA and ROCM EPs (#7960) 2021-07-09 01:00:22 -07:00
Nick Kreeger
800b62a139
Create a quantized EmbedLayerNorm for ORT. (#8124)
Create a quantized EmbedLayerNorm Op for ORT
2021-06-25 17:51:43 -05:00
Negin Raoof
80b7b134bf
Adding optional ops in contrib ops (#7946)
* Added optional const spec
2021-06-24 13:16:31 -07:00
Bowen Bao
51c12a715b
Add NGramRepeatBlock contrib op (#8078)
**Description**: 
Enforce no repetition of n-grams. Scores are set to `-inf` for tokens that form a repeated n-gram if added to the back of the input_ids.

**Motivation and Context**
Needed by transformer models in sequence generation algorithms (greedy search and beam search). This module has heavy impact on performance, and can be highly parallelized.
2021-06-21 10:21:48 -07:00
Olivia Jain
c72a8c7ff4
Upgrade tf 2.4.1 to 2.4.2 for component governance (#8036)
* Upgrade tf 2.4.1 to 2.4.2 for component governance

* Trial run with tf 2.5.0
2021-06-14 09:30:58 -07:00
Xavier Dupré
6d7461795f
Update Version.md (#8021)
Fix the correct supported opset 1.8.0.
2021-06-13 18:52:40 +02:00
RandySheriffH
1a5ee11dbd
Implement Sequence Ops GPU (#7863) 2021-06-07 15:30:26 -07:00
Thiago Crepaldi
c45ac166d3
Add graphviz into Dockerfile images for Python API documentation (#7819) 2021-06-02 16:12:54 -07:00
Scott McKay
0fbec1b9c1
Update the operator documentation generation (#7787)
* Update the operator documentation generation
  - Make layout a little nicer
  - Update to latest supported operators including training
  - Fix some links that are broken when the docs content is copied to github-pages
  - Fix incorrect usage of 'onnx.ai.ml' as the default domain
    - ML ops are now separated from the real default domain of 'onnx.ai'
  - Include CPU, CUDA and training kernels
    - exclude DNNL as it's not an EP we own

* There are separate paths for CUDA and CUDNN as they are not guaranteed to be in the same location on a Windows machine. Use the CUDNN path when looking for the CUDNN library.

* Enable validation of both contrib ops and operator kernels in build
Filter generation so it's deterministic
Add ability for CI to publish the md files as build artifacts if they differ so a developer can download and add to their PR to resolve any diffs.
Remove workarounds for github-pages as that will now link to the github docs which display correctly
2021-06-02 17:47:40 +10:00
Siva Popuri
c08bb4eee3
Update docs/ONNX_Runtime_Server_Usage.md (#7818)
Making it clear in the documentation to proactively inform users.
2021-05-26 16:17:20 -07:00
Scott McKay
57782b3463
Add supported operators/types documentation for the ORT Mobile package (#7807)
* Add ability to generate documentation for the ORT Mobile package using the build configuration as input.
2021-05-26 15:57:40 +10:00
Xueyun Zhu
e92b3c1394
bumping up version number to 1.8 (#7733)
* bump to 1.8

* fix windows AI
2021-05-18 09:03:37 -07:00
Thiago Crepaldi
4fe2ffae16
Fix ORTModule python doc generation (#7704)
* Fix ORTModule python doc generation

* Address comment
2021-05-17 09:55:49 -07:00
Yufeng Li
a74e41e47d
Add non-zero zp support for quant matmul and attention (#7570)
* add non-zero zp support
* support A and B scale with any dimensions
2021-05-14 16:50:31 -07:00
Zhang Lei
50c5edcf13
Add nhwc support for QLinearAveragePool operator (#7656)
* Add nhwc support for QLinearAveragePool operator

* Update ContribOperators.md

* Update OperatorKernels.md with cpu,dnnl and cuda enabled.
2021-05-13 22:05:30 -07:00
Faith Xu
7cb9077043
Fix readme page (#7659)
* Delete mobile page

Moved to: https://www.onnxruntime.ai/docs/how-to/deploy-on-mobile.html

* Delete ONNX_Runtime_Mobile_NNAPI_perf_considerations.md

Moved to: https://www.onnxruntime.ai/docs/reference/execution-providers/NNAPI-ExecutionProvider.html#performance-tuning

* Fix links to website docs

* Update some summary text

* Add space
2021-05-12 14:30:23 -07:00
Tracy Sharpe
16297a8e61
Implement NCHWc Upsample linear mode (#7623)
Extend the existing NCHWc Upsample operator to support linear modes too.
2021-05-10 12:16:16 -07:00
Ye Wang
803837df63
Add 4dmask support for attention cuda kernel (#7591)
* checkin

* add 4dmask support in attention cuda op

* trim

* add comments

* fix build/test error

* review comments and add tests

* sync doc

* review comments

* minor change
2021-05-07 20:17:29 -07:00
Scott McKay
d6df5764d7
Android package infrastructure (#7430)
* Include ORT format model conversion scripts and infrastructure in ORT python package.
  - tweak existing script setup so it can be easily run directly and from the ORT python package
Add config file and readme for Android minimal build package
Update ORT Mobile doco
Disable warning if 'all' optimizations are enabled but NCHWc transformer is excluded (device specific optimizations don't apply in this scenario so the warning is moot).

* Address PR comments
2021-04-30 14:23:54 +10:00
Changming Sun
1012535dab
Change onnxruntime::make_unique to std::make_unique (#7502)
1. Change onnxruntime::make_unique to std::make_unique
2. Add "-std=c++14" to ROCM EP's build flags.
2021-04-29 17:04:53 -07:00
KeDengMS
8e21329206
Update nuphar notebook model download url (#7475) 2021-04-27 21:18:06 -07:00
Edward Chen
d21304ceb0
Initial Objective-C API (#7366)
Initial implementation of an Objective-C API.
2021-04-27 10:06:30 -07:00