Commit graph

97 commits

Author SHA1 Message Date
kunal-vaishnavi
39d6d7050d
Change EmbedLayerNormalization mask index output to optional (#15526)
### Description
This PR changes an EmbedLayerNormalization node's mask index output to
be an optional output if a mask input is not provided.



### Motivation and Context
The documentation for EmbedLayerNormalization states 
```
The last input mask is optional. If mask is provided, mask index (that is position of first 0 in mask, or number of words) will be calculated.
```
However, if the mask input is not provided, the mask index output is
still calculated and required.
2023-04-27 16:32:42 -07:00
Patrice Vignola
3be5bfe363
[DML EP] Add MatMul + SoftMax fusion (#15240) 2023-04-11 08:31:04 -07:00
stevenlix
6d126f8996
Add FP16 support for Whisper model (#15427)
Current ORT can only run inference for Whisper FP32 model. This PR adds
FP16 support.
2023-04-08 21:36:10 -07:00
Chen Fu
8dce83a818
Fuse 'Add' operator into FP16 Conv (#15213)
### Description
Adding 'Add' functionality to FP16 Conv operator. It takes a tensor that
has the same shape of the output tensor, and add it to the result
tensor.


### Motivation and Context
Needed to run Resnet 50
2023-04-07 09:51:03 -07:00
petermcaughan
1251964f96
Petermca/beamsearch whisper (#15339)
### Description
Adjust various code paths to allow Whisper model to function with
BeamSearch op.

Approach: Add a new kModelType enum value in IGenerationParameters as
so:
#### Old: 0 = GPT2, 1 = T5
#### New: 0 = GPT2, 1 = T5, 2 = Whisper

When the user assigns this attribute value to 2, various shape and type
checks are changed to accommodate Whisper inputs.


### Motivation and Context
BeamSearch is currently designed to function with BERT-based models with
inputs as vocab tokens, and needs changes to function with Whisper
inputs (3-D float values processed from audio data).

---------

Co-authored-by: Peter McAughan <petermca@microsoft.com>
2023-04-04 09:09:10 -07:00
Ye Wang
fbfe92f66a
DecoderMaskedMultiHeadAttention enhancement (#15292) 2023-04-02 21:53:03 -07:00
Yufeng Li
c08d6b42e8
Add tool to support packing mode for BERT model (#15283)
### Description
<!-- Describe your changes. -->
Add a tool to convert fused BERT like model to packing mode


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-03-31 08:46:47 -07:00
Ye Wang
44ba23e0f5
Rename DecoderMaskedMHA to DecoderMaskedSelfAttn (#15166)
### Description
<!-- Describe your changes. -->

As synced offline, rename this op and will create another op for mha
that supports both self and cross attention.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-03-23 12:31:38 -07:00
Ye Wang
2ee822d483
Extend memory efficient attention coverage in Attention/MHA cuda op (#15064)
### Description
<!-- Describe your changes. -->

1. upgrade cutlass to 3.0 that containing attn_bias support.
2. extend Attention/MHA to use memory efficient attention when
rel_pos_bias with [1, num_head, s, s*] and 1d mask with [2 * batch_size
+ 1] are present.

new mask format introduction:
MASK_1D_KEY_SEQ_LEN_START,  
[3 * batch_size + 2] with [key_len[0], ..., key_len[batch_size - 1],
query_start[0], ..., query_start[batch_size - 1], query_end[batch_size -
1], key_start[0], ..., key_start[batch_size - 1], key_end[batch_size -
1]]

e.g
2D mask with [[1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]] converts to this
1D mask is [3, 5, 0, 6, 12, 0, 6, 12]


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

It potentially benefits tnlrv6 and t5(encoder)

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com>
Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-03-23 11:05:17 -07:00
Hariharan Seshadri
7033346605 Support mask_filter_value attribute in DecoderMaskedMultiheadAttention (#15158) 2023-03-23 11:00:09 -07:00
Yufeng Li
c7ced7a5e9
Add PackedAttention for packing mode (#14858)
### Description
<!-- Describe your changes. -->
Transformer models can handle batch of inputs at once. However,
sequences in a batch usually have different length. Then we have to pad
the short one to have same length as the longest. This is not efficient
especially for large batch with high variance.

This PR introduces a PackedAttention operator which can take in packed
sequences (no padding) and also produces output in packing mode.

There will be another PR to use the PackedAttention to implement the
encoder in packing mode.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-03-21 12:59:29 -07:00
Hariharan Seshadri
ed7ab1660d
[CUDA] Add option to use DecoderMaskedMultiheadAttention in BeamSearch (#14990) 2023-03-15 17:16:32 -07:00
Ye Wang
538d64891a
[t5 optimization] kernel changes to t5 (#14928)
### Description
<!-- Describe your changes. -->

1. support optional bias in Attention op (used in T5 encoder)
2. support broadcasting rel_pos_bias in attention_softmax.h
3. add scale in
MHA op's attributes
4. support past_key/past_value and present_key/present_value in MHA
5. UT and parity tests are added
6. fix an issue: https://github.com/microsoft/onnxruntime/issues/14920

note: the fusions will be in another PR since mt5 needs to be tested and
an issue from github will be investigated.

Future works:
1. support shared buffer for past/present
2. enable trt kernels when possible and investigate (trt/cutlass)kernels
with rel_pos_bias)
3. support KV/QKV packing with past/present

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-03-13 14:29:16 -07:00
Hariharan Seshadri
112a4d215a
[CUDA] Support decoding multihead self-attention implementation (#14848) 2023-03-08 09:17:54 -08:00
Ye Wang
58da3cacdf
support NeoX-style rotary embedding (#14785)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-02-22 18:21:34 -08:00
Tianlei Wu
eb2ac72fa9
Stable Diffusion CUDA Optimizations Part 4 (#14680)
(1) Support packed QKV format in MultiHeadAttention. This format could
avoid add bias transpose when TRT fused kernel is used.
(2) Add cache for cumulated sequence length computation. For SD, it only
need computed once since sequence length is fixed.
(3) Do not allocate qkv workspace to save memory for packed KV or QKV.
(4) Add unit tests for packed kv and packed qkv format in
MultiHeadAttention
(5) Mark some fusion options for SD only

Performance tests show slight improvement in T4. Average latency reduced
0.15 seconds (from 5.25s to 5.10s) for 512x512 in 50 steps for SD 1.5
models. Memory usage drops from 5.1GB to 4.8GB.
2023-02-15 14:55:42 -08:00
Tianlei Wu
f638c5a2ae
Stable Diffusion CUDA Optimizations Part 3 (#14646)
The third part for stable diffusion CUDA optimizations
(1) Add BiasAdd operator to replace two Add (bias and residual); Add
fusion for BiasAdd
(2) Add Attention fusion for VAE decoder.
(3) Update float16 conversion to handle Resize and GroupNorm. This could
reduce two Cast nodes for each Resize op in fp16 model.
(4) Force inputs and outputs to be float16 to avoid data casts in the
pipeline.
(5) Add options --force_fp32_ops, --inspect etc in optimize script so that
user could force some operator to run in float32 to potentially get
better image quality (with cost of performance).

Performance tests show slight improvement in T4. Average latency reduced
0.1 seconds (from 5.35s to 5.25s) for 512x512 in 50 steps.
2023-02-14 12:46:50 -08:00
Ye Wang
b539c364ee
Some kernel changes for TULR (#14517)
### Description
<!-- Describe your changes. -->
1. fix a bug in relative position bias kernel where seq_len > 32
2. rename extra_add_qk to relative_position_bias
3. support relative_position_bias in multihead attention (B, N, S, S*)
4. gru_gate support by Lei


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
Co-authored-by: Lei Zhang <zhang.huanning@hotmail.com>
2023-02-07 11:51:06 -08:00
Tianlei Wu
a6c5ba0185
Stable Diffusion CUDA Optimizations (#14428)
### Description

Add stable diffusion CUDA kernel optimizations.

The following are included:
(1) GroupNorm operator. This kernel is from TensorRT 8.5.
(2) BiasSplitGelu operator. This kernel is modified from SplitGelu of
TensorRT 8.5. We added bias to the SplitGelu.
(3) NhwcConv operator. This adds support of NHWC format (ONNX Conv
operator uses NCHW format).
(3) Update MultiHeadAttention (packed kv and no bias) for cross
attention. This could avoid transpose of kv for TRT fused cross
attention kernel.
(4) Optimization and benchmark script

Not included:
(1) Script to convert Conv to NhwcConv in onnx graph.
(2) Update symbolic shape inference for NhwcConv.
(3) Add SeqLen2Spatial operator
(4) Documents

Limitations: GroupNorm, BiasSplitGelu and NhwcConv kernels are
implemented based on stable diffusion usage. They might not be
applicable to any input size or dimensions. For example, BiasSplitGelu
requires hidden size to be 2560 | 5120 | 10240, and NhwcConv assumes 4D
input/weight.

There is minor increasement of binary size. For SM=75 only, python
package wheel size adds (33757K - 33640K) = 117 KB. It is possible to
move NHWC from template parameter to constructor to reduce binary size
(with slight cost of performance).

Note: for RTX 4090/4080/4070 Ti, need build with CUDA 11.8 and latest
cuDNN to get best performance.
2023-02-02 23:43:51 -08:00
Ye Wang
de7a868d5f
Update quantization_defs.cc (#14380)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-01-20 15:03:50 -08:00
Ye Wang
668586e8f8
Support muP in Attention (#14348)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-01-19 20:36:55 -08:00
Tianlei Wu
477cad3051
[CUDA] Add trt cross attention kernels (#14328)
Add TRT cross attention kernels for stable diffusion optimization.
2023-01-17 17:55:45 -08:00
Ye Wang
2db57a53a3
Add mask_filter in Attention related ops' attribute (#14274)
### Description
<!-- Describe your changes. -->


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

https://github.com/microsoft/onnxruntime/issues/12843

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-01-17 12:28:11 -08:00
Zhang Lei
15141a40b4
Add present_past_share_buff to QAttention Defs to enable QAttention related tests. (#14297) 2023-01-14 09:19:06 -08:00
Ye Wang
c9a53c9255
Some changes to Sampling Op (#14218)
### Description
<!-- Describe your changes. -->
1. add an optional input to pass in seed
2. two UTs. one for top_p=0.5, another for top_p=0.01(create greedy
search result, in convert_generation.py)
3. fix a bug in cpu kernel

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-01-12 14:15:26 -08:00
Ye Wang
a01bf8dbb1
rename CrossAttention to MultiHeadAttention (#14201)
### Description
<!-- Describe your changes. -->

rename the CrossAttention to MultiheadAttention since this op can also
be used as self attention

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-01-10 10:18:39 -08:00
Ye Wang
5eac2c1f41
relational attention bias cuda op (#14149)
### Description

This cuda op implements the compute_bias() method in T5 Attention
including the permutation.

note:
1. bias_table needs to be saved in col-major. be careful when
implementing fusion script
2. second input(sequence length) is placed on cpu. (using Shape node's
output should be good)
3. the first dimension of output is 1, so extra_add_qk in attention
should support broadcasting
4. compute_bias() only used in self-attn in t5

TODO: docs change will be applied later

### Motivation and Context
It's part of the process of optimizing t5 attention as well as t5 based
generation model

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-01-06 17:32:58 -08:00
Tianlei Wu
2cacb24cb0
Add CrossAttention operator (#14146)
Move separated Q, K and V (without input projection) from Attention to a
new operator CrossAttention.

The Attention operator is hard to maintain when we need support with and
without input projection in one class. Add a new operator according to
feedback.

Some change might need in the future, but not in this PR:
(1) bias could be optional (We will not proceed that route unless
experiments show that fusing Add bias with MatMul instead of this op
could improve performance).
(2) support packed KV. There are two ways to support it: when key and
value are same Tensor, they are packed; or we can make value as
optional, and use packed mode when value is empty and the key has packed
K/V.
(3) support cached key and value, and other (like relative position
bias), or more attention mask format. They can be added easily without
breaking backward compatible.
(4) ROCm/CPU implementation of this op.
2023-01-06 14:27:40 -08:00
Hariharan Seshadri
d0c5ffd5f7
Misc transformer fixes - 2 (#14156)
### Description
1. The graph pattern search introduced in
https://github.com/microsoft/onnxruntime/pull/13914/ needs to be
enhanced so that SkipLayerNormalization is supported

2. Fix fp32 parity for GPT-2 while using `SkipLayerNormalization`
fusion. The optional output of SLN needs to also include the bias (if
present) and the added output should be a sum of `input + skip + (bias)`

### Motivation and Context
Fix some breaking tests
2023-01-06 07:27:10 -08:00
Ye Wang
ae148ebc05
T5 skip_layer_norm cuda op (#14093)
### Description

T5 uses a layer_norm which only scales and doesn't shift, which is also
known as Root Mean Square Layer Normalization.
ORT already have the simplified_layer_norm which is the RMS layer_norm.
This PR extends this T5 layer_norm with support of skip/bias and the
residual output.
This new op is named SkipSimplifiedLayerNorm and has similar interface
as SkipLayerNorm but removes the beta as input


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-01-04 13:31:53 -08:00
Ye Wang
68518a1b72
Sampling op (#13426)
### Description
<!-- Describe your changes. -->

Sampling op for cpu and cuda
support huggingface case and custom case
            


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2022-12-22 17:34:12 -08:00
Hariharan Seshadri
7ed8bd4f95
Support (Bias)SkipLayerNormalization fusion in GPT2 (#13988) 2022-12-21 23:04:44 -08:00
Zhang Lei
fba09faf5b
Implement reuse past and present tensor in Attention Ops. (#13791)
Implement reuse kv_cache past and present tensor in Attention Ops. 
Unit test for abover feature.
Utilize the reuse kv_cache for past and present tensor in Greedy Search.
Correctness test for it.

Co-authored-by: Zhang Lei <phill.zhang@gmail.com>
2022-12-18 10:03:53 -08:00
Hariharan Seshadri
abc5c25a85
Updates to GreedySearch/BeamSearch (#13943) 2022-12-13 20:25:26 -08:00
Jian Chen
d7d932c1c2
Cjian/where python operator (#12795)
**Description**: 
This PR will enable the python tool to run QWhere and QDQWhere operation

**Limitation**:
s8s8 Where is still not supported.
2022-12-12 13:27:47 -08:00
Hariharan Seshadri
004a1538d3
Extend vocab padding for logits MatMul for fp16 GPT2 GreedySearch (#13842) 2022-12-06 19:39:20 -08:00
Tianlei Wu
8b0e0f4927
Add RemovePadding and RestorePadding for BERT model (#13701)
Add two operators RemovePadding and RestorePadding based on ideal of
effective transformer (https://github.com/bytedance/effective_transformer) to improve large
batch size inference for BERT model.
2022-11-22 10:00:23 -08:00
Hariharan Seshadri
c7329e004d
Improve fp16 performance of GPT-2's logits MatMul while using BeamSearch (#13686) 2022-11-18 18:50:19 -08:00
Ye Wang
38a74af45d
Support position_ids broadcasting in EmbedLayerNorm (#13677)
### Description
<!-- Describe your changes. -->


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

fix https://github.com/microsoft/onnxruntime/issues/13508
2022-11-17 17:56:27 -08:00
Vincent Wang
8b0669bf63
QuickGelu Fusion (#12417)
Some models have QuickGelu(x)=x*sigmoid(1.702x), which has 3 Ops for
forward and 5 Ops for backward. The PR is to fuse this to a single Op
named QuickGelu and its gradient QuickGeluGrad.

For CUDA, tested in V100 using input tensor with shape [64,128,2048] and
float16 type:
Before, FW takes 335us, BW takes 614us

![image](https://user-images.githubusercontent.com/11661208/182291335-15188709-ffe7-44d1-9d14-0b544cbe5e55.png)

After, FW takes 115us, BW takes 139us, which is much faster.

![image](https://user-images.githubusercontent.com/11661208/182291502-f0b5161c-b95c-45fc-90f8-ad0c592d2433.png)

For CPU kernel, using same shape and float type:
Before, FW takes 10us, BW takes 49us
Mul: 3480[µs]
Sigmoid: 1996[µs]
Mul: 4789[µs]
Mul: 4642[µs]
Mul: 4195[µs]
SigmoidGrad: 18328[µs]
Mul: 2988[µs]
Sum: 18576[µs]

After, FW takes 4us, BW takes 5us, which is also much faster.
QuickGelu: 3939[µs]
QuickGeluGrad: 5089[µs]

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2022-10-28 18:12:07 +08:00
Tianlei Wu
7aafd86229
Update Attention operator to support separated Q/K/V inputs (#13410)
### Description
Allow separated Q, K and V inputs to support cross attention:
* Q: [batch_size, sequence_length, hidden_size]
* K: [batch_size, kv_sequence_length, hidden_size]
* V: [batch_size, kv_sequence_length, v_hidden_size]
* Output: [batch_size, sequence_length, v_hidden_size]

To use separated Q/K/V inputs, the input tensor is for query, and two
optional inputs are added for key and value. Weights for input
projection is not included for now, so the MatMul of input projection
shall be done out of Attention operator, but Add bias is included for
performance consideration.
2022-10-25 11:51:06 -07:00
Ye Wang
928c9889a3
A few fixes for generative model ops (#13363)
### Description
<!-- Describe your changes. -->

Fix a bug in GreedySearch Op when batch > 1
Support custom attention mask in GreedySearch and BeamSearch with GPT2 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2022-10-21 15:00:18 -07:00
garanews
38906625a3
fix some typo in docs (#13212)
### Description
<!-- Describe your changes. -->
fix some typo in docs


### Motivation and Context
singed vs signed
succeding vs succeeding 
fileter vs filter
kernal vs kernel
libary vs library
2022-10-07 15:58:18 -07:00
ashari4
b09dd11ece
BFP schemas: Change block dimension type to Int (#13169)
* Change block dimension type to Int from Ints.
* In response to feedback that the block dimension corresponds to the
reduction dimension of the consuming matrix multiplication. There is
always only 1 reduction dimension.
2022-10-06 11:11:43 -07:00
ashari4
c4a7e88fc8
QuantizeBFP and DequantizeBFP (#12833)
* `QuantizeBFP` and `DequantizeBFP` schemas - similar to
`QuantizeLinear` and `DeQuantizeLinear`.
* BFP datatype is represented as a `uint8` tensor with shape and stride
metadata. This is preferrable to adding a new datatype for BFP, which is
more disruptive and [discouraged by
PyTorch](https://discuss.pytorch.org/t/training-with-custom-quantized-datatype/152132/2).

Context: 

The Microsoft Floating Point (BFP) datatype shares an exponent for every
n numbers called a “bounding box.” Each number still has its own
mantissa and sign bits. BFP has been shown to incur 3-4 less cost
(energy and area) than BFloat16 and INT8 counterparts without reductions
in accuracy for the ImageNet benchmark as described in [Rouhani
2020](https://proceedings.neurips.cc/paper/2020/file/747e32ab0fea7fbd2ad9ec03daa3f840-Paper.pdf).

Requirements:

* There are many variants of BFP (number of mantissa bits, number of
shared exponent bits, size of bounding box, custom bit fields, etc.)
* The size and layout of an BFP variant varies across hardware
* bounding box can be over arbitrary dimensions; for example, for the
channel "C" dimension in a N x C x H x W tensor for convolution

Goals of this PR:

* Add initial versions of QuantizeBFP and DequantizeBFP operators to
enable QDQ-style quantization with BFP. Once the schemas stabilize, we
can consider upstreaming to ONNX.
* Add some basic type and shape inferencing tests; tests that run on an
EP will be a follow-up.
2022-09-22 14:02:55 -07:00
Hariharan Seshadri
ad69aac491
Introduce ordered quantization ops for the CUDA EP [1/n] (#12582)
Initial core small set for the ordered quantization ops for cuda EP.
2022-09-07 11:58:15 -07:00
Yulong Wang
c144acc534
Replace 'master' branch ref to 'main' in the code (#12547) 2022-08-22 10:48:12 -07:00
Wei-Sheng Chin
dc486d146b
Make ORT callable from various Pytorch compilers (LazyTensor, TorchDynamo, etc) (#10460)
* Make ORT as Pytorch JIT backend

LORT likely doesn't work with aten fallback so we only test LORT in its own CI.

* Revert changes to enable external CUDA allocator. Will add it later.

Revert "Revert changes to enable external CUDA allocator. Will add it later."

This reverts commit d5487f2e193014c805505afae8fb577c53667658.

Fix external allocator

* Relax tolerance and remove commented code

* Print more information in CI

* Fix pointer

* Address comments.
1. Reuse ORT-eager mode's environment.
2. Remove unused ctor.

* Use Pytorch master branch as all PRs are merged

Fix

* Refine based on cpplint feedbacks

* Revert changes to allow custom CUDA allocator in public APIs

* Use torch.testing.assert_close

* Use unittest framework

* Switch docker repo

* Rename *.cpp to *.cc

* Address comments

* Add comment

* Use same pipeline file for eager and lort pipelines

* Address comments

* Add yaml comment

* Fix cmake files

* Address comments

* Rename flags, remove printing code, remove dead comment
2022-08-22 09:40:40 -07:00
Cheng
64e991a9fc
[Qlinearsoftmax] contrib cpu (#12177)
* [Qlinearsoftmax] contrib cpu

* int8 implementation

* contrib operator md

* qdq transformer test

* new attribute: opset

* doc

* quantized tool

* remove template to reduce Binary size

* doc of contribe operators

* enforce x_shape is valid

* fix reduce_size if input-shape is dynamic

* add UT

* register one op for reducing binarysize

* kernel hash update

* docs/ContribOperators.md
2022-08-10 10:52:02 +08:00
Vincent Wang
37995a7245
[CUDA] BiasSoftmax Supporting New Pattern (#12361) 2022-08-05 06:59:24 +08:00