Commit graph

539 commits

Author SHA1 Message Date
Linnea May
95a4607dcf
User/linneamay/roi align 16 (#15812)
### Description
<!-- Describe your changes. -->
Add registration for DML RoiAlign-16 and tests for new
coordinate_transform_mode attribute. PR
[7354](https://github.com/microsoft/onnxruntime/pull/7354) is still open
to fix the CPU EP version, which is why there are skipped tests right
now. That will be completed separately so that, for now, we can
officially support opset16 with the next release.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Linnea May <linneamay@microsoft.com>
Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>
2023-05-09 21:56:41 -07:00
Sumit Agarwal
b473e3f3c6
[DML EP] Update DirectML version to 1.11.0 (#15858)
### Description
- Update DML version to 1.11.0
- Disable Gemm+Softmax fusion



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-05-09 12:48:15 -07:00
Sheil Kumar
2b7f26af7c
Add GridSample implementation to DirectML (#15788)
Add GridSample implementation to DirectML EP.

Temporary add HLSL shader in the DirectML EP to handle GridSample until
officially added to DirectML.
2023-05-05 15:59:33 -07:00
Changming Sun
1fb2f2605b
Update VERSION_NUMBER (#15773)
### Description

1. Update VERSION_NUMBER for preparing the upcoming release. This PR's
commit will not be included in the 1.15 release branch
2. Delete package/rpm/onnxruntime.spec since it was not used in past
years.

### Motivation and Context
Preparing the release.

Fixed
[AB#15311](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15311)
2023-05-03 15:07:34 -07:00
Baiju Meswani
2d519d21af
Python documentation for onnxruntime-training (#15765) 2023-05-02 16:58:16 -07:00
Nat Kershaw (MSFT)
9219615471
Fix python AP docs generation (#15760)
Docs are failing on the operator generation step. Remove this
temporarily so that we can publish.
2023-05-01 18:31:59 -07:00
liqun Fu
62fc6ed5a8
[Feature Request] Support Resize opset 19 (#15633) 2023-05-01 10:49:17 -07:00
Linnea May
2c3697be00
User/linneamay/reduce 18 (#15701)
### Description
<!-- Describe your changes. -->
Add registration for DML reduce functions in opset 18. 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Linnea May <linneamay@microsoft.com>
2023-04-27 20:32:11 -07:00
kunal-vaishnavi
39d6d7050d
Change EmbedLayerNormalization mask index output to optional (#15526)
### Description
This PR changes an EmbedLayerNormalization node's mask index output to
be an optional output if a mask input is not provided.



### Motivation and Context
The documentation for EmbedLayerNormalization states 
```
The last input mask is optional. If mask is provided, mask index (that is position of first 0 in mask, or number of words) will be calculated.
```
However, if the mask input is not provided, the mask index output is
still calculated and required.
2023-04-27 16:32:42 -07:00
Justin Chu
76ddc92fbd
Enable RUFF as a formatter (#15699)
### Description

RUFF can now format since lintrunner-adapters v0.8. Removed the RUFF-FIX
linter.



### Motivation and Context

Better engineering
2023-04-26 14:04:07 -07:00
sfatimar
ebaafac3f5
Openvino ep ort 5.0 (#15626)
### Description
The PR adds VPU support to OpenVINO Execution Provider
Bug fixes for GPU, CPU. 
Changes to OpenVINO Backend in Serialized Model API for faster First
Inference Latency.
Deprecation to HDDL-VADM and MYRIAD, removed code
Support OpenVINO 2023.0 
Dynamic Shapes Support for iGPU

### Motivation and Context
- VPU is an upcoming hardware that can provide AI Acceleration for
Client Systems through OpenVINO
- If it fixes an open issue, please link to the issue here. -->

---------

Signed-off-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
2023-04-25 20:59:42 -07:00
Baiju Meswani
5885abfb35
Training Documentation (#15612) 2023-04-25 11:44:12 -07:00
Baiju Meswani
11b0a18de6
Add support for cuda 11.8 and python 3.11 for training (#15548) 2023-04-20 12:56:45 -07:00
kunal-vaishnavi
901c2bc384
Whisper Model Optimization (#15473)
### Description
This PR contains fusion-level and kernel-level optimizations for
[OpenAI's Whisper](https://github.com/openai/whisper).

Some of the added optimizations include:

- Pruning of duplicate/unnecessary inputs and outputs
- Fusion support for Whisper models with or without these inputs/outputs
(e.g. with these inputs/outputs if exporting with an older official
Optimum version, without these inputs/outputs if exporting with Optimum
from source)
- Attention fusions
   - For Whisper's encoder and decoder
- Modified symbolic shape inference for present output when no past
input exists (for decoder)
- Multi-head attention fusions
   - For Whisper's decoder and decoder with past
- Packed MatMul for the 3 MatMuls excluded in multi-head attention
fusion
- Attention kernel changes
   - CPU:
      - Different Q and KV sequence lengths
      - Parallel memset for large sequence lengths
- Convert broadcast add after MatMul of Q and K (add_qk) to element-wise
add
- Separate present key-value output into present key and present value
(for multi-head attention spec)
   - CUDA:
- Use memory efficient attention compute kernel with present state (for
decoder)
- Multi-head attention kernel changes
   - CPU:
- Introduction of multi-head attention CPU kernel (previously did not
exist)
- Use AddBiasReshape instead of AddBiasTranspose when sequence length =
1 (for decoder with past)
      - Different Q, K, V input shapes
      - Pass past key and past value directly as key and value
   - CUDA:
- Use memory efficient attention compute kernel with past and/or present
state (for decoder with past)

### Usage
To use the optimizations, run the ORT transformer optimizer script as
follows:
```
$ cd onnxruntime/onnxruntime/python/tools/transformers/
$ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type bart --num_heads <number of attention heads, depends on the size of the whisper model used> --hidden_size <attention hidden size, depends on the size of the whisper model used> --use_external_data_format --use_multi_head_attention
```

Once optimized, here's an example of how to run Whisper with [Hugging
Face's Optimum](https://github.com/huggingface/optimum):
```
from transformers.onnx.utils import get_preprocessor
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from optimum.pipelines import pipeline as ort_pipeline

import whisper # Installed from OpenAI's repo - setup instructions at https://github.com/openai/whisper/

directory = './whisper_opt' # Where the optimized ONNX models are located
model_name = 'openai/whisper-tiny'
device = 'cpu'

# Get pipeline
processor = get_preprocessor(model_name)
model = ORTModelForSpeechSeq2Seq.from_pretrained(
    directory,
    use_io_binding=(device == 'cuda'),
    provider='CPUExecutionProvider',
).to(device)
pipe = ort_pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    device=(-1 if device == 'cpu' else 0),
)

# Load audio file and run pipeline
audio = whisper.load_audio('tests/jfk.flac')
audio = whisper.pad_or_trim(audio)
outputs = pipe([audio])
print(outputs)
```

Note: In order to use these changes with Optimum, it is recommended to
use Optimum from source to have the following changes:
- https://github.com/huggingface/optimum/pull/872
- https://github.com/huggingface/optimum/pull/920

### Motivation and Context
This PR helps the following issues:
- https://github.com/microsoft/onnxruntime/issues/15100
- https://github.com/microsoft/onnxruntime/issues/15235
- https://github.com/huggingface/optimum/issues/869 (work in progress)

This PR can be used with the other currently merged Whisper PRs:
- https://github.com/microsoft/onnxruntime/pull/15247
- https://github.com/microsoft/onnxruntime/pull/15339
- https://github.com/microsoft/onnxruntime/pull/15362
- https://github.com/microsoft/onnxruntime/pull/15365
- https://github.com/microsoft/onnxruntime/pull/15427

This PR uses changes from the following merged PRs:
- https://github.com/microsoft/onnxruntime/pull/14198
- https://github.com/microsoft/onnxruntime/pull/14146
- https://github.com/microsoft/onnxruntime/pull/14201
- https://github.com/microsoft/onnxruntime/pull/14928 (this introduced
the new multi-head attention spec)
2023-04-18 17:13:54 -07:00
liqun Fu
919d8f2660
update with onnx main (#14929) 2023-04-18 08:42:51 -07:00
Justin Chu
a36caba073
Bump ruff in CI (#15533)
### Description

Bump ruff version in CI and fixed new lint errors. 

- This change enables the flake8-implicit-str-concat rules which helps
detect unintended string concatenations:
https://beta.ruff.rs/docs/rules/#flake8-implicit-str-concat-isc
- Update gitignore to include common python files that we want to
exclude.


### Motivation and Context

Code quality
2023-04-17 10:11:44 -07:00
pengwa
516c8e95fa
Optimize SCE loss compute (#15401)
### Optimize SCE loss compute

Compute optimization based on label data sparsity:
- Insert ShrunkenGather before SCELoss node, to filter out invalid
labels for compute.
- Support ShrunkenGather upstream.
- Added test for the above.
- Added flag to enable label sparsity optimization with env var, by
default disabled now. Will enable after comprehensive benchmarking
later.
- Extract common logic into test_optimizer_utils.h/cc from
core/optimizer/compute_optimzier_test.cc, then the common functions can
be shared by both core/optimizer/compute_optimzier_test.cc and
orttraining/core/optimizer/compute_optimzier_test.cc
- Extract common logic into shared_utils.h/cc: `GetONNXOpSetVersion` and
`Create1DInitializerFromVector`


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-04-13 13:02:12 +08:00
Patrice Vignola
3be5bfe363
[DML EP] Add MatMul + SoftMax fusion (#15240) 2023-04-11 08:31:04 -07:00
Patrice Vignola
7c927bb95c
[DML EP] Add BiasSplitGelu (#15197) 2023-04-11 08:30:37 -07:00
Patrice Vignola
c5b6ee1a99
[DML EP] Add NhwcConv (#15194) 2023-04-10 23:16:09 -07:00
Patrice Vignola
4a676b011a
[DML EP] Add BiasAdd (#15211) 2023-04-10 14:46:33 -07:00
stevenlix
6d126f8996
Add FP16 support for Whisper model (#15427)
Current ORT can only run inference for Whisper FP32 model. This PR adds
FP16 support.
2023-04-08 21:36:10 -07:00
Chen Fu
8dce83a818
Fuse 'Add' operator into FP16 Conv (#15213)
### Description
Adding 'Add' functionality to FP16 Conv operator. It takes a tensor that
has the same shape of the output tensor, and add it to the result
tensor.


### Motivation and Context
Needed to run Resnet 50
2023-04-07 09:51:03 -07:00
Patrice Vignola
9191e04259
[DML EP] Add QuickGelu (#15220) 2023-04-05 10:49:34 -07:00
Aditya Goel
a4e9a48345
Reduce operators support for int64 type (#15358) 2023-04-05 09:19:43 -07:00
Aditya Goel
1c1d386561
Adds int32_t and uint32_t clip kernels (#15306) 2023-04-04 13:44:50 -07:00
petermcaughan
1251964f96
Petermca/beamsearch whisper (#15339)
### Description
Adjust various code paths to allow Whisper model to function with
BeamSearch op.

Approach: Add a new kModelType enum value in IGenerationParameters as
so:
#### Old: 0 = GPT2, 1 = T5
#### New: 0 = GPT2, 1 = T5, 2 = Whisper

When the user assigns this attribute value to 2, various shape and type
checks are changed to accommodate Whisper inputs.


### Motivation and Context
BeamSearch is currently designed to function with BERT-based models with
inputs as vocab tokens, and needs changes to function with Whisper
inputs (3-D float values processed from audio data).

---------

Co-authored-by: Peter McAughan <petermca@microsoft.com>
2023-04-04 09:09:10 -07:00
pengwa
5baf5f506b
log level control + fix typos (#15302)
### log level control + fix typos
2023-04-04 20:19:13 +08:00
Ye Wang
fbfe92f66a
DecoderMaskedMultiHeadAttention enhancement (#15292) 2023-04-02 21:53:03 -07:00
Yufeng Li
c08d6b42e8
Add tool to support packing mode for BERT model (#15283)
### Description
<!-- Describe your changes. -->
Add a tool to convert fused BERT like model to packing mode


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-03-31 08:46:47 -07:00
Xavier Dupré
786f8b98f7
Add a page in the documentation for every operator in onnxruntime (#14340) 2023-03-30 14:39:16 -07:00
Jian Chen
792d411135
Update python 3.11 and remove 3.7 for Linux (#15214)
### Description
Update python 3.11 and remove 3.7



### Motivation and Context
Update python 3.11 and remove 3.7

---------

Co-authored-by: Ubuntu <chasun@chasunlinux.lw3b1xzoyrkuzm34swpscft0ff.dx.internal.cloudapp.net>
2023-03-27 14:46:30 -07:00
Patrice Vignola
67a6022c03
[DML EP] Add GroupNorm (#15189)
Comparison between the different normalization operations:
![](https://user-images.githubusercontent.com/1041752/106491728-73d40680-64b7-11eb-8769-3f758996e959.png)
2023-03-27 12:52:53 -07:00
Justin Chu
d834ec895a
Adopt linrtunner as the linting tool - take 2 (#15085)
### Description

`lintrunner` is a linter runner successfully used by pytorch, onnx and
onnx-script. It provides a uniform experience running linters locally
and in CI. It supports all major dev systems: Windows, Linux and MacOs.
The checks are enforced by the `Python format` workflow.

This PR adopts `lintrunner` to onnxruntime and fixed ~2000 flake8 errors
in Python code. `lintrunner` now runs all required python lints
including `ruff`(replacing `flake8`), `black` and `isort`. Future lints
like `clang-format` can be added.

Most errors are auto-fixed by `ruff` and the fixes should be considered
robust.

Lints that are more complicated to fix are applied `# noqa` for now and
should be fixed in follow up PRs.

### Notable changes

1. This PR **removed some suboptimal patterns**:

	- `not xxx in` -> `xxx not in` membership checks
	- bare excepts (`except:` -> `except Exception`)
	- unused imports
	
	The follow up PR will remove:
	
	- `import *`
	- mutable values as default in function definitions (`def func(a=[])`)
	- more unused imports
	- unused local variables

2. Use `ruff` to replace `flake8`. `ruff` is much (40x) faster than
flake8 and is more robust. We are using it successfully in onnx and
onnx-script. It also supports auto-fixing many flake8 errors.

3. Removed the legacy flake8 ci flow and updated docs.

4. The added workflow supports SARIF code scanning reports on github,
example snapshot:
	

![image](https://user-images.githubusercontent.com/11205048/212598953-d60ce8a9-f242-4fa8-8674-8696b704604a.png)

5. Removed `onnxruntime-python-checks-ci-pipeline` as redundant

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Unified linting experience in CI and local.

Replacing https://github.com/microsoft/onnxruntime/pull/14306

---------

Signed-off-by: Justin Chu <justinchu@microsoft.com>
2023-03-24 15:29:03 -07:00
Nat Kershaw (MSFT)
28f64066de
Auto deploy API docs (#15088) 2023-03-23 15:08:49 -07:00
Ye Wang
44ba23e0f5
Rename DecoderMaskedMHA to DecoderMaskedSelfAttn (#15166)
### Description
<!-- Describe your changes. -->

As synced offline, rename this op and will create another op for mha
that supports both self and cross attention.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-03-23 12:31:38 -07:00
Ye Wang
2ee822d483
Extend memory efficient attention coverage in Attention/MHA cuda op (#15064)
### Description
<!-- Describe your changes. -->

1. upgrade cutlass to 3.0 that containing attn_bias support.
2. extend Attention/MHA to use memory efficient attention when
rel_pos_bias with [1, num_head, s, s*] and 1d mask with [2 * batch_size
+ 1] are present.

new mask format introduction:
MASK_1D_KEY_SEQ_LEN_START,  
[3 * batch_size + 2] with [key_len[0], ..., key_len[batch_size - 1],
query_start[0], ..., query_start[batch_size - 1], query_end[batch_size -
1], key_start[0], ..., key_start[batch_size - 1], key_end[batch_size -
1]]

e.g
2D mask with [[1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]] converts to this
1D mask is [3, 5, 0, 6, 12, 0, 6, 12]


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

It potentially benefits tnlrv6 and t5(encoder)

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com>
Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-03-23 11:05:17 -07:00
Hariharan Seshadri
7033346605 Support mask_filter_value attribute in DecoderMaskedMultiheadAttention (#15158) 2023-03-23 11:00:09 -07:00
pengwa
1d32285536
Statistics tool for ORTModule convergence parity (#15020)
### Statistics tool for ORTModule convergence parity

As ORTModule get more and more validated, it is pretty fast to
intergrade PyTorch based model with ORT.

The same time, we need make sure once there is convergence issue, we
don't spend months of time to investigate. As part of this efforts, this
PR is introducing a tool to dump activation statistics without much
involvement from users. The dumping results contains only some statistic
numbers plus sampled data, which is not big, compared with dumping all
the tensors, it is much faster and space efficient.

For us to use it, two single lines are needed before wrapping ORTModule.
For baseline run, need also apply the same trick.

```
+	from onnxruntime.training.utils.hooks import SubscriberManager, StatisticsSubscriber
+	SubscriberManager.subscribe(model, [StatisticsSubscriber("pt_out", override_output_dir=True)])
```

Once you run the steps, following command can be used to merge result
into per-step-summary respectively for ORT and baseline runs.
 
```bash
python -m onnxruntime.training.utils.hooks.merge_activation_summary --pt_dir pt_out --ort_dir ort_out --output_dir /tmp/output
```

Docs is added here as part of this PR [convergence investigation
notes](https://github.com/microsoft/onnxruntime/blob/pengwa/conv_tool/docs/ORTModule_Convergence_Notes.md)

Based on the generated merged files, we can compare them with tools. 


![image](https://user-images.githubusercontent.com/10530022/224653929-4e4480bd-bb02-4bbe-bd44-2672bdf91a87.png)

### Design and Implementation

This PR introduced a common mechanism registering custom logic for
nn.Module's post forward hooks. And statistics for activation
(StatisticsSubscriber) is one of the implementations. If there is other
needs, we can define another XXSubscriber to do the customized things.
2023-03-23 20:34:24 +08:00
Yufeng Li
c7ced7a5e9
Add PackedAttention for packing mode (#14858)
### Description
<!-- Describe your changes. -->
Transformer models can handle batch of inputs at once. However,
sequences in a batch usually have different length. Then we have to pad
the short one to have same length as the longest. This is not efficient
especially for large batch with high variance.

This PR introduces a PackedAttention operator which can take in packed
sequences (no padding) and also produces output in packing mode.

There will be another PR to use the PackedAttention to implement the
encoder in packing mode.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-03-21 12:59:29 -07:00
Hariharan Seshadri
ed7ab1660d
[CUDA] Add option to use DecoderMaskedMultiheadAttention in BeamSearch (#14990) 2023-03-15 17:16:32 -07:00
Ye Wang
538d64891a
[t5 optimization] kernel changes to t5 (#14928)
### Description
<!-- Describe your changes. -->

1. support optional bias in Attention op (used in T5 encoder)
2. support broadcasting rel_pos_bias in attention_softmax.h
3. add scale in
MHA op's attributes
4. support past_key/past_value and present_key/present_value in MHA
5. UT and parity tests are added
6. fix an issue: https://github.com/microsoft/onnxruntime/issues/14920

note: the fusions will be in another PR since mt5 needs to be tested and
an issue from github will be investigated.

Future works:
1. support shared buffer for past/present
2. enable trt kernels when possible and investigate (trt/cutlass)kernels
with rel_pos_bias)
3. support KV/QKV packing with past/present

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-03-13 14:29:16 -07:00
Hariharan Seshadri
112a4d215a
[CUDA] Support decoding multihead self-attention implementation (#14848) 2023-03-08 09:17:54 -08:00
pengwa
f6c81d8aca
Introduce padding inspector in ORTModule (#14652)
### Introduce padding inspector in ORTModule

In some Transformer-based LLM training recipes, high data sparsity is
observed due to 1). token padding (to max sequence length), 2). labels
contains many ignore_index for calculate loss.

This PR introduces a switch to enable data sparsity inspection, which 
1). in short term, can inform training users to use techniques like
dynamic batching to amortize the issue.
2). in medium and longer term, also helps us (training team) to have
better understanding what our training customers' models looks like from
perspective of data sparsity (and potentially motivate us to improve
with runtime).

Here is an example of different data sparsity with same training model
arch, same training input, but with different user models.

**Low Embed Density, High Label Density Case - Sentence Classification**
`
python -m torch.distributed.launch --nproc_per_node=4
examples/onnxruntime/training/text-classification/run_glue.py
--model_name_or_path roberta-large-openai-detector --task_name mnli
--do_train --do_eval --max_seq_length 128 --per_device_train_batch_size
32 --learning_rate 2e-5 --num_train_epochs 3 --overwrite_output_dir
--output_dir ./outputs/ --per_device_eval_batch_size 32 --seed 1137
--fp16 True --ignore_mismatched_sizes True --optim adamw_ort_fused
`
```
>>>Valid token/label density (e.g. valid/total) in passing 10 steps:
        | STEP       | INPUT TYPE |  INPUT NAME     | PAD IDX    | DENSITY    | VALID TOKENS    | TOTAL TOKENS    | VALID TOKENS/BATCH |
        | 60         | EMBED      | input_ids       | 1          | 35.21    % | 1442            | 4096            | [50, 81, 35, 11, 29, 36, 66, 19, 40, 22, 21, 42, 17, 37, 40, 41, 26, 58, 38, 54, 41, 73, 48, 57, 50, 51, 49, 85, 48, 36, 79, 62] |
        | 61         | LABEL      | labels          | -100       | 100.00   % | 32              | 32              | N/A             |
        | 62         | EMBED      | input_ids       | 1          | 30.00    % | 1229            | 4096            | [36, 73, 13, 47, 27, 33, 53, 25, 51, 28, 36, 42, 42, 32, 39, 52, 27, 13, 31, 66, 42, 45, 52, 45, 58, 42, 37, 66, 12, 18, 29, 17] |
        | 63         | LABEL      | labels          | -100       | 100.00   % | 32              | 32              | N/A             |
        | 64         | EMBED      | input_ids       | 1          | 26.73    % | 1095            | 4096            | [37, 28, 20, 53, 16, 20, 44, 52, 27, 28, 16, 19, 16, 24, 63, 31, 24, 42, 33, 41, 44, 60, 44, 67, 54, 30, 20, 19, 33, 23, 24, 43] |
        | 65         | LABEL      | labels          | -100       | 100.00   % | 32              | 32              | N/A             |
        | 66         | EMBED      | input_ids       | 1          | 30.03    % | 1230            | 4096            | [22, 46, 36, 41, 46, 43, 26, 50, 60, 16, 24, 42, 56, 35, 35, 59, 29, 39, 34, 20, 66, 23, 47, 53, 19, 35, 44, 23, 34, 81, 21, 25] |
        | 67         | LABEL      | labels          | -100       | 100.00   % | 32              | 32              | N/A             |
        | 68         | EMBED      | input_ids       | 1          | 31.62    % | 1295            | 4096            | [75, 36, 48, 20, 38, 21, 49, 54, 38, 41, 26, 28, 80, 45, 48, 16, 22, 41, 34, 28, 37, 16, 74, 63, 62, 34, 22, 45, 23, 27, 37, 67] |
        | 69         | LABEL      | labels          | -100       | 100.00   % | 32              | 32              | N/A             |
<<<
```

**High Embed Density, Low Label Density Case - masked language model** 
`
python -m torch.distributed.launch --nproc_per_node=4
examples/onnxruntime/training/language-modeling/run_mlm.py
--model_name_or_path bert-base-uncased --dataset_name wikitext
--dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10
--per_device_train_batch_size 8 --per_device_eval_batch_size 8
--do_train --do_eval --overwrite_output_dir --output_dir ./outputs/
--seed 1137 --fp16 --report_to none --optim adamw_ort_fused
`
```
>>>Valid token/label density (e.g. valid/total) in passing 10 steps:
        | STEP       | INPUT TYPE |  INPUT NAME     | PAD IDX    | DENSITY    | VALID TOKENS    | TOTAL TOKENS    | VALID TOKENS/BATCH |
        | 710        | EMBED      | input_ids       | 0          | 100.00   % | 4096            | 4096            | [512, 512, 512, 512, 512, 512, 512, 512] |
        | 711        | LABEL      | labels          | -100       | 13.77    % | 564             | 4096            | N/A             |
        | 712        | EMBED      | input_ids       | 0          | 100.00   % | 4096            | 4096            | [512, 512, 512, 512, 512, 512, 512, 512] |
        | 713        | LABEL      | labels          | -100       | 14.48    % | 593             | 4096            | N/A             |
        | 714        | EMBED      | input_ids       | 0          | 100.00   % | 4096            | 4096            | [512, 512, 512, 512, 512, 512, 512, 512] |
        | 715        | LABEL      | labels          | -100       | 14.18    % | 581             | 4096            | N/A             |
        | 716        | EMBED      | input_ids       | 0          | 100.00   % | 4096            | 4096            | [512, 512, 512, 512, 512, 512, 512, 512] |
        | 717        | LABEL      | labels          | -100       | 14.53    % | 595             | 4096            | N/A             |
        | 718        | EMBED      | input_ids       | 0          | 100.00   % | 4096            | 4096            | [512, 512, 512, 512, 512, 512, 512, 512] |
        | 719        | LABEL      | labels          | -100       | 15.31    % | 627             | 4096            | N/A             |
<<<
```

#### Next Step

Let's see how we leverage the data sparsity for improvement.
Optimizations on the way around compute optimizer wave 2:
> Loss compute flops reduction.
> Flatten/Unflatten embedding tokens to save compute flops.
2023-03-03 18:36:08 +08:00
Justin Stoecker
928289c414
STFT for DML EP (#14736)
### Description
Implements the STFT operator for the DirectML execution provider. This
is implemented as a custom op, just like the DFT kernel, because it's
implemented as a composite of two operators (DML Mul/Identity + DFT). As
such, this inherits the same restrictions as the existing DFT kernel
(requires power-of-two window sizes for now).

This change also adds a native FP16 shader to DFT so that both DFT/STFT
kernels support float16 tensors. There is no typed UAV fallback or
emulation path, so the HW _needs_ to support native float16. It also
appears the stockham shader was compiled with all optimizations disabled
and debug symbols (tsk tsk, Sheil), and this has been fixed.

This is passing all existing STFT tests (i.e. all of 1). I'm adding some
additional collateral in the Windows AI conformance tests in parallel to
check some extra cases.

---------

Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>
2023-02-23 21:12:22 -08:00
James Yuzawa
d925055a3e
Fix broken and outdated links in documentation (#14092)
### Description
<!-- Describe your changes. -->

I fixed some broken links in the C API documentation, but then did a
quick pass over all of the links I could find and then fixed those.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

I got some 404's when exploring the documentation and wanted to fix it.
2023-02-23 10:48:04 -08:00
Ye Wang
58da3cacdf
support NeoX-style rotary embedding (#14785)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-02-22 18:21:34 -08:00
Sheil Kumar
1b7f65437e
Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP (#14442)
Enable Opset11 Sequence Ops on DirectML, and make the CPU
implementations agnostic to backend EP

Opset 11 introduced the following sequence related operators:
    - SequenceAt
    - SequenceConstruct
    - SequenceEmpty
    - SequenceLength
    - SequenceErase
    - SequenceInsert 
    - ConcatFromSequence

With the exception of ConcatFromSequence, all of the above operators
were implemented with CPU kernels that a) required all of the contained
tensors to also be on CPU, and b) would clone each tensor into a new
sequence as a side effect of each operator. The implementation of
sequences are backend agnostic, as they dont affect actual tensor layout
or manipulate the contents of the tensors. In addition, with the
exception of SequenceAt, the other operators need not make copies of the
underlying referenced tensors.

Consequently, this change does the following:
1) Sequence* operators (except SequenceAt) no longer copies the contents
of a sequence of tensors on every kernel execution.
2) SequenceAt uses the DataTransferManager to copy tensors agnostic to
backend.
3) The internal container implemented by TensorSeq has changed from
onnxruntime::Tensor to OrtValue. This is because onnxruntime::Tensor
does not support copy or assignment construction, so it must have a
singular owner. However, is same tensor participates in multiple
containers it would have multiple container "owners" and this would not
be possible.
4) Other code that accessed values from TensorSeq have associated
changes to extract Tensors from OrtValues now.

In addition, DirectML execution was very slow when the above Sequence
operators were added to a graph, as this caused MemcpyToHost and
MemcpyFromHost kernels to be inserted between the graph and the sequence
operators. To optimize DirectML,
1) The CPU implementations for the Sequence* ops were registered as DML
implementations. Since the above changes also includes making the CPU
kernel implementations EP agnostic, the CPU kernels can be added as is.
2) The ConcatFromSequence operator needed to be implemented on DirectML.
However, there was little DirectML EP operator framework support for
operators that accept/output sequences of tensors. This change has
modified the internal COM interfaces to include new apis to interrogate
for sequence shapes, and extract the needed tensors from TensorSeq.

---------

Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>
2023-02-21 18:08:28 -08:00
Ryan Hill
892f59b31a
Add string support to tile op (#14686)
### Description
Add std::string tensor type support to Tile operator


### Motivation and Context
Multiple users are hitting this missing feature:
https://github.com/microsoft/onnxruntime/issues/14511
2023-02-16 14:59:44 -08:00
Tianlei Wu
eb2ac72fa9
Stable Diffusion CUDA Optimizations Part 4 (#14680)
(1) Support packed QKV format in MultiHeadAttention. This format could
avoid add bias transpose when TRT fused kernel is used.
(2) Add cache for cumulated sequence length computation. For SD, it only
need computed once since sequence length is fixed.
(3) Do not allocate qkv workspace to save memory for packed KV or QKV.
(4) Add unit tests for packed kv and packed qkv format in
MultiHeadAttention
(5) Mark some fusion options for SD only

Performance tests show slight improvement in T4. Average latency reduced
0.15 seconds (from 5.25s to 5.10s) for 512x512 in 50 steps for SD 1.5
models. Memory usage drops from 5.1GB to 4.8GB.
2023-02-15 14:55:42 -08:00