Commit graph

1210 commits

Author SHA1 Message Date
Christian Veenhuis
59dfcfdce7
Fix typos in sources: operater, tranform, neccessary, trainig (#14907)
### Description
While browsing the sources I found several typos here and there.
I collected them to a single PR and fixed them.
Namely these typos are: operater, tranform, neccessary, trainig.
After fixing none of them was found anymore:

$ git grep "operater"
$ git grep "tranform"
$ git grep "neccessary"
$ git grep "trainig"
$ 

### Motivation and Context
Since some of the typos are in example notebooks and markdown files,
users can see them.
2023-03-13 22:45:04 -07:00
pengwa
44dda08b51
Renaming files (#15015)
### Renaming files for compute optimizer

### Motivation and Context

A follow up for https://github.com/microsoft/onnxruntime/pull/14832
2023-03-13 17:07:59 +08:00
pengwa
448e989df8
Op slicing upstream refactor (#14832)
### Slice op upstream refactor

A refactor work for https://github.com/microsoft/onnxruntime/pull/13672.

### Motivation and Context

There is a similar optimization opportunity for other operator
upstreaming, to reduce compute flops. So refactor the existing code base
for making it easier to support other ops.

The changes in this PR are mainly about renaming and moving. 
- Move common logic (from compute_optimizer.h/cc) into
upstream_transformer_base.h/cc and shared_utils.h/cc.
- For upstream common logic, they are moved into
upstream_transformer_base.h/cc
   - For shared utilities, they are moved to shared_utils.h/cc.
- After the move, compute_optimizer.h/cc mainly for upstreaming gather
implementation (inheriting upstream_transformer_base.h/cc). Ideally it
should be renamed, but for easier review this time, I keep its name.
2023-03-13 08:19:32 +08:00
Dmitri Smirnov
0d7855ea5a
Re-work global objects dependancies in pybind layer. (#14941)
### Description
Re-work handling of static objects in pybind.
Make sure we ref-count Environment from Sessions.

The following has been done:

- Make global objects function static. This ensures that the objects are
constructed on demand. The first object constructed is destructed last.
This is platform independent.
- Make global objects ownership shared as suggested by pybind since they
are not surfaced at Python level, and they cannot be referred to by
dependent python objects. Verified that all python objects are GCed
before globals are destroyed. This takes care of inference session
dependency on environment and its default logger and this is also
platform independent.
- Utilize pybind atexit mechanism to clear execution providers and
unload CUDA libraries (as suggested by
https://github.com/microsoft/onnxruntime/pull/14903) . Since this is
registered for module exit, it takes place before any other global are
destroyed and clears shared objects state or even unloads the libraries.
This should also work in a platform independent way.

### Motivation and Context

- Global object destruction order is managed manually and that becomes
source of trouble. We want to make it deterministic and platform
independent.
- Frequent hangs in Python layer due to the static object's destruction
order. Some of the Python session objects are being garbage collected
after main exits and they require ORT environment to be alive. (Use
after free)
2023-03-10 13:55:31 -08:00
Baiju Meswani
748758c135
Address issue with uninitialized variable (#14988) 2023-03-10 09:24:04 -08:00
Adam Pocock
47f00b5d49
[Java] Initial on device training support (#14027)
contributor: @Craigacp
2023-03-08 10:01:08 -08:00
Ashwini Khade
f14ab63c19
fix prefast warnings (#14931)
### Description
Fixes prefast warnings

Fixed
[AB#11328](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/11328)
Fixed
[AB#11329](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/11329)
2023-03-08 09:49:15 -08:00
Kyushick Lee
c696392f0c
Support external output tensors for DORT (#14516)
### Description
<!-- Describe your changes. -->
Support externally-managed output tensors (torch Tensors) for dort. 
Add `preallocate_output` option to OrtBackend to rely on
externally-managed output tensors for dort.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
DORT currently allocates and returns output ortvalues and convert them
to torch Tensors. The conversion based on dlpack does not support torch
Tensors for custom Aten backends, and it is not yet possible to transfer
the ownership from ortvalue to external handle (torch Tensor).

To avoid this issue, the PR change provides an option
(`preallocate_output`) to allocate output tensors externally in pytorch,
which creates torch Tensor for an Aten backend, and let dort take
pointers from torch Tensors to construct output ortvalues instead of
allocating them inside InferenceSession.
2023-03-07 21:32:23 -08:00
pengwa
5d8ce817cb
Fix simplified layer norm fusion for training (#14866)
### Fix simplified layer norm fusion for training

Co-author with @prathikr.

Fix bug identified by @prathikr.
https://github.com/microsoft/onnxruntime/issues/14822.

Running T5 model enabling deepspeed, we see simplified layer norm is not
fused because the device check did not pass

b7fde84341/onnxruntime/core/optimizer/layer_norm_fusion.cc (L568).
Since during pretraining optimization pass, there is no device
placement, so the device check not fulfilled is expected.

On the other hand, the device check is still valid to avoid simplified
layer norm fusion works correctly for CPU runs. As a mitigation, added a
flag to indicate whether the fusion is triggered by pre-training
optimization or not. There is a risk though, when we run ORTModule
training with CPU EP, but I feel the risk can be much reduced if we
check CUDA/ROCM is enabled for the build.

```
CUDA_VISIBLE_DEVICES=0 python examples/onnxruntime/training/summarization/run_summarization.py --model_name_or_path t5-small --do_train --dataset_name cnn_dailymail --dataset_config "3.0.0" --source_prefix "summarize: " --predict_with_generate --overwrite_output_dir --output_dir /bert_ort/pengwa/output --fp16 --max_steps 1 --logging_steps 1 --deepspeed aml_ds_config_zero_1.json
```

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-03-07 13:59:20 -08:00
pengwa
f6c81d8aca
Introduce padding inspector in ORTModule (#14652)
### Introduce padding inspector in ORTModule

In some Transformer-based LLM training recipes, high data sparsity is
observed due to 1). token padding (to max sequence length), 2). labels
contains many ignore_index for calculate loss.

This PR introduces a switch to enable data sparsity inspection, which 
1). in short term, can inform training users to use techniques like
dynamic batching to amortize the issue.
2). in medium and longer term, also helps us (training team) to have
better understanding what our training customers' models looks like from
perspective of data sparsity (and potentially motivate us to improve
with runtime).

Here is an example of different data sparsity with same training model
arch, same training input, but with different user models.

**Low Embed Density, High Label Density Case - Sentence Classification**
`
python -m torch.distributed.launch --nproc_per_node=4
examples/onnxruntime/training/text-classification/run_glue.py
--model_name_or_path roberta-large-openai-detector --task_name mnli
--do_train --do_eval --max_seq_length 128 --per_device_train_batch_size
32 --learning_rate 2e-5 --num_train_epochs 3 --overwrite_output_dir
--output_dir ./outputs/ --per_device_eval_batch_size 32 --seed 1137
--fp16 True --ignore_mismatched_sizes True --optim adamw_ort_fused
`
```
>>>Valid token/label density (e.g. valid/total) in passing 10 steps:
        | STEP       | INPUT TYPE |  INPUT NAME     | PAD IDX    | DENSITY    | VALID TOKENS    | TOTAL TOKENS    | VALID TOKENS/BATCH |
        | 60         | EMBED      | input_ids       | 1          | 35.21    % | 1442            | 4096            | [50, 81, 35, 11, 29, 36, 66, 19, 40, 22, 21, 42, 17, 37, 40, 41, 26, 58, 38, 54, 41, 73, 48, 57, 50, 51, 49, 85, 48, 36, 79, 62] |
        | 61         | LABEL      | labels          | -100       | 100.00   % | 32              | 32              | N/A             |
        | 62         | EMBED      | input_ids       | 1          | 30.00    % | 1229            | 4096            | [36, 73, 13, 47, 27, 33, 53, 25, 51, 28, 36, 42, 42, 32, 39, 52, 27, 13, 31, 66, 42, 45, 52, 45, 58, 42, 37, 66, 12, 18, 29, 17] |
        | 63         | LABEL      | labels          | -100       | 100.00   % | 32              | 32              | N/A             |
        | 64         | EMBED      | input_ids       | 1          | 26.73    % | 1095            | 4096            | [37, 28, 20, 53, 16, 20, 44, 52, 27, 28, 16, 19, 16, 24, 63, 31, 24, 42, 33, 41, 44, 60, 44, 67, 54, 30, 20, 19, 33, 23, 24, 43] |
        | 65         | LABEL      | labels          | -100       | 100.00   % | 32              | 32              | N/A             |
        | 66         | EMBED      | input_ids       | 1          | 30.03    % | 1230            | 4096            | [22, 46, 36, 41, 46, 43, 26, 50, 60, 16, 24, 42, 56, 35, 35, 59, 29, 39, 34, 20, 66, 23, 47, 53, 19, 35, 44, 23, 34, 81, 21, 25] |
        | 67         | LABEL      | labels          | -100       | 100.00   % | 32              | 32              | N/A             |
        | 68         | EMBED      | input_ids       | 1          | 31.62    % | 1295            | 4096            | [75, 36, 48, 20, 38, 21, 49, 54, 38, 41, 26, 28, 80, 45, 48, 16, 22, 41, 34, 28, 37, 16, 74, 63, 62, 34, 22, 45, 23, 27, 37, 67] |
        | 69         | LABEL      | labels          | -100       | 100.00   % | 32              | 32              | N/A             |
<<<
```

**High Embed Density, Low Label Density Case - masked language model** 
`
python -m torch.distributed.launch --nproc_per_node=4
examples/onnxruntime/training/language-modeling/run_mlm.py
--model_name_or_path bert-base-uncased --dataset_name wikitext
--dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10
--per_device_train_batch_size 8 --per_device_eval_batch_size 8
--do_train --do_eval --overwrite_output_dir --output_dir ./outputs/
--seed 1137 --fp16 --report_to none --optim adamw_ort_fused
`
```
>>>Valid token/label density (e.g. valid/total) in passing 10 steps:
        | STEP       | INPUT TYPE |  INPUT NAME     | PAD IDX    | DENSITY    | VALID TOKENS    | TOTAL TOKENS    | VALID TOKENS/BATCH |
        | 710        | EMBED      | input_ids       | 0          | 100.00   % | 4096            | 4096            | [512, 512, 512, 512, 512, 512, 512, 512] |
        | 711        | LABEL      | labels          | -100       | 13.77    % | 564             | 4096            | N/A             |
        | 712        | EMBED      | input_ids       | 0          | 100.00   % | 4096            | 4096            | [512, 512, 512, 512, 512, 512, 512, 512] |
        | 713        | LABEL      | labels          | -100       | 14.48    % | 593             | 4096            | N/A             |
        | 714        | EMBED      | input_ids       | 0          | 100.00   % | 4096            | 4096            | [512, 512, 512, 512, 512, 512, 512, 512] |
        | 715        | LABEL      | labels          | -100       | 14.18    % | 581             | 4096            | N/A             |
        | 716        | EMBED      | input_ids       | 0          | 100.00   % | 4096            | 4096            | [512, 512, 512, 512, 512, 512, 512, 512] |
        | 717        | LABEL      | labels          | -100       | 14.53    % | 595             | 4096            | N/A             |
        | 718        | EMBED      | input_ids       | 0          | 100.00   % | 4096            | 4096            | [512, 512, 512, 512, 512, 512, 512, 512] |
        | 719        | LABEL      | labels          | -100       | 15.31    % | 627             | 4096            | N/A             |
<<<
```

#### Next Step

Let's see how we leverage the data sparsity for improvement.
Optimizations on the way around compute optimizer wave 2:
> Loss compute flops reduction.
> Flatten/Unflatten embedding tokens to save compute flops.
2023-03-03 18:36:08 +08:00
guyang3532
c49f250a14
Del ort_model._modules to foward its accessing to torch_model._modules (#14563)
Missing '_modules' attribute in ORTModule will cause load_state_dict for
wrapped_ortmodule fail.

reference:https://github.com/microsoft/onnxruntime/pull/7847
2023-03-03 10:12:37 +08:00
Dmitri Smirnov
8d87fdcfa1
Add GetVersionSting API for C++, C# and Python (#14873)
### Description
Added APIs.

### Motivation and Context
Addresses https://github.com/microsoft/onnxruntime/issues/14584

Cc: @Craigacp cp
2023-03-02 17:11:07 -08:00
cao lei
d69823f764
Do not create Barrier and triggerDownstream steps if the corresponding nodes are split by yield Op in training scenario (#14570)
### Description
Do not create Barrier and triggerDownstream steps during execution plan
creation if the corresponding nodes are split by yield Op in training
scenario.



### Motivation and Context
In training scenario, forward and backward processes are running two
different partial nodes of a graph. If there are two nodes each in one
of the partial graph and separate in two streams, there are still
triggerDownstream/barrier steps between them which work quite different
from inference process as one of the steps will not be executed due to
it is not in the correct range. To make it work, there is a hacky way to
trigger the barrier step explicitly for training.
This PR is to do some check, and do not create Barrier and
triggerDownstream steps if the corresponding nodes are split by yield Op
in training scenario. So the hacky way is not needed.
2023-03-02 07:08:29 -08:00
pengwa
79aa0acdd0
SCELoss(SCELossGrad) support half(float) input float(half) output (#13972)
### Description

A follow up change for
https://github.com/microsoft/onnxruntime/pull/13616.

SoftmaxCrossEntropyLossInternal/SoftmaxCrossEntropyLossInternalGrad
support different type for input and output.

Add SCELoss(SCELossGrad) support half(float) input float(half) output

### Test Note

#### Add tests for variant input and output types. To add such tests,
have to refactor existing testing code for sce loss and scelossinternal
gradient.

Originally, 

FP32 input and output, the CPU kernels, runs with CPU kernels the
baseline, CUDA/RCOM then runs with same data, user CompareTester to
compare with CPU run results.

FP16 input and output, the CPU kernels (did not have half kernels), runs
with Cast_to_float->CPU kernel->cast_to_half as the baseline, CUDA/RCOM
then runs with same data but using Half implementation, user
CompareTester to compare with CPU run results.

Now, we want the support run different input and output types. The
proposed change here is, to run CPU kernels always with float input and
output as baseline (because CPU only have float type kernels impl), this
step is the very first thing for every test.

Then, we run CUDA/RCOM kernels using half_input_half_output,
float_input_float_output, half_input_float_output,
float_input_half_output if there is corresponding kernel registered.

Afterwards, compare the CUDA/ROCM run results with CPU float baselines. 

Be noted, there is one thing that deserved a special note:
CompareOpTester's result compare can be loose than OpTester's.
Roughly speaking: the former tolerant diff <= atol +
rtol*expected_value, while the later one telerant diff < atol && diff <
rtol*expected_value. When the expected value is super small in many
cases of our tests cases, the former one can pass but the later one
fails. So the refactoring also move the check outside of OpTester,
explicitly check the values using the way CompareOPTester did (to align
the previous behaviour).

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-02-28 18:02:08 +08:00
Sheil Kumar
1b7f65437e
Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP (#14442)
Enable Opset11 Sequence Ops on DirectML, and make the CPU
implementations agnostic to backend EP

Opset 11 introduced the following sequence related operators:
    - SequenceAt
    - SequenceConstruct
    - SequenceEmpty
    - SequenceLength
    - SequenceErase
    - SequenceInsert 
    - ConcatFromSequence

With the exception of ConcatFromSequence, all of the above operators
were implemented with CPU kernels that a) required all of the contained
tensors to also be on CPU, and b) would clone each tensor into a new
sequence as a side effect of each operator. The implementation of
sequences are backend agnostic, as they dont affect actual tensor layout
or manipulate the contents of the tensors. In addition, with the
exception of SequenceAt, the other operators need not make copies of the
underlying referenced tensors.

Consequently, this change does the following:
1) Sequence* operators (except SequenceAt) no longer copies the contents
of a sequence of tensors on every kernel execution.
2) SequenceAt uses the DataTransferManager to copy tensors agnostic to
backend.
3) The internal container implemented by TensorSeq has changed from
onnxruntime::Tensor to OrtValue. This is because onnxruntime::Tensor
does not support copy or assignment construction, so it must have a
singular owner. However, is same tensor participates in multiple
containers it would have multiple container "owners" and this would not
be possible.
4) Other code that accessed values from TensorSeq have associated
changes to extract Tensors from OrtValues now.

In addition, DirectML execution was very slow when the above Sequence
operators were added to a graph, as this caused MemcpyToHost and
MemcpyFromHost kernels to be inserted between the graph and the sequence
operators. To optimize DirectML,
1) The CPU implementations for the Sequence* ops were registered as DML
implementations. Since the above changes also includes making the CPU
kernel implementations EP agnostic, the CPU kernels can be added as is.
2) The ConcatFromSequence operator needed to be implemented on DirectML.
However, there was little DirectML EP operator framework support for
operators that accept/output sequences of tensors. This change has
modified the internal COM interfaces to include new apis to interrogate
for sequence shapes, and extract the needed tensors from TensorSeq.

---------

Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>
2023-02-21 18:08:28 -08:00
Vincent Wang
e9ec4c098b
[CUDA] Fix FP16 Precision for Sigmoid Op (#14727)
Current Sigmoid's CUDA kernel uses target data type for all computation.
For some small negative numbers, if using FP16, it will loss precision.
For example, for input [-7.8477, 7.3320, -7.8008, 6.6016], the expected
output is [3.9047e-04, 9.9935e-01, 4.0919e-04, 9.9864e-01], but current
kernel will generate result [0.0000, 0.9990, 0.0000, 0.9990]. If some
sub-graph contains Sigmoid, such as BinaryCrossEntropyWithLogits, it's
likely to produce NaN as compute result.

The PR fixes this by using FP32 for kernel internal computation. Note
that the fix will not have perf regression, as CUDA's _Exp will also do
float to half casting, so the fix doesn't introduce extra cast. We move
the cast to right begin and end of the whole kernel so that other parts
of computation are also in FP32 (instead of only Exp).
2023-02-22 09:16:22 +08:00
pengwa
fbf5d09a0c
Fix random failure of ortmodule_api.py::test_unused_parameters (#14729)
### Fix random failure of ortmodule_api.py::test_unused_parameters

Fix FAILED
orttraining_test_ortmodule_api.py::test_unused_parameters[model1-none_pt_params1]
for orttraining-linux-gpu-ci-pipeline CI pipeline

```
=================================== FAILURES ===================================
________________ test_unused_parameters[model1-none_pt_params1] ________________

model = UnusedMiddleParameterNet(
  (fc1): Linear(in_features=784, out_features=500, bias=True)
  (relu): ReLU()
  (fc2): Linear(in_features=500, out_features=400, bias=True)
  (fc3): Linear(in_features=500, out_features=10, bias=True)
)
none_pt_params = ['fc2.weight', 'fc2.bias']

    @pytest.mark.parametrize(
        "model, none_pt_params",
        [
            (UnusedBeginParameterNet(784, 500, 400, 10), ["fc1.weight", "fc1.bias"]),
            (UnusedMiddleParameterNet(784, 500, 400, 10), ["fc2.weight", "fc2.bias"]),
            (UnusedEndParameterNet(784, 500, 400, 10), ["fc2.weight", "fc2.bias"]),
        ],
    )
    def test_unused_parameters(model, none_pt_params):
        device = "cuda"
    
        N, D_in, H1, H2, D_out = 64, 784, 500, 400, 10
        model = model.to(device)
        ort_model = ORTModule(copy.deepcopy(model))
    
        # Make sure model runs without any exception
        for _ in range(5):
            x = torch.randn(N, D_in, device=device)
            y = copy.deepcopy(x)
    
            out_pt = model(x)
            out_ort = ort_model(y)
            loss_pt = out_pt.sum()
            loss_pt.backward()
            loss_ort = out_ort.sum()
            loss_ort.backward()
            _test_helpers.assert_values_are_close(out_ort, out_pt)
>           _test_helpers.assert_gradients_match_and_reset_gradient(ort_model, model, none_pt_params=none_pt_params)

orttraining_test_ortmodule_api.py:4050: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_test_helpers.py:216: in assert_gradients_match_and_reset_gradient
    assert_values_are_close(ort_param.grad, pt_param.grad, rtol=rtol, atol=atol)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

```

Initially the test runs very well. As we insert more and more tests,
when running ortmodule_api.py::test_unused_parameters, the random
generated data got changed, and now it is more easily to generate an
input data that produce a result the break existing rtol and atol.

The example data, 0.1041 only have very minor diff, e.g. abs_diff:
2.2649765014648438e-06.
> The torch.allclose judge it is not equal because: abs_diff> 0.1041 *
rtol + atol = 1.041e-1 * 1e-5 + 1e-6 =-2.041e-6.
> Additionally, according to math
[here](7b31bcda2e/orttraining/orttraining/test/python/_test_helpers.py (L230))
The maximum atol is 1.2238311910550692e-06 > current atol(1e-6), maximum
rtol is 1.2149855137977283e-05 > current rtol(1e-5).

This PR looses the atol to 1e-5, rtol to 1e-4 .
2023-02-20 18:09:53 +08:00
Baiju Meswani
ae205a7924
QAT POC tutorial (#14577) 2023-02-16 14:38:18 -08:00
Baiju Meswani
4e686a9a7d
Support building a QAT onnx model using onnxblock (#14551) 2023-02-16 14:38:01 -08:00
Edward Chen
5605c3d454
Make some variables constexpr in orttraining/orttraining/training_ops/cuda/optimizer/lamb.cc. (#14698) 2023-02-15 14:10:59 -08:00
cao lei
50fa151298
remove device_id parameter out of ExecutionProvider::GetAllocator() (#14580)
### Description
Remove the parameter device_id out of ExecutionProvider::GetAllocator()
function



### Motivation and Context
The parameter device_id is not necessary. We can fully rely on the
second parameter OrtMemType mem_type to determine the device_id when
getting allocator from executionProvider.
2023-02-13 10:01:07 -08:00
Baiju Meswani
22de2798f2
Update typing hints to support python 3.8 for training apis (#14649) 2023-02-13 09:52:05 -08:00
guyang3532
ba00f3a134
fix problem of reduplicate input names (#14163)
Contributor: @guyang3532
2023-02-10 12:57:51 -08:00
Wei-Sheng Chin
875a7791bf
[DORT] Update import path (#14605)
Follow up changes from
https://github.com/pytorch/pytorch/pull/93409/files for fixing DORT CI
failures.
2023-02-08 19:54:06 -08:00
Maximilian Müller
e9ab56fa64
Adding RunOptions synchronization behaviour to C/C++ API (#14088)
### Description
This is exposing the already existent interface of asynchronous work of
all CUDA base EP's (CUDA + TensorRT).


### Motivation and Context
This is something requested in #12216. It will enable users to build an
efficient data pipeline with ONNXRuntime and CUDA pre-/post-processing.
PCI traffic to the CUDA device can be run during inference as soon as
the postprocessing consumed the input buffer and it can be overwritten.
To do this work has to be submitted async to the device. Please see
below screenshots showing the illustration of this using NSight Systems.

Async: 
<img width="1401" alt="image"
src="https://user-images.githubusercontent.com/44298237/209894303-706460ed-cbdb-4be2-a2e4-0c111ec875dd.png">

Synchronous:
<img width="1302" alt="image"
src="https://user-images.githubusercontent.com/44298237/209894630-1ce40925-bbd5-470d-b888-46553ab75fb9.png">

Note the gap in between the 2 inference runs due to issuing PCI traffic
in between and to the CPU overhead the active synchronization has.

---------

Co-authored-by: Chi Lo <chi.lo@microsoft.com>
2023-02-07 19:59:28 -08:00
Tang, Cheng
8f34c8c8ed
Introduce collective ops to ort inference build (#14399)
### Description
Introduce collective ops into onnxruntime inference build, including
1) AllReduce and AllGather schema in contrib op, controlled by USE_MPI
flag
2) AllReduce and AllGather kernel in cuda EP, controlled by ORT_USE_NCCL
flag


### Motivation and Context
Enable the collective ops in onnxruntime inference build so we have the
ability to run distributed inference with multiple GPUs.
The original ncclAllReduce ops in training build require quite complex
configurations, which is not suitable for inference case, and it already
broken. so we introduce a new implementation.

---------

Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-02-07 13:47:48 -08:00
Vincent Wang
3d7518762a
[ORTModule] ATen Support for upsample_bilinear (#14519)
It's required by model MobileViT.
2023-02-04 15:20:18 +08:00
pengwa
62442c3d27
Enable multiple step run for adamw tests (on device training) (#14520)
(cherry picked from commit 414b73a02123b672e496326664cd2dc3bd6c6d24)

### Rework for PR https://github.com/microsoft/onnxruntime/pull/14068:
Enable multiple step run for adamw tests (on device training)
### Removed duplicated MACRO checks for training.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-02-02 18:40:30 +08:00
Abhishek Jindal
3d388a1aea
change deepspeed version in warning from 0.7.3 to 0.8.0 (#14527)
### Description
change deepspeed version in warning from 0.7.3 to 0.8.0



### Motivation and Context
The version was updated for Deepspeed support in ORT from 0.7.3 to 0.8.0
but wasn't updated in the warnings message and this PR is to fix that.
2023-02-01 12:00:43 -08:00
Abhishek Jindal
6fa4555a06
Including support for Deepspeed 0.8.0 (#14506)
### Description
Including Support for Deepspeed 0.8.0.



### Motivation and Context
Deepspeed 0.8.0 has a bug fix and mlfow integration.
2023-02-01 06:19:41 -08:00
Erick Muñoz
d1533c27eb
[oneDNN] Improved thread handling (#13618)
* Added the OrtDnnlProviderOptions structure to expose configuration
options to the user

* The number of threads can be defined by the user with the -i flag on
the perftest

* Number of threads can also be configured via the OMP_NUM_THREADS
environment variable

* The number of threads defined in the OrtDnnlProviderOptions is
prioritized over the environment variable

### Description
Avoids thread oversubscription caused by OpenMP allocating the maximum
number of threads possible for oneDNN EP. Added support for the
OrtDnnlProviderOptions, this will allow for more EP customization
capabilities, and allows for user defined number of threads.



### Motivation and Context
- Improves performances and allows for user to fine tune the number of
threads
2023-01-31 14:37:13 -08:00
Ashwini Khade
764202d740
fix prefast warning (#14446)
### Description
Fixes a prefast warning:
https://aiinfra.visualstudio.com/ONNX%20Runtime/_workitems/edit/11113



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-01-30 09:13:39 -08:00
Kyushick Lee
cd24f0794a
Extend ort_backend.py for another ep (#14349)
### Description
<!-- Describe your changes. -->

This PR extends OrtBackend to allow for configuring an EP based on the
name, and fallbacks to existing mechanism that infers the EP based on
tensor affinity if nothing is provided.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Currently OrtBackend needs `get_ort_device()` with the device tag
inferred from torch.Tensor, but ort device is not yet supported for
dort. The change allows run dort with a supported EP, by configuring
dort with a desired EP and letting the dort (ort InferenceSession) take
CPU-affined pytorch Tensors as inputs then inject data transfer nodes
internally.
2023-01-20 07:30:00 -08:00
Wei-Sheng Chin
432a9912a3
Fix LORT CI failure due to PyTorch change (#14367)
As title. The fuser in LORT doesn't like "scalar". With a recent PyTorch
change, scalar is intorduced somewhere it was there before. Now, a
simple fix is to check if all inputs are tensors or some specially
allowed cases before sending ops to ORT.
2023-01-19 16:02:40 -08:00
Ashwini Khade
ea7bbd667d
fix headers for training apis (#14350)
### Description
Minor refactor PR for fixing header placement for training apis
2023-01-19 10:26:53 -08:00
Adam Louly
f0555eb437
Improved test cases by using paramerters (#14246)
### Description
Completing some missing parts of some test cases for python bindings

### Motivation and Context
Some test cases like test_training_module_checkpoint and test_optimizer
step were not completed before because we had no access to parameters to
check if the parameters are changing after the optimizer step or that
the checkpoint saved parameters remains the same.
now that we have access to the vector or parameters by exposing
get_contiguous_parameters() method.
we can complete the tests.
2023-01-13 12:54:23 -08:00
Ashwini Khade
cc7799835e
Enable a single build with optimized inference and on device training (#14241)
### Description
Right now prepacking code is not compiled when training is enabled. Our
partners want a single build of ort which can do both optimized
inference + training on device. This PR enables prepacking code in a
training build and controls whether it is enabled or not using already
existing session option - kOrtSessionOptionsConfigDisablePrepacking

For Inference scenarios - prepacking will be turned on by default and
this behavior remains the same after this PR too.
For training scenarios - prepacking will be disabled by default and if
user explicitly enables it then an error will be thrown.



### Motivation and Context
Enable both optimized inference as well as on device training in a
single build. For on device training use flag --enable_training_apis.
2023-01-12 21:36:43 -08:00
Vincent Wang
fb3c1221e4
Fix Prefast Warning (#14250)
Fix two prefast:Warning related to constexpr.
2023-01-13 10:16:35 +08:00
Scott McKay
dd2df460b3
Split(18) (#14015)
### Description
<!-- Describe your changes. -->
Opset 18 Split changes. Adds ability to specify num_outputs which also
allows uneven splitting.

https://github.com/onnx/onnx/releases/tag/v1.13.0

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Support ONNX opset 18.
2023-01-12 08:14:10 +10:00
pengwa
a4180d79c5
Multi-tensor SGDOptimizer (on device training) (#14083)
Implement SGDOptimizerV2 taking sequence of weights and gradients as
inputs.

For CPU EP and CUDA EP only.

Added tests.
2023-01-11 10:15:53 -08:00
Ashwini Khade
d92c663f28
Create dedicated build for training api (#14136)
### Description
Enable creating dedicated build for on device training. With this PR we
can build a lean binary for on device training using flag
--enable_training_apis. This binary includes only the essentials like
training ops, optimizers etc and NOT features like Aten fallback,
strided tensors, gradient builders etc . This binary also removes all
the deprecated components like training::TrainingSession and OrtTrainer
etc

### Motivation and Context
This enables our partners to create a lean binary for on device
training.
2023-01-10 20:58:04 -08:00
Xavier Dupré
79dc39600f
Replace distutils by setuptools to import build_ext (#14108)
### Description
Uses setuptools instead of distutils.



### Motivation and Context
Fixes #14107.
2023-01-09 11:48:01 +01:00
Baiju Meswani
c6ff5bac9d
Update torch in eager mode CI pipeline (#14094) 2023-01-06 11:46:44 -08:00
Adrian Lizarraga
68794d0ac1
Improve custom op library handle cleanup (#14099)
### Description
- Adds a new C API `OrtApi::RegisterCustomOpsLibrary_V2` that manages
the lifetime of dynamic library handles (i.e., calls `dlclose` or
`FreeLibrary`).
- Deprecates C API `OrtApi::RegisterCustomOpsLibrary`.
- Adds C++ API wrapper for convenient registering of custom op
libraries.
- `PySessionOptions` is now an alias of `OrtSessionOptions`

### Motivation and Context
The current API for registering custom op libraries loads dynamic
libraries but requires users to handle the release of the corresponding
library handles. Additionally, the user has to make sure to release the
library handle _after_ the session has been destroyed (or the program
segfaults).

The new API automatically cleans up the library and allows the user to
write more straightforward code.
2023-01-04 17:56:29 -08:00
Baiju Meswani
0ff61f7b97
Update torch to 1.13.1 in CI and packaging pipelines for ort training (#14055) 2023-01-03 20:03:33 -08:00
cao lei
b29a1c7348
Address follow-up comments on multistream pr #13495 (#13992)
### Description
This PR is to address follow-up comments for the multi-stream pr
https://github.com/microsoft/onnxruntime/pull/13495

Changes including:

- Make StreamAwareArena transparent to minimal build
- Make DeviceStreamCollection transparent to minimal build
- Replace ORT_MUST_USE_RESULT with [[nodiscard]]
- Remove unnecessary shared_ptr


### Motivation and Context
This PR is to address follow-up comments for the multi-stream pr
https://github.com/microsoft/onnxruntime/pull/13495

Co-authored-by: Lei Cao <leca@microsoft.com>
2023-01-03 16:33:36 -08:00
Ashwini Khade
68b5b2d7d3
Refactor training build options (#13964)
### Description
1. Renames all references of on device training to training apis. This
is to keep the naming general. Nothing really prevents us from using the
same apis on servers\non-edge devices.
2. Update ENABLE_TRAINING option: With this PR when this option is
enabled, training apis and torch interop is also enabled.
3. Refactoring for onnxruntime_ENABLE_TRAINING_TORCH_INTEROP option: 
   -  Removed user facing option
- Setting onnxruntime_ENABLE_TRAINING_TORCH_INTEROP to ON when
onnxruntime_ENABLE_TRAINING is ON as we always build with torch interop.

Once this PR is merged when --enable_training is selected we will do a
"FULL Build" for training (with all the training entry points and
features).
Training entry points include:
1. ORTModule
2. Training APIs

Features include:
1. ATen Fallback
2. All Training OPs includes communication and collectives
3. Strided Tensor Support
4. Python Op (torch interop)
5. ONNXBlock (Front end tools for training artifacts prep when using
trianing apis)

### Motivation and Context
Intention is to simply the options for building training enabled builds.
This is part of the larger work item to create dedicated build for
learning on the edge scenarios with just training apis enabled.
2023-01-03 13:28:16 -08:00
Dmitri Smirnov
5d729839b5
Support loading widechar paths on windows (#14066)
### Description
Make GetRuntimePath() and LoadDynamicLibrary() operate on platform
specific paths

### Motivation and Context
This addresses https://github.com/microsoft/onnxruntime/issues/14063
2022-12-30 16:30:11 -08:00
Vincent Wang
0c3480e565
[ORTModule] ATen upsample_nearest Gradient Bugfix (#14069)
PyTorch removed upsample_nearest related backward functions with "vec"
overload name since 1.13. The functions without overload name are
available for all versions, though they are not that convienent to use.
This PR changes the gradient builder code to use functions without
overload name for ATen upsample_nearest nodes.

This PR also fixed a bug for ORTModule's corner case introduced by the
multi-stream PR. There is some code to execute the barrier step for
triggered downsteam is the barrier is out of range. But this should be
applied to triggered downstream only. If it's a normal run with start
step as a barrier step but out of range, we should not apply the logic.
For example, for ORTModule, if the barrier is the 1st step of whole CPU
plan, and the forward part is empty, then the forward normal run will
run step from start-0 to end-0 (actually nothing), and step-0 is the
barrier, then we should not execute the barrier in such case.
2022-12-27 10:18:30 +08:00
Adam Louly
e49f358686
expose lr scheduler python bindings for on device training. (#13882)
### Description
Exposing LR Scheduler python bindings for on device training.

Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
2022-12-22 18:44:04 -08:00