Commit graph

1331 commits

Author SHA1 Message Date
pengwa
cd7b3f54da
Allow defining customized PythonOp shape inferer (#17093)
### Allow defining customized PythonOp shape inferer

For `torch.autograd.Function`, we converted it to PythonOp in MSDomain,
there are two places to do shape inferencing for it:

1. in SymbolicShapeInfer, there is one. 
2. in PythonOp op definition. 

For common PythonOp, since we don't know the relation ship between
inputs and outputs, so we only infer the rank from output ranks, and
generate symbolic dimensions for each dim. While this will introduce
many meaningless symbolic dimensions, sometimes blocking our graph
transformers to do op fusion.

This PR provide a way to define custom shape inferencing for
`torch.autograd.Function` we defined, to propagate the original
dimensions across the PythonOp at the best efforts.

But the 2rd one is not covered yet, we could refine that later. Fixing
1st one is enough for ORTModule training/evaluation.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-14 09:13:32 +08:00
Baiju Meswani
3e7f70bf88
LeakyRelu Gradient (#17039) 2023-08-10 20:45:34 -07:00
Changming Sun
6dffd1a890
Update model_tests.cc: avoid auto adding new tests from new opsets (#17084)
### Description
1. Update model_tests.cc: avoid auto adding new tests from new opsets. 
2. Simplify the "ConcatPathComponent" function. It does not need to be a
template.

### Motivation and Context
All our Windows/Linux CI build machines are preloaded with some test
data. In model_tests.cc, we auto add all of them to
onnxruntime_test_all.exe's unit tests. However, it causes problems when
we update the CI build machine images: new data could cause pipelines
suddenly failing.
Therefore, instead of auto discovering test data and adding all of them
to tests, this PR changes it to explicitly specify the opset names.

This change doesn't impact how Web CI pipeline runs its tests.

Going forward, the workflow would be like:
Step 1: update the onnx version in deps.txt
Step 2: Update js/scripts/prepare-onnx-node-tests.ts. Like #16943 .
Better to put step 1 and step 2 in the same PR.
Step 3: onnxruntime-es team regenerates VM images, test them and deploy
them.
Step 4: Enable the new opset test data for EPs. 


[AB#18340](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/18340)
2023-08-10 11:11:26 -07:00
PeixuanZuo
12837ba5c7
[ROCm] Update CI based on ubuntu 22.04 (#17076)
- Update ROCm version to ROCm5.6
- Update CI based on ubuntu 22.04
2023-08-10 09:51:29 -07:00
guyang3532
ef6f4a4aa1
support broadcast shape for elementwise node in padding elimination (#16710)
With PaddingElimination optimizer, input1 of element-wise op may be
flattened like:

```
  input1 (shape:[batch_size, seq_len, ...])        input1 (shape:[valid_tokens, ...])
        \                                               \
         \               input2                          \               input2
          \                /              ----->          \               /
           \              /                                \             /
	    Element-wise Op                                Element-wise Op
```
So, the shape of input2 should be processed accordingly:
1. If input2.shape.dim_size <= input1.shape.dim_size-2, i.e. input2 has
no [batch_size, seq_len] at begining,
we needn't to process the shape of input2 because it's compatible with
the flattened shape of input1 (shape:[valid_tokens, ...]).
   
2. If the shape of input2 has the same dim_size with shape of input1 and
has [batch_size, seqlen] at begening,
to be compatible with flattened shape of input1, we need to insert
flatten pattern for input2 also,
which flatten the shape of input2 from [batch_size, seq_len, ...] to
[valida_tokens, ...].
   
  
3. (which done in this pr) In other case for shape of input2, like [1,
seq_len, ...] or [batch_size, 1, ...], we firstly need to expand it
to [batch_size, seq_len, ...] which is convenient to flatten. And then
insert flatten pattern.
2023-08-10 19:07:22 +08:00
pengwa
0471f6fbb3
Check type for building gradient graph (#17046)
### Check type for building gradient graph

**Bug1**: 

To fix the error when running the model with ORTModule + Stage 3:

```
Exception happens when running  <bound method Function.apply of <class 'onnxruntime.training.utils.hooks._zero_offload_subscriber.ORTZeROOffloadPreForwardFunction'>>
Traceback (most recent call last):
  File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_custom_autograd_function_runner.py", line 207, in call_python_forward_function
    wrapped_arg.requires_grad = is_training_mode and grad_flag
RuntimeError: only Tensors of floating point and complex dtype can require gradients

```

This is because when running PythonA, the 3rd input is int64, we find it
requires gradient during the check in gradient builder, so we set its
requires_grad = True, but PyTorch thinks it is incorrect, throwing the
exception. So we need understand why ORT gradient builder think the 3rd
input need gradients.


During `ReverseBFSWithStopGradient`, which do reverse BFS from graph
outputs, it collects all nodes that are needed for computing the graph
outputs. `ReverseBFSWithStopGradient` define a queue, initially add all
nodes that generate graph outputs, then iterate the nodes one by one,
checking each node's input, if the input did not hit stop edge and its
node arg type is allowed type (float, etc), then the input node is
append into the queue, do the next iteration of work.

PythonOpA is such a node that is needed to compute graph outputs, then
IsReachable(PythonOpA) will return True.


![image](https://github.com/microsoft/onnxruntime/assets/10530022/c4c53fb9-15f7-4e8d-9aa2-7fc20555a001)

In the above code snippet, when node is PythonOpB, and next_node being
PythonOpA, we did not check node_arg type between node and next_node on
the connection of PythonOpA's 3rd input to PythonOpB's outputs. So we
append the int64 typed node args to sets that require gradient.


**Fix1**: add the node arg type check before appending it into require
grad lists.


After the fixing, A unit test failed
"orttraining_test_ortmodule_api.py::test_gradient_correctness_minmax[data_type0-True-0-min]
Fatal Python error: Segmentation fault". After investigation, it is
another bug.

**Bug2**: 

Without the above fix1, the execution graph looks like this


![image](https://github.com/microsoft/onnxruntime/assets/10530022/b2fd4b03-95c7-414a-b268-2ba6a7300105)

As you can see, int64 type has a gradient edge built, while it is not
used for any consumers. And the execution runs well. While think twice,
int type should not have grad edge built.

With the Fix1, the execution graph looks like this;


![image](https://github.com/microsoft/onnxruntime/assets/10530022/1870d3cc-2fe5-4aa7-ad6b-0d88dcc40f8a)

So the int type node arg did not has gradient edge built. **Fix1** is
fixing this problem.

But another bug happens if the inital "y_node_arg_names" e.g. in this
case Aten's two outputs, 1st one in float, 2nd one in int. When we check
the y_node
(6e6f582e08/orttraining/orttraining/core/framework/gradient_graph_builder.cc (L60C16-L60C16)),
we did not check the data type, then add it into `y_node_args_` which is
the list of graph output node args that requires gradient. Then
`non_differentiable_y_node_arg_names_` did not has the int type graph
output.

Then
6e6f582e08/orttraining/orttraining/core/framework/ortmodule_graph_builder.cc (L312C18-L312C18)
will try to get the grad node arg into `yield_output_node_args`, BUT the
grad node arg is not built for int type node arg (with the **Fix1**). So
we insert a nullptr, later when we using it, we get segment fault.

**Fix2** 

Again, we add the type check when handle y_node_args, also add null
check when getting gradient node arg and append into
yield_output_node_args
2023-08-10 14:24:42 +08:00
Baiju Meswani
31cbd63af7
GRU Training and GRU Gradient Kernels (#16929) 2023-08-09 21:24:47 -07:00
Baiju Meswani
f17efb5c7b
Copy to buffer for both trainable as well as non trainable parameters (#17070) 2023-08-09 17:23:24 -07:00
cloudhan
a4902ee65b
[CUDA][ROCm] Allow allocating ScratchBuffer from TuningContext (#17028)
By switching to ort native stream, we can allocate scratch buffer
directly from tuning context.
2023-08-10 00:05:10 +08:00
pengwa
6e6f582e08
Use full qualified name for PythonOp export (#17021)
### Use full qualified name for PythonOp export

Originally, when there are duplicate named torch.autograd.Function in
different module, for example:

`a.b.c.Gelu` v.s. `d.e.func.<locals>.Gelu`

We by default will throw exception to let user be aware we cannot
distinguish the two Gelu because during model export, we did not module
path. The workaround is we introduced
`ORTMODULE_SKIPPED_AUTOGRAD_FUNCTIONS` to ignore those duplicated named
Gelu that is not used by model run. This has limitations obviously for
example if two Gelus are both used in training.



This PR finds a way to construct a full qualified name.

`def _export_pt_1_10(g, n, *args, **kwargs):`

1. in exporter function, kwargs contains `name` and `module`, in the
above example:
   `a.b.c.Gelu`  --> name: `Gelu`, module: `a.b.c`
   `d.e.func.<locals>.Gelu` --> name: `Gelu`, module: `d.e`
   
 
Using name and module is not enough to get a full qualified name, for
the second case, where `d.e` is the module path, then there is a
function called `func`, in this function, there is a local
auto.grad.Function named `Gelu`. (Many of our UT looks like this). We
can only get `d.e.Gelu`, but this is not the correct full qual name.

The reason for this: `kwargs[name]` or `n.name` only return the class's
name, not the class's full qual name. (be noted kwargs[module]` is
correct).

2. `n` is torch.Node, we can access `pyobj` to get the
torch.autograd.Function's apply method instance, then use `._self` to
get the torch.autograd.Function class. Then we can get the `module` and
`class`'s ful qual name, added together, we get the full qual name.

With the above change, we don't need use `kwargs[name]` and
`kwargs[module]` , and don't need check naming conflicting or
`ORTMODULE_SKIPPED_AUTOGRAD_FUNCTIONS` env var any more.
2023-08-09 10:58:33 +08:00
Ti-Tai Wang
45ea907f53
Fix orttraining_test_dort.py (#17034)
Converter has moved `opset_version` out from `torch.onnx.ExportOptions`,
and put it into `torch.onnx.OnnxRegistry`.
This PR fixes the usage in DORT.
2023-08-08 08:11:48 -07:00
Baiju Meswani
249917a093
Add mac and windows python packages for onnxruntime-training (#16993) 2023-08-07 20:32:55 -07:00
Ti-Tai Wang
8a335b8347
Update torch.onnx.OnnxRegistry usage in DORT tests (#17009)
Update the usage of torch.onnx.OnnxRegistry, as it's officially
published in PyTorch: https://github.com/pytorch/pytorch/pull/106140.

---------

Co-authored-by: Wei-Sheng Chin <wechi@microsoft.com>
2023-08-07 10:15:51 -07:00
pengwa
3649376f09
Fix few small bugs (#17019)
### Fix few bugs

1. symbolic shape infer, there is no None check before get length. 
2. Rename PythonOp/PythonOpGrad's attribute `name` to `func_name`,
otherwise, when we use onnx.helper.make_node to create node, `name`
conflicts with node name.
3. Filter shape inference warnings for PythonOp for torch 2.0 or newer. 
4. Close file descriptor for log suppression. Without the fix, two extra
fd is left after the log suppression exit its context.
Before enter log suppression (left), Before exit log suppression (right)

![image](https://github.com/microsoft/onnxruntime/assets/10530022/3cd3057a-59f9-4c89-8359-d9b32c49a17e)
   With the fix, no fd added after context exit.

![image](https://github.com/microsoft/onnxruntime/assets/10530022/03454a8f-ab48-4552-bb9b-293a4f51be67)
2023-08-07 14:01:36 +08:00
Baiju Meswani
e5bb7aba50
Add Gradient for Reciprocal (#16945) 2023-08-04 09:38:09 -07:00
pengwa
a6887f171f
Refactor schema extraction and output unflattening (#16894)
### Motivation and Context

When we handle PyTorch models' inputs in different places (ORTModule or
others), it's common for us to flatten a structured data into a 1-D
tensor list (required by lib for example torch.onnx.export,
torch.autograd.Function.forward or ORT inference session), then do
subsequent work, then unflatten back to original hierarchy as returned
values.

DeepStage3 hooks support work also need such a lib to do similar things,
so I was proposing to extract this pair of APIs in training/utils/,
which can be more used more generally. Also a comprehensive set of test
data are used for testing unflatten/flatten in unit tests.

Let me know if you have any other suggestions. 


### Refactor schema extraction and output unflattening

Move `_extract_schema` and `unflatten_user_output` in
`orttraining/orttraining/python/training/ortmodule/_io.py` . to
`extract_data_and_schema` and `unflatten_data_using_schema` in
`orttraining/orttraining/python/training/utils/torch_io_helper.py` as
shared libs, which can be used later by other features (deepspeed stage
3 hook rewrite).

While there are still a few duplicated logic handling flatten with
different task by recursively loop the data struct, will change them
step by step in case of heavy review efforts.
2023-08-04 13:58:21 +08:00
Edward Chen
f98d3f8a23
[CoreML EP] Enable inputs with dynamic shape (#16915)
Enable node inputs with dynamic shape to be handled by the CoreML EP.
2023-08-03 18:15:00 -07:00
pengwa
b9d80131a7
Save optimized pre_grad graph once ready (#16816)
### Save optimized pre_grad graph once it's ready

`graph_builder.build()` did two things for training: 1. optimized
forward graph, e.g. pre_grad graph optimization. 2. build gradient
graph.

Originally after `graph_builder.build()` completed, pre_graph graph is
saved. While if pre_grad graph optimization completed, but fail during
gradient graph build, we still cannot get pre_grad graph to investigate.

This PR made the change once pre_grad graph is ready, we save it (if
save_model is enabled) in C++ backend.
2023-08-02 14:05:26 +08:00
pengwa
a021cb1b6e
Allow creating ConstantScalarNode for double type (#16797)
### Allow creating ConstantScalarNode for double type

Allow create ConstantScalarNode for double type. Looks double type is
not respected when creating constant. So fix it.

```
onnxruntime::python::addObjectMethodsForTraining(pybind11::module&, onnxruntime::python::ExecutionProviderRegistrationFn)::<lambda(onnxruntime::training::OrtModuleGraphBuilder*, const onnxruntime::training::TrainingGraphTransformerConfiguration&)> [ONNXRuntimeError] : 1 : FAIL : Type Error: Type parameter (T) of Optype (Sub) bound to different types (tensor(double) and tensor(float) in node (/_original_module/_original_model/gpt_neox/layers.0/input_layernorm/Pow_Grad/Sub_1).
```
2023-07-28 12:41:22 +08:00
Prathik Rao
779fba1666
ORT Cache (#16744)
### Description
<!-- Describe your changes. -->

This PR adds support to cache the exported training/evaluation ONNX
model in `ORTModule`. On future runs, instead of exporting the model
again, we can pick up the model from a location on disc and run
`ORTModule` training/evaluation.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

ORT Training DRI Contribution

---------

Co-authored-by: root <root@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
Co-authored-by: pengwa <pengwa@microsoft.com>
2023-07-27 09:00:43 -07:00
pengwa
39fca225ea
ORTModule log clean up (#16795)
### ORTModule log clean up

ORTModule log level - WARNING(Default) is for end users; INFO and
VERBOSE is for internal ORT training developers.

Few issues: 
1. ONNX export will output lots of WARNING error message like "The shape
inference of
com.microsoft::SoftmaxCrossEntropyLossInternal/ATen/PythonOp type is
missing", which is useless for us or end users.

![image](https://github.com/microsoft/onnxruntime/assets/10530022/f2409480-32e1-483d-bd18-f14149f0588d)

3. ORT also print some information like
""CleanUnusedInitializersAndNodeArgs] Removing
initializer","ReverseBFSWithStopGradient] Skip building gradient for",
which is also useless for us or end users most of the time.

![image](https://github.com/microsoft/onnxruntime/assets/10530022/ff3feaf1-3cb2-4392-b087-86b30b72994c)


5. Different ranks output logs and making ORT developers or end users
feels there are too many logs but usually not useful until we need
investigate.

Few improvements for the issues:
1. For ONNX export logs, there are two kinds of logs: a. export verbose
log; b. other logs printed by torch C++ backend. So this PR make
following change:
# VERBOSE -> FULL export verbose log + FULL torch other logs from stdout
and stderr (C++ backend)
# INFO -> FULL export verbose log + FILTERED torch other logs from
stdout and stderr (C++ backend)
# WARNING/ERROR -> [Rank 0] NO export verbose log + FILTERED torch other
logs from stdout and stderr (C++ backend)

e.g. for verbose level, print all logs as usually; for info level, print
verbose export log, and filtered logs from torch C++ backend (removing
messages like this "The shape inference of
com.microsoft::SoftmaxCrossEntropyLossInternal/ATen/PythonOp type is
missing") . For higher level, only log the info on rank 0.

2. For ORT gradient graph build and session creation, also suppress the
message and filtered out the message when log level >=INFO.

3. log level > INFO, then only logs on rank 0 is logged, to have a
cleaner user experience


This is the log for a BLOOM model training after the change: there are
limited of warnings.


![image](https://github.com/microsoft/onnxruntime/assets/10530022/f270b8d5-2944-49d2-a253-c07057d641a0)
2023-07-26 12:42:50 +08:00
Justin Chu
0c1a5098dc
Disable PERF* rules in ruff to allow better readability (#16834)
### Description

Disable two PERF* rules in ruff to allow better readability. Rational
commented inline. This change also removes the unused noqa directives
because of the rule change.

### Motivation and Context

Readability
2023-07-25 15:38:22 -07:00
Wei-Sheng Chin
b0279b14d8
[DORT] Enable Dynamic Shape in DORT and Use Different InferenceSession's when Inputs Are Not Compatible (#16753)
Sometimes, ONNX exporter generates rank- or shape-dependent sub-graphs.
Thus, error could occur when running the ONNX model with different
inputs. This PR
([78e736d](78e736d857))
addresses this problem by
- if needed, exporting multiple ONNX models with different inputs for
the same GraphModule.
- implementing a naive mechanism to determine of existing ONNX models
(and the associated InferenceSession) can be reused.
 
On the other hand, in the second commit
[b5a9b5f](b5a9b5f849),
this PR also enables dynamic shapes in DORT by
- passing dynamic_shapes = True to exporter (see how
DEFAULT_DYNAMIC_BACKEND is created)
- calling torch._dynamo.optimize(dynamic_ort_aot, dynamic=True) (see how
dynamic_ort_aot is created).
2023-07-24 16:54:01 -07:00
pengwa
40277b7f37
Fix orttraining-linux-gpu-ci-pipeline - LargeSizeTensorUInt64Index tests (#16820)
### Disable large index tests due to limited GPU mem

Recently following two tests fail due to GPU mem not enough, not sure
what else program running using GPU as well. So disable them for now to
unblock the required CI.

```
1: [  FAILED  ] 2 tests, listed below:
1: [  FAILED  ] CrossEntropyTest.SoftmaxCrossEntropyLossInternal_LargeSizeTensorUInt64Index
1: [  FAILED  ] CrossEntropyTest.SoftmaxCrossEntropyLossInternalGrad_LargeSizeTensorUInt64Index


2023-07-23T02:15:39.7559251Z 1: [ RUN      ] CrossEntropyTest.SoftmaxCrossEntropyLossInternal_LargeSizeTensorUInt64Index
2023-07-23T02:16:53.0904576Z 1: 2023-07-23 02:16:53.089586592 [E:onnxruntime:SoftmaxCrossEntropyLossInternal, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running SoftmaxCrossEntropyLossInternal node. Name:'node1' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:376 void* **onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 4294973440**
2023-07-23T02:16:53.0905775Z 1: 
2023-07-23T02:16:53.0906087Z 1: /onnxruntime_src/onnxruntime/test/providers/base_tester.cc:323: Failure
2023-07-23T02:16:53.0906698Z 1: Expected equality of these values:
2023-07-23T02:16:53.0907086Z 1:   expect_result
2023-07-23T02:16:53.0907564Z 1:     Which is: 4-byte object <00-00 00-00>
2023-07-23T02:16:53.0973055Z 1:   ExpectResult::kExpectFailure
2023-07-23T02:16:53.0973984Z 1:     Which is: 4-byte object <01-00 00-00>
2023-07-23T02:16:53.0975375Z 1: Run failed but expected success: Non-zero status code returned while running SoftmaxCrossEntropyLossInternal node. Name:'node1' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:376 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 4294973440
2023-07-23T02:16:53.0976198Z 1: 
2023-07-23T02:16:53.0976483Z 1: Google Test trace:
2023-07-23T02:16:53.0976818Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 8910
2023-07-23T02:16:53.0977229Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 8910
2023-07-23T02:16:53.0977639Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 2345
2023-07-23T02:16:53.0978035Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 5678
2023-07-23T02:16:53.0978441Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 1234
2023-07-23T02:16:53.1303810Z 1: /onnxruntime_src/orttraining/orttraining/test/training_ops/cuda/cross_entropy_test.cc:443: Failure
2023-07-23T02:16:53.1304644Z 1: Expected equality of these values:
2023-07-23T02:16:53.1304974Z 1:   ret.first
2023-07-23T02:16:53.1305685Z 1:     Which is: 4-byte object <04-00 00-00>
2023-07-23T02:16:53.1306030Z 1:   COMPARE_RESULT::SUCCESS
2023-07-23T02:16:53.1306414Z 1:     Which is: 4-byte object <00-00 00-00>
2023-07-23T02:16:53.1306754Z 1: Unsupported compare with CompareOrtValueNumerals.
2023-07-23T02:16:53.1307487Z 1: Google Test trace:
2023-07-23T02:16:53.1307848Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 8910
2023-07-23T02:16:53.1308252Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 8910
2023-07-23T02:16:53.1308652Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 2345
2023-07-23T02:16:53.1309068Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 5678
2023-07-23T02:16:53.1309460Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 1234
2023-07-23T02:16:53.1309889Z 1: /onnxruntime_src/orttraining/orttraining/test/training_ops/cuda/cross_entropy_test.cc:443: Failure
2023-07-23T02:16:53.1310239Z 1: Expected equality of these values:
2023-07-23T02:16:53.1310527Z 1:   ret.first
2023-07-23T02:16:53.1310893Z 1:     Which is: 4-byte object <04-00 00-00>
2023-07-23T02:16:53.1311208Z 1:   COMPARE_RESULT::SUCCESS
2023-07-23T02:16:53.1311600Z 1:     Which is: 4-byte object <00-00 00-00>
2023-07-23T02:16:53.1311921Z 1: Unsupported compare with CompareOrtValueNumerals.
2023-07-23T02:16:53.1312229Z 1: Google Test trace:
2023-07-23T02:16:53.1312556Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 8910
2023-07-23T02:16:53.1312951Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 8910
2023-07-23T02:16:53.1313362Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 2345
2023-07-23T02:16:53.1313749Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 5678
2023-07-23T02:16:53.1314156Z 1: /onnxruntime_src/onnxruntime/test/common/random_generator.h:49: ORT test random seed: 1234
2023-07-23T02:16:53.4476437Z 1: [  FAILED  ] CrossEntropyTest.SoftmaxCrossEntropyLossInternal_LargeSizeTensorUInt64Index (73692 ms)

```



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-07-23 15:02:09 +08:00
Justin Chu
d79515041c
[Better Engineering] Bump ruff to 0.0.278 and fix new lint errors (#16789)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* __->__ #16789

Bump ruff to 0.0.278 and fix new lint errors. I added noqa to all
existing RUF012 errors which requires mutable class variables to be
annotated with `ClassVar`, as well as all PERF issues.

Signed-off-by: Justin Chu <justinchu@microsoft.com>
2023-07-21 12:53:41 -07:00
Xavier Dupré
b508c7236f
Replace call to deprecated torch.norm (#16758)
### Description
torch.norm is deprecated as mentioned in issue #16751. This PR replaces
the call to torch.norm by the options suggested by torch documentation.
2023-07-20 19:52:19 -07:00
Baiju Meswani
538d2412ef
Objective-C Add Support to Create and Query String ORTValues (#16764)
This pull request contains a few changes:

1. Adds support for string ort values.
2. Fixes the training minimal build (that was broken with #16601) by
putting custom op registration behind #ifdefs
3. Fixes the iOS pod package generation (that was again broken with
#16601) by explicitly providing paths to be copied during pod creation.
2023-07-20 17:39:29 -07:00
Wei-Sheng Chin
b71ebf91a5
[DORT] Reduce global configs to make enabling dynamic shape easier (#16720)
There are several global configs used by DORT.
```py
DEFAULT_ONNX_EXPORTER_OPTIONS = torch.onnx._internal.exporter.ResolvedExportOptions(
    torch.onnx._internal.exporter.ExportOptions()
)

# TODO(wechi): This line must generate result identical to the call of
# _create_onnx_supports_op_overload_table(...) inside
# create_onnx_friendly_decomposition_table(...) in
# torch/onnx/_internal/fx/decomposition_table.py.
_SUPPORT_DICT = torch.onnx._internal.fx.decomposition_table._create_onnx_supports_op_overload_table(
    DEFAULT_ONNX_EXPORTER_OPTIONS.onnx_registry
)  # type: ignore

_EXTRA_SUPPORT_DICT: Dict[str, Any] = {
    "getattr": None,
    "_operator.getitem": None,
}

DORT_DECOMPOSITION_TABLE = DEFAULT_ONNX_EXPORTER_OPTIONS.decomposition_table
```

We can see all but `_EXTRA_SUPPORT_DICT` are extracted from deduced from
ONNX exporter's options. As there are many ways to configure ONNX
exporter's options, we decided to move these variables to `OrtBackend`'s
`__init__` so that the construction of `OrtBackend` becomes more
flexible (especially for enabling dynamic shape or not).
2023-07-18 09:06:58 -07:00
Wei-Sheng Chin
44fd98ebfe
[DORT] Enable aten::full by implementing extra logics to select EP (#16699)
DORT only select devices from inputs arguments' (type: torch.Tensor).
However, it errors out when a graph doesn't have any inputs (e.g., a
single aten::full graph). This PR address this problem by changing the
EP selection to

- First, inspect graph inputs. If there are some valid devices, use them
plus a default one (`OrtBackend.ep: str`).
- Otherwise, inspect graph outputs carried by `torch.fx.GraphModule` and
use all valid devices plus the default `OrtBackend.ep`.
- When both (1) and (2) fail, it uses the default EP specified by
`OrtBackend.ep`.
2023-07-14 15:42:25 -07:00
Baiju Meswani
9889f0f507
Add support for training apis to support custom ops (#16601) 2023-07-14 11:15:51 -07:00
Dmitri Smirnov
853c4ff0a5
[C#, CPP] Introduce Float16/BFloat16 support and tests for C#, C++ (#16506)
### Description
Introduce `Float16/BFloat16` support for C# and C++ APIs.
User should be able to perform conversions from `float` to/from
`Float16/BFloat16`, compare values and tests for `NaN, Inifnity, and
whether the number is denormalized.`

### Motivation and Context
User filed issues such as:
https://github.com/microsoft/onnxruntime/issues/14303
2023-07-14 10:46:52 -07:00
Vincent Wang
c07a3b869c
Triton Codegen for ORTModule (#15831)
Fuse connected elementwise and reduce Ops to TritonOp and codegen triton
code to run the kernel.

This PR is co-edited by @wejoncy and @er3x3
2023-07-13 18:17:58 +08:00
PeixuanZuo
ebc311365b
[ROCm] Optimize ROCm CI to reduce time (#16620)
This PR mainly optimize ROCm CI test to reduce time and CPU utilization.

- use smaller batch size on strided_batched_gemm/batched_gemm test
- disable cpu training test
- fix test_e2e_padding_elimination Occasional failures on ROCm.
2023-07-13 10:58:03 +08:00
pengwa
2449ded20f
Use autograd_inlining for model export (#16665)
### Use autograd_inlining for model export

From some versions of PyTorch, there is an issue related to custom
autograd.Function inlining, even though we register custom export
function for the autograd.Function (e.g. when custom autograd function
is enabled).

As an options, PyTorch exporter adds a new flag during export, we can
disable the inline. https://github.com/pytorch/pytorch/pull/104067

Currently the PyTorch change is in nightly built, this PR dynamically
check the torch.onnx.export's signature and decide to use the
`autograd_inlining` when it exists.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-07-12 20:57:24 +08:00
Edward Chen
1b8d5c43c2
Fix builds (#16646)
- Fix some more `shorten-64-to-32` warnings
- Move minimum build.py Python version back to 3.6
2023-07-11 19:21:25 -07:00
Ti-Tai Wang
72076e5320
Update converter registry usage in orttraining_test_dort_custom_ops.py (#16663)
Fix Orttraining Linux Lazy Tensor CI       

Orttraining Linux Lazy Tensor CI is broken.
The error message is
AttributeError: 'OnnxRegistry' object has no attribute 'register'
2023-07-11 12:03:12 -07:00
pengwa
1ebc5d3879
Log ORTModule initialization overhead (#16529)
### Log ORTModule initialization overhead

When profiling some model for example 

```
 torchrun --nproc_per_node=1 examples/onnxruntime/training/language-modeling/run_mlm.py  --model_name_or_path microsoft/deberta-v3-large --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1  --num_train_epochs 10 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --do_train  --overwrite_output_dir --output_dir ./outputs/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused  --max_steps 200 --logging_steps 1 --use_module_with_loss

{'train_runtime': 303.8711, 'train_samples_per_second': 0.658, 'train_steps_per_second': 0.658, 'train_loss': 6.569518616199494, 'epoch': 0.09}
100%|200/200 [05:03<00:00,  1.52s/it]
***** train metrics *****
  epoch                    =       0.09
  train_loss               =     6.5695
  train_runtime            = 0:05:03.87
  train_samples            =       2223
  train_samples_per_second =      0.658
  train_steps_per_second   =      0.658


```



The end to end time is 303s (train_runtime=0:05:03.87), but the
ORTModule first step initialization (including export, graph build, etc)
takes about 255s, so when we compare the end to end time for a baseline
ORT with an improved version of ORT, there is no perf gains, since the
x% gains over (303-255) is diluted out among the overall 303s. This is
misleading!

So this PR outputs the ORTModule initialization overhead in the output,
then we can manually compute the real compte time and get the perf
gains.


If the log level is >= WARNING, then only the total end to end time +
export time is logged, otherwise, more details of break down is logged:


![image](https://github.com/microsoft/onnxruntime/assets/10530022/8e34283d-4868-4f22-b65b-9f00d10d8fb7)



![image](https://github.com/microsoft/onnxruntime/assets/10530022/c13bcfad-0d79-483d-a886-e238efcbe657)
2023-07-11 14:11:29 +08:00
pengwa
15cb2f5a8a
Warn the user when nondet kernels are invoked in det mode (#16571)
### Give user warnings if nondeterministic kernels got called when
Deterministic flag is set

When we do accuracy investigation (for example training convergence
issue debug), usually we will set `use_deterministic_compute ` to be
true.

```
 SessionOptions sess_options;
 sess_options.use_deterministic_compute = true;
```

While in recent investigation, it is found GatherElementsGrad kernel
(who used atomic add) generate non-deterministic results, making a
deberta model ouput pretty different loss curve every time we run it
even we fix the seed, remove the dropout ratio, and set
use_deterministic_compute to be true. It turned out to be an expected
problem if we do the add in different order by cuda threads. The order
cannot be guaranteed.

So this PR will give warnings when users set `use_deterministic_compute
`, but some kernels don't have determinstic kernel impl, has to run with
non-determinstic impls. This would at least let users know the results
is not determinstic though that flag is set to be True.


![image](https://github.com/microsoft/onnxruntime/assets/10530022/99ff60f5-21a4-44cf-bf5b-323d698b7147)

Only print the message once in case it floods training logs.
2023-07-11 11:45:47 +08:00
PeixuanZuo
cb4bf4f5c8
[ROCm] Move ROCm build step on CPU only machine (#16596)
- Move ROCm build step on CPU only machine
- Add the performance data of the huggingface bert-large model on the
MI200
- At the beginning of the test step, check the agent's GPU usage and
kill the threads occupying the GPU, which may be left over from previous
tasks that exited abnormally.
- Use different docker images during the build and test steps. The
difference is the `uid` and `user` when build docker image and create
docker container.
2023-07-10 11:55:10 +08:00
cao lei
329e8156d4
clean unused parameter in ORT_UNUSED_PARAMETER (#16538)
### Description
clean unused parameter in ORT_UNUSED_PARAMETER


### Motivation and Context
clean unused parameters in ORT_UNUSED_PARAMETER which are introduced
from #15833
2023-07-07 13:20:36 -07:00
Edward Chen
6be7b03e53
Enable -Wshorten-64-to-32 warning if available. (#16524)
- Fix some warnings from Xcode build (`-Wshorten-64-to-32`).
- Enable `-Wshorten-64-to-32` warning if available. Currently it's not fully enabled for `onnxruntime_test_all` and `onnxruntime_providers_xnnpack` yet.
- Some clean up in build.py including setting CMake generator more consistently.
2023-07-07 08:11:44 -07:00
Vincent Wang
2a11f29eaa
[CUDA] Optimize BiasGelu/BiasGeluGrad Kernel (#16608)
The PR optimizes BiasGelu/BiasGeluGrad CUDA kernel by 3 changes:
- Use Erf instead of Normcdf for half compute
- Change CUDA thread organization for BiasGelu kernel instead of using
binary elementwise template
- Add vectorized support

Using BiasGelu(A[256, 128, 768] + B[768]) in V100 as example, the perf
number below are in us
Before change, FW: 246.37, BW: 292.77
Use Erf, FW: 152.86, BW: 238.98
All above changes, FW: 132.45, BW: 199.14

For Huggingface's bertweet-base model, with the changes, the step time
(FW+BW) reduces from 324.71766 ms to 316.42552 ms, which is 1.026x
faster.

Using Erf is for half data only, evaluation shows that for float on
CUDA, Normcdf is faster. I didn't check the perf for BFloat16 or on AMD,
so keep them unchanged.
2023-07-07 08:28:38 +08:00
Wei-Sheng Chin
a0a5f57581
[DORT] Use new FX-to-ONNX exporter (#16450)
The ONNX exporter in DORT have been moved to PyTorch as a formal
feature. We therefore switch to consume the exporter from PyTorch
instead of maintaining two duplicates.
2023-07-04 13:13:04 -07:00
pengwa
8fc3037ff4
Support SCELossInternal/SCELossInternalGrad run with larger sized input (#16363)
### Support SCELoss/SCELossGrad run with larger sized input

#### Motivation and Context: Run bigger batch size for Bloom model. 
For Bloom560M model, ORT has potential to run bigger batch size from
initialally 6 to now 10. SCELoss/SCELossGrad's input size is Bsz X 1023
* 250680. When Bsz is bigger than 8, totoal element count cannot be
represented by int32_t, which those kernels are using to passing total
elem count. There is silent overflow causing other indirectly
exceptions, or wrong mistake without errors.


#### Changes in this PR

- For SCELossInternal/SCELossGradInternal CUDA kernels, use uint64_t if
total element count is bigger than int32::max() to pass all element
count and element index for the ops mentioned above.
- For SCELossInternal/SCELossGradInternal CPU kernels, 
   - always use uint64_t to pass the element count. 
- update the Eigen functions involved in the two kernels'
implementations, to use `ptrdiff_t` to pass element count instead of
original `int`.
- Parallelize SCELossInternal/SCELossGradInternal CPU kernels,
otherwise, it is super slow when handling so many elements.
  
- Others changed needed:
- Add `CompareOrtValueNumerals` to compare two OrtValue with different
data types (float or float16), without caller explicitly converting to
the lower-precision data types. The comparison is also done in parallel,
which reduce the comparsion time for the large UT case from 22s to
~1.6s.
- The check of `IsResultCloselyMatch` is buggy for nan/inf cases, so fix
the bugs.
- The cross entropy tests are running CPU base line with float, then the
result is used to compare with float16 results of CUDA runs. But there
is precision issue when we check the results. Because the randomized
input data is represented in float, CPU use it directly, but CUDA use a
float16 version of it, so there is precision diff between the inputs, as
the test data count increases, it make the results fail even on 1e-2.
The fix is: generate data in float16, convert to float for CPU run,
directly use float16 for CUDA runs. When compare the output, cast back
CPU float to float16 then compare with CUDA outputs.
- `RandomValueGenerator ` for the large size take about ~20second, so
`ParallelRandomValueGenerator ` is added to random input in parallel, it
takes about <2s for preparing input data.

#### Non-goals

`SoftmaxCrossEntropyLoss` && `SoftmaxCrossEntropyLossGrad` is not
covered in this PR
2023-06-30 08:36:06 +08:00
Baiju Meswani
efeb6672d6
Temporary optimizer support for ort format models in non minimal build (#16485) 2023-06-28 11:35:57 -07:00
Vrajang Parikh
960e320dff
Objective C Training API: TrainingSession (#16374)
### Description
- Implement Objective-C binding for `ORTTrainingSession`
- Add `ORTUtils` utility class to handle conversion between C++ and
Objective-C types
- Add test case for saving checkpoint
- Add unit test cases for `ORTTrainingSession`

### Motivation and Context
This PR is part of implementing Objective-C bindings for training API.
It implements objective-c binding for training session. The objective-C
API closely resembles the C++ API.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-06-28 09:13:56 -07:00
cao lei
e5270e3b4f
shared allocator for on device training (#16432)
### Description
<!-- Describe your changes. -->
New logic to share allocators among module, optimizer and eval sessions
for Training scenario



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Previously on device training using shared allocator by sharing EP, now
with new mechanism to share allocator, we need to explicitly register
allocator in the environment.

---------

Co-authored-by: Lei Cao <leca@microsoft.com>
2023-06-27 15:10:42 -07:00
pengwa
a49bb85cfe
Manage ORTModule configurations consistently (#16396)
### Manage ORTModule options

Move all env vars that used for feature ON/OFF into runtime options for
consistent managements.


Be noted: the features' switch are assigned in 2 phases: default values,
overwritten by env vars (if specified by users). So env vars take the
highest priority when all 2 phases both given value explicitly for one
feature.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-06-27 19:19:36 +08:00
pengwa
403bebfb51
Use PadAndUnflatten to replace GatherGrad for restore (#16429)
### Use PadAndUnflatten to replace GatherGrad for restore




### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-06-27 15:07:20 +08:00
guyang3532
eb4e6d2062
Support Mul and Sub in padding elimination (#16478)
### Description
Support Mul and Sub in padding elimination
2023-06-27 07:43:29 +08:00