Commit graph

1093 commits

Author SHA1 Message Date
ashari4
c4a7e88fc8
QuantizeBFP and DequantizeBFP (#12833)
* `QuantizeBFP` and `DequantizeBFP` schemas - similar to
`QuantizeLinear` and `DeQuantizeLinear`.
* BFP datatype is represented as a `uint8` tensor with shape and stride
metadata. This is preferrable to adding a new datatype for BFP, which is
more disruptive and [discouraged by
PyTorch](https://discuss.pytorch.org/t/training-with-custom-quantized-datatype/152132/2).

Context: 

The Microsoft Floating Point (BFP) datatype shares an exponent for every
n numbers called a “bounding box.” Each number still has its own
mantissa and sign bits. BFP has been shown to incur 3-4 less cost
(energy and area) than BFloat16 and INT8 counterparts without reductions
in accuracy for the ImageNet benchmark as described in [Rouhani
2020](https://proceedings.neurips.cc/paper/2020/file/747e32ab0fea7fbd2ad9ec03daa3f840-Paper.pdf).

Requirements:

* There are many variants of BFP (number of mantissa bits, number of
shared exponent bits, size of bounding box, custom bit fields, etc.)
* The size and layout of an BFP variant varies across hardware
* bounding box can be over arbitrary dimensions; for example, for the
channel "C" dimension in a N x C x H x W tensor for convolution

Goals of this PR:

* Add initial versions of QuantizeBFP and DequantizeBFP operators to
enable QDQ-style quantization with BFP. Once the schemas stabilize, we
can consider upstreaming to ONNX.
* Add some basic type and shape inferencing tests; tests that run on an
EP will be a follow-up.
2022-09-22 14:02:55 -07:00
Weixing Zhang
4113df0e21
use constexpr (#12953) 2022-09-20 14:34:33 -07:00
Edward Chen
454f77cd94
Update kernel matching logic: decouple from op schemas and remove kernel def hashes (#12791)
# Motivation
Currently, ORT minimal builds use kernel def hashes to map from nodes to
kernels to execute when loading the model. As the kernel def hashes must
be known ahead of time, this works for statically registered kernels.
This works well for the CPU EP.
For this approach to work, the kernel def hashes must also be known at
ORT format model conversion time, which means the EP with statically
registered kernels must also be enabled then. This is not an issue for
the always-available CPU EP. However, we do not want to require that any
EP which statically registers kernels is always available too.
Consequently, we explore another approach to match nodes to kernels that
does not rely on kernel def hashes. An added benefit of this is the
possibility of moving away from kernel def hashes completely, which
would eliminate the maintenance burden of keeping the hashes stable.

# Approach
In a full build, ORT uses some information from the ONNX op schema to
match a node to a kernel. We want to avoid including the ONNX op schema
in a minimal build to reduce binary size. Essentially, we take the
necessary information from the ONNX op schema and make it available in a
minimal build.
We decouple the ONNX op schema from the kernel matching logic. The
kernel matching logic instead relies on per-op information which can
either be obtained from the ONNX op schema or another source.
This per-op information must be available in a minimal build when there
are no ONNX op schemas. We put it in the ORT format model.
Existing uses of kernel def hashes to look up kernels are replaced
with the updated kernel matching logic. We no longer store
kernel def hashes in the ORT format model’s session state and runtime
optimization representations. We no longer keep the logic to
generate and ensure stability of kernel def hashes.
2022-09-20 14:24:59 -07:00
Pranav Sharma
a8b0f57d1a
Fix eager mode pipeline to accommodate recent allocator change. (#13000) 2022-09-20 12:53:46 +08:00
cloudhan
0ddf4efbd9
Make PythonOp report dtype mismatch by name, instead of by using enum index (#13007) 2022-09-20 12:29:30 +08:00
Adam Louly
268bfe2a5d
python training api bindings (#12610)
**Description**: **Python API Bindings for on device training. **
**Motivation and Context**
- This PR contains api bindings so python users can perform a whole
training loop.

Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
2022-09-16 09:38:24 -07:00
Vincent Wang
da07c83948
SoftmaxCrossEntropyLossInternalGrad and Sum Fusion (#12746)
* fuse scegrad and sum

* add yield output shapes to value_info

* resolve comments

* fix merge main
2022-09-14 14:45:51 +08:00
pengwa
b5327595f3
Fix [prefast:Warning]: C26814 (#12897)
fix C26814
2022-09-09 08:26:48 +08:00
Thiago Crepaldi
55c745eefd
Add support for ORTModule Torch cpp CUDA extension build within docker (#12868)
Currently, CUDA hardware is not available to be leveraged by build
during `docker build`. because of that, CUDA capable hardware would not
have CUDA support

This PR adds an env varf ONNXRUNTIME_FORCE_CUDA in which it allows CUDA
extensions to be compiled even when CUDA support is not detected.
2022-09-08 15:30:44 -04:00
guyang3532
4765e5c382
Using ORTModule to wrap a evaluation model should not change the mode (#12747)
Using ORTModule to wrap a evaluation model should not change the mode of model
2022-09-08 10:54:59 +08:00
RandySheriffH
d3b684cd9e
Drop nuphar (#11555)
* drop nuphar code and configs

* refactor test case

* format python

* remove nuphar from training test

* remove commented nuphar logics

* restore llvm setting

* drop nuphar ci

* fix compile err

* fix compile err

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2022-09-07 15:11:18 -07:00
Baiju Meswani
9e47eb68e0
Remove unused orttraining amd dockerfiles and scripts (#12707) 2022-09-02 18:43:21 -07:00
Baiju Meswani
295bd26980
Remove orttraining-distributed CI pipeline (#12738) 2022-09-02 14:34:26 -07:00
ashbhandare
27dde0b51f
Csharp bindings for on-device training APIs (#12404) 2022-09-02 13:13:48 -07:00
Baiju Meswani
56bae3b196
Use InplaceClipGradNorm for offline processing for on-device training (#12603) 2022-09-02 07:47:17 -07:00
ashbhandare
349469c381
Enable way to extract all parameters to and from a contiguous buffer. (#12674)
* implementation

* review comments

* review comment

* lint error
2022-09-01 15:23:30 -07:00
George Nash
0125e15281
Fix include order build failure training build (#12425)
Signed-off-by: George Nash <george.nash@intel.com>
2022-09-01 10:48:40 -07:00
Cheng
5dd9afe75a
python lint (#12825) 2022-09-01 22:38:25 +08:00
PeixuanZuo
adbc0757ad
[UPDATE] update ROCm ci pipeline to ROCm5.2.3 (#12799)
* [Update] update to rocm5.2.3

* [Fix] cmake version

* [Fix] disbale ortmodule tests

* [revert] revert performance number
2022-09-01 10:32:24 +08:00
Vincent Wang
262a597e2a
[CUDA] BiasSoftmax and Dropout Fusion (#12667)
* bias softmax dropout fusion

* fix rocm build

* move some files
2022-09-01 10:01:44 +08:00
Justin Chu
a48b115540
Remove reference to the deprecated variable in torch.onnx.symbolic_helper (#12452)
**Description**: Remove reference to the deprecated variable in `torch.onnx.symbolic_helper` pytorch/pytorch#81953

- Removed unused imports
- Changed BANNED_AUTOGRAD_FUNCTION_NAMES to a frozenset

**Motivation and Context**

The cast_pytorch_to_onnx variable is deprecated and removed in `torch.onnx.symbolic_helper`. Since there is still a need for converting scalar types to onnx type, I copied the mapping to `_CAST_PYTORCH_TO_ONNX` in the module.
2022-08-31 11:55:56 -07:00
Yulong Wang
1a402a3f25
replace 'master' branch ref to 'main' for onnx repo (#12678) 2022-08-30 13:41:42 -07:00
cloudhan
9907b59a1e
Change cuda and rocm error checking helpers to return Status (#12699)
* CudaCall returns Status in non-throw and void in throw

* RocmCall returns Status in non-throw and void in throw
2022-08-30 13:18:47 +08:00
pengwa
a0c25e5c2f
Fix segment fault for alltoall (#12701)
* fix segment fault

* formatting
2022-08-30 11:27:14 +08:00
Baiju Meswani
b83ea3c2ff
Address prefast static analysis warnings (#12756) 2022-08-29 10:09:32 -07:00
Adam Louly
ee543a47f6
upgrade cuda version on ci pipelines (training CI pipelines) (#12708)
* upgrade cuda version on ci pipelines

* keeping folder name same

* keeping folder name same

* setting manual seed for primitive test case

* resolving comments

* changing atol and rtrol only for test case

Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2022-08-26 16:51:19 -07:00
edgchen1
64e8806148 Address some static analysis warnings. 2022-08-26 15:05:53 -07:00
abhi-ort
ebff15d743
Pinning manual seed (#12714) 2022-08-25 10:09:02 -07:00
Vincent Wang
5104c7dbd3
Fix Prefast Warnings (#12717)
fix prefast warnings
2022-08-25 17:09:37 +08:00
Vincent Wang
53ecb9e635
Update Supporting DS Version to 0.7.1 for ORTModule (#12696)
update ds version support for fp16_optimizer
2022-08-24 14:56:12 +08:00
abhi-ort
73e5741a9a
Enabling softmax grad and logsoftmax grad on ORT (#12614)
* Enabling softmax grad and logsoftmax grad on ORT

* formatting changes

* formatting changes

* reverting changes

* Changing the OpType
2022-08-23 15:49:02 -07:00
Yulong Wang
c144acc534
Replace 'master' branch ref to 'main' in the code (#12547) 2022-08-22 10:48:12 -07:00
Wei-Sheng Chin
dc486d146b
Make ORT callable from various Pytorch compilers (LazyTensor, TorchDynamo, etc) (#10460)
* Make ORT as Pytorch JIT backend

LORT likely doesn't work with aten fallback so we only test LORT in its own CI.

* Revert changes to enable external CUDA allocator. Will add it later.

Revert "Revert changes to enable external CUDA allocator. Will add it later."

This reverts commit d5487f2e193014c805505afae8fb577c53667658.

Fix external allocator

* Relax tolerance and remove commented code

* Print more information in CI

* Fix pointer

* Address comments.
1. Reuse ORT-eager mode's environment.
2. Remove unused ctor.

* Use Pytorch master branch as all PRs are merged

Fix

* Refine based on cpplint feedbacks

* Revert changes to allow custom CUDA allocator in public APIs

* Use torch.testing.assert_close

* Use unittest framework

* Switch docker repo

* Rename *.cpp to *.cc

* Address comments

* Add comment

* Use same pipeline file for eager and lort pipelines

* Address comments

* Add yaml comment

* Fix cmake files

* Address comments

* Rename flags, remove printing code, remove dead comment
2022-08-22 09:40:40 -07:00
Vincent Wang
a078c8d99b
Update Supporting Deepspeed Version of ORTModule's FP16_Optimizer (#12668) 2022-08-22 22:22:53 +08:00
Scott McKay
2102b8f67c
Avoid duplicate symbol error between ONNX and ORT for ostream operator<< with TensorShapeProto (#12651)
* Remove ostream operator<< definitions for TensorShapeProto and TensorProto as they clash with ONNX definitions in onnx/defs/printer.h/cc.

Currently printer.h (unnecessarily) pulls in a number of other ONNX headers which causes naming clashes with parts of ORT. It is also excluded in a minimal build.

Instead convert the onnx::TensorShapeProto to onnxruntime::TensorShape so we use the existing ostream operator<< for TensorShape.

Make GetTensorShapeFromTensorProto consistent with GetTensorShapeFromTensorShapeProto so both return a TensorShape (as the name implies).
2022-08-22 17:20:52 +10:00
pengwa
7df2e8c5cc
Refactor with std::variant (on device training) (#12383)
* use std::variant for synthetic data storage.

* use std::variant to replace TypedCheckpointProperty

* Remvoe shared ptr for checkpoint property

* fix tests

* refine std::variant usage a bit

* remove CheckpointProperty data abstraction

* use InlinedVector and InlinedHashMap if possible

* fix comments

* fix build and test

* fix some comments

* use gsl::span

* fix tests

* refine based on comments

* fix win build

* fix build
2022-08-17 08:31:23 +08:00
Baiju Meswani
f5e3517c39
Add Learning Rate Scheduler C API (#11957) 2022-08-15 09:10:25 -07:00
Wil Brady
3d009cdde3
Updating binary ops in eager mode to support broadcasting. (#12560)
* Updating binary ops in eager mode to support broadcasting.
2022-08-11 17:00:12 -04:00
pengwa
24eab921be
Enable PythonOp for --enable_training_torch_interop build (#12539)
* enable PythonOp by default when --enable_training_torch_interop is enabled during build

* clean up

* fix

* fix comment

* fix

* fix tests

* fix fallback test

* pylint format

* refine based on comments
2022-08-12 00:49:30 +08:00
Baiju Meswani
3e78f3cf1f
Add win-ci pipeline for on-device training (#12513) 2022-08-10 14:45:39 -07:00
msftlincoln
0d9a02e647
Eager Mode - Support Concatenation via aten::cat.out (#12527)
* support concatenation via aten::cat.out

* wrap dims

* rename vars in tests, test wrapped dims
2022-08-09 17:16:18 -04:00
Adam Louly
2681648f5b
Load checkpoint in cpp (#12352)
* Load checkpoint in cpp

* removed unused imports

* throw error on invalid name and change function name

* inplace model assignment, change name and other comments resolved

* name change  on import

* Addded unit test, resolved comments

* remove unused  imports

* resolved comments

* refactoring too reduce memoory allocation

* resolved extra comments

* changed files hierarchy an force added onnx moodel

* solved order of function argument

* used gtest macros on test cases

Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2022-08-09 12:30:50 -07:00
Vincent Wang
2bed0d4abb
[CUDA] SoftmaxCrossEntropy Kernels Refactor (#12482)
* sce refactor

* refactor

* remove usnecessory memset
2022-08-09 16:48:44 +08:00
pengwa
a2dc3e9eac
Improve the compilation speed when compiling for multiple architectures. (#12490)
* improve the compilation speed when compiling for multiple architectures.

* formatting

* fix

* use 0 by default

* fix comments
2022-08-09 11:52:26 +08:00
Vincent Wang
e85e31ee80
Update ORTModule Default Opset Version to 15 (#12419)
* update ortmodule opset to 15

* update torch version

* fix ut

* fix ut

* rollback

* rollback for orttrainer
2022-08-05 16:55:04 +08:00
Baiju Meswani
a7d6290774
CUDA kernel for ClipGradNorm for TensorSeq gradients (#12412) 2022-08-04 22:28:28 -07:00
LironKesem
d452462b5e
Lironkesem/unsqueeze_and_squeeze (#12421) 2022-08-04 15:12:34 -04:00
Baiju Meswani
7f58bd7236
Perform graph transformations during offline tooling (#12422) 2022-08-03 11:27:12 -07:00
Vincent Wang
99d2a63e1a
Set Fix Seed For SoftmaxCrossEntoryLoss Related UTs (#12432)
add seed
2022-08-03 13:29:30 +08:00
smrkatte
54d5e86981
Add cast before copy for dissimilar scalar type (#12391)
* Add proper cast/copy callflow for ORT and non-ORT devices
2022-08-02 18:32:58 -07:00