onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-17 18:40:28 +00:00

Author	SHA1	Message	Date
pengwa	a3e7da60e7	Trade subgraph recompute for memory (#12852 ) Description: Subgraph-level recompute This PR adds an optional capability trading additional re-computation for better memory efficiency. Specifically, a pre-defined operator list used to iterate the Graph to find some subgraphs for recompute, to reduce some stashed activations whose lifetime across forward and backward pass. When training with ORTModule, by default, the graph transformer will scan the execution graph to find all eligible subgraph to recompute, along with sizes that can save. An example looks like below. If we want to enable some of them to recompute, we can define env variable this way: `export ORTMODULE_ENABLE_MEMORY_ALLEVIATION="Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1:-1,BiasGelu+:1:-1,BitmaskDropout+Cast+:1:-1,FusedMatMul+:1:-1,Cast+:1:-1,Mul+Add+:1:-1,Mul+Sub+:1:-1"` ``` [1,0]<stderr>:2,022-10-12 14:47:39.302,954,530 [W:onnxruntime:, memory_alleviation.cc:595 PrintSummary] [1,0]<stderr>:MemoryAlleviation Summary: [1,0]<stderr>: User config: [1,0]<stderr>: Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1,BiasGelu+:1,BitmaskDropout+Cast+:1,FusedMatMul+:1,Cast+:1,Mul+Add+:1,Mul+Sub+:1 [1,0]<stderr>: ================================= [1,0]<stderr>: Subgraph: BitmaskDropout+ [1,0]<stderr>: AlleviationType: Disabled [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 1,024 x Frequency:1 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: BiasGelu+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Reshape[1,0]<stderr>:+ [1,0]<stderr>: AlleviationType: Disabled [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:labels_dim0 x Frequency:1 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Unsqueeze+Unsqueeze+Cast+Sub+Mul+Mul+FusedMatMul+Cast+Add+BiasSoftmaxDropout+Cast+ [1,0]<stderr>: AlleviationType: Disabled [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x Frequency:23 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x Frequency:1 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Mul+Add+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: FusedMatMul+Cast+Add+Reshape+Cast+ [1,0]<stderr>: AlleviationType: Disabled [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 2 x 4 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Mul+Sub+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Cast+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:1,024 x 1,024 x Frequency:97 [1,0]<stderr>: PatternShape:3 x 1,024 x Frequency:1 [1,0]<stderr>: PatternShape:8 x 64 x Frequency:24 [1,0]<stderr>: PatternShape:1,024 x 4,096 x Frequency:24 [1,0]<stderr>: PatternShape:4,096 x Frequency:24 [1,0]<stderr>: PatternShape:4,096 x 1,024 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: FusedMatMul+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: ================================= ``` "Type config:" whether recompute is enabled by users. 0 - disable, 1- enable. "Subgraph" means what kind of subgraph will be recomputed, in this case, it is a single node "Gelu", and it will be "Recompute". "Shape && Frequency" means, for this recompute, one tensor of size (batch size, 500) will be saved because it will be recomputed. Baseline On a 1P model (DEBERTA V2), sequence length 256, training with 16 A100 GPUs. With latest main branch, we can run batch size 16, and the maximum batch size < 32. So 16 is usually chosen by data scientists. 65% of 40GB memory is used during training. The SamplesPerSec=479.2543353561354. ![image](https://user-images.githubusercontent.com/10530022/188320941-13dde5e7-c32b-4399-a64b-6803fbb9dcda.png) With this PR Gelu is recomputed for saving memory peak, batch size 32 can be run. The 97% of 40GB A100 is used, the SamplesPerSec=562.041593991271 (1.17X of baseline). ![image](https://user-images.githubusercontent.com/10530022/188321081-f64811bf-9637-4873-8095-349de8d498cc.png) Motivation and Context - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here.	2022-11-03 13:49:41 +08:00
Wei-Sheng Chin	b5904c40dd	Enable ORT in TorchDynamo (#13259 ) This PR enables ORT to execute graphs captured by TorchDynamo. Major compilation code is in `OrtBackend.compile` in ort_backend.py. `register_backend.py` is for plugging `OrtBackend` into TorchDynamo as a compiler.	2022-11-01 11:19:29 -07:00
PeixuanZuo	6740528b98	[ROCm] Fix bug for rocm ep build using MS GSL 4.0.0 (#13525 )	2022-11-01 13:05:55 +08:00
Baiju Meswani	c557a55816	Fix on-device training ExportModelForInferencing api (#13510 )	2022-10-31 21:29:06 -07:00
Edward Chen	2ecd1d6622	Switch GSL to MS GSL 4.0.0 (#13416 )	2022-10-29 04:15:20 -07:00
Vincent Wang	8b0669bf63	QuickGelu Fusion (#12417 ) Some models have QuickGelu(x)=x*sigmoid(1.702x), which has 3 Ops for forward and 5 Ops for backward. The PR is to fuse this to a single Op named QuickGelu and its gradient QuickGeluGrad. For CUDA, tested in V100 using input tensor with shape [64,128,2048] and float16 type: Before, FW takes 335us, BW takes 614us ![image](https://user-images.githubusercontent.com/11661208/182291335-15188709-ffe7-44d1-9d14-0b544cbe5e55.png) After, FW takes 115us, BW takes 139us, which is much faster. ![image](https://user-images.githubusercontent.com/11661208/182291502-f0b5161c-b95c-45fc-90f8-ad0c592d2433.png) For CPU kernel, using same shape and float type: Before, FW takes 10us, BW takes 49us Mul: 3480[µs] Sigmoid: 1996[µs] Mul: 4789[µs] Mul: 4642[µs] Mul: 4195[µs] SigmoidGrad: 18328[µs] Mul: 2988[µs] Sum: 18576[µs] After, FW takes 4us, BW takes 5us, which is also much faster. QuickGelu: 3939[µs] QuickGeluGrad: 5089[µs] Co-authored-by: Vincent Wang <weicwang@microsoft.com>	2022-10-28 18:12:07 +08:00
Baiju Meswani	a46c599a40	Training API to export the eval model to an inference model (#13345 )	2022-10-27 09:34:01 -07:00
Vincent Wang	805ec459a0	Fix a PoliCheck finding in _hierarchical_ortmodule.py(#13462 )	2022-10-26 15:45:18 -07:00
Vincent Wang	b6a3562ffb	[ORTModule] Add Env Variable to Control Disabling Custom AutoGrad Function Support (#13430 ) Add env variable to control disabling custom autogard function support. When using ORTModule, if the torch model has torch.nn.Function, if user confirms that it can be exported to ONNX (for example, by inline PythonOp) and the backward implementation is matched to the forward impl, user can export "ORTMODULE_DISABLE_CUSTOM_AUTOGRAD_SUPPORT=1" to disable the custom autograd support so that it won't use ORT's PythonOp to fallback to PyTorch. Exporting to ONNX sometimes can leverage some graph optimizations in ORT so that perf is better.	2022-10-25 16:58:04 +08:00
cloudhan	2748f38362	Drop hip_add_library (#13406 ) Switching to use CMake's builtin hip language support.	2022-10-25 12:57:48 +08:00
Adam Louly	bed169192d	Windows build fix for on device training training. (#13354 ) ### Description This is a fix for on device training wheel build. ### Motivation and Context when building linux wheel it treats PathString same as std::string, but when trying to build the wheel on windows it fails because we needed to cast the std::string to a PathString. This error was found manually because there is no pipeline that uses the --enable_training_on_device for windows. Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2022-10-20 09:58:02 -07:00
cloudhan	fc12abf6b1	Enable/Disbale tunable GEMM by using tunable switch in provider options and env var (#13116 ) Related PRs #12853 This allows the user enable/disbale tunable GEMM on demand.	2022-10-19 22:35:08 -07:00
PeixuanZuo	4b2b588895	[ROCm] Fix azcopy issue on ROCm ci pipeline (#13365 ) ### Description <!-- Describe your changes. --> Use SAS Token to fix error` failed to perform copy command due to error: no SAS token or OAuth token is present and the resource is not public` Generate SAS Token of target data, add it into Key vault, and use it as Pipeline Variable. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2022-10-20 12:08:57 +08:00
Vincent Wang	67150baa8d	[ORTModule] ATen Support for aten::upsample_nearest (#13364 ) ATen support for aten::upsample_nearest, which is required for Huggingface's diffusers model training using ORTModule.	2022-10-20 08:30:04 +08:00
Vincent Wang	b6b3f41636	Fixes of Hierarchical ORTModule and ORTModule PythonOp (#13347 ) The PR applies some fixes to Hierarchical ORTModule and ORTModule PythonOp. For Hierarchical ORTModule: - Don't wrap module if the caller is to call other function instead of forward() function - Support single module instance is call multiple times with different types of inputs - Check if module can be warped from top to bottom instead of from bottom to top For ORTModule PythonOp: - Add env variable control to allow using torch.utils.checkpoint.CheckpointFunction - Add env variable control to skip register some autograd functions so that there is no conflict for some models.	2022-10-20 08:16:03 +08:00
Adam Louly	61ee5585b2	update the nightly build to use the latest ptca image. (#13309 ) ### Description updating the ptca image used in the nightly pipeline Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2022-10-17 14:12:03 -07:00
Adam Louly	68eff69ab1	Add Utils for federated learning scenarios (#13014 ) Description: utils for federated learning. Motivation and Context - This PR includes utils that will be used on federated learning scenarios. - Exposing python bindings to some utils, and added a util to calculate the difference between two buffers. Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>	2022-10-17 12:39:43 -07:00
Jeff Daily	65c67764ae	remove line "ADD model ${WORKSPACE_DIR}/model" in the amdgpu Dockerfile (#12914 ) Follow-up to #12707. docker build is broken otherwise; model dir is gone.	2022-10-14 13:17:28 -07:00
Wei-Sheng Chin	dc324b1d90	[LazyTensor] Make LORT Build Again with Latest PyTorch (#13303 ) `python setup.py develop` doesn't install PyTorch as a normal package in site-packages anymore, and the user must stay at PyTorch's root directory to call `import torch`. This will break LORT tests because LORT tests contains `import torch` and are called outside PyTorch root directory. To make PyTorch a normal package again, this PR build PyTorch with `python setup.py install`.	2022-10-13 13:56:17 -07:00
Vincent Wang	807b2f4dd5	[ORTModule] Use Env Variable to Set Provider Option cudnn_conv_algo_search (#13296 ) This PR is to add support of using env variable to set provider option cudnn_conv_algo_search so that user can choose better conv algo search method to run model. This is a quick fix to unblock the test of MoE model. Will have another PR to design and implement the ORTModule config so that we can config ORTModule using Python script or config file instead of env variable.	2022-10-13 15:36:21 +08:00
Vincent Wang	6fb70a82df	[ORTModule] Update Supported DeepSpeed Version for FP16_Optimizer (#13305 ) Update supported deepspeed highest version from 0.7.1 to 0.7.3 for FP16_Optimizer. Also add version info to warning log.	2022-10-13 13:03:01 +08:00
Vincent Wang	afb5f76770	[ORTModule] ATen Support for torch.nn.GroupNorm (#13293 ) Model [huggingface's diffusers library](https://github.com/huggingface/diffusers) has torch.nn.GroupNorm which will be exported to sub-graph containing ONNX's InstanceNormalization, which is lack of gradient. The implementation of ORT's InstanceNormalization will call cuDNN's BatchNorm for part of computation, which is not efficient compared to PyTorch's implementation. This PR is to use ATen fallback to support this torch module, including its forward and backward.	2022-10-13 11:59:03 +08:00
PeixuanZuo	6895918b1c	[ROCm] Revert CI pipeline to ROCm5.2.3 (#13297 ) ### Description <!-- Describe your changes. --> Unit test with ROCm5.3 slower than ROCm5.2.3. Revert to ROCm5.2.3. We will update to ROCm5.3 when the issue resloved by AMD. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-10-12 10:47:33 -07:00
Vincent Wang	a2658f0784	[ORTModule] Fix Graph Builder for Eval Mode (#13255 ) Current graph builder for ORTModule will apply the training's graph optimizations for both training and eval mode. Take BatchNorm as example, one of training's graph optimizations will replace BatchNormalization Op to BatchNormInternal which is for training only. This PR is to fix this, for eval mode, we will not apply the training's graph optimizations. The inference's graph optimizations will be applied when InferenceSession initialization.	2022-10-12 14:39:54 +08:00
Prathik Rao	93e0a15117	implement cos gradient as a function op (#13227 ) ### Description Implemented gradient of cos as per the function below. ![image](https://user-images.githubusercontent.com/31260940/193900310-b62a3e77-06d5-45af-ad28-a1d41920bad0.png) ### Motivation and Context Cos gradient required for [huggingface's diffusers library](https://github.com/huggingface/diffusers) ### Testing built ORT from source: `./build.sh --config RelWithDebInfo --enable_training --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/local/cuda --build_wheel --parallel --skip_tests` tested CosGrad implementation: `cd build/Linux/RelWithDebInfo/ && ./onnxruntime_test_all --gtest_filter=GradientCheckerTest.CosGrad` Co-authored-by: Prathik Rao <prathikrao@microsoft.com>	2022-10-11 10:11:19 -07:00
Prathik Rao	05acd20a88	convert singrad to function op and remove cpu kernel (#13263 ) ### Description Implemented gradient of sin as a function op. ### Motivation and Context Sin gradient currently implemented as cpu op which could hurt performance. ### Testing built ORT from source: `./build.sh --config RelWithDebInfo --enable_training --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/local/cuda --build_wheel --parallel --skip_tests` tested SinGrad implementation: `cd build/Linux/RelWithDebInfo/ && ./onnxruntime_test_all --gtest_filter=GradientCheckerTest.SinGrad` Co-authored-by: Prathik Rao <prathikrao@microsoft.com> Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>	2022-10-11 10:11:08 -07:00
Vincent Wang	b9e23bd086	[ORTModule] Fix Custom Op Registry for Torch 1.13+ (#13250 ) This PR has two fixes: - https://github.com/pytorch/pytorch/pull/85636 change the behavior of register_custom_op_symbolic to only register the symbolic function at a single version. For ORTModule we need to pass the op_set version when calling it. - Since torch_1.13 the signature of einsum is changed to have a new argument, need to change our custom op symbolic registry code accordingly. Without the fixes, ORTModule will not work with the nightly torch, and the new torch version will be released.	2022-10-11 15:20:51 +08:00
PeixuanZuo	4d25b9c8f0	[ROCm] Update ROCm and MIGraphX CI pipeline to ROCm5.3 (#13257 ) ### Description <!-- Describe your changes. --> 1. Update ROCm pipeline and MIGraphX pipeline to ROCm5.3 ROCm pipeline run ortmodule test one time and disable it : https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=777794&view=logs&j=48b14a85-ff1a-5ca4-53fa-8ea420d27feb&t=9c199f35-fc50-565d-6c65-5162c9bb1b04 2. Add `workspace: clean: all `. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-10-11 13:47:22 +08:00
Baiju Meswani	04ba8a7e6e	Introduce Training C++ Apis (#12994 )	2022-10-06 20:13:37 -07:00
cloudhan	72076b1eb2	Update ROCm CI to use HIP LANGUAGE (#13214 ) Update for ROCm CI before reland tunable GEMM #12853. This PR also update composable kernel to use CMakes's HIP language support so that we can mix C/C++ compiler with HIP compiler instead of locking to hip-clang	2022-10-05 16:15:16 +08:00
Ashwini Khade	4fc8f7139a	Bug Fix - C# API order incompatibile with C API (#13191 ) ### Description Training C# bindings (ReleaseTrainingSession and ReleaseCheckpointState) broke after an API order change in Training C API. This PR fixes this issue. ### Motivation and Context Bug Fix for Training C# bindings <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-10-04 09:29:20 -07:00
Ashwini Khade	c780c4a2b9	Fix two prefast warnings (#13211 )	2022-10-03 20:00:57 -07:00
Tony Xia	962fee5fe5	Fix typo enviroment => environment (#13195 )	2022-10-03 17:02:26 -07:00
Vincent Wang	6c63c1c9ee	Multiple Gather to Split Fusion (#13095 ) For below code in some transformers models: ``` fused_qkv = fused_qkv.view(batch_size, seq_length, self.num_heads, 3, self.head_dim) return fused_qkv[..., 0, :], fused_qkv[..., 1, :], fused_qkv[..., 2, :] ``` The exported graph will contains 3 Gather nodes, currently ORT's GatherGrad CUDA implementation is slow. This pattern can be fused to use one Split, so that we can launch less kernels for the compute, the perf of Split/Concat (for grad) is also better than Gather/GatherGrad. In a real example, one GatherGrad will take 15ms and there are 3 for each layer in the graph, after the fusion, one Concat takes only 35us. The total time of a step is improved from 1.5s to 0.4s.	2022-09-29 11:09:57 +08:00
Vincent Wang	94e34ace15	Bugfix for SimplifiedLayerNormalization (#12975 ) This PR is to fix https://github.com/microsoft/onnxruntime/issues/12930 and https://github.com/microsoft/onnxruntime/issues/12579. In detail: - For CPU EP, since current impl of SimplifiedLayerNormalization doesn't support input and scale having different data types, so if the sub-graph contains Cast Op, the sub-graph will not fused, this guarantee that both inputs and output data type will be same - For CUDA EP, add (fp16, float) support to (T,V) type constraints all combinations of fp16 and float can be supported in the impl With the fix, the original model can be run with SimplifiedLayerNormalization, which also helps to improve the perf.	2022-09-27 14:24:16 +08:00
Baiju Meswani	bcc93ab17c	Deprecate ORTTrainer (#13022 )	2022-09-23 18:10:09 -07:00
ashari4	c4a7e88fc8	QuantizeBFP and DequantizeBFP (#12833 ) * `QuantizeBFP` and `DequantizeBFP` schemas - similar to `QuantizeLinear` and `DeQuantizeLinear`. * BFP datatype is represented as a `uint8` tensor with shape and stride metadata. This is preferrable to adding a new datatype for BFP, which is more disruptive and [discouraged by PyTorch](https://discuss.pytorch.org/t/training-with-custom-quantized-datatype/152132/2). Context: The Microsoft Floating Point (BFP) datatype shares an exponent for every n numbers called a “bounding box.” Each number still has its own mantissa and sign bits. BFP has been shown to incur 3-4 less cost (energy and area) than BFloat16 and INT8 counterparts without reductions in accuracy for the ImageNet benchmark as described in [Rouhani 2020](https://proceedings.neurips.cc/paper/2020/file/747e32ab0fea7fbd2ad9ec03daa3f840-Paper.pdf). Requirements: * There are many variants of BFP (number of mantissa bits, number of shared exponent bits, size of bounding box, custom bit fields, etc.) * The size and layout of an BFP variant varies across hardware * bounding box can be over arbitrary dimensions; for example, for the channel "C" dimension in a N x C x H x W tensor for convolution Goals of this PR: * Add initial versions of QuantizeBFP and DequantizeBFP operators to enable QDQ-style quantization with BFP. Once the schemas stabilize, we can consider upstreaming to ONNX. * Add some basic type and shape inferencing tests; tests that run on an EP will be a follow-up.	2022-09-22 14:02:55 -07:00
Weixing Zhang	4113df0e21	use constexpr (#12953 )	2022-09-20 14:34:33 -07:00
Edward Chen	454f77cd94	Update kernel matching logic: decouple from op schemas and remove kernel def hashes (#12791 ) # Motivation Currently, ORT minimal builds use kernel def hashes to map from nodes to kernels to execute when loading the model. As the kernel def hashes must be known ahead of time, this works for statically registered kernels. This works well for the CPU EP. For this approach to work, the kernel def hashes must also be known at ORT format model conversion time, which means the EP with statically registered kernels must also be enabled then. This is not an issue for the always-available CPU EP. However, we do not want to require that any EP which statically registers kernels is always available too. Consequently, we explore another approach to match nodes to kernels that does not rely on kernel def hashes. An added benefit of this is the possibility of moving away from kernel def hashes completely, which would eliminate the maintenance burden of keeping the hashes stable. # Approach In a full build, ORT uses some information from the ONNX op schema to match a node to a kernel. We want to avoid including the ONNX op schema in a minimal build to reduce binary size. Essentially, we take the necessary information from the ONNX op schema and make it available in a minimal build. We decouple the ONNX op schema from the kernel matching logic. The kernel matching logic instead relies on per-op information which can either be obtained from the ONNX op schema or another source. This per-op information must be available in a minimal build when there are no ONNX op schemas. We put it in the ORT format model. Existing uses of kernel def hashes to look up kernels are replaced with the updated kernel matching logic. We no longer store kernel def hashes in the ORT format model’s session state and runtime optimization representations. We no longer keep the logic to generate and ensure stability of kernel def hashes.	2022-09-20 14:24:59 -07:00
Pranav Sharma	a8b0f57d1a	Fix eager mode pipeline to accommodate recent allocator change. (#13000 )	2022-09-20 12:53:46 +08:00
cloudhan	0ddf4efbd9	Make PythonOp report dtype mismatch by name, instead of by using enum index (#13007 )	2022-09-20 12:29:30 +08:00
Adam Louly	268bfe2a5d	python training api bindings (#12610 ) Description: Python API Bindings for on device training. Motivation and Context - This PR contains api bindings so python users can perform a whole training loop. Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>	2022-09-16 09:38:24 -07:00
Vincent Wang	da07c83948	SoftmaxCrossEntropyLossInternalGrad and Sum Fusion (#12746 ) * fuse scegrad and sum * add yield output shapes to value_info * resolve comments * fix merge main	2022-09-14 14:45:51 +08:00
pengwa	b5327595f3	Fix [prefast:Warning]: C26814 (#12897 ) fix C26814	2022-09-09 08:26:48 +08:00
Thiago Crepaldi	55c745eefd	Add support for ORTModule Torch cpp CUDA extension build within docker (#12868 ) Currently, CUDA hardware is not available to be leveraged by build during `docker build`. because of that, CUDA capable hardware would not have CUDA support This PR adds an env varf ONNXRUNTIME_FORCE_CUDA in which it allows CUDA extensions to be compiled even when CUDA support is not detected.	2022-09-08 15:30:44 -04:00
guyang3532	4765e5c382	Using ORTModule to wrap a evaluation model should not change the mode (#12747 ) Using ORTModule to wrap a evaluation model should not change the mode of model	2022-09-08 10:54:59 +08:00
RandySheriffH	d3b684cd9e	Drop nuphar (#11555 ) * drop nuphar code and configs * refactor test case * format python * remove nuphar from training test * remove commented nuphar logics * restore llvm setting * drop nuphar ci * fix compile err * fix compile err Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2022-09-07 15:11:18 -07:00
Baiju Meswani	9e47eb68e0	Remove unused orttraining amd dockerfiles and scripts (#12707 )	2022-09-02 18:43:21 -07:00
Baiju Meswani	295bd26980	Remove orttraining-distributed CI pipeline (#12738 )	2022-09-02 14:34:26 -07:00
ashbhandare	27dde0b51f	Csharp bindings for on-device training APIs (#12404 )	2022-09-02 13:13:48 -07:00

1 2 3 4 5 ...

1129 commits