onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-24 02:47:54 +00:00

Author	SHA1	Message	Date
guyang3532	cfe830b248	Generalize label input sparsity check and refactor (#20636 ) ### Description The InsertGatherBeforeSceLoss optimization is enabled when the density of label padding less than 90%. We need to check the density of the label padding to decide whether enable the optimization. Before this pr, we just check the inputs of graph and correlate one with the SCE node by iterate graph from the SCE node back to one graph input. This is hard to be general because there may be complicated pattern between graph input and SCE node. This pr check padding density by the direct input of SCE module rather than the input of graph at the first graph execution when exporting onnx graph. And if the density < 90%, insert a flag PythonOp after the SCE node as: ``` SoftmaxCrossEntropy \| PythonOp (func_name: FlagAndPrintDensity) (insert if density < 90%) \| Following graph ``` When the InsertGatherBeforeSceLoss is invoked, it check if there is the flag PythonOp(func_name: FlagAndPrintDensity) after the SCE node and if it is, remove it and do the padding elimination optimization. If the env of ORTMODULE_PRINT_INPUT_DENSITY is 1, we will print input density each step by the PythonOp (func_name: FlagAndPrintDensity). In this case the PythonOp will not be removed.	2024-05-10 21:55:43 +08:00
pengwa	56f7035521	Improve perf for mem efficient grad mgmt (#20480 ) ### Improve perf for mem efficient grad mgmt When memory efficient gradient mangement feature is enabled, the weight retrieval PythonOp for every layers will be launched at the beginning of the forward, which would make GPU stream idle for few milliseconds. The reason is the ReversedDFS ordering cannot ALWAYS handle such input branching well, so we introduce a distantance-to-input_leaf concepts when doing the reversedDFS, which not only move the problematical PythonOp to the place where it is needed, but also those Cast ops following the weight retrieval to the place where it is needed. Main branch: 102.19 - 26.35s = 75.84s for 260 steps(4627samples), 61.04sample/second This PR: 100.28s - 25.10s = 75.18s for 260 steps. 61.54samples/second (+0.8% gains) Main branch: ![image](https://github.com/microsoft/onnxruntime/assets/10530022/75c4131e-dade-49b0-aa8b-ee1c637ad9a8) This PR: ![image](https://github.com/microsoft/onnxruntime/assets/10530022/e590a536-3b80-4f51-b89f-f25a55ddd7e2) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-10 08:09:17 +08:00
Dmitri Smirnov	08ecf30e0b	Implement numpy array over CPU OrtValues on return values (#20539 ) ### Description Create numpy arrays based on the native buffers of returned OrtValues. Hold on to the OrtValue until the numpy array is garbage collected. ### Motivation and Context This saves cpu on tensor copies and addresses customer concerns.	2024-05-08 10:56:36 -07:00
guyang3532	3e4db2c686	Fuse Cast + SoftmaxCrossEntropyLossInternal (#20334 ) ### Description Fuse Cast + SoftmaxCrossEntropyLossInternal to SoftmaxCrossEntropyLossInternal.	2024-04-29 14:12:10 +08:00
pengwa	f31486c8b7	Disable test_aten_conv_bf16 to unblock amd ci (#20499 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-29 11:38:40 +08:00
Scott McKay	b842effa29	Fix some x86 build warnings in training code (#20451 ) ### Description <!-- Describe your changes. --> Fix some misc build warnings from x86 Windows build ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-26 20:29:21 +10:00
Frank Dong	227c4419fc	add bf16 support for few ops (#20385 ) ### Description Add bf16 support for below ops: ConstantOfShape Exp Erf convolution PythonOp ### Motivation and Context phimm model works on bf16, ORT need support bf16 on previous ops to work with phimm on bf16	2024-04-25 11:28:34 -07:00
Adam Louly	4ce7bbf6f1	Add LayerSpec Support to ORTPipelineModule (#20410 ) ### Description In Deepspeed's Pipeline Parallel Implementation, there is a class used to instantiate the object after it's moved to the device and assigned in a stage. This approach helps reduce peak memory usage. In this PR, we're adding support to ORT for wrapping this LayerSpec.	2024-04-23 17:57:08 -07:00
guyang3532	ffb9c8d598	fix embedding sparsity log bug of -1% density (#20420 ) ### Description When not checked valid embedding sparsity, the log print a wrong info of "-1% density", this pr is to fix it.	2024-04-23 20:37:50 +08:00
Scott McKay	ed6f1adcb8	Fix overflow causing test failure on x86 (#20425 ) ### Description <!-- Describe your changes. --> Fix comparison that was not updated when the threshold was converted to bytes. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix CI failure	2024-04-23 21:33:59 +10:00
pengwa	a7787a0bad	Introduce memory efficient topological sort (#20258 ) ### Introduce memory efficient topo sort (for training) ~~and laze initialize Priority-Based and Memory-Efficient topo sort. Because in most cases, they are not needed, so we free the overheads of GraphViewer construction for most use cases.~~ ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-23 08:00:23 +08:00
Scott McKay	9372e9a0a3	Support >2GB of Tensor data in training checkpoint (#20077 ) ### Description <!-- Describe your changes. --> Add ability to store initializer data in an external file. Update training checkpoint code to use external file if data > ~2GB. I don't see a way for the flatbuffers 64-bit offsets to be used, as they don't support storing 'table' types with 64-bit offsets (and our Tensor is a 'table' type not a simple struct). `0cfb7eb80b/tests/64bit/test_64bit.fbs (L38-L39)` Allowing a Tensor to have its raw_data in an external file should hopefully work with the least friction. As it's an extra field it's backwards compatible. Please feel free to suggest alternative approaches. Side note: the diffs in the generated *.fbs.h files are unexpectedly large. Maybe they weren't re-generated when the new flatbuffers version was checked in. I updated by running: `python .\compile_schema.py -f <build output dir>\_deps\flatbuffers-build\Debug\flatc.exe` from onnxruntime\core\flatbuffers\schema which I thought was the correct way but maybe that's out of date. I think you can ignore all the diffs in the generated files and just worry about the changes to the .fbs files in onnxruntime/core/flatbuffers/schema. Basically start at the bottom of the files changed and work up as all the 'real' diffs are there. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: carzh <wolfivyaura@gmail.com>	2024-04-22 15:17:43 -07:00
Adam Louly	ee74fb6908	Introducing ORTPipelineModule - DeepSpeed Parallel Pipeline Support. (#20287 ) ### Description Introducing a new class ORTPipelineModule to handle wrapping layers in DeepSpeed pipeline parallel. ### Motivation and Context To support pipeline parallelism on ORTModule. This PR will include an initial support of deepspeed Pipeline parallelism. - [x] Support Pipeline parallel where layers are nn Modules in Sequential. - [ ] Support LayerSpec and TiedLayerSpec - [ ] Enable partitioning to accept List - [ ] Full-GPU Graph Consolidation - [ ] Subgraph Merging for Inference	2024-04-18 11:30:15 -07:00
Vincent Wang	c47f446f25	Support BFloat16 for Triton Codegen (#20353 ) Previous implementation used numpy array and numpy data_type to store constant value and data type, which is not support BFloat16 natively. This PR is to switch to use torch tensor which supports BFloat16.	2024-04-18 17:15:11 +08:00
Hector Li	5daeb5e0b0	enable model with external data be loaded from memory buffer (#19089 ) ### Description Background: User save large model with initializer data in external file. e.g: onnx.save_model(onnx_model, "path/to/save/the/model.onnx", save_as_external_data=True, all_tensors_to_one_file=True, location="filename", size_threshold=1024). In that case, Ort loads the model, get the external initializer information (external file name, offset, length) and use the model path to find the external file, and locate to the tensor data via the offset and length. But it won't work if user load the model from memory, since Ort lost track of the model path. This PR adds API/session option to let user provide a table with external initializer file name as the key, the pointer to the loaded external file in memory and the buffer length as value. So that 1. user can load the model from memory buffer with external initializers in memory buffer too. 2. the initializers can be shared across sessions, for different EPs. 3. user can load the file in any way they want, e.g mmap. Internally, 1. at session creation time, Ort goes through the external initializers in the graph, gets the file name, offset, data length of the external initializers from Tensorproto . 2. With the file name, Ort get the file in memory buffer and buffer length from the table user provided. 4. Ort locates the tensor buffer from file in memory buffer (user provided) using the offset and data length (from Tensorproto ). 5. Ort creates the Tensor and replace the existing Tensor in the graph. ### Motivation and Context https://github.com/onnx/onnx/blob/main/docs/ExternalData.md For a model with external data, the Tensorproto may have initializer data in a separate file. The external file location is set via the file path relative to the model path. With the API to load model from memory buffer, it lost track of the model path. So it causes error if the model has external data. By adding a session option to set the external data buffer, Ort can find the external data correctly if model loaded from memory buffer.	2024-04-17 19:01:01 -07:00
Adrian Lizarraga	0a1902525f	Add patch for ONNX 1.16.0 shape inference bug (#20316 ) ### Description - Adds a patch that fixes a shape inference bug that caused a segfault: https://github.com/onnx/onnx/pull/6080 - Fix documentation describing why QLinearMatMul tests are currently being skipped. ### Motivation and Context The [PR for integrating with ONNX 1.16.0](https://github.com/microsoft/onnxruntime/pull/19745) disabled various python quantization tests due to a shape inference bug. This PR applies the ONNX fix as a patch. We still can't enable the tests because some of our CIs pip install onnx-1.16.0, which doesn't include the fix.	2024-04-17 10:23:22 -07:00
liqun Fu	cd7112f800	Integration with ONNX 1.16.0 (#19745 ) ### Description update with ONNX 1.16.0 branch according to https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md ONNX 1.16.0 release notes: https://github.com/onnx/onnx/releases/tag/v1.16.0 #### Updated ops for CPU EP: - DequantizeLinear(21) - Added int16 and uint16 support + various optimizer tests - Missing int4 and uint4 support - Missing block dequantization support - QuantizeLinear(21) - Added int16 and uint16 support + various optimizer tests - Missing int4 and uint4 support - Missing block quantization support - Cast(21) - Missing int4 and uint4 support - CastLike(21) - Missing int4 and uint4 support - ConstantOfShape(21) - Missing int4 and uint4 support - Identity(21) - Missing int4 and uint4 support - If(21) - Missing int4 and uint4 support - Loop(21) - Missing int4 and uint4 support - Reshape(21) - Missing int4 and uint4 support - Scan(21) - Missing int4 and uint4 support - Shape(21) - Missing int4 and uint4 support - Size(21) - Missing int4 and uint4 support - Flatten(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support - Pad(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support - Squeeze(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support - Transpose(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support - Unsqueeze(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support #### Unimplemented opset 21 features/ops - int4 and uint4 data type - QLinearMatMul(21) - GroupNormalization(21) - ai.onnx.ml.TreeEnsemble(5) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> ### Disabled tests #### ORT Training orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py - test_ort_custom_ops: Potential shape inference bug for custom ops #### Python quantization unit tests test/onnx/python/quantization (shape inference bug) - test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16 - test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16 - test_op_gemm.py: test_quantize_qop_gemm_s8s8 - test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same - test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3 - test_op_matmul.py: test_quantize_matmul_u8u8_f16 - test_op_matmul.py: test_quantize_matmul_s8s8_f16 - test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy - test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile - test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution - test_op_relu.py: test_quantize_qop_relu_s8s8 #### ONNX tests - test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a maxpool output size bug and added this test. Enable this test when [ORT PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged. Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741). - test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op ai.onnx.ml.TreeEnsemble - test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same - test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same - test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same - test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4 yet - test_cast_INT4_to_INT8_cpu: same - test_cast_UINT4_to_FLOAT_cpu: same - test_cast_UINT4_to_UINT8_cpu: same - test_cast_INT4_to_FLOAT_cuda - test_cast_INT4_to_INT8_cuda - test_cast_UINT4_to_FLOAT_cuda - test_cast_UINT4_to_UINT8_cuda - test_constantofshape_float_ones_cuda: ConstantOfShape(21) not implemented for cuda - test_constantofshape_int_shape_zero_cuda: same - test_constantofshape_int_zeros_cuda: same - test_flatten_axis0_cuda: Flatten(21) not implemented for cuda - test_flatten_axis1_cuda: same - test_flatten_axis2_cuda: same - test_flatten_axis3_cuda: same - test_flatten_default_axis_cuda: same - test_flatten_negative_axis1_cuda: same - test_flatten_negative_axis2_cuda: same - test_flatten_negative_axis3_cuda: same - test_flatten_negative_axis4_cuda: same - test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not implemented in ORT yet - test_qlinearmatmul_2D_int8_float32_cpu: same - test_qlinearmatmul_2D_uint8_float16_cpu: same - test_qlinearmatmul_2D_uint8_float32_cpu: same - test_qlinearmatmul_3D_int8_float16_cpu: same - test_qlinearmatmul_3D_int8_float32_cpu: same - test_qlinearmatmul_3D_uint8_float16_cpu: same - test_qlinearmatmul_3D_uint8_float32_cpu: same - test_qlinearmatmul_2D_int8_float16_cuda: same - test_qlinearmatmul_2D_int8_float32_cuda: same - test_qlinearmatmul_2D_uint8_float16_cuda: same - test_qlinearmatmul_2D_uint8_float32_cuda: same - test_qlinearmatmul_3D_int8_float16_cuda: same - test_qlinearmatmul_3D_int8_float32_cuda: same - test_qlinearmatmul_3D_uint8_float16_cuda: same - test_qlinearmatmul_3D_uint8_float32_cuda: same - test_size_cuda: Size(21) not implemented for cuda - test_size_example_cuda: same - test_dequantizelinear_blocked: Missing implementation for block dequant for DequantizeLinear(21) - test_quantizelinear_blocked_asymmetric: Missing implementation for block quant for QuantizeLinear(21) - test_quantizelinear_blocked_symmetric: Missing implementation for block quant for QuantizeLinear(21) --------- Signed-off-by: liqunfu <liqun.fu@microsoft.com> Signed-off-by: Ganesan Ramalingam <grama@microsoft.com> Co-authored-by: Ganesan Ramalingam <grama@microsoft.com> Co-authored-by: George Wu <jywu@microsoft.com> Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>	2024-04-12 09:46:49 -07:00
guyang3532	471e969e2f	Check padding density by input of embedding module (#19821 ) ### Description The PaddingElimination optimization is enabled when the density of embedding padding less than 90%. We need to check the density of the embedding padding to decide whether enable the optimization. Before this pr, we just check the inputs of graph and correlate one with the embedding node by iterate graph from the embedding node back to one graph input. This is hard to be general because there may be complicated pattern between graph input and embedding node. This pr check padding density by the direct input of embedding module rather than the input of graph at the first graph execution when exporting onnx graph. And if the density < 90%, insert a flag PythonOp after the embedding node as: ``` Embedding \| PythonOp (func_name:_FlagPaddingElimination) (insert if density < 90%) \| Following graph ``` When the PaddingElimination is invoked, it check if there is the flag PythonOp(func_name:_FlagPaddingElimination) after the Embedding node and if it is, remove it and do the padding elimination optimization.	2024-04-10 18:45:51 +08:00
pengwa	280b2634c5	Prompt layer-wise recompute when applicable (#20126 ) ### Prompt layer-wise when applicable Give explicit prompts in export failures to users to enable layer-wise memory optimization if we found the checkpoint function is used. - Using checkpoint function is a strong indicator that the model is too large to fit in GPU memory. - If we don't override the checkpoint function here, mostly ONNX export will be failed. 1. For old version PyTorch, when handling gradient checkpoint feature, we just throw an exception. 2. For new version PyTorch, an export failure happens. - But both failures did not give users explicitly "HOW" to mitigate. This PR did that. `` ![image](https://github.com/microsoft/onnxruntime/assets/10530022/c0476748-5818-4cc8-b2d6-88c7580fe4da) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-10 11:50:28 +08:00
pengwa	41acd8c543	Support more ops for recompute (#20234 ) ### Support more ops for recompute To cover Mistral model, and support padding elimination ops. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-09 09:24:48 +08:00
pengwa	2092bebc78	Fix transformer layer detection for recompute (#20106 ) ### Fix transformer layer detection for recompute Originally logic miss detecting the layer boudary node in Mistral model. This PR simplifies the searching, by using more strong pattern's match, to make sure it is flexible enough to cover different transformer variants. Also add a UT. Add a warning when user enable layerwise recompute but no layer boudary nodes are found.	2024-03-29 17:44:38 +08:00
pengwa	55f63a48ca	Keep original name during fusion (#20097 ) ### Keep original name during fusion This could be helpful to know where the fused node coming from, I feel this is very useful when debugging the execution order issues between different transformer layers. For example: - A node named `/_original_module/model/layers.1/self_attn/MatMul/MatmulTransposeFusion//MatMulScaleFusion/` goes through two fusion paths in the 1st transformer layer - e.g. `MatmulTransposeFusion` and `MatMulScaleFusion`. - `/_original_module/model/layers.2/post_attention_layernorm/Mul_1/SimplifiedLayerNormFusion/` node is a fused node by `SimplifiedLayerNormFusion`. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-28 08:40:34 +08:00
guyang3532	4aa84003ca	support Pow/Div/Sqrt in PaddingElimination (#20083 )	2024-03-27 16:10:07 +08:00
zhijiang	b14d3f1d52	Zhijxu/fix softmax fp16 (#20059 ) in fp16 input, the softmax will return nan in some case, the reason is because in float16 dtype, std::numeric_limits<float16>::infinity() will return 0 instead of inf	2024-03-27 11:37:10 +08:00
pengwa	dfa891a2d8	Fix memory stats printing (#20061 ) ### Fix memory stats printing The mmeory stats printing is failed when module is in eval mode, doing ORTModule wrap. At that time, runtime inspector for training manager should have training model being true, but got a false (because existing logic get the boolean from module.training). Runtime inspector as part of training manager or inference manager should know it is serving training or not explicitly, so we cannot depend on the stat of module.training during ORTModule initialization. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-26 21:25:59 +08:00
pengwa	1a0ba3f69f	Fix softmax export (#20057 ) ### Description Why we need to define softmax export logic here? For the usage `nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32)` in the model, `76a33a1092/src/transformers/models/mistral/modeling_mistral.py (L302)` If dtype is specified, the input tensor is casted to dtype before the operation is performed. This is useful for preventing data type overflows. While existing ONNX exporter do the cast after the operation, which is not correct. (`cf06189a2d/torch/onnx/symbolic_opset13.py (L27)`). This override can be a workaround before PyTorch fix the issues in coming releases. (TODO: pengwa - add PyTorch versions when the issue is fixed). @thiagocrepaldi We may need a fix in PyTorch repo as well. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-26 13:09:20 +08:00
Vincent Wang	d30c81d270	Add Symbolic Shape Hint to Triton Codegen Config (#20056 ) Add symbolic shape hint to Triton codegen config so that we can avoid unnecessary recompile when input shapes are keeping changing. Below screenshot shows that with proper configuration, we can speed up the training a lot by reducing unnecessary recompile. ![image](https://github.com/microsoft/onnxruntime/assets/11661208/699944d2-81cd-4c22-84e7-73a4fa0d2a28)	2024-03-25 15:05:02 +08:00
Baiju Meswani	2bc29244b4	Support model with multiple SCE loss nodes (#20016 )	2024-03-22 10:28:44 -07:00
Prathik Rao	0b958bb421	add random seed to layernorm tests (#19998 ) Adds random seed to layernorm tests to prevent random failure. ### Motivation and Context Fixes https://github.com/microsoft/onnxruntime/issues/19983	2024-03-20 21:00:25 -07:00
mindest	3dfe4a5e6d	[ROCm] Remove MPI dependency and collectives to use NCCL (#19830 ) ### Description * Remove MPI dependency to use NCCL AllReduce, etc. * Exclude unsupported collectives in hipify	2024-03-19 17:35:18 -07:00
Tianlei Wu	597e828aae	Adjust test tolerance (#19947 ) ### Description Improve the precision of tests. Changes include: (1) Update checkers.cc to use consistent default tolerance. (2) Allow different default tolerances for different providers at runtime (Previously, threshold of a test is decided during compiling). (3) Explicitly set absolute and relative error tolerances for tests that failed to pass new default threshold. #### Default Thresholds Change Note that the formula of testing is `abs(expected - value) < absolute + relative * expected` Default test thresholds when both absolute and relative tolerance are not set: type \| provider \| absolute (before) \| absolute (after) \| relative (before) \| relative (after) -- \| -- \| -- \| -- \| -- \| -- double \| CPU \| 0.001 \| 0.00001 \| 0 \| 0.00001 double \| CUDA \| 0.005 \| 0.00001 \| 0 \| 0.00001 double \| TRT \| 0.005 \| 0.00001 \| 0 \| 0.00001 double \| ROCM \| 0.005 \| 0.00001 \| 0 \| 0.00001 double \| DML \| 0.005 \| 0.00001 \| 0 \| 0.00001 \| \| \| \| \| float \| CPU \| 0.0001 \| 0.00001 \| 0 \| 0.0001 float \| CUDA \| 0.005 \| 0.00001 \| 0 \| 0.0001 float \| TRT \| 0.005 \| 0.00001 \| 0 \| 0.0001 float \| ROCM \| 0.005 \| 0.00001 \| 0 \| 0.0001 float \| DML \| 0.005 \| 0.00001 \| 0 \| 0.0001 float \| Training* \| 0.005 \| 0.001 \| 0 \| 0.0001 \| \| \| \| \| half \| CPU \| 0.001 \| 0.0025 \| 0 \| 0.001 half \| CUDA \| 0.005 \| 0.0025 \| 0 \| 0.001 half \| TRT \| 0.005 \| 0.0025 \| 0 \| 0.001 half \| ROCM \| 0.005 \| 0.0025 \| 0 \| 0.001 half \| DML \| 0.02 \| 0.005 \| 0 \| 0.001 half \| Training* \| 0.005 \| 0.005 \| 0 \| 0.001 \| \| \| \| \| bfloat16 \| CPU \| 0.0001 \| 0.02 \| 0 \| 0.01 bfloat16 \| CUDA \| 0.0001 \| 0.02 \| 0.05 \| 0.01 bfloat16 \| TRT \| 0.0001 \| 0.02 \| 0.05 \| 0.01 bfloat16 \| ROCM \| 0.0001 \| 0.02 \| 0.05 \| 0.01 bfloat16 \| DML \| 0.0001 \| 0.02 \| 0.05 \| 0.01 bfloat16 \| Training* \| 0.0001 \| 0.02 \| 0.05 \| 0.01 *Training mean a build flag ENABLE_TRAINING_CORE is defined. The provider can be any one. #### Threshold for provider Previously, the threshold might change according to build flags: ``` #if defined(USE_CUDA) \|\| defined(USE_ROCM) \|\| defined(USE_DML) constexpr float threshold = 0.005f; #else constexpr float threshold = 0.0001f; #endif ``` For a cpu only build, the threshold is 0.0001. For a cuda build, the threshold for CPU provider (some tests in cuda build actually run with CPU provider) is changed to 0.005. After this change, the threshold only depends on data type and provider used in the test. It will not change by build flags for non-training builds. Default thresholds for training might be different from inference (please refer to the above table). There are a few factors there: Training has gradient outputs; TF32 is not disabled in training; Some training tests has iterations, and error might accumulate. How to set different thresholds based on these factors could be a future task.	2024-03-19 15:50:13 -07:00
Prathik Rao	26cd3c1fb0	add kernel tests for ops that changed in opset18 (#19767 ) ### Description <!-- Describe your changes. --> - [x] Pad operator has introduced a new input called "axes" which specifies which axis to pad. But it defaults to input_rank if axes is not provided which was the behavior before the opset upgrade. - [x] ReduceMean - [x] ReduceL2 - [x] ReduceLogSumExp - [x] ReduceSum - Reduction ops all had the axes attribute switched to an input and a new attribute called "noop_with_empty_axes" was added to define what to do when axes is not specified. - [x] Resize has had two new attributes introduced: antialias and keep_aspect_ratio_policy. From Operators.md I've gathered: "Antialiasing is achieved by stretching the resampling filter by a factor max(1, 1 / scale), which means that when downsampling, more input pixels contribute to an output pixel." keep_aspect_ratio_policy "describes how to interpret the `sizes` input with regard to keeping the original aspect ratio of the input." there are a couple enum-type options that specify different policies and what to do in each case. - NOTE: Baiju already included opset18 tests in https://github.com/microsoft/onnxruntime/pull/17772 - [x] ScatterElements/ScatterND has had a new attribute introduced called "reduction." This specifies the type of reduction to apply: none (default), add, mul, max, min. - [x] Split introduced a new attribute called "num_outputs" which specifies how many outputs to split the input tensor into. This is in contrast to the previous, default behavior of specifying a "split" input which defines the size of each resultant tensor of the output. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-19 09:33:06 -07:00
Baiju Meswani	226f60f2f1	Add support for SGD optimizer in minimal build (#19901 )	2024-03-14 11:31:20 -07:00
Justin Chu	faea42af95	Bump ruff to 0.3.2 and black to 24 (#19878 ) ### Motivation and Context Routing updates	2024-03-13 10:00:32 -07:00
pengwa	3fb8905393	Fix torch cpp extension build warnings (#19842 ) ### Fix torch cpp extension build warnings For the warnings shown as below: ``` cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [4/5] c++ -MMD -MF /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.o.d -pthread -B /opt/conda/envs/ptca/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/include/python3.8 -c -c /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.cc -o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.o -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=torch_interop_utils -D_GLIBCXX_USE_CXX11_ABI=0 cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ In file included from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_arg_parser.h:65, from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/tensor_new.h:4, from /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.cc:9: /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_strings.h:104:19: warning: ‘pybind11::object PyObject_FastGetAttrString(PyObject, const char)’ defined but not used [-Wunused-function] 104 \| static py::object PyObject_FastGetAttrString(PyObject* obj, const char* name) { \| ^~~~~~~~~~~~~~~~~~~~~~~~~~ [5/5] c++ -MMD -MF /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.o.d -pthread -B /opt/conda/envs/ptca/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/include/python3.8 -c -c /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.cc -o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.o -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=torch_interop_utils -D_GLIBCXX_USE_CXX11_ABI=0 cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ In file included from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_arg_parser.h:65, from /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/tensor_new.h:4, from /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.cc:13: /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/utils/python_strings.h:104:19: warning: ‘pybind11::object PyObject_FastGetAttrString(PyObject, const char)’ defined but not used [-Wunused-function] 104 \| static py::object PyObject_FastGetAttrString(PyObject* obj, const char* name) { \| ^~~~~~~~~~~~~~~~~~~~~~~~~~ g++ -pthread -B /opt/conda/envs/ptca/compiler_compat -Wl,--sysroot=/ -pthread -shared -B /opt/conda/envs/ptca/compiler_compat -L/opt/conda/envs/ptca/lib -Wl,-rpath=/opt/conda/envs/ptca/lib -Wl,--no-as-needed -Wl,--sysroot=/ /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/ctx_pool.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_bw.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_fw.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/custom_function_shared.o /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/temp.linux-x86_64-cpython-38/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/torch_interop_utils/torch_interop_utils.o -L/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-cpython-38/torch_interop_utils.cpython-38-x86_64-linux-gnu.so Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/fused_ops.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/fused_ops.cpython-38-x86_64-linux-gnu.so Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/aten_op_executor.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/aten_op_executor.cpython-38-x86_64-linux-gnu.so Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/torch_gpu_allocator.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/torch_gpu_allocator.cpython-38-x86_64-linux-gnu.so Installing /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/build/lib.linux-x86_64-cpython-38/torch_interop_utils.cpython-38-x86_64-linux-gnu.so -> /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/torch_interop_utils.cpython-38-x86_64-linux-gnu.so ``` Fix by replacing eixsting `PyObject_GetAttrString` with `PyObject_FastGetAttrString` which claims to be faster in its implementation comment. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-12 10:51:30 +08:00
pengwa	3e954da3e6	Fix and enable few ORTModule Unit Tests (#19847 ) ### Fix and enable few ORTModule Unit Tests Fix 'test_bert_inputs_with_dynamic_shape' and 'test_bert_result_with_layerwise_recompute' generate Nan loss in ORT run. The root cause is, the logic to generatic attention mask test data is not correct, only 0 or 1 is allowed in the dataset, but we see lots of other numbers. ( The reason we don't have this using old version of transformers for example v4.4.2 or 4.16.2 is because they don't contains such `d3cb28886a`, which increase the scaling to a bigger number, causing a overflow to inf) Another improvement during the investigation using convergence tools: Don't dump the activations during model export phase, otherwise, the dumped data might contains some PyTorch run's result making us confused during comparing with stock PyTorch run results. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-12 10:49:19 +08:00
Vincent Wang	0c078dfc8b	Some Shape Related Fusions (#19832 ) This PR adds below shape related fusions, which is helpful for some transformer models: - ShapeInputMerge is to merge all Shape nodes' input NodeArg to a single one (the 1st one on topo order) if they have the same shape value. This helps CSE fusion to merge more nodes. - CSE fusion to support scalar tensor as attribute value. This is mainly to support ConstantOfShape node.	2024-03-12 10:29:27 +08:00
pengwa	5c5d6e99ce	Define recomputable op list with domain/opset (#19722 ) ### Define recomputable op list with domain/opset Originally, we just check the OpType and decide whether it is recomputable. In this PR, few improvements are made: 1. [Op type search] Domain + OpType are used to check whether the op is supported to recompute. 2. [Opset search] Then, node.SinceVersion() will be searched in the supported opsets. 3. During subgraph detection, If the node in that this opset is supported, get the ignorable input indices, which means we don't consider in the bottom-up search. This would save time for the subgraph detection. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-07 09:12:12 +08:00
pengwa	d9bf85613d	Adapt memory optimizer to fit PHI2 (#19757 ) ### Adapt memory optimizer to fit PHI2 Few improvements and bug fixes: 1. Fix bug related to transformer layer detection. 2. Use default reversed typo order to create recompute node, to avoid the leaf nodes are handled too late, then having lowest priority for execution. 3. Add early stop when activation's element count is constant and total element count < 1M. This can avoid overhead to search subgraphs. Using export ORTMODULE_MEMORY_OPT_LEVEL=1 to enable layerwise recompute, on given recipe, memory consumption dropped from ~22GB to ~13GB .	2024-03-06 21:54:16 +08:00
Scott McKay	db59cec82f	Don't reduce warning level for CUDA build on Windows (#19663 ) ### Description <!-- Describe your changes. --> Address warnings so all the ORT projects build with /W4 on Windows. Mainly - unused parameters - variables shadowing other ones ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #19588 started on this.	2024-03-06 15:03:55 +10:00
Vincent Wang	1bfc26685b	ATen Op Supports Int Return Type and CPU Tensor Arguments (#19773 ) This PR: - add support for int as return type, will create a CPU scalar tensor for it. - add attributes to specify which arguments or returns are CPU tensors. - adjust ATen efficient attn to match latest PyTorch native function. - a Triton codegen bugfix by the way.	2024-03-06 10:11:46 +08:00
pengwa	d102569755	Fix seed for recomputed Dropout (#19715 ) ### Fix seed for recomputed Dropout If Dropout node is recomputed in the backward, we should make sure its execution is same as the run in the forward. If we don't set seed attribute, then this cannot be guaranteed. Add ` export ORTMODULE_MEMORY_OPT_LEVEL=2` to enabled per layer recompute with compromised recomputable subgraphs.	2024-03-06 10:06:25 +08:00
guyang3532	cd56ea4a74	enable embedding sparse optimization by default (#19714 )	2024-03-05 13:15:30 +08:00
zhijiang	2a5c9b86eb	Zhijxu/fix conv1d replacement (#19758 ) remove the constraint - "group number should be less than 3"; add more condition to make sure the conv1d replacement only happens on conv1d instead of conv2d/conv3d; add more tests;	2024-03-05 10:11:19 +08:00
Adam Louly	d5606cd7ee	Introducing customizable input names for loss in generate_artifacts. (#19705 ) # loss function extra inputs. Currently, the loss functions in onnxblock expect exactly two inputs in their build method. Occasionally, models may pass additional inputs, causing the build function to fail. To solve this issue, we can let users pass a list of loss input names to be used in the loss function.	2024-02-29 13:40:56 -08:00
Vincent Wang	937cdd651e	[ORTMODULE] Support Register Custom Triton Kernel (#19690 ) Add support for registering custom Triton kernel function.	2024-02-29 23:03:57 +08:00
Vincent Wang	d2e6dd25ea	Merge GatherToSplitFusion and #19218 to a General Fusion (#19600 ) #19218 tried to fuse Gather/Slice to Split, but the logic has problem. Scalar value or 1-dim value of indices in Gather node will produce different result, scalar value will produce a result tensor by removing the axis dim, will 1-dim indices value will keep that dim, even when the dim value is 1. For example, Node \|-> Gather(indices=[0], axis=axis) \|-> Gather(indices=[1], axis=axis) \|-> Slice(index=2, axis=axis) is same as Node \|-> Split(axis=axis) But Node \|-> Gather(indices=0, axis=axis) \|-> Gather(indices=1, axis=axis) \|-> Slice(index=2, axis=axis) is same as Node \|-> Split(axis=axis) \|\|-> Squeeze(axis=axis) \|\|-> Squeeze(axis=axis) \|\|-> Previous PR doesn't take such case related to Squeeze/Unsqueeze into account. This PR merges #19218 and GatherToSplitFusion to a general fusion, which relaxes the limit the number of Gather and Slice node number, check all Gather and Slice consumers, if the indices of Gather and start/end of Slice can cover the specific dim of the input tensor, then we can fuse them to a Split, and adding Squeeze if necessary according to the dim count of the indices tensor in Gather. @rui-ren, please check if the fix can still be applied to your model.	2024-02-29 13:45:58 +08:00
pengwa	026e3178ae	Improve memory matrix for ORTModule (#19620 ) ### Memory matrix for ORTModule Collect parameter/gradient/buffers sizes also. Exposed as a function, can be used externally for debugging purpose. ``` 2024-02-27 07:18:55,283 orttraining.rank-0 [INFO] - rank-0 step 1 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 816 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,322 orttraining.rank-0 [INFO] - rank-0 step 1 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 816 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,358 orttraining.rank-0 [INFO] - rank-0 step 1 memory (MiB) \| phase: pre_backward \| allocated: 8926 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 816 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,438 orttraining.rank-0 [INFO] - rank-0 step 1 memory (MiB) \| phase: post_backward \| allocated: 6098 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 218 \| max inactive: 831 \| param: 5314 \| grad: 12 \| buffer: 8 0%\|▏ \| 2/3200 [01:27<32:05:11, 36.12s/it]2024-02-27 07:18:55,498 orttraining.rank-0 [INFO] - rank-0 step 2 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,537 orttraining.rank-0 [INFO] - rank-0 step 2 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,576 orttraining.rank-0 [INFO] - rank-0 step 2 memory (MiB) \| phase: pre_backward \| allocated: 8926 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,657 orttraining.rank-0 [INFO] - rank-0 step 2 memory (MiB) \| phase: post_backward \| allocated: 6098 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 218 \| max inactive: 831 \| param: 5314 \| grad: 12 \| buffer: 8 0%\|▏ \| 3/3200 [01:27<17:30:57, 19.72s/it]2024-02-27 07:18:55,711 orttraining.rank-0 [INFO] - rank-0 step 3 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,750 orttraining.rank-0 [INFO] - rank-0 step 3 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,786 orttraining.rank-0 [INFO] - rank-0 step 3 memory (MiB) \| phase: pre_backward \| allocated: 8926 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,867 orttraining.rank-0 [INFO] - rank-0 step 3 memory (MiB) \| phase: post_backward \| allocated: 6098 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 218 \| max inactive: 831 \| param: 5314 \| grad: 12 \| buffer: 8 [2024-02-27 07:18:55,886] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 0%\|▎ \| 4/3200 [01:28<10:39:52, 12.01s/it]2024-02-27 07:18:55,902 orttraining.rank-0 [INFO] - rank-0 step 4 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,944 orttraining.rank-0 [INFO] - rank-0 step 4 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,979 orttraining.rank-0 [INFO] - rank-0 step 4 memory (MiB) \| phase: pre_backward \| allocated: 8926 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,060 orttraining.rank-0 [INFO] - rank-0 step 4 memory (MiB) \| phase: post_backward \| allocated: 6098 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 218 \| max inactive: 831 \| param: 5314 \| grad: 12 \| buffer: 8 0%\|▍ \| 5/3200 [01:28<6:53:04, 7.76s/it]2024-02-27 07:18:56,115 orttraining.rank-0 [INFO] - rank-0 step 5 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,154 orttraining.rank-0 [INFO] - rank-0 step 5 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,190 orttraining.rank-0 [INFO] - rank-0 step 5 memory (MiB) \| phase: pre_backward \| allocated: 8926 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,270 orttraining.rank-0 [INFO] - rank-0 step 5 memory (MiB) \| phase: post_backward \| allocated: 6098 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 218 \| max inactive: 831 \| param: 5314 \| grad: 12 \| buffer: 8 0%\|▍ \| 6/3200 [01:28<4:36:19, 5.19s/it]2024-02-27 07:18:56,323 orttraining.rank-0 [INFO] - rank-0 step 6 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,365 orttraining.rank-0 [INFO] - rank-0 step 6 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,398 orttraining.rank-0 [INFO] - rank-0 step 6 memory (MiB) \| phase: pre_backward \| allocated: 8926 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,478 orttraining.rank-0 [INFO] - rank-0 step 6 memory (MiB) \| phase: post_backward \| allocated: 6098 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 218 \| max inactive: 831 \| param: 5314 \| grad: 12 \| buffer: 8 0%\|▌ \| 7/3200 [01:28<3:09:33, 3.56s/it]2024-02-27 07:18:56,533 orttraining.rank-0 [INFO] - rank-0 step 7 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,572 orttraining.rank-0 [INFO] - rank-0 step 7 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,608 orttraining.rank-0 [INFO] - rank-0 step 7 memory (MiB) \| phase: pre_backward \| allocated: 8926 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,727 orttraining.rank-0 [INFO] - rank-0 step 7 memory (MiB) \| phase: post_backward \| allocated: 6098 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 218 \| max inactive: 831 \| param: 5314 \| grad: 12 \| buffer: 8 0%\|▌ \| 8/3200 [01:28<2:13:48, 2.52s/it]2024-02-27 07:18:56,806 orttraining.rank-0 [INFO] - rank-0 step 8 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,846 orttraining.rank-0 [INFO] - rank-0 step 8 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,882 orttraining.rank-0 [INFO] - rank-0 step 8 memory (MiB) \| phase: pre_backward \| allocated: 8926 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,962 orttraining.rank-0 [INFO] - rank-0 step 8 memory (MiB) \| phase: post_backward \| allocated: 6098 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 218 \| max inactive: 831 \| param: 5314 \| grad: 12 \| buffer: 8 0%\|▋ \| 9/3200 [01:29<1:36:03, 1.81s/it]2024-02-27 07:18:57,053 orttraining.rank-0 [INFO] - rank-0 step 9 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:57,094 orttraining.rank-0 [INFO] - rank-0 step 9 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 ```	2024-02-28 15:57:05 +08:00
jingyanwangms	3bdb10d5ca	Move import to when needed to avoid circular dependency error (#19579 ) ### Description Move import to when needed to avoid circular dependency error ### Motivation and Context Fixes dependency error described here: https://github.com/microsoft/DeepSpeed/issues/5140 --------- Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>	2024-02-22 10:56:25 -08:00
Vincent Wang	3d88487c96	Minor Triton Fix (#19589 ) Including removing a unnecessary assert, and add support of passing string attribute from ONNX node attribute to python functoin kwargs (mainly for passing debug info from graph to python for now).	2024-02-22 10:35:26 +08:00

1 2 3 4 5 ...

1470 commits