Check type for building gradient graph (#17046)

### Check type for building gradient graph **Bug1**: To fix the error when running the model with ORTModule + Stage 3: ``` Exception happens when running <bound method Function.apply of <class 'onnxruntime.training.utils.hooks._zero_offload_subscriber.ORTZeROOffloadPreForwardFunction'>> Traceback (most recent call last): File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_custom_autograd_function_runner.py", line 207, in call_python_forward_function wrapped_arg.requires_grad = is_training_mode and grad_flag RuntimeError: only Tensors of floating point and complex dtype can require gradients ``` This is because when running PythonA, the 3rd input is int64, we find it requires gradient during the check in gradient builder, so we set its requires_grad = True, but PyTorch thinks it is incorrect, throwing the exception. So we need understand why ORT gradient builder think the 3rd input need gradients. During `ReverseBFSWithStopGradient`, which do reverse BFS from graph outputs, it collects all nodes that are needed for computing the graph outputs. `ReverseBFSWithStopGradient` define a queue, initially add all nodes that generate graph outputs, then iterate the nodes one by one, checking each node's input, if the input did not hit stop edge and its node arg type is allowed type (float, etc), then the input node is append into the queue, do the next iteration of work. PythonOpA is such a node that is needed to compute graph outputs, then IsReachable(PythonOpA) will return True. ![image](https://github.com/microsoft/onnxruntime/assets/10530022/c4c53fb9-15f7-4e8d-9aa2-7fc20555a001) In the above code snippet, when node is PythonOpB, and next_node being PythonOpA, we did not check node_arg type between node and next_node on the connection of PythonOpA's 3rd input to PythonOpB's outputs. So we append the int64 typed node args to sets that require gradient. **Fix1**: add the node arg type check before appending it into require grad lists. After the fixing, A unit test failed "orttraining_test_ortmodule_api.py::test_gradient_correctness_minmax[data_type0-True-0-min] Fatal Python error: Segmentation fault". After investigation, it is another bug. **Bug2**: Without the above fix1, the execution graph looks like this ![image](https://github.com/microsoft/onnxruntime/assets/10530022/b2fd4b03-95c7-414a-b268-2ba6a7300105) As you can see, int64 type has a gradient edge built, while it is not used for any consumers. And the execution runs well. While think twice, int type should not have grad edge built. With the Fix1, the execution graph looks like this; ![image](https://github.com/microsoft/onnxruntime/assets/10530022/1870d3cc-2fe5-4aa7-ad6b-0d88dcc40f8a) So the int type node arg did not has gradient edge built. **Fix1** is fixing this problem. But another bug happens if the inital "y_node_arg_names" e.g. in this case Aten's two outputs, 1st one in float, 2nd one in int. When we check the y_node (6e6f582e08/orttraining/orttraining/core/framework/gradient_graph_builder.cc (L60C16-L60C16)), we did not check the data type, then add it into `y_node_args_` which is the list of graph output node args that requires gradient. Then `non_differentiable_y_node_arg_names_` did not has the int type graph output. Then 6e6f582e08/orttraining/orttraining/core/framework/ortmodule_graph_builder.cc (L312C18-L312C18) will try to get the grad node arg into `yield_output_node_args`, BUT the grad node arg is not built for int type node arg (with the **Fix1**). So we insert a nullptr, later when we using it, we get segment fault. **Fix2** Again, we add the type check when handle y_node_args, also add null check when getting gradient node arg and append into yield_output_node_args
2026-05-14 20:48:00 +00:00 · 2023-08-10 14:24:42 +08:00 · 2023-08-10 14:24:42 +08:00 · 0471f6fbb3
commit 0471f6fbb3
parent 31cbd63af7
3 changed files with 71 additions and 20 deletions
--- a/orttraining/orttraining/core/framework/gradient_graph_builder.cc
+++ b/orttraining/orttraining/core/framework/gradient_graph_builder.cc
@ -18,6 +18,44 @@ using namespace ONNX_NAMESPACE;
 namespace onnxruntime {
 namespace training {

+namespace {
+
+const std::unordered_set<int64_t> GRAD_ALLOWED_TYPES{
+    ONNX_NAMESPACE::TensorProto_DataType_FLOAT,
+    ONNX_NAMESPACE::TensorProto_DataType_FLOAT16,
+    ONNX_NAMESPACE::TensorProto_DataType_DOUBLE,
+    ONNX_NAMESPACE::TensorProto_DataType_BFLOAT16,
+};
+
+std::tuple<bool, bool, int> IsAllowedForGradient(const Graph* graph, const NodeArg* node_arg) {
+  // This is a temporary workaround for Send nodes, who output a bool. If we don't allow building gradient for
+  // it, the PipelineTransformer tests will fail. We should revisit the PipelineTransformer and fix it in a
+  // better way. (While it might not be a priority for us now.)
+  bool skip_elem_type_check = false;
+  const Node* node = graph->GetProducerNode(node_arg->Name());
+  if (node && (node->OpType() == "Send")) {
+    skip_elem_type_check = true;
+  }
+
+  bool is_tensor_type = false;
+  bool is_allowed_type_for_grad = false;
+  const auto* type_proto = node_arg->TypeAsProto();
+  int32_t type = -1;
+  if (nullptr != type_proto && type_proto->value_case() == ONNX_NAMESPACE::TypeProto::kTensorType) {
+    is_tensor_type = true;
+    type = type_proto->tensor_type().elem_type();
+
+    if (skip_elem_type_check) {
+      is_allowed_type_for_grad = true;
+    } else {
+      is_allowed_type_for_grad = GRAD_ALLOWED_TYPES.find(type) != GRAD_ALLOWED_TYPES.end();
+    }
+  }
+
+  return {is_tensor_type, is_allowed_type_for_grad, type};
+}
+}  // namespace
+
 using namespace common;

 GradientGraphBuilder::GradientGraphBuilder(Graph* graph,
@ -57,10 +95,13 @@ GradientGraphBuilder::GradientGraphBuilder(Graph* graph,

    const Node* node = graph_->GetProducerNode(name);
    if (node) {
-      if (forward_reachable_nodes.find(node) == forward_reachable_nodes.end()) {
+      const auto rets = IsAllowedForGradient(graph_, node_arg);
+      bool is_allowed_type_for_grad = std::get<1>(rets);
+      if (forward_reachable_nodes.find(node) == forward_reachable_nodes.end() || !is_allowed_type_for_grad) {
        non_differentiable_y_node_arg_names_.insert(name);
        LOGS(logger_, INFO) << "The model weights and inputs are non-differentiable from " << name << ". "
-                            << "ORT will assume no gradient will be provided for " << name << ".";
+                            << "ORT will assume no gradient will be provided for " << name
+                            << ", is_allowed_type_for_grad: " << is_allowed_type_for_grad;
      } else {
        y_node_args_.insert(node_arg);
        y_nodes_.insert(node);
@ -169,17 +210,16 @@ NodeSet GradientGraphBuilder::ReverseBFSWithStopGradient(const NodeSet& nodes) c
        continue;
      }
      const NodeArg* node_arg = n->InputDefs()[edge_it->GetDstArgIndex()];
-      const auto* type_proto = node_arg->TypeAsProto();
-      if (nullptr != type_proto && type_proto->value_case() == ONNX_NAMESPACE::TypeProto::kTensorType) {
-        const int32_t type = type_proto->tensor_type().elem_type();
-        if (GRAD_ALLOWED_TYPES.find(type) == GRAD_ALLOWED_TYPES.end()) {
+      const auto [is_tensor_type, is_allowed_type_for_grad, type] = IsAllowedForGradient(graph_, node_arg);
+      if (is_tensor_type) {
+        if (!is_allowed_type_for_grad) {
          LOGS(logger_, VERBOSE) << "Skip building gradient for input_" << edge_it->GetDstArgIndex()
-                                 << " of node: " << n->Name() << "because element type is: " << type;
+                                 << " of node: " << n->Name() << " because element type is: " << type;
          continue;
        }
      } else {
        LOGS(logger_, VERBOSE) << "Skip building gradient for input_" << edge_it->GetDstArgIndex()
-                               << " of node: " << n->Name() << "because it is not a Tensor type";
+                               << " of node: " << n->Name() << " because it is not a Tensor type";
        continue;
      }

@ -286,6 +326,21 @@ Status GradientGraphBuilder::Build(const std::unordered_set<std::string>* p_init
      }

      const NodeArg* node_arg = node->OutputDefs()[edge_it->GetSrcArgIndex()];
+
+      // Make sure node_arg as input of next_node, has the data type allowed to compute gradient.
+      const auto [is_tensor_type, is_allowed_type_for_grad, type] = IsAllowedForGradient(graph_, node_arg);
+      if (is_tensor_type) {
+        if (!is_allowed_type_for_grad) {
+          LOGS(logger_, VERBOSE) << "Skip building gradient for input_" << edge_it->GetDstArgIndex()
+                                 << " of node: " << next_node.Name() << " because element type is: " << type;
+          continue;
+        }
+      } else {
+        LOGS(logger_, VERBOSE) << "Skip building gradient for input_" << edge_it->GetDstArgIndex()
+                               << " of node: " << next_node.Name() << " because it is not a Tensor type";
+        continue;
+      }
+
      std::string grad_node_arg_name = GradientBuilderBase::GradientName(node_arg->Name());

      pending_[grad_node_arg_name] += 1;
--- a/orttraining/orttraining/core/framework/gradient_graph_builder.h
+++ b/orttraining/orttraining/core/framework/gradient_graph_builder.h
@ -150,19 +150,13 @@ class GradientGraphBuilder {
  //   The 1st and 3rd inputs are not differentiable.
  std::unordered_map<std::string, std::vector<int64_t>> python_op_input_require_grad_info_;

-  const std::unordered_set<int64_t> GRAD_ALLOWED_TYPES{
-      ONNX_NAMESPACE::TensorProto_DataType_FLOAT,
-      ONNX_NAMESPACE::TensorProto_DataType_FLOAT16,
-      ONNX_NAMESPACE::TensorProto_DataType_DOUBLE,
-      ONNX_NAMESPACE::TensorProto_DataType_BFLOAT16,
-  };
  const std::unordered_set<size_t>* GetStopGradientEdges(const Node& node) const;

  /**
  Performs a BFS on the graph with STOP_GRADIENT_EDGES constrain
  It will skip traversing over the edges defined in STOP_GRADIENT_EDGES map.
  The resulting node set contains all the nodes that are differentiable wrt the x_node_args
-  @param Starting nodes arg name for BFS
+  @param x_node_arg_names Starting nodes arg name for BFS
  @returns All the nodes visited during BFS
  */
  NodeSet BFSWithStopGradient(const std::unordered_set<std::string>& x_node_arg_names) const;
@ -171,14 +165,13 @@ class GradientGraphBuilder {
  Performs a ReverseBFS on the graph with STOP_GRADIENT_EDGES constrain
  It will skip traversing over the edges defined in STOP_GRADIENT_EDGES map.
  The resulting node set contains all the nodes that are differentiable wrt the input nodes
-  @param Starting nodes for ReverseBFS
+  @param nodes Starting nodes for ReverseBFS
  @returns All the nodes visited during ReverseBFS
  */
  NodeSet ReverseBFSWithStopGradient(const NodeSet& nodes) const;

  /**
-  Check if 'x_node_args_' are reachable from 'y_node_args_' for computing the partial derivative
-  @param reachable_nodes All the nodes reachable from the 'y_node_args_'
+  Check if 'x_node_args_' are reachable from 'y_node_args_' for computing the partial derivative.
  @returns OK if all 'x_node_args_' are reachable, else an ONNXRUNTIME INVALID_ARGUMENT status
  */
  Status CheckNodeArgsReachable() const;
--- a/orttraining/orttraining/core/framework/ortmodule_graph_builder.cc
+++ b/orttraining/orttraining/core/framework/ortmodule_graph_builder.cc
@ -309,7 +309,9 @@ void OrtModuleGraphBuilder::HandleOutputsAndGrads() {

    if (std::find(non_differentiable_indices.begin(), non_differentiable_indices.end(), i) ==
        non_differentiable_indices.end()) {
-      yield_output_node_args.emplace_back(gradient_graph.GetNodeArg(grad_name));
+      NodeArg* grad_node_arg = gradient_graph.GetNodeArg(grad_name);
+      ORT_ENFORCE(grad_node_arg != nullptr, "Differentiable param grad node arg should exist.");
+      yield_output_node_args.emplace_back(grad_node_arg);
      graph_info_.module_output_gradient_name.emplace_back(grad_name);
    }
  }
@ -335,9 +337,10 @@ void OrtModuleGraphBuilder::HandleOutputsAndGrads() {

  attributes.insert({full_shape_outputs_name, full_shape_outputs});

-  // Handle potential duplciated output_gradient names
+  // Handle potential duplicated output_gradient names
  std::unordered_map<std::string, std::vector<size_t>> name_to_idx;
  for (size_t i = 0; i < yield_output_node_args.size(); ++i) {
+    ORT_ENFORCE(yield_output_node_args[i] != nullptr);
    const std::string& name = yield_output_node_args[i]->Name();
    auto it = name_to_idx.find(name);
    if (it == name_to_idx.end()) {