[fx] move DCE rand check to import time (#145118)

Mitigates the deterministic benchmark regression: https://github.com/pytorch/pytorch/issues/144775#issuecomment-2593411844. and maybe the dashboard issue. fx.Node.is_impure is unexpectedly a hot spot. It gets called for every node in the graph whenever we invoke DCE, which should be okay, EXCEPT we invoke DCE on the full graph ~10 times at various stages of torch.compile, and an insane number of times (>O(parameters)) for the subgraphs traced by the pattern matcher. I considered addressing this problem by reducing the amount of times DCE is called, but I think we can only trim the ones from the pattern matcher, which will require some refactor/caching solution that I leave out of this PR. torch.Tag.nondeterministic_seeded is provided by native_functions.yml and is implemented as a list. Most of the time, it has <=2 elements, so it's not really worth it to turn it into a set for fast lookup. Using the deterministic instruction count benchmarks ```python # before aotdispatcher_partitioner_cpu,compile_time_instruction_count,8914894946 aotdispatcher_partitioner_cpu,compile_time_instruction_count,8866669058 # after aotdispatcher_partitioner_cpu,compile_time_instruction_count,8770562314 aotdispatcher_partitioner_cpu,compile_time_instruction_count,8779547794 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145118 Approved by: https://github.com/ezyang, https://github.com/zou3519
2026-05-14 20:57:59 +00:00 · 2025-01-21 14:03:04 -08:00 · 2025-01-21 14:03:04 -08:00 · 27598cd154
commit 27598cd154
parent f2cfe8b59f
3 changed files with 8 additions and 7 deletions
--- a/benchmarks/dynamo/pr_time_benchmarks/expected_results.csv
+++ b/benchmarks/dynamo/pr_time_benchmarks/expected_results.csv
@ -6,7 +6,7 @@ add_loop_eager_dynamic,compile_time_instruction_count,5703000000,0.025



-add_loop_inductor,compile_time_instruction_count,32440000000,0.015
+add_loop_inductor,compile_time_instruction_count,32120000000,0.015



@ -14,7 +14,7 @@ add_loop_inductor_dynamic_gpu,compile_time_instruction_count,45210000000,0.025



-add_loop_inductor_gpu,compile_time_instruction_count,27740000000,0.015
+add_loop_inductor_gpu,compile_time_instruction_count,27360000000,0.015



@ -22,11 +22,11 @@ basic_modules_ListOfLinears_eager,compile_time_instruction_count,928600000,0.015



-basic_modules_ListOfLinears_inductor,compile_time_instruction_count,21760000000,0.015
+basic_modules_ListOfLinears_inductor,compile_time_instruction_count,21310000000,0.015



-basic_modules_ListOfLinears_inductor_gpu_force_shape_pad,compile_time_instruction_count,17810000000,0.015
+basic_modules_ListOfLinears_inductor_gpu_force_shape_pad,compile_time_instruction_count,17600000000,0.015



@ -54,7 +54,7 @@ aotdispatcher_inference_subclass_cpu,compile_time_instruction_count,5764000000,0



-aotdispatcher_partitioner_cpu,compile_time_instruction_count,9203000000,0.015
+aotdispatcher_partitioner_cpu,compile_time_instruction_count,9103000000,0.015



--- a/torch/_ops.py
+++ b/torch/_ops.py
@ -688,6 +688,8 @@ class OpOverload(OperatorBase):
        self._overloadname = (
            "default" if schema.overload_name == "" else schema.overload_name
        )
+        if tags:
+            self._nondeterministic_seeded = torch.Tag.nondeterministic_seeded in tags
        self._name = self._schema.name
        if schema.overload_name:
            self._name += "." + schema.overload_name
--- a/torch/fx/node.py
+++ b/torch/fx/node.py
@ -764,8 +764,7 @@ class Node(_NodeBase):
                # impure since it mutates inputs
                return True

-            tags: Optional[list[torch.Tag]] = getattr(self.target, "_tags", None)
-            if tags is not None and torch.Tag.nondeterministic_seeded in tags:
+            if getattr(self.target, "_nondeterministic_seeded", False):
                # impure since it mutates RNG state
                return True