* This has a couple of new features, but mostly has a lot of bugfixes for the prior releases
* This is the last Hopper-focused release of CUTLASS before blackwell drops, so let's upgrade to it.
* Most of the remaining diff noise is copyright year updates on the CUTLASS submodule
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145172
Approved by: https://github.com/eqy, https://github.com/henrylhtsang
If the command is too long, the linter fails with
```
Failed due to OSError:
[Errno 7] Argument list too long: 'grep'
```
Fix this by batching the command so it is shorter
Limit of 750k was chosen due to `getconf ARG_MAX` returns ~1M on my mac. My guess is that most people shouldn't hit this unless they run --all-files and the directory length is long.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145950
Approved by: https://github.com/wdvr
This is an attempt to fix flaky mypy errors in CI that look like:
```
dmypy status --verbose
connection_name : /var/folders/rf/qrn1jkgj0b9_tcznwp8ck46w0000gn/T/tmpjoqsid7_/dmypy.sock
pid : 32233
error : timed out
Daemon is stuck; consider /Users/zainr/pytorch/venv/bin/dmypy kill
```
"Fix" it by not using the daemon at all, since it doesn't actually provide any perf benefits in CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145961
Approved by: https://github.com/malfet
### Summary
`RECORD_FUNCTION` wasn't present in codegened Inductor-CPU Flex Attention C++ kernels, so flex attention kernels weren't present in the PyTorch profiler profiling data.
Fixes#145825 by adding `RECORD_FUNCTION` calls in the codegened flex-attention kernels.
### Caveat
#### _Before_
No corresponding results in PyTorch profiler profiling data
#### _After_
| Inductor config settings | What kernel name looks like in profiling data | Comments|
|-------------------|------------------------------------|--------------------|
| Env variable `TORCHINDUCTOR_CPP_WRAPPER=1` OR `inductor.config.cpp_wrapper=1` in python code | `graph_x_cpp_fused_y` | No way to tell from the profiling results if the kernel is a GEMM kernel or an attention kernel |
| `inductor.config.cpp.descriptive_names = "inductor_node"` but not CPP wrapper | `graph_x_kernel` | No way to tell from the profiling results if the kernel is a GEMM kernel or an attention kernel |
| Both `inductor_config.cpp.descriptive_names = "inductor_node"` & Inductor CPP Wrapper | `graph_x_cpp_fused_flex_attention_y`| Easy to interpret data |
| Neither of the two configs | `graph_x_kernel`| No way to tell from the profiling results if the kernel is a GEMM kernel or an attention kernel |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145894
Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel
Adds feature for #98925
Tests pass for both existing reflectionpad2d and the new one I inserted.
**Summary of the work:**
Simple conditional check for deterministic mode that will dispatch to a different kernel. This kernel does not use any atomic operations, and will lead to deterministic results as instead of going from the output to input(1:1) relationship, I am doing the opposite. I am going from input -> all outputs, which is 1 to many. These operations are done in the same order every execution as I simply traverse the data set with a grid stride loop and use simple linearized indexing into the input tensor.
So each thread will compute the 4 conditionals, which are then used to see if the input has an output in the 8 regions. These 8 regions are top left, top, top right, left, right, bottom left, bottom, bottom right`.
I did not focus on performance for this PR as that would expand the scope heavily. If there are any performance questions though i can answer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136241
Approved by: https://github.com/eqy, https://github.com/albanD
Instead of bumping symint counters when we process unbacked bindings during deserialization, it's better to bump them at the beginning based on what the symbols in the original shape env before serialization were. This allows symbols in unbacked bindings to have "gaps" that bumping alone would not be able to match.
Why is bumping counters important at all? It is because when the shape env coming out of deserialization is used later for propagating symints, say in run_decompositions, we don't want new names to clash with existing names (bad things happen).
Differential Revision: [D68798191](https://our.internmc.facebook.com/intern/diff/D68798191/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145882
Approved by: https://github.com/pianpwk
This PR introduces two new methods to the LazyGraphExecutor class:
- ClearComputationCache(): Allows clearing the entire computation cache.
- RemoveFromComputationCache(hash): Enables removal of specific cache entries based on their hash.
The main objective is to expose cache management functionality for debugging cache hits and misses across different computations. For instance:
- Reset the cache state in tests, allowing reuse of the same computation client to evaluate cache logic consistently.
- Selectively remove cache entries to analyze the impact on subsequent computations.
- Improve observability into the cache behavior, aiding in the investigation of cache-related issues or optimizations.
On the XLA lazy graph executor, we want to run a series of tests that modify some parts of the HLO module proto of the computation, and we need a means to ensure that the hash is agnostic to some elements (OpMetadata in the XLA proto data). Hence, it would be easy to parameterize the test, clear the cache and validate that the resulting hash is the same between runs. Otherwise, we'd need to hardcode the resulting serialized hash.
Simultaneously, **another motivation**, is that users could also clear some computation hashes for an added flexibility in their applications, by introducing their own custom strategies for maintaining the cache (without relying on the default LRU).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144489
Approved by: https://github.com/wconstab
Fixes https://github.com/pytorch/pytorch/issues/143876
Open to other suggestions - we have an invariant that all nodes in our ATen graphs should have a `meta['val']` field, but I don't think this is actually true in all cases, so I just hardcoded the invariant to ignore `_assert_scalar()` (which is a "special" op used in dynamic shapes for runtime asserts, and doesn't have a meta['val'] field)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143877
Approved by: https://github.com/zou3519
# Motivation
for https://github.com/pytorch/pytorch/issues/143914
On Windows, there are two separate SYCL platforms for iGPU and dGPU. To simplify the logic, we will exclude iGPUs when a dGPU is present. This ensures that all XPU devices enumerated by PyTorch share the same SYCL context.
Now I generalize the logic as below:
1. We find the first L0 platform containing at least one dGPU and enumerate all dGPUs of that platform.
2. If no dGPU is found, we find the first L0 platform containing iGPU and enumerate all iGPUs of that platform.
3. No GPU is found (neither iGPU nor dGPU).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144378
Approved by: https://github.com/EikanWang, https://github.com/gujinghui
Prior to this PR, constexprs were appearing in signatures as `{.. "XBLOCK : tl.constexpr": "constexpr"}` when they really should appear as `{.. "XBLOCK": "constexpr"}`.
This PR represents the argument names as ArgName objects, which can optionally be marked as constexpr.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145583
Approved by: https://github.com/jansel
Example failing test:
`pytest -s test_torchinductor_opinfo.py -k test_comprehensive_special_polygamma_special_polygamma_n_0_cpu_float32` when using triton CPU.
Failure:
```shell
triton.compiler.errors.CompilationError: at 10:11:
def triton_poi_fused_polygamma_0(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 25
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex
tmp0 = tl.load(in_ptr0 + (x0), xmask)
tmp1 = 1.0
tl.static_assert(tmp1.dtype == tl.float32)
tmp2 = ops.polygamma(tmp1, tmp0)
^
NameError('ops is not defined')
```
This occurs because the registered triton fallbacks are not used during the lowering to inductor IR.
Marked the problematic code in the excerpt below from 6bc17b0725/torch/_inductor/lowering.py (L572)
```python
def make_pointwise(
fn,
override_return_dtype=None,
override_device=None,
override_fn_when_input_bool=None,
override_fn_when_gpu_float64=None,
allow_alpha=False,
triton_fallback=None,
):
def inner(*inputs: TensorBox, alpha=None):
if triton_fallback is not None and any(
isinstance(inp, IRNode) and is_triton(inp) for inp in inputs <--- is_triton should return True when using triton CPU
):
assert not allow_alpha # not implemented
return triton_fallback(*inputs)
inputs = promote_constants(inputs, override_return_dtype)
if allow_alpha:
if alpha is not None and alpha != 1:
inputs = list(inputs)
```
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144389
Approved by: https://github.com/jansel
This PR implements a small UI improvement over #133603.
It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.
UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
tensor = torch.arange(1024 * 1024 * 2, device=device)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab
Summary:
Thanks Microve for discovering that recGPT has some repeated similar kernels that might be optimized through optimus. After investigation, I designed a pattern in the aten level to remove such excessive kernels.
trace: https://fburl.com/perfdoctor/82fauil7
tlparse: https://fburl.com/98q6tadx
Test Plan:
# unit test
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_split_cat_post_grad
```
Buck UI: https://www.internalfb.com/buck2/e8458d63-b8ca-498b-a731-77a83fb4d1cb
Test UI: https://www.internalfb.com/intern/testinfra/testrun/16325548715106567
Network: Up: 341KiB Down: 359KiB (reSessionID-7d3de666-7fc1-4988-8d11-d75ba958016d)
Executing actions. Remaining 0/3
Command: test. Finished 2 local
Time elapsed: 3:04.8s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
# local run
```
buck2 run @//mode/opt aps_models/ads/recgpt_exp:recgpt_launcher -- mode=local_recgpt_ranking_30x_v0_unified_seq_1115
```
https://www.internalfb.com/mlhub/pipeline/1630903954173593
# E2E
```
buck2 run @//mode/opt aps_models/ads/recgpt_exp:recgpt_launcher -- mode=mast_recgpt_ranking_30x_v0_unified_seq_1115 launcher.oncall=ads_model_platform launcher.data_project=ai_large_scale launcher.fbl_entitlement=ads_global_tc_training_efficiency launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] launcher.hardware=SMC_T20 launcher.job_name=recgpt_ranking_1115_pt2_with_optimus data_loader.dataset.table_ds=[2024-12-13,2024-12-14,2024-12-15,2024-12-16,2024-12-17,2024-12-18]
```
### how to add the config
Add the following patterns to the dynamo config
```
post_grad_fusion_options: {
"normalization_aten_pass": {},
"split_cat_aten_pass": {},
}
```
{F1974700331}
baseline:
aps-recgpt_ranking_1115_pt2_5-8cb4905c7d
{F1974700216}
proposal:
Differential Revision: D68695717
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145721
Approved by: https://github.com/Yuzhen11
Fixes#140092
Here's what this PR does:
Case 1: no `eps` is passed to python frontend:
Use `eps` associated with opmath_t instead of than `eps` associated with`scalar_t` for intermediate computation
Case 2: `eps` is passed to python frontend
Avoid downcasting `eps` to `scalar_t` and then upcasting it again implicitly in the `rqrst_input` computation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142848
Approved by: https://github.com/albanD