Commit graph

76779 commits

Author SHA1 Message Date
Matthew Hoffman
258f47fc0b Add padding_side to pad_sequence with "left" and "right" options ("right" as default) (#131884)
Fixes #10536

Reattempt of #61467. Thank you so much to @mskoh52 for your excellent work!

As I was trying to create a more efficient LLM data collator, I realized that `pad_sequence` only supports right padding, even though left padding is a very common format for LLMs, like Llama and Mistral.

The proposed alternative implementation was to use multiple flips, which tends to be 1.5x-2x slower. Instead we can add a [`padding_side` parameter as there is for for Hugging Face tokenizers](9d6c0641c4/src/transformers/tokenization_utils_base.py (L1565)), which requires only a very small change in the C++ code.

Here are the benchmarks of the new implementation!

`float32`:

![eaaa95ef-9384-45d2-be56-6898bc1d3514](https://github.com/user-attachments/assets/3b0eb309-e5a0-4a4d-97bb-4e3298783dbb)

`bool`:

![892f32da-8d9a-492b-9507-18d3f0a41e8e](https://github.com/user-attachments/assets/6824ea15-7d4e-4b89-95f0-8546635f0c2e)

Code:

```python
from __future__ import annotations

import random
import time
from typing import Literal

import numpy as np
import torch

def pad_sequence_with_flips(
    sequences: list[torch.Tensor],
    batch_first: bool = False,
    padding_value: int | float | bool = 0.0,
    padding_side: Literal["left", "right"] | str = "left",
) -> torch.Tensor:
    if padding_side == 'right':
        padded_sequence = torch._C._nn.pad_sequence([t.flatten() for t in sequences], batch_first=batch_first, padding_value=padding_value)
    elif padding_side=='left':
        padded_sequence = torch._C._nn.pad_sequence([t.flatten().flip(0) for t in sequences], batch_first=batch_first, padding_value=padding_value)  # pyright: ignore[reportArgumentType]
        padded_sequence = padded_sequence.flip(int(batch_first))
    else:
        raise ValueError(f"padding_side should be either 'right' or 'left', but got {padding_side}")

    return padded_sequence

sequence_lengths: list[int] = []

flip_left_pad_times: list[float] = []
flip_left_pad_times_std: list[float] = []

left_pad_times: list[float] = []
left_pad_times_std: list[float] = []

RUNS_PER_LOOP: int = 100

for i in range(1, 7):
    sequence_length = i * int(1e6) // 6
    sequence_lengths.append(sequence_length)

    sequences = [torch.randint(0, 2, (random.randint(1, sequence_length),), dtype=torch.bool) for _ in range(64)]

    inner_left_pad_times: list[float] = []
    inner_right_pad_times: list[float] = []

    inner_flip_left_pad_times: list[float] = []
    inner_flip_right_pad_times: list[float] = []

    for _ in range(RUNS_PER_LOOP):

        start = time.perf_counter()
        torch._C._nn.pad_sequence(sequences, batch_first=True, padding_value=False, padding_side="left")
        end = time.perf_counter()
        inner_left_pad_times.append(end - start)

        start = time.perf_counter()
        pad_sequence_with_flips(sequences, batch_first=True, padding_value=False, padding_side="left")
        end = time.perf_counter()
        inner_flip_left_pad_times.append(end - start)

    left_pad_times.append(sum(inner_left_pad_times) / len(inner_left_pad_times))
    left_pad_times_std.append(np.std(inner_left_pad_times))

    flip_left_pad_times.append(sum(inner_flip_left_pad_times) / len(inner_flip_left_pad_times))
    flip_left_pad_times_std.append(np.std(inner_flip_left_pad_times))

    print(f"Sequence Length: {sequence_length}, Left Pad Time: {left_pad_times[-1]}, Left with Flips Pad Time: {flip_left_pad_times[-1]}")

import matplotlib.pyplot as plt

plt.plot(sequence_lengths, left_pad_times, label="new pad_sequence left")
plt.scatter(sequence_lengths, left_pad_times)
plt.errorbar(sequence_lengths, left_pad_times, yerr=left_pad_times_std, linestyle='None', marker='^')

plt.plot(sequence_lengths, flip_left_pad_times, label="old pad_sequence left (2 flips)")
plt.scatter(sequence_lengths, flip_left_pad_times)
plt.errorbar(sequence_lengths, flip_left_pad_times, yerr=flip_left_pad_times_std, linestyle='None', marker='^')

plt.xlabel("Sequence Length")
plt.ylabel("Time (s)")
plt.legend(loc="upper right")

# Sequence Length: 166666, Left Pad Time: 0.06147645162009212, Left with Flips Pad Time: 0.09842291727001794
# Sequence Length: 333333, Left Pad Time: 0.08933195920990329, Left with Flips Pad Time: 0.15597836187991562
# Sequence Length: 500000, Left Pad Time: 0.08863158334006585, Left with Flips Pad Time: 0.15224887342999863
# Sequence Length: 666666, Left Pad Time: 0.10524682551997103, Left with Flips Pad Time: 0.18177212480995877
# Sequence Length: 833333, Left Pad Time: 0.11801802741003485, Left with Flips Pad Time: 0.20821274195001024
# Sequence Length: 1000000, Left Pad Time: 0.131894061660023, Left with Flips Pad Time: 0.23223503091008751
```

Co-authored-by: mskoh52 <mskoh52@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131884
Approved by: https://github.com/ezyang
2024-08-07 15:53:07 +00:00
PyTorch MergeBot
780310fed7 Revert "Only thunkify proxies in some situations (#132421)"
This reverts commit bb99008c9e.

Reverted https://github.com/pytorch/pytorch/pull/132421 on behalf of https://github.com/clee2000 due to I think this broke dynamo/test_subclasses.py::TestNestedTensor::test_in_graph_construction_from_input [GH job link](https://github.com/pytorch/pytorch/actions/runs/10283744685/job/28459340678) [HUD commit link](bb99008c9e).  Test got added in f50621989b which is before your merge base ([comment](https://github.com/pytorch/pytorch/pull/132421#issuecomment-2273742960))
2024-08-07 15:29:54 +00:00
PyTorch MergeBot
de9b8a42c1 Revert "Add support for other backends in get_preferred_device (#132118)"
This reverts commit c184ac0f6b.

Reverted https://github.com/pytorch/pytorch/pull/132118 on behalf of https://github.com/clee2000 due to I think this broke distributed/checkpoint/test_file_system_checkpoint_cpu.py::TestDistributedReshardOnLoad::test_load_rowwise_to_colwise_thread_count_1 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10279901233/job/28456599072) [HUD commit link](c184ac0f6b).  Dr CI classification is wrong, the failure is not flaky ([comment](https://github.com/pytorch/pytorch/pull/132118#issuecomment-2273729288))
2024-08-07 15:22:42 +00:00
cyy
13fa59580e Enable clang-tidy on aten/src/ATen/cpu (#132830)
Expands code coverage of clang-tidy to aten/src/ATen/cpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132830
Approved by: https://github.com/Skylion007
2024-08-07 14:44:17 +00:00
Antoni Viros
ed97fb77f9 Conversions between strided and jagged layouts for Nested Tensors (#115749)
This PR does 3 things:
1. Adds a copy-free strided->jagged layout conversion for NT
2. Adds a copy-free jagged->strided layout conversion for NT
3. Modifies and expands the .to() API to support the layout argument for the specific case of NT layout conversion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115749
Approved by: https://github.com/jbschlosser
2024-08-07 14:18:53 +00:00
Joel Schlosser
fb146fc3c6 Only store necessary tensor_dict fields in node meta (#132805)
Fixes #132290

This PR attempts a more invasive / complete solution than the one from #132338, which removes immediate tensor fields from the `tensor_dict` copy stored in node meta. The approach taken here is to store only those fields of the `tensor_dict` which are absolutely utilized somewhere else.

So far, this appears to be limited to:
* `_dynamo_static_input_type`
* `tag` (at least in the tests). Discussion at #94080 appears to indicate this is depended on for export

(CI may point out more)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132805
Approved by: https://github.com/mlazos
2024-08-07 13:35:16 +00:00
Edward Z. Yang
7c79e89bc5 Stop using clear_frame as decorator (#132778)
See https://github.com/pytorch/pytorch/pull/132073 for motivation

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132778
Approved by: https://github.com/albanD
ghstack dependencies: #132774
2024-08-07 11:53:18 +00:00
Edward Z. Yang
bb99008c9e Only thunkify proxies in some situations (#132421)
The goal of this PR is to avoid stack overflow when we create extremely long chains of thunks, and then evaluate them (e.g., as occurs if you sum(long list of symint)). The basic idea behind this PR is to only thunkify proxies if they're being created in places where they may or may not be used--crucially, symint operations that occur in user code we are tracing are eagerly placed into the graph, even if they may eventually be dead.

I annotated the PR with explanation of changes.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132421
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #132674, #132675
2024-08-07 11:51:17 +00:00
Danielmic
32f9a809c7 Replace [[unlikely]] with unlikely(x) (#130816)
Do not use `[[unlikely]]` as its c++20 language features, see https://en.cppreference.com/w/cpp/language/attributes/likely

Fixes https://github.com/pytorch/pytorch/issues/130815

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130816
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/malfet
2024-08-07 10:38:13 +00:00
zengxian
8c8eb9670a [CI] Enable inductor UT test on avx512 (#132645)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132645
Approved by: https://github.com/desertfire
2024-08-07 10:22:40 +00:00
Syed Tousif Ahmed
37ab0f3385 Loads .pyd instead of .so in MemPool test for windows (#132749)
Fixes #132650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132749
Approved by: https://github.com/albanD
2024-08-07 09:58:52 +00:00
xinyu-intel
8333ecf085 Support hasattr tracing for more PythonModuleVariable (#132731)
Fixes #132237

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132731
Approved by: https://github.com/EikanWang, https://github.com/yanboliang
2024-08-07 09:15:17 +00:00
Nicolas Macchioni
c8c964f950 [inductor] check best templates first for fusions (#132829)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132829
Approved by: https://github.com/eellison
2024-08-07 07:48:00 +00:00
Jeeja
c184ac0f6b Add support for other backends in get_preferred_device (#132118)
Currenlty get_preferred_device supports only cuda and cpu. Add support for other backends using backend config.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132118
Approved by: https://github.com/awgu
2024-08-07 07:19:20 +00:00
wz337
87053132ea [DeviceMesh] Remove parent mesh concept from _MeshEnv and replace by root mesh (#132339)
Previously, when we slice out a submesh from a mesh, we assign the mesh as the parent mesh of the submesh. In this case, when we have a 3D mesh topology, the parent mesh of a 1D mesh sliced out from the 3D mesh is different from the parent mesh of the same 1D mesh sliced out from the 2D submesh of the 3D mesh. For example:
```
mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2"))
mesh_dim0 = mesh_3d["dim0"]

mesh_2d = mesh_2d["dim0", "dim1"]
mesh_dim0_2 =  mesh_2d["dim0_2"]

# This would evaluate to be True
print(_mesh_resources.get_parent_mesh(mesh_dim0) != _mesh_resources.get_parent_mesh(mesh_dim0))
```

We can always reconstruct the mesh needed from the mesh dim names, as long as two dims come from the same root. For simplicity, we do not see the necessity of building a tree structure to represent child-parent relationship. Therefore, we are replacing the parent mesh concept with a root mesh concept in `_MeshEnv` so we would have:

```
mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2"))
mesh_dim0 = mesh_3d["dim0"]

mesh_2d = mesh_2d["dim0", "dim1"]
mesh_dim0_2 =  mesh_2d["dim0_2"]

# This would evaluate to be True
print(_mesh_resources.get_root_mesh(mesh_dim0) == _mesh_resources.get_root_mesh(mesh_dim0))
```
With this change, we will have two types of meshes in an environment.
1. `device_mesh != _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is created by slicing.
2. `device_mesh == _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is a root mesh not created through slicing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132339
Approved by: https://github.com/wanchaol
ghstack dependencies: #132310, #132311
2024-08-07 07:01:12 +00:00
leslie-fang-intel
dc00eeb0f4 [Dynamo] fix incorrect kwargs in create_proxy (#132723)
## Summary
Fix https://github.com/pytorch/pytorch/issues/132642, the implementation of `create_proxy` requires to pass-in `kwargs` explicitly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132723
Approved by: https://github.com/aorenste
2024-08-07 06:26:24 +00:00
Nikita Shulga
2206a3de00 [Compile] Speedup int8-to-float conversion on aarch64 (#132676)
With this change following snippet:
```cpp
#include <ATen/cpu/vec/vec.h>

void int8tofloat(int8_t* in, float* out) {
        auto tmp0 = at::vec::Vectorized<int8_t>::loadu(in, 8);
        auto tmp1 = at::vec::convert<float>(tmp0);
        tmp1.store(out);
}
```, which is core of the algorithm generated by cpu_inductor for the following compiled function:
```python
@torch.compile
def to_float(x):
  return x.to(torch.float)
```

changes from
```assembly
int8tofloat(signed char*, float*):
0000000000000000	stp	x29, x30, [sp, #-0x10]!
0000000000000004	mov	x29, sp
0000000000000008	sub	x9, sp, #0x30
000000000000000c	and	sp, x9, #0xffffffffffffffe0
0000000000000010	adrp	x8, 0 ; 0x0
0000000000000014	ldr	x8, [x8]
0000000000000018	ldr	x8, [x8]
000000000000001c	str	x8, [sp, #0x28]
0000000000000020	ldr	s0, [x0]
0000000000000024	sshll.8h	v0, v0, #0x0
0000000000000028	sshll.4s	v0, v0, #0x0
000000000000002c	scvtf.4s	v0, v0
0000000000000030	str	q0, [sp]
0000000000000034	ldr	s0, [x0, #0x4]
0000000000000038	sshll.8h	v0, v0, #0x0
000000000000003c	sshll.4s	v0, v0, #0x0
0000000000000040	scvtf.4s	v0, v0
0000000000000044	str	q0, [sp, #0x10]
0000000000000048	mov	x8, sp
000000000000004c	ld1.4s	{ v0, v1 }, [x8]
0000000000000050	st1.4s	{ v0, v1 }, [x1]
0000000000000054	ldr	x8, [sp, #0x28]
0000000000000058	adrp	x9, 0 ; 0x0
000000000000005c	ldr	x9, [x9]
0000000000000060	ldr	x9, [x9]
0000000000000064	cmp	x9, x8
0000000000000068	b.ne	0x78
000000000000006c	mov	sp, x29
0000000000000070	ldp	x29, x30, [sp], #0x10
0000000000000074	ret
0000000000000078	bl	0x78
```
to
```assembly
0000000000000000	ldr	d0, [x0]
0000000000000004	sshll.8h	v0, v0, #0x0
0000000000000008	sshll.4s	v1, v0, #0x0
000000000000000c	scvtf.4s	v1, v1
0000000000000010	sshll2.4s	v0, v0, #0x0
0000000000000014	scvtf.4s	v2, v0
0000000000000018	st1.4s	{ v1, v2 }, [x1]
000000000000001c	ret
```

and improves perf of `python3 torchchat.py generate stories110M --num-samples 3 --quantize '{"linear:int8" : {"groupsize" : 0}}' --compile --device cpu` from 56 to 98 tokens per sec on MacBook M1 Pro

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132676
Approved by: https://github.com/desertfire
2024-08-07 06:26:05 +00:00
Sun, Jiayi
4faa0e3efb [Inductor] support masked vectorization for the tail_loop (#126526)
Currently the tail_loop always uses the scalar kernel. This PR supports masked vectorization for the tail_loop to improve the performance.

Example:
```
import torch
import torch.nn as nn

class GN(nn.Module):
    def __init__(self, num_groups, num_channels):
        super(GN, self).__init__()
        self.gn = nn.GroupNorm(num_groups, num_channels)

    def forward(self, x):
        return self.gn(x)

input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last)
m = GN(32, 960).eval()
compiled_m = torch.compile(m)

with torch.no_grad():
    for _ in range(3):
        compiled_m(input)

```

Generated code:
- Before:
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/ky/cky2bufythacofebk7ujv36e4pxyqcqbpsy5r4vojoprjiwcwfxf.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    #pragma omp parallel num_threads(112)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L))
                {
                    {
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        static WeightRecp<at::vec::Vectorized<float>> weight_recps(static_cast<long>(17280L));
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L))
                        {
                            for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 16);
                                tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &weight_recps);
                            }
                            #pragma omp simd simdlen(8)
                            for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(1L))
                            {
                                auto tmp0 = in_ptr0[static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0))];
                                tmp_acc0 = welford_combine(tmp_acc0, tmp0);
                            }
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
        }
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L))
                {
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(960L); x2+=static_cast<long>(16L))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (960L*x1) + (8847360L*x0)), 16);
                        auto tmp1 =
                        [&]
                        {
                            __at_align__ std::array<float, 16> tmpbuf;
                            #pragma GCC unroll 16
                            for (long x2_inner = 0; x2_inner < 16; x2_inner++)
                            {
                                tmpbuf[x2_inner] = out_ptr0[static_cast<long>((32L*x0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))];
                            }
                            return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16);
                        }
                        ()
                        ;
                        auto tmp3 =
                        [&]
                        {
                            __at_align__ std::array<float, 16> tmpbuf;
                            #pragma GCC unroll 16
                            for (long x2_inner = 0; x2_inner < 16; x2_inner++)
                            {
                                tmpbuf[x2_inner] = out_ptr1[static_cast<long>((32L*x0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))];
                            }
                            return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16);
                        }
                        ()
                        ;
                        auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x2), 16);
                        auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x2), 16);
                        auto tmp2 = tmp0 - tmp1;
                        auto tmp4 = static_cast<float>(276480.0);
                        auto tmp5 = at::vec::Vectorized<float>(tmp4);
                        auto tmp6 = tmp3 / tmp5;
                        auto tmp7 = static_cast<float>(1e-05);
                        auto tmp8 = at::vec::Vectorized<float>(tmp7);
                        auto tmp9 = tmp6 + tmp8;
                        auto tmp10 = tmp9.rsqrt();
                        auto tmp11 = tmp2 * tmp10;
                        auto tmp13 = tmp11 * tmp12;
                        auto tmp15 = tmp13 + tmp14;
                        tmp15.store(out_ptr2 + static_cast<long>(x2 + (960L*x1) + (8847360L*x0)));
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1 = args
    args.clear()
    assert_size_stride(arg0_1, (960, ), (1, ))
    assert_size_stride(arg1_1, (960, ), (1, ))
    assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960))
    buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32)
    cpp_fused_native_group_norm_0(arg2_1, arg0_1, arg1_1, buf0, buf1, buf3)
    del arg0_1
    del arg1_1
    del arg2_1
    return (buf3, )
```

- After:
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/em/cemtujj65j5txpqlxc7w4pcunpmvz3qtiudkc5ocxxhcmdlknw2m.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    #pragma omp parallel num_threads(112)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L))
                {
                    {
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L));
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L))
                        {
                            for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 16);
                                tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
                            }
                            for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 14);
                                masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0);
                            }
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
        }
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L))
                {
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(960L); x2+=static_cast<long>(16L))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (960L*x1) + (8847360L*x0)), 16);
                        auto tmp1 =
                        [&]
                        {
                            __at_align__ std::array<float, 16> tmpbuf;
                            #pragma GCC unroll 16
                            for (long x2_inner = 0; x2_inner < 16; x2_inner++)
                            {
                                tmpbuf[x2_inner] = out_ptr0[static_cast<long>((32L*x0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))];
                            }
                            return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16);
                        }
                        ()
                        ;
                        auto tmp3 =
                        [&]
                        {
                            __at_align__ std::array<float, 16> tmpbuf;
                            #pragma GCC unroll 16
                            for (long x2_inner = 0; x2_inner < 16; x2_inner++)
                            {
                                tmpbuf[x2_inner] = out_ptr1[static_cast<long>((32L*x0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))];
                            }
                            return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16);
                        }
                        ()
                        ;
                        auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x2), 16);
                        auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x2), 16);
                        auto tmp2 = tmp0 - tmp1;
                        auto tmp4 = static_cast<float>(276480.0);
                        auto tmp5 = at::vec::Vectorized<float>(tmp4);
                        auto tmp6 = tmp3 / tmp5;
                        auto tmp7 = static_cast<float>(1e-05);
                        auto tmp8 = at::vec::Vectorized<float>(tmp7);
                        auto tmp9 = tmp6 + tmp8;
                        auto tmp10 = tmp9.rsqrt();
                        auto tmp11 = tmp2 * tmp10;
                        auto tmp13 = tmp11 * tmp12;
                        auto tmp15 = tmp13 + tmp14;
                        tmp15.store(out_ptr2 + static_cast<long>(x2 + (960L*x1) + (8847360L*x0)));
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1 = args
    args.clear()
    assert_size_stride(arg0_1, (960, ), (1, ))
    assert_size_stride(arg1_1, (960, ), (1, ))
    assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960))
    buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32)
    cpp_fused_native_group_norm_0(arg2_1, arg0_1, arg1_1, buf0, buf1, buf3)
    del arg0_1
    del arg1_1
    del arg2_1
    return (buf3, )
```

Co-authored-by: CaoE <e.cao@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126526
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-08-07 06:00:12 +00:00
Apurva Jain
8bc5ef563e Grouped Query Attention (#132689)
### Approach: Using the current function declaration

**Constraint:** Q_Heads % KV_Heads == 0

**Major change:**
- Added a new argument enable_gqa: bool to sdpa function call
- It adds a meaning to the last third dimension.

Sample use cases this would enable:
LLama3

```
# LLama3 8b call to SDPA
query = torch.rand(batch, 32, seq_len_q, D)
key = torch.rand(batch, 8, seq_len_kv, D)
value = torch.rand(batch, 8, seq_len_kv, D)

output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True)

# Output Shape
(batch, 32, seq_len_q, D)
```

### Design Choice:

- Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0
- The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms.
- By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged.

### Benchmarks:

- **sdpa.py: #130634**
For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa

 | batch_size | q_num_heads | kv_num_heads | q_seq_len | kv_seq_len | embed_dim | forward_time when enable_gqa=True   |   forward_time when enable_gqa=False    |
| ------------ | ------------- | -------------- | ----------- | ------------ | ----------- | ----------- | ---------------- |
|     1      |     32      |      8       |   2048    |    2048    |   2048    |   100.71  |  119.70  |
|     8      |     32      |      8       |   2048    |    2048    |   2048    |   539.78  |  628.83  |
|     16     |     32      |      8       |   2048    |    2048    |   2048    |   1056.81  |  1225.48  |
|     32      |     32      |      8       |   2048    |    2048    |   2048    |   2099.54  |  2440.45  |

![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b)

- **TorchTitan: https://github.com/pytorch/torchtitan/pull/458**

Differential Revision: D60772086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132689
Approved by: https://github.com/drisspg
2024-08-07 05:35:36 +00:00
Nicolas Macchioni
527f104a69 add L2 cache size to device properties (#132819)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132819
Approved by: https://github.com/eellison
2024-08-07 04:55:06 +00:00
cyy
bfeb45e46b [17/N] Fix clang-tidy warnings in jit (#132753)
Follows #132604
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132753
Approved by: https://github.com/Skylion007
2024-08-07 03:47:54 +00:00
cyy
03480213de [8/N] Fix clang-tidy warnings in aten/src/ATen (#132728)
Follows  #132727
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132728
Approved by: https://github.com/ezyang
2024-08-07 02:44:17 +00:00
Menglu Yu
919e384247 [PT2][Optimus] Add unbind_stack_to_cat_pass (#132542)
Summary: We observe the stack mpde can be transformed to cat node to elimiate split nodes, which could further enable the unbind cat optimization, thus we add a more advanced pattern to do the graph transformation

Test Plan:
# unit test

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes
```
Buck UI: https://www.internalfb.com/buck2/de6c1cda-3d74-4a30-8980-7b209b6fe5dc
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12103424042268125
Network: Up: 485KiB  Down: 728KiB  (reSessionID-2f2c01c3-79bb-4e37-b5be-fb77ec09b264)
Jobs completed: 29. Time elapsed: 5:19.8s.
Cache hits: 0%. Commands: 4 (cached: 0, remote: 0, local: 4)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0

# benchmark

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697
```
P1503698962

before and after graph transformation
https://www.internalfb.com/intern/diffing/?paste_number=1504050718

Differential Revision: D60411560

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132542
Approved by: https://github.com/jackiexu1992
2024-08-07 02:26:40 +00:00
Xuehai Pan
063a45ed27 Fix infinite recursion while walking to submodules (#132763)
Fixes https://github.com/pytorch/pytorch/pull/132216#issuecomment-2271555873

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132763
Approved by: https://github.com/ezyang
2024-08-07 02:20:17 +00:00
leslie-fang-intel
73c083e02c [Inductor][CPP] Turns on inline_inbuilt_nn_modules for CPP GEMM template testing (#132487)
**Summary**
The CPP GEMM template testing has been skipped with turning on `inline_inbuilt_nn_modules ` as in https://github.com/pytorch/pytorch/issues/131929.  Since https://github.com/pytorch/pytorch/pull/132334 has landed to fix the issues. Turn on this flag back since it's default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132487
Approved by: https://github.com/anijain2305, https://github.com/jgong5
2024-08-07 02:18:51 +00:00
Edward Z. Yang
ed224554eb [BE] Don't unnecessarily suggest -k for rerunning tests locally (#132807)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132807
Approved by: https://github.com/malfet
2024-08-07 02:15:18 +00:00
Edward Z. Yang
837898d9c8 Stop using preserve_rng_state as decorator (#132774)
See https://github.com/pytorch/pytorch/pull/132073 for motivation

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132774
Approved by: https://github.com/albanD
2024-08-07 01:07:12 +00:00
cyy
b01402b0a4 [7/N] Fix clang-tidy warnings in aten/src/ATen (#132727)
Follows  #132620
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132727
Approved by: https://github.com/Skylion007
2024-08-07 00:29:03 +00:00
drisspg
178dc0c9c7 various doc fixes (#132803)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132803
Approved by: https://github.com/Chillee, https://github.com/joydddd, https://github.com/BoyuanFeng
ghstack dependencies: #132799
2024-08-07 00:19:42 +00:00
drisspg
cb4d1bfb71 Clean up some tflop calc and add option for saving (#132799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132799
Approved by: https://github.com/BoyuanFeng
2024-08-07 00:19:42 +00:00
PyTorch MergeBot
cbee9c1fd2 Revert "Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)"
This reverts commit 0e7e61f7ce.

Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2272370386))
2024-08-07 00:05:20 +00:00
Henry Tsang
e98eac76b3 [inductor] switch AotCodeCompiler to new cpp_builder. (take 3) (#132766)
Summary: This is basically https://github.com/pytorch/pytorch/pull/131304 together with https://github.com/pytorch/pytorch/pull/132594 and absolute path fix for fbcode.

Test Plan: ci

Differential Revision: D60773405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132766
Approved by: https://github.com/xuhancn, https://github.com/chenyang78, https://github.com/desertfire
2024-08-06 23:56:34 +00:00
PyTorch MergeBot
c7113a6186 Revert "[DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709)"
This reverts commit 1a23ef2ece.

Reverted https://github.com/pytorch/pytorch/pull/132709 on behalf of https://github.com/clee2000 due to I think this broke distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_device_mesh_initialization [GH job link](https://github.com/pytorch/pytorch/actions/runs/10274519791/job/28432469987) [HUD commit link](1a23ef2ece).  Test not run due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/132709#issuecomment-2272350923))
2024-08-06 23:47:53 +00:00
rzou
0d6caeb259 Add logging + counter for missed reinplacing opportunities (#132758)
Summary:
- We add Inductor logs for what tensors we tried to reinplace, what
  tensors we were unable to reinplace, and of those tensors, which of
  those might be bugs (the "missed reinplacing opportunities"). You can
  tell this by reading the Inductor output graph but the logs make it
  easier to figure out.
- Add a dynamo_compile counter for missed reinplacing opportunities. The
  goal is to see how widespread existing problems (if any) are. We've had
  trouble getting all of the edge cases for the reinplacing pass; the
  counter will help us hunt down issues.

Test Plan:
- tested locally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132758
Approved by: https://github.com/eellison
2024-08-06 23:44:24 +00:00
mori360
cd7f527c59 [3/3] 3D Composability - move tp dp tests (#129802)
pytorch (fsdp, tp, pp) -> pytorch (composable)
Move (fsdp, tp, pp) tests under pytorch into a composable folder

FSDP:
test/distributed/_composable/fsdp/test_fully_shard_trainin.py
-TestFullyShard2DTraining
**DP:
test/distributed/tensor/parallel/test_ddp_2d_parallel.py
TP:
test/distributed/tensor/parallel/test_fsdp_2d_parallel.py**
PP:
test/distributed/pipelining/test_composability.py

=>
**distributed/_composable/test_composability/test_2d_composability.py**
distributed/_composable/test_composability/test_pp_composability.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129802
Approved by: https://github.com/fduwjj
ghstack dependencies: #129801
2024-08-06 23:07:07 +00:00
mori360
179b572fd9 [2/3] 3D Composability - move pp tests (#129801)
pytorch (fsdp, tp, pp) -> pytorch (composable)
Move (fsdp, tp, pp) tests under pytorch into a composable folder

FSDP:
test/distributed/_composable/fsdp/test_fully_shard_trainin.py
-TestFullyShard2DTraining
DP:
test/distributed/tensor/parallel/test_ddp_2d_parallel.py
TP:
test/distributed/tensor/parallel/test_fsdp_2d_parallel.py
**PP:
test/distributed/pipelining/test_composability.py**

=>
distributed/_composable/test_composability/test_2d_composability.py
**distributed/_composable/test_composability/test_pp_composability.py**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129801
Approved by: https://github.com/wconstab, https://github.com/atalman
2024-08-06 23:07:07 +00:00
Shangdi Yu
825002c9c6 [export][fx] More robust DCE pass (#132764)
Summary:
- make default DCE pass check schema,
- need to rebase onto https://github.com/pytorch/pytorch/pull/131651 after it's in phabricator (for now the change is manually added).

- mark Proxy dump as NotImplemented for better error msg

- Remove Proxy from tensors when dumping models, as Proxy cannot be dumped.

More details in https://docs.google.com/document/d/1G5vmTXjzxoyVGRI2kpA1gQukK_Glyg2NrE0Oh6Nlg9A/edit?usp=sharing.

Test Plan:
CI
```
- buck2 run 'fbcode//mode/dev-nosan'  fbcode//caffe2/test/quantization:test_quantization -- -r  qat_conv2d
- test_export.py
- buck2 run 'fbcode//mode/dev-nosan' fbcode//modai/test:test_modai -- -r test_qat_stinson_htp_export
- buck2 run 'fbcode//mode/dev-nosan' fbcode//vizard_projects/ml_depth/tests:test_model -- -r test_qat_model_et
- buck2 run 'fbcode//mode/dev-nosan'  fbcode//caffe2/test:fx -- -r dce
- buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/backends/tests:qnn_test -- -r test_qat_bias=False,use_3d_input=False
- buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/backends/tests:qnn_test -- -r test_qat_bias=True,use_3d_input=False
- buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r  test_fold_bn_erases_bn_node
```

Reviewed By: angelayi

Differential Revision: D60319175

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132764
Approved by: https://github.com/angelayi
2024-08-06 22:27:22 +00:00
wz337
073cee531c [Test][Easy] Remove print in test_device_mesh.py (#132780)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132780
Approved by: https://github.com/XilunWu
2024-08-06 22:04:39 +00:00
wz337
1a23ef2ece [DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709)
More context in [#132471](https://github.com/pytorch/pytorch/issues/132471) and https://github.com/pytorch/pytorch/issues/132366.

TLDR:
When cuda is available and users move tensors to cuda, we cannot really reuse the default pg if default pg is gloo, as lots of collectives are not supported on gloo for cuda tensors. For example, `dtensor.full_tensor()` would result in a mysterious SIGTERM when all_gather a cuda tensor using gloo. Without the change in this PR, users would have to know the context and explicitly move the cuda tensor to cpu before invoking most collectives, which I think is not so ideal UX.

Therefore, given most collectives are not supported on gloo for cuda tensors, we should init a new pg if the default pg is gloo when torch.cuda.is_available() and device_type is cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132709
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-08-06 22:00:09 +00:00
eellison
18b678082e [Easy] log output code path on cache hit (#132718)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132718
Approved by: https://github.com/oulgen, https://github.com/masnesral
2024-08-06 21:59:30 +00:00
Edward Z. Yang
3c1033eeb0 Don't auto request review for reopened PRs (#132681)
This will clobber previous approves.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132681
Approved by: https://github.com/albanD, https://github.com/malfet
2024-08-06 21:36:18 +00:00
rzou
2073ddfd1c Actually report the HOP and subclass/mode when there isn't a registration (#132550)
Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132550
Approved by: https://github.com/ydwu4
2024-08-06 21:33:10 +00:00
yuqingj
623d0204f0 [NJT] Support Chunk backward for simple cases (#132193)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132193
Approved by: https://github.com/soulitzer
2024-08-06 21:20:09 +00:00
Aart Bik
2f908ffa4a [traced-graph][sparse] sparsity propagation for all current tests (#132690)
This PR makes sure all current tests in the sparsity export test suite pass. Note that there will probably be anecdotal cases that need fixing after this, but the general idea of preserving sparsity metadata has been completed.

Fixes: https://github.com/pytorch/pytorch/issues/117188

```
$ PYTORCH_TEST_WITH_DYNAMO=0 python test/export/test_sparse.py ........................................................................................................................................................
 ----------------------------------------------------------------------
Ran 152 tests
OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132690
Approved by: https://github.com/ezyang
2024-08-06 21:18:13 +00:00
dependabot[bot]
029f8fc701 Bump rexml from 3.2.8 to 3.3.3 in /ios/TestApp (#132469)
Bumps [rexml](https://github.com/ruby/rexml) from 3.2.8 to 3.3.3.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/ruby/rexml/releases">rexml's releases</a>.</em></p>
<blockquote>
<h2>REXML 3.3.3 - 2024-08-01</h2>
<h3>Improvements</h3>
<ul>
<li>
<p>Added support for detecting invalid XML that has unsupported
content before root element</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/184">GH-184</a></li>
<li>Patch by NAITOH Jun.</li>
</ul>
</li>
<li>
<p>Added support for <code>REXML::Security.entity_expansion_limit=</code> and
<code>REXML::Security.entity_expansion_text_limit=</code> in SAX2 and pull
parsers</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/187">GH-187</a></li>
<li>Patch by NAITOH Jun.</li>
</ul>
</li>
<li>
<p>Added more tests for invalid XMLs.</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/183">GH-183</a></li>
<li>Patch by Watson.</li>
</ul>
</li>
<li>
<p>Added more performance tests.</p>
<ul>
<li>Patch by Watson.</li>
</ul>
</li>
<li>
<p>Improved parse performance.</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/186">GH-186</a></li>
<li>Patch by tomoya ishida.</li>
</ul>
</li>
</ul>
<h3>Thanks</h3>
<ul>
<li>
<p>NAITOH Jun</p>
</li>
<li>
<p>Watson</p>
</li>
<li>
<p>tomoya ishida</p>
</li>
</ul>
<h2>REXML 3.3.2 - 2024-07-16</h2>
<h3>Improvements</h3>
<ul>
<li>
<p>Improved parse performance.</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/160">GH-160</a></li>
<li>Patch by NAITOH Jun.</li>
</ul>
</li>
<li>
<p>Improved parse performance.</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/169">GH-169</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/170">GH-170</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/171">GH-171</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/172">GH-172</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/173">GH-173</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/174">GH-174</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/175">GH-175</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/176">GH-176</a></li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/ruby/rexml/blob/master/NEWS.md">rexml's changelog</a>.</em></p>
<blockquote>
<h2>3.3.3 - 2024-08-01 {#version-3-3-3}</h2>
<h3>Improvements</h3>
<ul>
<li>
<p>Added support for detecting invalid XML that has unsupported
content before root element</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/184">GH-184</a></li>
<li>Patch by NAITOH Jun.</li>
</ul>
</li>
<li>
<p>Added support for <code>REXML::Security.entity_expansion_limit=</code> and
<code>REXML::Security.entity_expansion_text_limit=</code> in SAX2 and pull
parsers</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/187">GH-187</a></li>
<li>Patch by NAITOH Jun.</li>
</ul>
</li>
<li>
<p>Added more tests for invalid XMLs.</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/183">GH-183</a></li>
<li>Patch by Watson.</li>
</ul>
</li>
<li>
<p>Added more performance tests.</p>
<ul>
<li>Patch by Watson.</li>
</ul>
</li>
<li>
<p>Improved parse performance.</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/186">GH-186</a></li>
<li>Patch by tomoya ishida.</li>
</ul>
</li>
</ul>
<h3>Thanks</h3>
<ul>
<li>
<p>NAITOH Jun</p>
</li>
<li>
<p>Watson</p>
</li>
<li>
<p>tomoya ishida</p>
</li>
</ul>
<h2>3.3.2 - 2024-07-16 {#version-3-3-2}</h2>
<h3>Improvements</h3>
<ul>
<li>
<p>Improved parse performance.</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/160">GH-160</a></li>
<li>Patch by NAITOH Jun.</li>
</ul>
</li>
<li>
<p>Improved parse performance.</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/169">GH-169</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/170">GH-170</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/171">GH-171</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/172">GH-172</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/173">GH-173</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/174">GH-174</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/175">GH-175</a></li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="e4a067e112"><code>e4a067e</code></a> Add 3.3.3 entry</li>
<li><a href="17ff3e7874"><code>17ff3e7</code></a> test: add a performance test for attribute list declaration</li>
<li><a href="be86b3de0a"><code>be86b3d</code></a> test: fix wrong test name</li>
<li><a href="b93d790b36"><code>b93d790</code></a> test: use double quote for string literal</li>
<li><a href="0fbe7d5a0e"><code>0fbe7d5</code></a> test: don't use abbreviated name</li>
<li><a href="1599e8785f"><code>1599e87</code></a> test: add a performance test for PI with many tabs</li>
<li><a href="e2546e6eca"><code>e2546e6</code></a> parse pi: improve invalid case detection</li>
<li><a href="73661ef281"><code>73661ef</code></a> test: fix a typo</li>
<li><a href="850488abf2"><code>850488a</code></a> test: use double quote for string literal</li>
<li><a href="46c6397d5c"><code>46c6397</code></a> test: add performance tests for entity declaration</li>
<li>Additional commits viewable in <a href="https://github.com/ruby/rexml/compare/v3.2.8...v3.3.3">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=rexml&package-manager=bundler&previous-version=3.2.8&new-version=3.3.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).

</details>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132469
Approved by: https://github.com/ezyang
2024-08-06 21:17:24 +00:00
PyTorch MergeBot
e47b684c33 Revert "Temp disable MKL in DistributionKernels.cpp (#132532)"
This reverts commit 7b2664ece6.

Reverted https://github.com/pytorch/pytorch/pull/132532 on behalf of https://github.com/PaliC due to causing numerical instability issues internally ([comment](https://github.com/pytorch/pytorch/pull/132532#issuecomment-2272136210))
2024-08-06 20:57:09 +00:00
Li Yu (ads)
94155ce31b [Torch] Support meta device in checkpoint (#132684)
Summary:
## Why
utils.checkpoint doesn't support meta device:

```
  File "/Users/lyu1/torchdev/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 490, in checkpoint
    next(gen)
  File "/Users/lyu1/torchdev/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 1359, in _checkpoint_without_reentrant_generator
    device_module = _get_device_module(device)
  File "/Users/lyu1/torchdev/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 98, in _get_device_module
    device_module = getattr(torch, device)
  File "/Users/lyu1/torchdev/lib/python3.9/site-packages/torch/__init__.py", line 1938, in __getattr__
    raise AttributeError(f"module '{__name__}' has no attribute '{name}'")
AttributeError: module 'torch' has no attribute 'meta'
```

This blocks us from running model with checkpoint enabled in meta mode.

## What
This diff handles the case of meta device in checkpoint.py.

(in checkpoint.py, device module is manily used when preserve_rng_state=true, which doesn't apply to meta case. So a more elgant fix might be set preserve_rng_state=false when detecting args are on meta device. But I didn't find where to do this check in the minimum way. Let me know if you have ideas.)

Test Plan: Tested with toy model which has checkpoint on its module: P1513716944

Differential Revision: D60749427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132684
Approved by: https://github.com/kit1980
2024-08-06 20:45:50 +00:00
Animesh Jain
de00c79583 [dynamo][inline_inbuilt_nn_modules] Mark nn module tensor static for cudagraphs (#132736)
Fixes https://github.com/pytorch/pytorch/issues/132714

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132736
Approved by: https://github.com/mlazos
ghstack dependencies: #132538
2024-08-06 20:13:28 +00:00
Shuo Ding
1954bfacda [Inductor] Small performance, precision, and dependency updates to B2B-GEMM (#132354)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132354
Approved by: https://github.com/masnesral
2024-08-06 20:01:27 +00:00
Tugsbayasgalan Manlaibaatar
775c310c0c Preserve source_fn_stack in the training IR decomp (#132033)
Title

Differential Revision: [D60377712](https://our.internmc.facebook.com/intern/diff/D60377712/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132033
Approved by: https://github.com/angelayi
ghstack dependencies: #131988, #131995, #131999
2024-08-06 19:45:40 +00:00