pytorch

mirror of https://github.com/saymrwulf/pytorch.git synced 2026-05-14 20:57:59 +00:00

Author	SHA1	Message	Date
Matthew Hoffman	258f47fc0b	Add `padding_side` to `pad_sequence` with `"left"` and `"right"` options (`"right"` as default) (#131884 ) Fixes #10536 Reattempt of #61467. Thank you so much to @mskoh52 for your excellent work! As I was trying to create a more efficient LLM data collator, I realized that `pad_sequence` only supports right padding, even though left padding is a very common format for LLMs, like Llama and Mistral. The proposed alternative implementation was to use multiple flips, which tends to be 1.5x-2x slower. Instead we can add a [`padding_side` parameter as there is for for Hugging Face tokenizers](`9d6c0641c4/src/transformers/tokenization_utils_base.py (L1565)`), which requires only a very small change in the C++ code. Here are the benchmarks of the new implementation! `float32`: ![eaaa95ef-9384-45d2-be56-6898bc1d3514](https://github.com/user-attachments/assets/3b0eb309-e5a0-4a4d-97bb-4e3298783dbb) `bool`: ![892f32da-8d9a-492b-9507-18d3f0a41e8e](https://github.com/user-attachments/assets/6824ea15-7d4e-4b89-95f0-8546635f0c2e) Code: ```python from __future__ import annotations import random import time from typing import Literal import numpy as np import torch def pad_sequence_with_flips( sequences: list[torch.Tensor], batch_first: bool = False, padding_value: int \| float \| bool = 0.0, padding_side: Literal["left", "right"] \| str = "left", ) -> torch.Tensor: if padding_side == 'right': padded_sequence = torch._C._nn.pad_sequence([t.flatten() for t in sequences], batch_first=batch_first, padding_value=padding_value) elif padding_side=='left': padded_sequence = torch._C._nn.pad_sequence([t.flatten().flip(0) for t in sequences], batch_first=batch_first, padding_value=padding_value) # pyright: ignore[reportArgumentType] padded_sequence = padded_sequence.flip(int(batch_first)) else: raise ValueError(f"padding_side should be either 'right' or 'left', but got {padding_side}") return padded_sequence sequence_lengths: list[int] = [] flip_left_pad_times: list[float] = [] flip_left_pad_times_std: list[float] = [] left_pad_times: list[float] = [] left_pad_times_std: list[float] = [] RUNS_PER_LOOP: int = 100 for i in range(1, 7): sequence_length = i * int(1e6) // 6 sequence_lengths.append(sequence_length) sequences = [torch.randint(0, 2, (random.randint(1, sequence_length),), dtype=torch.bool) for _ in range(64)] inner_left_pad_times: list[float] = [] inner_right_pad_times: list[float] = [] inner_flip_left_pad_times: list[float] = [] inner_flip_right_pad_times: list[float] = [] for _ in range(RUNS_PER_LOOP): start = time.perf_counter() torch._C._nn.pad_sequence(sequences, batch_first=True, padding_value=False, padding_side="left") end = time.perf_counter() inner_left_pad_times.append(end - start) start = time.perf_counter() pad_sequence_with_flips(sequences, batch_first=True, padding_value=False, padding_side="left") end = time.perf_counter() inner_flip_left_pad_times.append(end - start) left_pad_times.append(sum(inner_left_pad_times) / len(inner_left_pad_times)) left_pad_times_std.append(np.std(inner_left_pad_times)) flip_left_pad_times.append(sum(inner_flip_left_pad_times) / len(inner_flip_left_pad_times)) flip_left_pad_times_std.append(np.std(inner_flip_left_pad_times)) print(f"Sequence Length: {sequence_length}, Left Pad Time: {left_pad_times[-1]}, Left with Flips Pad Time: {flip_left_pad_times[-1]}") import matplotlib.pyplot as plt plt.plot(sequence_lengths, left_pad_times, label="new pad_sequence left") plt.scatter(sequence_lengths, left_pad_times) plt.errorbar(sequence_lengths, left_pad_times, yerr=left_pad_times_std, linestyle='None', marker='^') plt.plot(sequence_lengths, flip_left_pad_times, label="old pad_sequence left (2 flips)") plt.scatter(sequence_lengths, flip_left_pad_times) plt.errorbar(sequence_lengths, flip_left_pad_times, yerr=flip_left_pad_times_std, linestyle='None', marker='^') plt.xlabel("Sequence Length") plt.ylabel("Time (s)") plt.legend(loc="upper right") # Sequence Length: 166666, Left Pad Time: 0.06147645162009212, Left with Flips Pad Time: 0.09842291727001794 # Sequence Length: 333333, Left Pad Time: 0.08933195920990329, Left with Flips Pad Time: 0.15597836187991562 # Sequence Length: 500000, Left Pad Time: 0.08863158334006585, Left with Flips Pad Time: 0.15224887342999863 # Sequence Length: 666666, Left Pad Time: 0.10524682551997103, Left with Flips Pad Time: 0.18177212480995877 # Sequence Length: 833333, Left Pad Time: 0.11801802741003485, Left with Flips Pad Time: 0.20821274195001024 # Sequence Length: 1000000, Left Pad Time: 0.131894061660023, Left with Flips Pad Time: 0.23223503091008751 ``` Co-authored-by: mskoh52 <mskoh52@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131884 Approved by: https://github.com/ezyang	2024-08-07 15:53:07 +00:00
PyTorch MergeBot	780310fed7	Revert "Only thunkify proxies in some situations (#132421 )" This reverts commit `bb99008c9e`. Reverted https://github.com/pytorch/pytorch/pull/132421 on behalf of https://github.com/clee2000 due to I think this broke dynamo/test_subclasses.py::TestNestedTensor::test_in_graph_construction_from_input [GH job link](https://github.com/pytorch/pytorch/actions/runs/10283744685/job/28459340678) [HUD commit link](`bb99008c9e`). Test got added in `f50621989b` which is before your merge base ([comment](https://github.com/pytorch/pytorch/pull/132421#issuecomment-2273742960))	2024-08-07 15:29:54 +00:00
PyTorch MergeBot	de9b8a42c1	Revert "Add support for other backends in get_preferred_device (#132118 )" This reverts commit `c184ac0f6b`. Reverted https://github.com/pytorch/pytorch/pull/132118 on behalf of https://github.com/clee2000 due to I think this broke distributed/checkpoint/test_file_system_checkpoint_cpu.py::TestDistributedReshardOnLoad::test_load_rowwise_to_colwise_thread_count_1 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10279901233/job/28456599072) [HUD commit link](`c184ac0f6b`). Dr CI classification is wrong, the failure is not flaky ([comment](https://github.com/pytorch/pytorch/pull/132118#issuecomment-2273729288))	2024-08-07 15:22:42 +00:00
cyy	13fa59580e	Enable clang-tidy on aten/src/ATen/cpu (#132830 ) Expands code coverage of clang-tidy to aten/src/ATen/cpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/132830 Approved by: https://github.com/Skylion007	2024-08-07 14:44:17 +00:00
Antoni Viros	ed97fb77f9	Conversions between strided and jagged layouts for Nested Tensors (#115749 ) This PR does 3 things: 1. Adds a copy-free strided->jagged layout conversion for NT 2. Adds a copy-free jagged->strided layout conversion for NT 3. Modifies and expands the .to() API to support the layout argument for the specific case of NT layout conversion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115749 Approved by: https://github.com/jbschlosser	2024-08-07 14:18:53 +00:00
Joel Schlosser	fb146fc3c6	Only store necessary tensor_dict fields in node meta (#132805 ) Fixes #132290 This PR attempts a more invasive / complete solution than the one from #132338, which removes immediate tensor fields from the `tensor_dict` copy stored in node meta. The approach taken here is to store only those fields of the `tensor_dict` which are absolutely utilized somewhere else. So far, this appears to be limited to: * `_dynamo_static_input_type` * `tag` (at least in the tests). Discussion at #94080 appears to indicate this is depended on for export (CI may point out more) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132805 Approved by: https://github.com/mlazos	2024-08-07 13:35:16 +00:00
Edward Z. Yang	7c79e89bc5	Stop using clear_frame as decorator (#132778 ) See https://github.com/pytorch/pytorch/pull/132073 for motivation Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132778 Approved by: https://github.com/albanD ghstack dependencies: #132774	2024-08-07 11:53:18 +00:00
Edward Z. Yang	bb99008c9e	Only thunkify proxies in some situations (#132421 ) The goal of this PR is to avoid stack overflow when we create extremely long chains of thunks, and then evaluate them (e.g., as occurs if you sum(long list of symint)). The basic idea behind this PR is to only thunkify proxies if they're being created in places where they may or may not be used--crucially, symint operations that occur in user code we are tracing are eagerly placed into the graph, even if they may eventually be dead. I annotated the PR with explanation of changes. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132421 Approved by: https://github.com/Skylion007, https://github.com/zou3519 ghstack dependencies: #132674, #132675	2024-08-07 11:51:17 +00:00
Danielmic	32f9a809c7	Replace [[unlikely]] with unlikely(x) (#130816 ) Do not use `[[unlikely]]` as its c++20 language features, see https://en.cppreference.com/w/cpp/language/attributes/likely Fixes https://github.com/pytorch/pytorch/issues/130815 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130816 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/malfet	2024-08-07 10:38:13 +00:00
zengxian	8c8eb9670a	[CI] Enable inductor UT test on avx512 (#132645 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132645 Approved by: https://github.com/desertfire	2024-08-07 10:22:40 +00:00
Syed Tousif Ahmed	37ab0f3385	Loads .pyd instead of .so in MemPool test for windows (#132749 ) Fixes #132650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132749 Approved by: https://github.com/albanD	2024-08-07 09:58:52 +00:00
xinyu-intel	8333ecf085	Support hasattr tracing for more PythonModuleVariable (#132731 ) Fixes #132237 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132731 Approved by: https://github.com/EikanWang, https://github.com/yanboliang	2024-08-07 09:15:17 +00:00
Nicolas Macchioni	c8c964f950	[inductor] check best templates first for fusions (#132829 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132829 Approved by: https://github.com/eellison	2024-08-07 07:48:00 +00:00
Jeeja	c184ac0f6b	Add support for other backends in get_preferred_device (#132118 ) Currenlty get_preferred_device supports only cuda and cpu. Add support for other backends using backend config. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132118 Approved by: https://github.com/awgu	2024-08-07 07:19:20 +00:00
wz337	87053132ea	[DeviceMesh] Remove parent mesh concept from _MeshEnv and replace by root mesh (#132339 ) Previously, when we slice out a submesh from a mesh, we assign the mesh as the parent mesh of the submesh. In this case, when we have a 3D mesh topology, the parent mesh of a 1D mesh sliced out from the 3D mesh is different from the parent mesh of the same 1D mesh sliced out from the 2D submesh of the 3D mesh. For example: ``` mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2")) mesh_dim0 = mesh_3d["dim0"] mesh_2d = mesh_2d["dim0", "dim1"] mesh_dim0_2 = mesh_2d["dim0_2"] # This would evaluate to be True print(_mesh_resources.get_parent_mesh(mesh_dim0) != _mesh_resources.get_parent_mesh(mesh_dim0)) ``` We can always reconstruct the mesh needed from the mesh dim names, as long as two dims come from the same root. For simplicity, we do not see the necessity of building a tree structure to represent child-parent relationship. Therefore, we are replacing the parent mesh concept with a root mesh concept in `_MeshEnv` so we would have: ``` mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2")) mesh_dim0 = mesh_3d["dim0"] mesh_2d = mesh_2d["dim0", "dim1"] mesh_dim0_2 = mesh_2d["dim0_2"] # This would evaluate to be True print(_mesh_resources.get_root_mesh(mesh_dim0) == _mesh_resources.get_root_mesh(mesh_dim0)) ``` With this change, we will have two types of meshes in an environment. 1. `device_mesh != _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is created by slicing. 2. `device_mesh == _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is a root mesh not created through slicing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132339 Approved by: https://github.com/wanchaol ghstack dependencies: #132310, #132311	2024-08-07 07:01:12 +00:00
leslie-fang-intel	dc00eeb0f4	[Dynamo] fix incorrect kwargs in create_proxy (#132723 ) ## Summary Fix https://github.com/pytorch/pytorch/issues/132642, the implementation of `create_proxy` requires to pass-in `kwargs` explicitly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132723 Approved by: https://github.com/aorenste	2024-08-07 06:26:24 +00:00
Nikita Shulga	2206a3de00	[Compile] Speedup int8-to-float conversion on aarch64 (#132676 ) With this change following snippet: ```cpp #include <ATen/cpu/vec/vec.h> void int8tofloat(int8_t* in, float* out) { auto tmp0 = at::vec::Vectorized<int8_t>::loadu(in, 8); auto tmp1 = at::vec::convert<float>(tmp0); tmp1.store(out); } ```, which is core of the algorithm generated by cpu_inductor for the following compiled function: ```python @torch.compile def to_float(x): return x.to(torch.float) ``` changes from ```assembly int8tofloat(signed char, float): 0000000000000000 stp x29, x30, [sp, #-0x10]! 0000000000000004 mov x29, sp 0000000000000008 sub x9, sp, #0x30 000000000000000c and sp, x9, #0xffffffffffffffe0 0000000000000010 adrp x8, 0 ; 0x0 0000000000000014 ldr x8, [x8] 0000000000000018 ldr x8, [x8] 000000000000001c str x8, [sp, #0x28] 0000000000000020 ldr s0, [x0] 0000000000000024 sshll.8h v0, v0, #0x0 0000000000000028 sshll.4s v0, v0, #0x0 000000000000002c scvtf.4s v0, v0 0000000000000030 str q0, [sp] 0000000000000034 ldr s0, [x0, #0x4] 0000000000000038 sshll.8h v0, v0, #0x0 000000000000003c sshll.4s v0, v0, #0x0 0000000000000040 scvtf.4s v0, v0 0000000000000044 str q0, [sp, #0x10] 0000000000000048 mov x8, sp 000000000000004c ld1.4s { v0, v1 }, [x8] 0000000000000050 st1.4s { v0, v1 }, [x1] 0000000000000054 ldr x8, [sp, #0x28] 0000000000000058 adrp x9, 0 ; 0x0 000000000000005c ldr x9, [x9] 0000000000000060 ldr x9, [x9] 0000000000000064 cmp x9, x8 0000000000000068 b.ne 0x78 000000000000006c mov sp, x29 0000000000000070 ldp x29, x30, [sp], #0x10 0000000000000074 ret 0000000000000078 bl 0x78 ``` to ```assembly 0000000000000000 ldr d0, [x0] 0000000000000004 sshll.8h v0, v0, #0x0 0000000000000008 sshll.4s v1, v0, #0x0 000000000000000c scvtf.4s v1, v1 0000000000000010 sshll2.4s v0, v0, #0x0 0000000000000014 scvtf.4s v2, v0 0000000000000018 st1.4s { v1, v2 }, [x1] 000000000000001c ret ``` and improves perf of `python3 torchchat.py generate stories110M --num-samples 3 --quantize '{"linear:int8" : {"groupsize" : 0}}' --compile --device cpu` from 56 to 98 tokens per sec on MacBook M1 Pro Pull Request resolved: https://github.com/pytorch/pytorch/pull/132676 Approved by: https://github.com/desertfire	2024-08-07 06:26:05 +00:00
Sun, Jiayi	4faa0e3efb	[Inductor] support masked vectorization for the tail_loop (#126526 ) Currently the tail_loop always uses the scalar kernel. This PR supports masked vectorization for the tail_loop to improve the performance. Example: ``` import torch import torch.nn as nn class GN(nn.Module): def __init__(self, num_groups, num_channels): super(GN, self).__init__() self.gn = nn.GroupNorm(num_groups, num_channels) def forward(self, x): return self.gn(x) input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last) m = GN(32, 960).eval() compiled_m = torch.compile(m) with torch.no_grad(): for _ in range(3): compiled_m(input) ``` Generated code: - Before: ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/ky/cky2bufythacofebk7ujv36e4pxyqcqbpsy5r4vojoprjiwcwfxf.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { #pragma omp parallel num_threads(112) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> weight_recps(static_cast<long>(17280L)); for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &weight_recps); } #pragma omp simd simdlen(8) for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0))]; tmp_acc0 = welford_combine(tmp_acc0, tmp0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L)) { for(long x2=static_cast<long>(0L); x2<static_cast<long>(960L); x2+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0)), 16); auto tmp1 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr0[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp3 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr1[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x2), 16); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x2), 16); auto tmp2 = tmp0 - tmp1; auto tmp4 = static_cast<float>(276480.0); auto tmp5 = at::vec::Vectorized<float>(tmp4); auto tmp6 = tmp3 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = at::vec::Vectorized<float>(tmp7); auto tmp9 = tmp6 + tmp8; auto tmp10 = tmp9.rsqrt(); auto tmp11 = tmp2 * tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; tmp15.store(out_ptr2 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0))); } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1 = args args.clear() assert_size_stride(arg0_1, (960, ), (1, )) assert_size_stride(arg1_1, (960, ), (1, )) assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960)) buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32) cpp_fused_native_group_norm_0(arg2_1, arg0_1, arg1_1, buf0, buf1, buf3) del arg0_1 del arg1_1 del arg2_1 return (buf3, ) ``` - After: ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/em/cemtujj65j5txpqlxc7w4pcunpmvz3qtiudkc5ocxxhcmdlknw2m.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { #pragma omp parallel num_threads(112) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L)); for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 14); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L)) { for(long x2=static_cast<long>(0L); x2<static_cast<long>(960L); x2+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0)), 16); auto tmp1 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr0[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp3 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr1[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x2), 16); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x2), 16); auto tmp2 = tmp0 - tmp1; auto tmp4 = static_cast<float>(276480.0); auto tmp5 = at::vec::Vectorized<float>(tmp4); auto tmp6 = tmp3 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = at::vec::Vectorized<float>(tmp7); auto tmp9 = tmp6 + tmp8; auto tmp10 = tmp9.rsqrt(); auto tmp11 = tmp2 * tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; tmp15.store(out_ptr2 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0))); } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1 = args args.clear() assert_size_stride(arg0_1, (960, ), (1, )) assert_size_stride(arg1_1, (960, ), (1, )) assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960)) buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32) cpp_fused_native_group_norm_0(arg2_1, arg0_1, arg1_1, buf0, buf1, buf3) del arg0_1 del arg1_1 del arg2_1 return (buf3, ) ``` Co-authored-by: CaoE <e.cao@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126526 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-08-07 06:00:12 +00:00
Apurva Jain	8bc5ef563e	Grouped Query Attention (#132689 ) ### Approach: Using the current function declaration Constraint: Q_Heads % KV_Heads == 0 Major change: - Added a new argument enable_gqa: bool to sdpa function call - It adds a meaning to the last third dimension. Sample use cases this would enable: LLama3 ``` # LLama3 8b call to SDPA query = torch.rand(batch, 32, seq_len_q, D) key = torch.rand(batch, 8, seq_len_kv, D) value = torch.rand(batch, 8, seq_len_kv, D) output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True) # Output Shape (batch, 32, seq_len_q, D) ``` ### Design Choice: - Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0 - The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms. - By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged. ### Benchmarks: - sdpa.py: #130634 For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa \| batch_size \| q_num_heads \| kv_num_heads \| q_seq_len \| kv_seq_len \| embed_dim \| forward_time when enable_gqa=True \| forward_time when enable_gqa=False \| \| ------------ \| ------------- \| -------------- \| ----------- \| ------------ \| ----------- \| ----------- \| ---------------- \| \| 1 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 100.71 \| 119.70 \| \| 8 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 539.78 \| 628.83 \| \| 16 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 1056.81 \| 1225.48 \| \| 32 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 2099.54 \| 2440.45 \| ![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b) - TorchTitan: https://github.com/pytorch/torchtitan/pull/458 Differential Revision: D60772086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132689 Approved by: https://github.com/drisspg	2024-08-07 05:35:36 +00:00
Nicolas Macchioni	527f104a69	add L2 cache size to device properties (#132819 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132819 Approved by: https://github.com/eellison	2024-08-07 04:55:06 +00:00
cyy	bfeb45e46b	[17/N] Fix clang-tidy warnings in jit (#132753 ) Follows #132604 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132753 Approved by: https://github.com/Skylion007	2024-08-07 03:47:54 +00:00
cyy	03480213de	[8/N] Fix clang-tidy warnings in aten/src/ATen (#132728 ) Follows #132727 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132728 Approved by: https://github.com/ezyang	2024-08-07 02:44:17 +00:00
Menglu Yu	919e384247	[PT2][Optimus] Add unbind_stack_to_cat_pass (#132542 ) Summary: We observe the stack mpde can be transformed to cat node to elimiate split nodes, which could further enable the unbind cat optimization, thus we add a more advanced pattern to do the graph transformation Test Plan: # unit test ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes ``` Buck UI: https://www.internalfb.com/buck2/de6c1cda-3d74-4a30-8980-7b209b6fe5dc Test UI: https://www.internalfb.com/intern/testinfra/testrun/12103424042268125 Network: Up: 485KiB Down: 728KiB (reSessionID-2f2c01c3-79bb-4e37-b5be-fb77ec09b264) Jobs completed: 29. Time elapsed: 5:19.8s. Cache hits: 0%. Commands: 4 (cached: 0, remote: 0, local: 4) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0 # benchmark ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697 ``` P1503698962 before and after graph transformation https://www.internalfb.com/intern/diffing/?paste_number=1504050718 Differential Revision: D60411560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132542 Approved by: https://github.com/jackiexu1992	2024-08-07 02:26:40 +00:00
Xuehai Pan	063a45ed27	Fix infinite recursion while walking to submodules (#132763 ) Fixes https://github.com/pytorch/pytorch/pull/132216#issuecomment-2271555873 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132763 Approved by: https://github.com/ezyang	2024-08-07 02:20:17 +00:00
leslie-fang-intel	73c083e02c	[Inductor][CPP] Turns on inline_inbuilt_nn_modules for CPP GEMM template testing (#132487 ) Summary The CPP GEMM template testing has been skipped with turning on `inline_inbuilt_nn_modules ` as in https://github.com/pytorch/pytorch/issues/131929. Since https://github.com/pytorch/pytorch/pull/132334 has landed to fix the issues. Turn on this flag back since it's default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132487 Approved by: https://github.com/anijain2305, https://github.com/jgong5	2024-08-07 02:18:51 +00:00
Edward Z. Yang	ed224554eb	[BE] Don't unnecessarily suggest -k for rerunning tests locally (#132807 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132807 Approved by: https://github.com/malfet	2024-08-07 02:15:18 +00:00
Edward Z. Yang	837898d9c8	Stop using preserve_rng_state as decorator (#132774 ) See https://github.com/pytorch/pytorch/pull/132073 for motivation Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132774 Approved by: https://github.com/albanD	2024-08-07 01:07:12 +00:00
cyy	b01402b0a4	[7/N] Fix clang-tidy warnings in aten/src/ATen (#132727 ) Follows #132620 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132727 Approved by: https://github.com/Skylion007	2024-08-07 00:29:03 +00:00
drisspg	178dc0c9c7	various doc fixes (#132803 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132803 Approved by: https://github.com/Chillee, https://github.com/joydddd, https://github.com/BoyuanFeng ghstack dependencies: #132799	2024-08-07 00:19:42 +00:00
drisspg	cb4d1bfb71	Clean up some tflop calc and add option for saving (#132799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132799 Approved by: https://github.com/BoyuanFeng	2024-08-07 00:19:42 +00:00
PyTorch MergeBot	cbee9c1fd2	Revert "Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 )" This reverts commit `0e7e61f7ce`. Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2272370386))	2024-08-07 00:05:20 +00:00
Henry Tsang	e98eac76b3	[inductor] switch AotCodeCompiler to new cpp_builder. (take 3) (#132766 ) Summary: This is basically https://github.com/pytorch/pytorch/pull/131304 together with https://github.com/pytorch/pytorch/pull/132594 and absolute path fix for fbcode. Test Plan: ci Differential Revision: D60773405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132766 Approved by: https://github.com/xuhancn, https://github.com/chenyang78, https://github.com/desertfire	2024-08-06 23:56:34 +00:00
PyTorch MergeBot	c7113a6186	Revert "[DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709 )" This reverts commit `1a23ef2ece`. Reverted https://github.com/pytorch/pytorch/pull/132709 on behalf of https://github.com/clee2000 due to I think this broke distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_device_mesh_initialization [GH job link](https://github.com/pytorch/pytorch/actions/runs/10274519791/job/28432469987) [HUD commit link](`1a23ef2ece`). Test not run due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/132709#issuecomment-2272350923))	2024-08-06 23:47:53 +00:00
rzou	0d6caeb259	Add logging + counter for missed reinplacing opportunities (#132758 ) Summary: - We add Inductor logs for what tensors we tried to reinplace, what tensors we were unable to reinplace, and of those tensors, which of those might be bugs (the "missed reinplacing opportunities"). You can tell this by reading the Inductor output graph but the logs make it easier to figure out. - Add a dynamo_compile counter for missed reinplacing opportunities. The goal is to see how widespread existing problems (if any) are. We've had trouble getting all of the edge cases for the reinplacing pass; the counter will help us hunt down issues. Test Plan: - tested locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/132758 Approved by: https://github.com/eellison	2024-08-06 23:44:24 +00:00
mori360	cd7f527c59	[3/3] 3D Composability - move tp dp tests (#129802 ) pytorch (fsdp, tp, pp) -> pytorch (composable) Move (fsdp, tp, pp) tests under pytorch into a composable folder FSDP: test/distributed/_composable/fsdp/test_fully_shard_trainin.py -TestFullyShard2DTraining DP: test/distributed/tensor/parallel/test_ddp_2d_parallel.py TP: test/distributed/tensor/parallel/test_fsdp_2d_parallel.py PP: test/distributed/pipelining/test_composability.py => distributed/_composable/test_composability/test_2d_composability.py distributed/_composable/test_composability/test_pp_composability.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129802 Approved by: https://github.com/fduwjj ghstack dependencies: #129801	2024-08-06 23:07:07 +00:00
mori360	179b572fd9	[2/3] 3D Composability - move pp tests (#129801 ) pytorch (fsdp, tp, pp) -> pytorch (composable) Move (fsdp, tp, pp) tests under pytorch into a composable folder FSDP: test/distributed/_composable/fsdp/test_fully_shard_trainin.py -TestFullyShard2DTraining DP: test/distributed/tensor/parallel/test_ddp_2d_parallel.py TP: test/distributed/tensor/parallel/test_fsdp_2d_parallel.py PP: test/distributed/pipelining/test_composability.py => distributed/_composable/test_composability/test_2d_composability.py distributed/_composable/test_composability/test_pp_composability.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129801 Approved by: https://github.com/wconstab, https://github.com/atalman	2024-08-06 23:07:07 +00:00
Shangdi Yu	825002c9c6	[export][fx] More robust DCE pass (#132764 ) Summary: - make default DCE pass check schema, - need to rebase onto https://github.com/pytorch/pytorch/pull/131651 after it's in phabricator (for now the change is manually added). - mark Proxy dump as NotImplemented for better error msg - Remove Proxy from tensors when dumping models, as Proxy cannot be dumped. More details in https://docs.google.com/document/d/1G5vmTXjzxoyVGRI2kpA1gQukK_Glyg2NrE0Oh6Nlg9A/edit?usp=sharing. Test Plan: CI ``` - buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r qat_conv2d - test_export.py - buck2 run 'fbcode//mode/dev-nosan' fbcode//modai/test:test_modai -- -r test_qat_stinson_htp_export - buck2 run 'fbcode//mode/dev-nosan' fbcode//vizard_projects/ml_depth/tests:test_model -- -r test_qat_model_et - buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r dce - buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/backends/tests:qnn_test -- -r test_qat_bias=False,use_3d_input=False - buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/backends/tests:qnn_test -- -r test_qat_bias=True,use_3d_input=False - buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_fold_bn_erases_bn_node ``` Reviewed By: angelayi Differential Revision: D60319175 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132764 Approved by: https://github.com/angelayi	2024-08-06 22:27:22 +00:00
wz337	073cee531c	[Test][Easy] Remove print in test_device_mesh.py (#132780 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132780 Approved by: https://github.com/XilunWu	2024-08-06 22:04:39 +00:00
wz337	1a23ef2ece	[DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709 ) More context in [#132471](https://github.com/pytorch/pytorch/issues/132471) and https://github.com/pytorch/pytorch/issues/132366. TLDR: When cuda is available and users move tensors to cuda, we cannot really reuse the default pg if default pg is gloo, as lots of collectives are not supported on gloo for cuda tensors. For example, `dtensor.full_tensor()` would result in a mysterious SIGTERM when all_gather a cuda tensor using gloo. Without the change in this PR, users would have to know the context and explicitly move the cuda tensor to cpu before invoking most collectives, which I think is not so ideal UX. Therefore, given most collectives are not supported on gloo for cuda tensors, we should init a new pg if the default pg is gloo when torch.cuda.is_available() and device_type is cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132709 Approved by: https://github.com/awgu, https://github.com/wanchaol	2024-08-06 22:00:09 +00:00
eellison	18b678082e	[Easy] log output code path on cache hit (#132718 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132718 Approved by: https://github.com/oulgen, https://github.com/masnesral	2024-08-06 21:59:30 +00:00
Edward Z. Yang	3c1033eeb0	Don't auto request review for reopened PRs (#132681 ) This will clobber previous approves. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132681 Approved by: https://github.com/albanD, https://github.com/malfet	2024-08-06 21:36:18 +00:00
rzou	2073ddfd1c	Actually report the HOP and subclass/mode when there isn't a registration (#132550 ) Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/132550 Approved by: https://github.com/ydwu4	2024-08-06 21:33:10 +00:00
yuqingj	623d0204f0	[NJT] Support Chunk backward for simple cases (#132193 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132193 Approved by: https://github.com/soulitzer	2024-08-06 21:20:09 +00:00
Aart Bik	2f908ffa4a	[traced-graph][sparse] sparsity propagation for all current tests (#132690 ) This PR makes sure all current tests in the sparsity export test suite pass. Note that there will probably be anecdotal cases that need fixing after this, but the general idea of preserving sparsity metadata has been completed. Fixes: https://github.com/pytorch/pytorch/issues/117188 ``` $ PYTORCH_TEST_WITH_DYNAMO=0 python test/export/test_sparse.py ........................................................................................................................................................ ---------------------------------------------------------------------- Ran 152 tests OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132690 Approved by: https://github.com/ezyang	2024-08-06 21:18:13 +00:00
dependabot[bot]	029f8fc701	Bump rexml from 3.2.8 to 3.3.3 in /ios/TestApp (#132469 ) Bumps [rexml](https://github.com/ruby/rexml) from 3.2.8 to 3.3.3. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/ruby/rexml/releases">rexml's releases</a>.</em></p> <blockquote> <h2>REXML 3.3.3 - 2024-08-01</h2> <h3>Improvements</h3> <ul> <li> <p>Added support for detecting invalid XML that has unsupported content before root element</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/184">GH-184</a></li> <li>Patch by NAITOH Jun.</li> </ul> </li> <li> <p>Added support for <code>REXML::Security.entity_expansion_limit=</code> and <code>REXML::Security.entity_expansion_text_limit=</code> in SAX2 and pull parsers</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/187">GH-187</a></li> <li>Patch by NAITOH Jun.</li> </ul> </li> <li> <p>Added more tests for invalid XMLs.</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/183">GH-183</a></li> <li>Patch by Watson.</li> </ul> </li> <li> <p>Added more performance tests.</p> <ul> <li>Patch by Watson.</li> </ul> </li> <li> <p>Improved parse performance.</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/186">GH-186</a></li> <li>Patch by tomoya ishida.</li> </ul> </li> </ul> <h3>Thanks</h3> <ul> <li> <p>NAITOH Jun</p> </li> <li> <p>Watson</p> </li> <li> <p>tomoya ishida</p> </li> </ul> <h2>REXML 3.3.2 - 2024-07-16</h2> <h3>Improvements</h3> <ul> <li> <p>Improved parse performance.</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/160">GH-160</a></li> <li>Patch by NAITOH Jun.</li> </ul> </li> <li> <p>Improved parse performance.</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/169">GH-169</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/170">GH-170</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/171">GH-171</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/172">GH-172</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/173">GH-173</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/174">GH-174</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/175">GH-175</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/176">GH-176</a></li> </ul> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/ruby/rexml/blob/master/NEWS.md">rexml's changelog</a>.</em></p> <blockquote> <h2>3.3.3 - 2024-08-01 {#version-3-3-3}</h2> <h3>Improvements</h3> <ul> <li> <p>Added support for detecting invalid XML that has unsupported content before root element</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/184">GH-184</a></li> <li>Patch by NAITOH Jun.</li> </ul> </li> <li> <p>Added support for <code>REXML::Security.entity_expansion_limit=</code> and <code>REXML::Security.entity_expansion_text_limit=</code> in SAX2 and pull parsers</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/187">GH-187</a></li> <li>Patch by NAITOH Jun.</li> </ul> </li> <li> <p>Added more tests for invalid XMLs.</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/183">GH-183</a></li> <li>Patch by Watson.</li> </ul> </li> <li> <p>Added more performance tests.</p> <ul> <li>Patch by Watson.</li> </ul> </li> <li> <p>Improved parse performance.</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/186">GH-186</a></li> <li>Patch by tomoya ishida.</li> </ul> </li> </ul> <h3>Thanks</h3> <ul> <li> <p>NAITOH Jun</p> </li> <li> <p>Watson</p> </li> <li> <p>tomoya ishida</p> </li> </ul> <h2>3.3.2 - 2024-07-16 {#version-3-3-2}</h2> <h3>Improvements</h3> <ul> <li> <p>Improved parse performance.</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/160">GH-160</a></li> <li>Patch by NAITOH Jun.</li> </ul> </li> <li> <p>Improved parse performance.</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/169">GH-169</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/170">GH-170</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/171">GH-171</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/172">GH-172</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/173">GH-173</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/174">GH-174</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/175">GH-175</a></li> </ul> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`e4a067e112`"><code>e4a067e</code></a> Add 3.3.3 entry</li> <li><a href="`17ff3e7874`"><code>17ff3e7</code></a> test: add a performance test for attribute list declaration</li> <li><a href="`be86b3de0a`"><code>be86b3d</code></a> test: fix wrong test name</li> <li><a href="`b93d790b36`"><code>b93d790</code></a> test: use double quote for string literal</li> <li><a href="`0fbe7d5a0e`"><code>0fbe7d5</code></a> test: don't use abbreviated name</li> <li><a href="`1599e8785f`"><code>1599e87</code></a> test: add a performance test for PI with many tabs</li> <li><a href="`e2546e6eca`"><code>e2546e6</code></a> parse pi: improve invalid case detection</li> <li><a href="`73661ef281`"><code>73661ef</code></a> test: fix a typo</li> <li><a href="`850488abf2`"><code>850488a</code></a> test: use double quote for string literal</li> <li><a href="`46c6397d5c`"><code>46c6397</code></a> test: add performance tests for entity declaration</li> <li>Additional commits viewable in <a href="https://github.com/ruby/rexml/compare/v3.2.8...v3.3.3">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=rexml&package-manager=bundler&previous-version=3.2.8&new-version=3.3.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts). </details> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132469 Approved by: https://github.com/ezyang	2024-08-06 21:17:24 +00:00
PyTorch MergeBot	e47b684c33	Revert "Temp disable MKL in DistributionKernels.cpp (#132532 )" This reverts commit `7b2664ece6`. Reverted https://github.com/pytorch/pytorch/pull/132532 on behalf of https://github.com/PaliC due to causing numerical instability issues internally ([comment](https://github.com/pytorch/pytorch/pull/132532#issuecomment-2272136210))	2024-08-06 20:57:09 +00:00
Li Yu (ads)	94155ce31b	[Torch] Support meta device in checkpoint (#132684 ) Summary: ## Why utils.checkpoint doesn't support meta device: ``` File "/Users/lyu1/torchdev/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 490, in checkpoint next(gen) File "/Users/lyu1/torchdev/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 1359, in _checkpoint_without_reentrant_generator device_module = _get_device_module(device) File "/Users/lyu1/torchdev/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 98, in _get_device_module device_module = getattr(torch, device) File "/Users/lyu1/torchdev/lib/python3.9/site-packages/torch/__init__.py", line 1938, in __getattr__ raise AttributeError(f"module '{__name__}' has no attribute '{name}'") AttributeError: module 'torch' has no attribute 'meta' ``` This blocks us from running model with checkpoint enabled in meta mode. ## What This diff handles the case of meta device in checkpoint.py. (in checkpoint.py, device module is manily used when preserve_rng_state=true, which doesn't apply to meta case. So a more elgant fix might be set preserve_rng_state=false when detecting args are on meta device. But I didn't find where to do this check in the minimum way. Let me know if you have ideas.) Test Plan: Tested with toy model which has checkpoint on its module: P1513716944 Differential Revision: D60749427 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132684 Approved by: https://github.com/kit1980	2024-08-06 20:45:50 +00:00
Animesh Jain	de00c79583	[dynamo][inline_inbuilt_nn_modules] Mark nn module tensor static for cudagraphs (#132736 ) Fixes https://github.com/pytorch/pytorch/issues/132714 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132736 Approved by: https://github.com/mlazos ghstack dependencies: #132538	2024-08-06 20:13:28 +00:00
Shuo Ding	1954bfacda	[Inductor] Small performance, precision, and dependency updates to B2B-GEMM (#132354 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132354 Approved by: https://github.com/masnesral	2024-08-06 20:01:27 +00:00
Tugsbayasgalan Manlaibaatar	775c310c0c	Preserve source_fn_stack in the training IR decomp (#132033 ) Title Differential Revision: [D60377712](https://our.internmc.facebook.com/intern/diff/D60377712/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132033 Approved by: https://github.com/angelayi ghstack dependencies: #131988, #131995, #131999	2024-08-06 19:45:40 +00:00

1 2 3 4 5 ...

76779 commits