Commit graph

2707 commits

Author SHA1 Message Date
Davide Italiano
d4ad7b91ad [mps] Move zeta() to special_math.h. (#146231)
In preparation for implementing digamma/polygamma

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146231
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-02-01 19:22:59 +00:00
PyTorch MergeBot
c39c679813 Revert "Tensor .cuda() very slow with specific array sizes (#138964)"
This reverts commit 98f87edd23.

Reverted https://github.com/pytorch/pytorch/pull/138964 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but some slow test start failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/138964#issuecomment-2628455198))
2025-01-31 21:48:51 +00:00
Donald Tolley
98f87edd23 Tensor .cuda() very slow with specific array sizes (#138964)
### **Pull Request: Optimized Non-Contiguous Tensor Copy for CPU to GPU in PyTorch**

#### **Summary**
This PR addresses the performance issue identified in [#111570](https://github.com/pytorch/pytorch/issues/111570), where non-contiguous tensors took significantly longer to transfer from CPU to GPU. Through detailed tracing of the call flow, we identified that PyTorch was creating temporary contiguous buffers for non-contiguous tensor transfers, which introduced unnecessary overhead.

#### **Tracing the Issue**
To pinpoint the cause of the slowdown, we followed the call flow from Python’s `tensor.cuda()` method through PyTorch’s backend, ultimately identifying `copy_kernel_cuda` as the key function responsible for CPU-to-GPU tensor transfers. Here’s a summary of the tracing process:

1. **Python Call: `tensor.cuda()`**
   - Starting from Python, the `cuda()` method initiates the tensor transfer to the GPU.

2. **`TensorBody.h: cuda()`**
   - The `cuda()` method calls `to()`, specifying the target device as CUDA.

3. **`Tensor.cpp: TensorBase::to()`**
   - The `to()` function prepares device and data type options before invoking `_ops::to_dtype_layout::call()`.

4. **Operator Call: `_ops::to_dtype_layout::call()`**
   - This operator dispatches the request to the backend-specific function responsible for managing the transfer.

5. **`Copy.cpp: copy_()`**
   - The `copy_()` function performs preliminary checks (e.g., zero-tensor immutability) and proceeds to call `copy_impl()`.

6. **`Copy.cpp: copy_impl()`**
   - This function sets up a tensor iterator and dispatches the copy operation to the appropriate backend through `copy_stub`.

7. **Dispatch to CUDA: `copy_stub`**
   - The dispatch mechanism routes the call to the CUDA-specific function, `copy_kernel_cuda`.

8. **`Copy.cu: copy_kernel_cuda()`**
   - Here, we identified that PyTorch was creating temporary contiguous buffers for 1D and 2D non-contiguous tensors, which slowed down the copy process. This behavior is managed by the `copy_requires_temporaries()` function.

#### **Solution**
To address this, we modified `copy_kernel_cuda` to handle non-contiguous 1D and 2D tensors directly by using `cudaMemcpy2DAsync`, which allows efficient, stride-aware memory transfers without temporary buffers. Here’s why this approach improves performance:

- **Efficiency of `cudaMemcpy2DAsync`**: This CUDA function is optimized for pitched (stride-based) memory transfers, allowing it to handle non-contiguous data layouts effectively by specifying memory strides for source and destination tensors.
- **Reduction of Overhead**: By directly copying non-contiguous tensors without intermediate buffers, we eliminate extra memory allocation and achieve faster CPU-to-GPU transfers.
- **Asynchronous Execution**: `cudaMemcpy2DAsync` enables asynchronous transfer on the CUDA stream, further improving performance by taking advantage of CUDA's optimized memory handling for non-contiguous layouts.

#### **Performance Results**

In my testing, I created tensors of size `327680 x 2000` and used slices for transfer performance measurements. The tests show that the average time for transferring a non-contiguous slice (e.g., rows 10,000 to 50,000) from CPU to GPU now closely matches the contiguous case. This improvement indicates that the updated implementation effectively addresses the performance discrepancy. Below are the measured times and validation checks:

```plaintext
Average time for contiguous slice (rows 10,000-50,000): 66 ms
Average time for non-contiguous slice (rows 10,000-50,000): 66 ms

Validation of contiguous and non-contiguous tensor copies:
 PASS: Tensor shapes match.
 PASS: Tensor contiguity matches.
 PASS: Tensor contents match.
 PASS: Tensor data types match.

 Success: Both contiguous and non-contiguous tensors were copied correctly to the GPU.
```

#### **Conclusion**
This PR resolves the identified performance issue by eliminating the need for temporary buffers in non-contiguous 1D and 2D tensor transfers, ensuring faster and more efficient copies from CPU to GPU. Future optimizations could further enhance performance for higher-dimensional non-contiguous tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138964
Approved by: https://github.com/jeffdaily

Co-authored-by: Natalia Gimelshein <ngimel@gmail.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-01-31 17:05:02 +00:00
Natalia Gimelshein
08ff11e9d0 initialize device when pinning memory on this device, short circuit i… (#145752)
…s_pinned if device is not initialized
Do not land
RFC
potential fix for #144687

Now `.is_pinned(device="cuda")` does not initialize device and thus doesn't poison the fork (but it complains about `device` arg being deprecated). To not need `device=` arg we'd need to fix get_accelerator to not initialize device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145752
Approved by: https://github.com/albanD

Co-authored-by: albanD <albandes@fb.com>
2025-01-30 21:37:29 +00:00
cyyever
8a6e9a88e9 Let PYTORCH_NO_CUDA_MEMORY_CACHING has effect only when value is 1 (#145905)
Fixes #145661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145905
Approved by: https://github.com/eqy, https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-01-30 05:11:10 +00:00
cyy
116af809eb Use std::string_view (#145906)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145906
Approved by: https://github.com/albanD
2025-01-30 03:14:27 +00:00
Yu, Guangye
3b3aac0cde Filter out iGPU if dGPU is found on XPU (#144378)
# Motivation
for https://github.com/pytorch/pytorch/issues/143914
On Windows, there are two separate SYCL platforms for iGPU and dGPU. To simplify the logic, we will exclude iGPUs when a dGPU is present. This ensures that all XPU devices enumerated by PyTorch share the same SYCL context.

Now I generalize the logic as below:
1. We find the first L0 platform containing at least one dGPU and enumerate all dGPUs of that platform.
2. If no dGPU is found, we find the first L0 platform containing iGPU and enumerate all iGPUs of that platform.
3. No GPU is found (neither iGPU nor dGPU).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144378
Approved by: https://github.com/EikanWang, https://github.com/gujinghui
2025-01-29 15:53:16 +00:00
Nikita Shulga
3fd4691908 [MPS] Add op_math_t (#145808)
Similar to `at::opmath_t` to be used for reduction (and int mms)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145808
Approved by: https://github.com/dcci
2025-01-28 18:03:52 +00:00
cyy
c751541e79 Fix cppcoreguidelines-init-variables ignorance (#141795)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141795
Approved by: https://github.com/albanD
2025-01-28 17:11:37 +00:00
cyy
67fcc7cf02 [3/N] Remove unnecessary once flag usage (#145672)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145672
Approved by: https://github.com/albanD
2025-01-28 04:28:18 +00:00
Nikita Shulga
3a23d75b37 [MPS] Fix c0:🤘:log_gamma correctness on M4 (#145740)
To workaround a bug where `abs` method call seems to be ignored before calling log, which could be reproduced by running the following code (submitted as FB16415011 )
```swift
import Metal

func run_shader<T: BinaryFloatingPoint> (library: MTLLibrary, kernel_name: String, type: T.Type, nelem: Int = 16) {
  guard let mfunc = library.makeFunction(name: kernel_name) else { fatalError("Can't find function") }
  let device = library.device
  guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") }
  guard let cmdBuffer = queue.makeCommandBuffer() else { fatalError("Can't make command buffer") }
  guard let computeEncoder = cmdBuffer.makeComputeCommandEncoder() else { fatalError("Can't make compute encoder") }
  guard let ibuf = device.makeBuffer(length:nelem * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }
  let ibuf_data = ibuf.contents().assumingMemoryBound(to: T.self)
  for i in 0..<nelem {
    ibuf_data[i] = T(sin(Float(2 + i)))
  }
  guard let obuf = device.makeBuffer(length:nelem * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }
  let obuf_data = obuf.contents().assumingMemoryBound(to: T.self)

  computeEncoder.setComputePipelineState(try! device.makeComputePipelineState(function: mfunc))
  computeEncoder.setBuffer(obuf, offset:0, index: 0)
  computeEncoder.setBuffer(ibuf, offset:0, index: 1)
  computeEncoder.dispatchThreads(MTLSizeMake(nelem, 1, 1), threadsPerThreadgroup:MTLSizeMake(nelem, 1, 1))
  computeEncoder.endEncoding()
  cmdBuffer.commit()
  cmdBuffer.waitUntilCompleted()

  print("Results for \(String(describing: T.self)):", terminator: " ")
  for i in 0..<nelem {
    print(obuf_data[i], terminator: " ")
  }
  print()
}

let shader_source = """
#include <metal_stdlib>

template<typename T>
float foo(T x) {
  const auto abs_x = :🤘:abs(static_cast<float>(x));
  auto rc = :🤘:log(abs_x);

  return rc - :🤘:log(:🤘:abs(abs_x * :🤘:sinpi(abs_x)));
}

kernel void half_kernel(
    device half* out_ptr0,
    constant half* in_ptr0,
    uint xindex [[thread_position_in_grid]]
) {
  auto inp = in_ptr0[xindex];
  auto out = foo(inp);
  out_ptr0[xindex] = static_cast<half>(out);
}

kernel void float_kernel(
    device float* out_ptr0,
    constant float* in_ptr0,
    uint xindex [[thread_position_in_grid]]
) {
  auto inp = in_ptr0[xindex];
  auto out = foo(inp);
  out_ptr0[xindex] = static_cast<float>(out);
}
"""
let options = MTLCompileOptions()
options.mathMode = .safe
options.mathFloatingPointFunctions = .precise

guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") }
let library = try! device.makeLibrary(source:shader_source, options:options)
run_shader(library:library, kernel_name:"half_kernel", type: Float16.self)
run_shader(library:library, kernel_name:"float_kernel", type: Float.self)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145740
Approved by: https://github.com/dcci
2025-01-27 21:24:22 +00:00
Nikita Shulga
71caac2b30 [MPSInductor] Add rand support (#145705)
Using Philox4 as PRNG

Test plan (other that CI)
Run
```python
mport torch
from torch._inductor.utils import run_and_get_code
from contextlib import nullcontext

def foo(x):
   return x * torch.randn_like(x)

foo_c = torch.compile(foo)

x = torch.ones(100, 100, device="mps")

y = foo_c(x)

print(y.mean().item(), y.std().item())
for i in range(25):
  print(y[i].mean(), y[i].std())
```
And observe that printed values are close to 0 and 1

TODO: Better `randint` algorithm for large ranges

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145705
Approved by: https://github.com/dcci, https://github.com/jansel
2025-01-27 06:07:36 +00:00
Yichen Yan
ed015143ef Set RUNPATH on CUDA and XPU tests (#144305)
#136627 has almost fixed the issue that test binaries' runpath has not been set correctly, with few cases left.

This PR fixes the rest.

The binaries are found by `auditwheel repair` a wheel built with `BUILD_TEST=1`.

@malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144305
Approved by: https://github.com/malfet
2025-01-26 08:40:22 +00:00
Simon Mahns
6939a56e13 [autocast][pytorch] Support autocast for MTIA (#145627)
Summary: Add autocast support to MTIA

Reviewed By: egienvalue

Differential Revision: D68572548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145627
Approved by: https://github.com/egienvalue
2025-01-25 03:24:59 +00:00
Davide Italiano
57591edca1 [mps/inductor] Add support for erfinv. (#145643)
After several rounds of refactoring, this seems to be done now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145643
Approved by: https://github.com/malfet, https://github.com/jansel
2025-01-24 22:55:44 +00:00
Davide Italiano
f56c638849 [c10/metal] Add a vectype variant for short/int/long (#145430)
Some of the kernels (exp_complex/atan_complex) need the specialization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145430
Approved by: https://github.com/malfet, https://github.com/jansel
2025-01-23 04:52:56 +00:00
Nikita Shulga
70ccbade83 [MPSInductor] Add gamma op (#145341)
By moving `gamma` and `log_gamma` implementation from `Gamma.metal` to `c10/metal/special_math.h`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145341
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #145309
2025-01-22 19:37:45 +00:00
Isuru Fernando
4b77ff9784 Fix PythonMod printing for C++ (#143385)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143385
Approved by: https://github.com/leslie-fang-intel, https://github.com/anijain2305
2025-01-22 14:58:35 +00:00
Nikita Shulga
1908116ace [MPS][BE] Move vectypes from Quantized to utils (#145312)
That allows one to get appropriate vectorized types for templates using `c10:🤘:vec2type_t<>` or `c10:🤘:vec4type_t<>`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145312
Approved by: https://github.com/dcci
2025-01-22 00:37:28 +00:00
Davide Italiano
8cc415774f [mps/inductor] Introduce a metal approx for erf() and use it. (#145161)
Probably we can do better, but this is a start.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145161
Approved by: https://github.com/malfet
2025-01-19 02:29:05 +00:00
Nikita Shulga
cede43e06b [MPSInductor][BE] NaN-propagating min/max to header (#145157)
May be to be later reused from eager op as well

Also, didn't know that Metal already have type_traits
And use `metal::isunorderder(a, b)` instead of `metal::isnan(a + b)` is it is defined as function that is equivalent  `a != a || b != b`, but I suspect it might have a best native implementation for the specific architecture

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145157
Approved by: https://github.com/dcci
2025-01-18 22:52:44 +00:00
Nikita Shulga
dc9b77cc55 [MPS] Support includes in metal objects (#145087)
Useful for code reuse for Metal shader build both for eager mode and MPSInductor, but it requires one to implement `_cpp_embed_headers` tool that, as name suggests, would preprocess and embeds the for shader to be used in dynamic compilation.
Test using:
 -  `TestMetalLibrary.test_metal_include`
 - Moving `i0`/`i1` implementation to `c10/util/metal_special_math.h` and call it from `SpecialOps.metal` shader, which now looks much more compact:
 ```metal
template <typename T, typename Tout = T>
void kernel
i0(constant T* input,
   device Tout* output,
   uint index [[thread_position_in_grid]]) {
  output[index] = c10::i0(static_cast<Tout>(input[index]));
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145087
Approved by: https://github.com/dcci
ghstack dependencies: #145023
2025-01-18 05:35:22 +00:00
Scott Wolchok
69b883d7ac Remove C10_EMBEDDED (#144808)
I added this to support code sharing with ExecuTorch, but the operator<< overrides are load-bearing for builds -- we have other code that attempts to pretty-print Half/BFloat16, and implicit conversions can't be used to make that work because there are *multiple* implicit conversions from Half/BFloat16 to primitive types, so which one to select is ambiguous. Also, we don't actually seem to need it now in ExecuTorch core because we have `include <ostream>` in there at the moment anyway.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144808
Approved by: https://github.com/janeyx99, https://github.com/malfet
2025-01-15 06:08:53 +00:00
fan.mo
64829b356a [PrivateUse1] Support parseDispatchKey with modified PrivateUse1 (#144325)
PyTorch now support many private1 backend names like `AutogradPrivateUse1` or `QuantizedPrivateUse1`, not mentioned the original `PrivateUse1` backend.

However, users that implement `PrivateUse1` funtionalities would modified the backend name by calling  `torch.utils.rename_privateuse1_backend("my_backend")`, in that case, all `PrivateUse1` backend string would not be found when we call other functions related to it. For example, we utilize `torch.library` to register some customize functions to our new backend, we would use "my_backend" as the backend name instead of "PrivateUse1", in which the error will be throw:
```
could not parse dispatch key 'my_backend'
```

So, this PR changed the function `c10::DispatchKey parseDispatchKey(const std::string& k)`, it would double check if the `PrivateUse1` has been modified, and if so, we would change `k` to adapt new backend name then find it again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144325
Approved by: https://github.com/albanD
2025-01-14 21:21:29 +00:00
PyTorch MergeBot
bdd942efd7 Revert "Increase C10_COMPILE_TIME_MAX_GPUS to 128 (#144138)"
This reverts commit 6cfc081675.

Reverted https://github.com/pytorch/pytorch/pull/144138 on behalf of https://github.com/albanD due to This seems to impact the caffe2 code ([comment](https://github.com/pytorch/pytorch/pull/144138#issuecomment-2590891200))
2025-01-14 19:04:12 +00:00
Richard Barnes
1dab79470d c10::string_view -> std::string_view in pytorch (#143591)
Test Plan: Sandcastle

Differential Revision: D67312322

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143591
Approved by: https://github.com/malfet
2025-01-13 21:44:05 +00:00
Nikita Shulga
9ae35b8bb1 [BE] Introduce c10::SyntaxError (#144647)
Which will be translated into Python's SyntaxError
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144647
Approved by: https://github.com/Skylion007
2025-01-12 23:23:54 +00:00
cyy
6cfc081675 Increase C10_COMPILE_TIME_MAX_GPUS to 128 (#144138)
To facilitate further possible changes of DeviceIndex to int16_t.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144138
Approved by: https://github.com/albanD
2025-01-10 23:53:19 +00:00
Scott Wolchok
0529908f13 Remove is_reduced_floating_point from namespace std (#144502)
Partial fix for #144495. Avoiding BC-break using existing practice of removing only if FBCODE_CAFFE2 and C10_NODEPRECATED are not defined.

Differential Revision: [D67992342](https://our.internmc.facebook.com/intern/diff/D67992342/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144502
Approved by: https://github.com/malfet
2025-01-10 03:24:10 +00:00
cyy
9a841f9321 Enable bugprone-unchecked-optional-access (#144226)
We can actually enable bugprone-unchecked-optional-access without the risk of hang.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144226
Approved by: https://github.com/albanD
2025-01-10 03:16:56 +00:00
cyy
d0070ca07e [18/N] Fix extra warnings brought by clang-tidy-17 (#144014)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144014
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-01-08 17:21:55 +00:00
Michal Gallus
93633d0e80 [ROCm][Windows] Fix export macros (#144098)
For correct import and export of functions when the dynamic linkage is used for HIP libraries on windows, the appropriate export/import macros need to be put in place. This Pull Request utilizes existing CUDA import/export macros by converting them to corresponding HIP macros during the hipification process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144098
Approved by: https://github.com/jeffdaily
2025-01-04 17:12:46 +00:00
Yu, Guangye
09e47ab7ab Refine CUDA Stream priority (#143849)
# Motivation
As mentioned in https://github.com/pytorch/pytorch/pull/141119#discussion_r1897480515, we properly handle the priority value if it is outside of the priority range.

# Additional Context
If the value falls outside of the allowed priority range, it will automatically be mapped to the nearest valid priority(either lowest or highest).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143849
Approved by: https://github.com/albanD, https://github.com/EikanWang
ghstack dependencies: #142347, #141119, #141123, #143799
2024-12-31 11:15:59 +00:00
Yu, Guangye
a68c0ca497 Add low priority XPU Stream (#141119)
# Motivation
Due to the potential for the external SYCL queue to have a low priority, we need to support the low-priority SYCL queue for native XPU Streams to maintain consistency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141119
Approved by: https://github.com/gujinghui, https://github.com/albanD
ghstack dependencies: #142347
2024-12-31 11:15:45 +00:00
Yu, Guangye
39450ae655 Refine XPU external Stream (#142347)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142347
Approved by: https://github.com/gujinghui, https://github.com/albanD
2024-12-31 11:15:38 +00:00
cyy
af629a8146 Enable readability-redundant-declaration (#143982)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143982
Approved by: https://github.com/Skylion007
2024-12-31 00:20:10 +00:00
cyy
dca443835e Enable more readability-redundant checks (#143963)
They are helpful to simplifying code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143963
Approved by: https://github.com/albanD
2024-12-30 14:49:33 +00:00
Yu, Guangye
07fa6e2c8b Fix torch.accelerator api abort when passing invaild device (#143550)
# Motivation
Fix https://github.com/pytorch/pytorch/issues/143543

# Solution
We should raise python exception instead of aborting...

# Additional Context
without this PR:
```python
>>> import torch
>>> torch.accelerator.current_stream(torch.accelerator.device_count())
terminate called after throwing an instance of 'c10::Error'
  what():  device is out of range, device is 2, total number of device is 2.
Exception raised from check_device_index at /home/dvrogozh/git/pytorch/pytorch/c10/xpu/XPUFunctions.h:36 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f30707eb95c in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f307078fc57 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so)
frame #2: <unknown function> + 0x19a3e (0x7f3070c2ba3e in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #3: c10::xpu::getCurrentXPUStream(signed char) + 0x2f (0x7f3070c2c83f in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #4: <unknown function> + 0x1ca35 (0x7f3070c2ea35 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #5: <unknown function> + 0x653f15 (0x7f3083391f15 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x39e5f2 (0x7f30830dc5f2 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so)
<omitting python frames>
frame #20: <unknown function> + 0x29d90 (0x7f308b19bd90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: __libc_start_main + 0x80 (0x7f308b19be40 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)
```
with this PR:
```python
>>> import torch
>>> torch.accelerator.current_stream(torch.accelerator.device_count())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pt-gpu/4T-4652/guangyey/stock-pytorch/torch/accelerator/__init__.py", line 123, in current_stream
    return torch._C._accelerator_getStream(device_index)
RuntimeError: The device index is out of range. It must be in [0, 2), but got 2.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143550
Approved by: https://github.com/EikanWang, https://github.com/dvrogozh, https://github.com/albanD
2024-12-23 03:44:22 +00:00
Richard Barnes
e2d47a133b Disable c10::optional macros (#138912)
Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138912
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-12-17 09:22:47 +00:00
Richard Barnes
82ce888273 c10::string_view -> std::string_view in more places (#142517)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142517
Approved by: https://github.com/malfet
2024-12-12 19:45:59 +00:00
cyyever
a108b282ff [4/N] Avoid copy in std::get (#142285)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142285
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-12-09 07:59:35 +00:00
xinan.lin
3d227ae315 [Intel GPU] Support getStreamFromExternel for XPU. (#140268)
In AOT inductor scenario, the GPU Stream can be created outside of the pool of `XPUStream`, and we need to create a `XPUStream` which refers to this stream for the the common logic of AOTI, for example a stream guard is a guard for `XPUStream`.  So we add the getStreamFromExternel following the design of CUDAStream.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140268
Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/EikanWang
2024-12-07 19:22:04 +00:00
Nikita Shulga
3a3638be50 [BE] Enable Scalar.h compilation on 32-bit system (#142235)
By hiding ambiguous Scalar(long long) constructor behind `std::enable_if_t<sizeof(void *) == 8>`

Followup after https://github.com/pytorch/pytorch/pull/141244

Test Plan: Run `printf "#include <c10/core/Scalar.h>\n c10::Scalar x(3);" | gcc -x c++ -std=c++17 -I. -Ibuild - -c` on ARMv7 system.
Before this change it failed with:
```
In file included from <stdin>:1:
./c10/core/Scalar.h:83:3: error: ‘c10::Scalar::Scalar(long long int)’ cannot be overloaded with ‘c10::Scalar::Scalar(int64_t)’
   83 |   Scalar(long long vv) : Scalar(vv, true) {}
      |   ^~~~~~
./c10/core/Scalar.h:50:3: note: previous declaration ‘c10::Scalar::Scalar(int64_t)’
   50 |   Scalar(type vv) : Scalar(vv, true) {}
      |   ^~~~~~
./c10/core/ScalarType.h:288:3: note: in expansion of macro ‘DEFINE_IMPLICIT_CTOR’
  288 |   _(int64_t, Long)                                \
      |   ^
./c10/core/Scalar.h:52:3: note: in expansion of macro ‘AT_FORALL_SCALAR_TYPES_AND7’
   52 |   AT_FORALL_SCALAR_TYPES_AND7(
      |   ^~~~~~~~~~~~~~~~~~~~~~~~~~~
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142235
Approved by: https://github.com/Skylion007
2024-12-07 01:05:55 +00:00
Brian Hirsh
20912ba582 fix incorrect c10::SymFloat::sqrt (#141728)
Fixes the silent correctness for SDPA in https://github.com/pytorch/pytorch/issues/141710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141728
Approved by: https://github.com/Skylion007, https://github.com/ezyang, https://github.com/drisspg
ghstack dependencies: #141725
2024-12-03 23:34:16 +00:00
cyy
7c1d5db1f3 [2/N] Enable UBSAN tests (#141740)
Apply c10::load in more places. The function was introduced to cast a byte to valid boolean values, thus fixing the UBSAN errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141740
Approved by: https://github.com/ezyang
2024-12-03 20:52:26 +00:00
Yu, Guangye
77748ed8ec fix c10::Event UT failure on XPU backend (#141800)
# Motivation
Fix this UT failure introduced by https://github.com/pytorch/pytorch/pull/140865. The unrelated failure suppressed this UT failure.
It goes to happen since https://github.com/pytorch/pytorch/pull/141546 is landed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141800
Approved by: https://github.com/EikanWang
2024-12-03 01:34:42 +00:00
Benjamin Glass
4959784dac Add API query for available per-process CUDA memory (#140620)
Certain `cpp_wrapper`-enabled tests were OOM-ing in the CI pipeline, with error messages suggesting that sufficient memory was accessible.  This ultimately resulted from an internal memory limitation that was not queryable in the API.  This PR adds querying for that limit.

Additionally, the failing tests had incorrect memory availability checks, and are updated with measured memory requirements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140620
Approved by: https://github.com/malfet, https://github.com/eqy
ghstack dependencies: #141367
2024-12-03 00:24:03 +00:00
Edward Z. Yang
5212ec3879 Add admonition about as_float_unchecked() (#141742)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141742
Approved by: https://github.com/bdhirsh
2024-11-28 06:25:18 +00:00
Yu, Guangye
d905f1350a Friendly catch exception when fail to initialize XPU devices (#141658)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141658
Approved by: https://github.com/EikanWang
2024-11-28 05:17:08 +00:00
Umberto Valleriani
3becdaf8a7 [c10] Fix static_assert for 32-bit systems (#141244)
the `__ANDROID__` macro was used as a proxy to check whether compilation is targeting a 32 or 64 bit system, causing build failure on non-android 32 bit linux targets like arm v7.

This modification adjusts the check to fail if and only if int64_t and long and not the same on 64-bit systems, on systems where `sizeof(void*) == 8`

Like I said in the issue #141043 , I'm not sure whether a different `Scalar` constructor should be defined in the 32 bit case. My code does not break but I'm not sure other people's code won't.

Fixes #141043

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141244
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-11-28 03:11:52 +00:00