Commit graph

2733 commits

Author SHA1 Message Date
Davide Italiano
91c4bf39d3 [mps] Add a shader for spherical_bessel_j0. (#146771)
In preparation for adding the operation to inductor/eager.
Adapted from the CUDA version of the shader.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146771
Approved by: https://github.com/malfet
2025-02-09 05:11:17 +00:00
Scott Wolchok
ade8fee512 Use c10 version of half/bfloat16 in executorch (#144111)
Summary:
X-link: https://github.com/pytorch/executorch/pull/7040

Accomplished by importing relevant files from c10 into
executorch/runtime/core/portable_type/c10, and then using `using` in
the top-level ExecuTorch headers. This approach should keep the
ExecuTorch build hermetic for embedded use cases. In the future, we
should add a CI job to ensure the c10 files stay identical to the
PyTorch ones.
ghstack-source-id: 260047850
exported-using-ghexport

Test Plan: builds

Differential Revision: D66106969

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144111
Approved by: https://github.com/malfet
2025-02-08 22:40:14 +00:00
Nikita Shulga
2328dcccb9 [MPSInductor] Implement Welford reduction (#146703)
Still work in progress, though fallback works as expected, but custom shader is not

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146703
Approved by: https://github.com/jansel, https://github.com/dcci
2025-02-08 05:00:00 +00:00
Aaron Gokaslan
bc40ccf6aa [BE]: Inline special functions for MPS (#146627)
These header functions should be inlined for consistency and to avoid translation unit / symbol issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146627
Approved by: https://github.com/dcci
2025-02-07 05:15:15 +00:00
Nikita Shulga
624d94bdb8 [MPS] Extend torch.special.sinc to complex (#146648)
And to integral data types as well

Was too lazy to deduce the formula myself(or write a sympy script), but ChatGPT did a decent job of doing it, though it forgot that input must be multiplied by $$\pi$$:
```math
\text{Re}\left(\text{sinc}(x + i y)\right) = \frac{\sin(x)\cosh(y) x - \cos(x)\sinh(y) y}{x^2 + y^2}
```
```math
\text{Im}\left(\text{sinc}(x + i y)\right) = \frac{\cos(x)\sinh(y) x + \sin(x)\cosh(y) y}{x^2 + y^2}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146648
Approved by: https://github.com/dcci
2025-02-07 01:12:37 +00:00
Michal Gallus
3379c65de6 [ROCm][Windows] Fix unrecognized _BitScanReverse intrinsic (#146606)
Since PyTorch with ROCm on Windows is built with clang-cl and not MSVC, the intrinsics used are different and hence an attempt to compile with `_BitScanReverse` fails. However, a call to `__builtin_clz` which follows in the subsequent preprocessor branch is correctly recognized by the clang-cl compiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146606
Approved by: https://github.com/jeffdaily
2025-02-06 23:47:18 +00:00
Michal Gallus
3f5ed05688 [Windows][ROCm] Fix c10 hip tests (#146599)
- Solves a problem related to .hip source files being ignored by the build system when HIP language is not enabled in CMake.
- Also ensures that the test executables link to an appropriate CRT Runtime Library and hence have access to all the necessary symbols. Previously, there were many problems related to linkage errors.
- Moves part of Linux-related hipBLASLt changes in `LoadHIP.cmake` under the UNIX conditional branch, as these aren't supported on Windows yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146599
Approved by: https://github.com/jeffdaily
2025-02-06 23:41:25 +00:00
Isalia20
7725d0ba12 [METAL] inline bfloat min/max (#146588)
After a recent commit 36c6e09528 , building from source with `python setup.py develop` leads to an error due to multiple symbols for min/max:
```
FAILED: caffe2/aten/src/ATen/kernels_bfloat.metallib /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen/kernels_bfloat.metallib
cd /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen && xcrun metallib -o kernels_bfloat.metallib BinaryKernel_31.air Bucketization_31.air CrossKernel_31.air FusedOptimizerOps_31.air Gamma_31.air HistogramKernel_31.air Im2Col_31.air Indexing_31.air LinearAlgebra_31.air Quantized_31.air RMSNorm_31.air RenormKernel_31.air Repeat_31.air SpecialOps_31.air TriangularOps_31.air UnaryKernel_31.air UnfoldBackward_31.air UpSample_31.air
LLVM ERROR: multiple symbols ('_ZN3c105metal3minIDF16bEEN5metal9enable_ifIXgssr5metalE19is_floating_point_vIT_EES4_E4typeES4_S4_')!
```

This PR fixes that.

@malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146588
Approved by: https://github.com/FFFrog, https://github.com/Skylion007, https://github.com/malfet
2025-02-06 17:57:31 +00:00
Davide Italiano
46390e9a37 [mps] Implement support for sinc() operator (inductor and eager). (#146539)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146539
Approved by: https://github.com/malfet, https://github.com/jansel

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-06 16:37:27 +00:00
FEI
8a4dd763b8 [CCA] remove TODO for hardware_destructive_interference_size (#145591)
@zyan0 @albanD  @houseroad

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145591
Approved by: https://github.com/albanD
2025-02-06 14:41:25 +00:00
Nikita Shulga
36c6e09528 [MPSInductor] Fix min/max for bfloat16 (#146552)
By introducing a full specialization that upcasts everything to float, as bfloat does not have a native min/max

Test by runing `test_min_max_reduction`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146552
Approved by: https://github.com/dcci
2025-02-06 05:15:00 +00:00
Nikita Shulga
495049860b [BE][Metal] Fix signed unsigned comparison warning (#146549)
I wish I knew how to extract Metal warnings during JIT compilation but https://developer.apple.com/documentation/metal/mtldevice/makelibrary(source:options:)?changes=_7&language=objc is a lie as `error:` stays `nil` unless shader compilation fails. But when it does following warnings are thrown
```
program_source:666:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (auto idx = 1; idx < size; ++idx) {
                     ~~~ ^ ~~~~
program_source:677:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (auto idx = 1; idx < size; ++idx) {
                     ~~~ ^ ~~~~
program_source:688:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (auto idx = 1; idx < size; ++idx) {
                     ~~~ ^ ~~~~
program_source:699:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (auto idx = 1; idx < size; ++idx) {
                     ~~~ ^ ~~~~
program_source:710:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (auto idx = 1; idx < size; ++idx) {
                     ~~~ ^ ~~~~
program_source:723:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (auto idx = 1; idx < size; ++idx) {

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146549
Approved by: https://github.com/dcci
2025-02-06 00:40:17 +00:00
Davide Italiano
09b0dfdc90 [metal] Add a missing cast to make the call to copysign unambiguous. (#146422)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146422
Approved by: https://github.com/Skylion007, https://github.com/Samkm0084
2025-02-04 22:04:25 +00:00
Nikita Shulga
3525b834f0 [MPSInductor] Implement argmax/argmin (#146429)
TODOs:
 - Find test with NaN
 - Report internal compiler error when running `test_argmax_argmin1` (which is actually not enough shared memory)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146429
Approved by: https://github.com/dcci
ghstack dependencies: #146423, #146428
2025-02-04 19:16:06 +00:00
Nikita Shulga
7d60235aa6 [Metal] Small speedup for sum/prod (#146428)
As they can not really be invoked over empty arrays
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146428
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #146423
2025-02-04 19:10:33 +00:00
Nikita Shulga
b1663b31e1 [Metal][BE] Add #pragma once to all headers (#146423)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146423
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-02-04 19:10:33 +00:00
Nikita Shulga
5d81bc3696 [MPSInductor] Implement prod reduction (#146396)
Mostly reusing `sum` reduction logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146396
Approved by: https://github.com/dcci
ghstack dependencies: #146369, #146370, #146380, #146389
2025-02-04 14:08:04 +00:00
Nikita Shulga
bbe95341d9 [MPSInductor] Implement min and max reductions (#146389)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146389
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #146369, #146370, #146380
2025-02-04 14:04:10 +00:00
Davide Italiano
bb4bd5f00b [Metal][BE] Fix the arguments of polygamma (#146382)
In the public API, order comes before input, while here they're
 reversed. Match for consistency (and make this less error prone).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146382
Approved by: https://github.com/jansel, https://github.com/malfet
2025-02-04 06:40:34 +00:00
Nikita Shulga
54ceb7c565 [MPSInductor] Add support for sum reduction (#146380)
- Add `threadgroup_sum` template to `c10/metal/reduction_utils.h` that so far uses barrier to compute the reductions

TODOs:
 - Implement efficient reduction using cooperative functions such as `simd_shuffle_down`
 - Figure out how to merge several sum reduction together
 - Implement `reduction_store` that will only write results from the first thread

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146380
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #146369, #146370
2025-02-04 06:23:44 +00:00
Natalia Gimelshein
0bc036a9e9 use copy2d in h2d/d2h copy when possible (#146256)
A rewrite of #138964
In addition to rewriting the conditions for using copy2d, this PR fixes a few other problems with #138964:
1) gpu-gpu copies when peer access is disabled shouldn't rely on copy2d
2) copy2d should record even for the host pinned memory, like the regular copy does
3) copy2d shouldn't pretend that it's synchronizing (for the purposes of cuda sanitizer tracer) when it's non-blocking

In this PR copy2d behaves in exactly the same way as copy does wrt to those additional syncs, except it calls a different underlying cuda call.

Tests for multiple cases going through copy2d and avoiding copy2d pattern due to unsatisfied conditions are added.
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146256
Approved by: https://github.com/eqy, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-03 23:07:54 +00:00
Isalia20
f237172768 Fix not inlining functions used in metal files (#146316)
Fixes issue when building PyTorch with Xcode installed after https://github.com/pytorch/pytorch/pull/146231
```
FAILED: caffe2/aten/src/ATen/kernels_basic.metallib /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen/kernels_basic.metallib
cd /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen && xcrun metallib -o kernels_basic.metallib BinaryKernel_30.air Bucketization_30.air CrossKernel_30.air FusedOptimizerOps_30.air Gamma_30.air HistogramKernel_30.air Im2Col_30.air Indexing_30.air LinearAlgebra_30.air Quantized_30.air RMSNorm_30.air RenormKernel_30.air Repeat_30.air SpecialOps_30.air TriangularOps_30.air UnaryKernel_30.air UnfoldBackward_30.air UpSample_30.air
LLVM ERROR: multiple symbols ('_ZN3c105metal4zetaEff')!
[3835/5420] Building CXX object c10/test/CMakeFiles/c10_small_vector_test.dir/util/small_vector_test.cpp.o
ninja: build stopped: subcommand failed.
```

AI to @malfet: Add linter that ensures that `c10/metal/` headers do not have any functions there, only templates
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146316
Approved by: https://github.com/malfet, https://github.com/atalman
2025-02-03 19:33:52 +00:00
Davide Italiano
d28fe3ed47 [metal] Move digamma to special_math.h (#146284)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146284
Approved by: https://github.com/jansel
2025-02-03 01:29:14 +00:00
Davide Italiano
7854299b27 [mps/inductor] Implement support for polygamma(). (#146259)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146259
Approved by: https://github.com/jansel
2025-02-02 01:54:23 +00:00
Davide Italiano
dca5cc0255 [mps] Move polygamma to special_math.h. (#146253)
In preparation to implement it in inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146253
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-02-01 21:45:23 +00:00
Aaron Gokaslan
07dbd539b4 [BE][Ez]: Make c10/special arrays constexpr (#146246)
No reason to have array creation overhead for these constexpr arrays. This is better because it guarantees the array is not duplicated across templates or translation units unless necessary and allows the compiler to do static compile time bounds checking (even in loop based accesses)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146246
Approved by: https://github.com/dcci, https://github.com/malfet
2025-02-01 21:03:18 +00:00
Davide Italiano
d4ad7b91ad [mps] Move zeta() to special_math.h. (#146231)
In preparation for implementing digamma/polygamma

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146231
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-02-01 19:22:59 +00:00
PyTorch MergeBot
c39c679813 Revert "Tensor .cuda() very slow with specific array sizes (#138964)"
This reverts commit 98f87edd23.

Reverted https://github.com/pytorch/pytorch/pull/138964 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but some slow test start failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/138964#issuecomment-2628455198))
2025-01-31 21:48:51 +00:00
Donald Tolley
98f87edd23 Tensor .cuda() very slow with specific array sizes (#138964)
### **Pull Request: Optimized Non-Contiguous Tensor Copy for CPU to GPU in PyTorch**

#### **Summary**
This PR addresses the performance issue identified in [#111570](https://github.com/pytorch/pytorch/issues/111570), where non-contiguous tensors took significantly longer to transfer from CPU to GPU. Through detailed tracing of the call flow, we identified that PyTorch was creating temporary contiguous buffers for non-contiguous tensor transfers, which introduced unnecessary overhead.

#### **Tracing the Issue**
To pinpoint the cause of the slowdown, we followed the call flow from Python’s `tensor.cuda()` method through PyTorch’s backend, ultimately identifying `copy_kernel_cuda` as the key function responsible for CPU-to-GPU tensor transfers. Here’s a summary of the tracing process:

1. **Python Call: `tensor.cuda()`**
   - Starting from Python, the `cuda()` method initiates the tensor transfer to the GPU.

2. **`TensorBody.h: cuda()`**
   - The `cuda()` method calls `to()`, specifying the target device as CUDA.

3. **`Tensor.cpp: TensorBase::to()`**
   - The `to()` function prepares device and data type options before invoking `_ops::to_dtype_layout::call()`.

4. **Operator Call: `_ops::to_dtype_layout::call()`**
   - This operator dispatches the request to the backend-specific function responsible for managing the transfer.

5. **`Copy.cpp: copy_()`**
   - The `copy_()` function performs preliminary checks (e.g., zero-tensor immutability) and proceeds to call `copy_impl()`.

6. **`Copy.cpp: copy_impl()`**
   - This function sets up a tensor iterator and dispatches the copy operation to the appropriate backend through `copy_stub`.

7. **Dispatch to CUDA: `copy_stub`**
   - The dispatch mechanism routes the call to the CUDA-specific function, `copy_kernel_cuda`.

8. **`Copy.cu: copy_kernel_cuda()`**
   - Here, we identified that PyTorch was creating temporary contiguous buffers for 1D and 2D non-contiguous tensors, which slowed down the copy process. This behavior is managed by the `copy_requires_temporaries()` function.

#### **Solution**
To address this, we modified `copy_kernel_cuda` to handle non-contiguous 1D and 2D tensors directly by using `cudaMemcpy2DAsync`, which allows efficient, stride-aware memory transfers without temporary buffers. Here’s why this approach improves performance:

- **Efficiency of `cudaMemcpy2DAsync`**: This CUDA function is optimized for pitched (stride-based) memory transfers, allowing it to handle non-contiguous data layouts effectively by specifying memory strides for source and destination tensors.
- **Reduction of Overhead**: By directly copying non-contiguous tensors without intermediate buffers, we eliminate extra memory allocation and achieve faster CPU-to-GPU transfers.
- **Asynchronous Execution**: `cudaMemcpy2DAsync` enables asynchronous transfer on the CUDA stream, further improving performance by taking advantage of CUDA's optimized memory handling for non-contiguous layouts.

#### **Performance Results**

In my testing, I created tensors of size `327680 x 2000` and used slices for transfer performance measurements. The tests show that the average time for transferring a non-contiguous slice (e.g., rows 10,000 to 50,000) from CPU to GPU now closely matches the contiguous case. This improvement indicates that the updated implementation effectively addresses the performance discrepancy. Below are the measured times and validation checks:

```plaintext
Average time for contiguous slice (rows 10,000-50,000): 66 ms
Average time for non-contiguous slice (rows 10,000-50,000): 66 ms

Validation of contiguous and non-contiguous tensor copies:
 PASS: Tensor shapes match.
 PASS: Tensor contiguity matches.
 PASS: Tensor contents match.
 PASS: Tensor data types match.

 Success: Both contiguous and non-contiguous tensors were copied correctly to the GPU.
```

#### **Conclusion**
This PR resolves the identified performance issue by eliminating the need for temporary buffers in non-contiguous 1D and 2D tensor transfers, ensuring faster and more efficient copies from CPU to GPU. Future optimizations could further enhance performance for higher-dimensional non-contiguous tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138964
Approved by: https://github.com/jeffdaily

Co-authored-by: Natalia Gimelshein <ngimel@gmail.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-01-31 17:05:02 +00:00
Natalia Gimelshein
08ff11e9d0 initialize device when pinning memory on this device, short circuit i… (#145752)
…s_pinned if device is not initialized
Do not land
RFC
potential fix for #144687

Now `.is_pinned(device="cuda")` does not initialize device and thus doesn't poison the fork (but it complains about `device` arg being deprecated). To not need `device=` arg we'd need to fix get_accelerator to not initialize device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145752
Approved by: https://github.com/albanD

Co-authored-by: albanD <albandes@fb.com>
2025-01-30 21:37:29 +00:00
cyyever
8a6e9a88e9 Let PYTORCH_NO_CUDA_MEMORY_CACHING has effect only when value is 1 (#145905)
Fixes #145661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145905
Approved by: https://github.com/eqy, https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-01-30 05:11:10 +00:00
cyy
116af809eb Use std::string_view (#145906)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145906
Approved by: https://github.com/albanD
2025-01-30 03:14:27 +00:00
Yu, Guangye
3b3aac0cde Filter out iGPU if dGPU is found on XPU (#144378)
# Motivation
for https://github.com/pytorch/pytorch/issues/143914
On Windows, there are two separate SYCL platforms for iGPU and dGPU. To simplify the logic, we will exclude iGPUs when a dGPU is present. This ensures that all XPU devices enumerated by PyTorch share the same SYCL context.

Now I generalize the logic as below:
1. We find the first L0 platform containing at least one dGPU and enumerate all dGPUs of that platform.
2. If no dGPU is found, we find the first L0 platform containing iGPU and enumerate all iGPUs of that platform.
3. No GPU is found (neither iGPU nor dGPU).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144378
Approved by: https://github.com/EikanWang, https://github.com/gujinghui
2025-01-29 15:53:16 +00:00
Nikita Shulga
3fd4691908 [MPS] Add op_math_t (#145808)
Similar to `at::opmath_t` to be used for reduction (and int mms)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145808
Approved by: https://github.com/dcci
2025-01-28 18:03:52 +00:00
cyy
c751541e79 Fix cppcoreguidelines-init-variables ignorance (#141795)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141795
Approved by: https://github.com/albanD
2025-01-28 17:11:37 +00:00
cyy
67fcc7cf02 [3/N] Remove unnecessary once flag usage (#145672)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145672
Approved by: https://github.com/albanD
2025-01-28 04:28:18 +00:00
Nikita Shulga
3a23d75b37 [MPS] Fix c0:🤘:log_gamma correctness on M4 (#145740)
To workaround a bug where `abs` method call seems to be ignored before calling log, which could be reproduced by running the following code (submitted as FB16415011 )
```swift
import Metal

func run_shader<T: BinaryFloatingPoint> (library: MTLLibrary, kernel_name: String, type: T.Type, nelem: Int = 16) {
  guard let mfunc = library.makeFunction(name: kernel_name) else { fatalError("Can't find function") }
  let device = library.device
  guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") }
  guard let cmdBuffer = queue.makeCommandBuffer() else { fatalError("Can't make command buffer") }
  guard let computeEncoder = cmdBuffer.makeComputeCommandEncoder() else { fatalError("Can't make compute encoder") }
  guard let ibuf = device.makeBuffer(length:nelem * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }
  let ibuf_data = ibuf.contents().assumingMemoryBound(to: T.self)
  for i in 0..<nelem {
    ibuf_data[i] = T(sin(Float(2 + i)))
  }
  guard let obuf = device.makeBuffer(length:nelem * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }
  let obuf_data = obuf.contents().assumingMemoryBound(to: T.self)

  computeEncoder.setComputePipelineState(try! device.makeComputePipelineState(function: mfunc))
  computeEncoder.setBuffer(obuf, offset:0, index: 0)
  computeEncoder.setBuffer(ibuf, offset:0, index: 1)
  computeEncoder.dispatchThreads(MTLSizeMake(nelem, 1, 1), threadsPerThreadgroup:MTLSizeMake(nelem, 1, 1))
  computeEncoder.endEncoding()
  cmdBuffer.commit()
  cmdBuffer.waitUntilCompleted()

  print("Results for \(String(describing: T.self)):", terminator: " ")
  for i in 0..<nelem {
    print(obuf_data[i], terminator: " ")
  }
  print()
}

let shader_source = """
#include <metal_stdlib>

template<typename T>
float foo(T x) {
  const auto abs_x = :🤘:abs(static_cast<float>(x));
  auto rc = :🤘:log(abs_x);

  return rc - :🤘:log(:🤘:abs(abs_x * :🤘:sinpi(abs_x)));
}

kernel void half_kernel(
    device half* out_ptr0,
    constant half* in_ptr0,
    uint xindex [[thread_position_in_grid]]
) {
  auto inp = in_ptr0[xindex];
  auto out = foo(inp);
  out_ptr0[xindex] = static_cast<half>(out);
}

kernel void float_kernel(
    device float* out_ptr0,
    constant float* in_ptr0,
    uint xindex [[thread_position_in_grid]]
) {
  auto inp = in_ptr0[xindex];
  auto out = foo(inp);
  out_ptr0[xindex] = static_cast<float>(out);
}
"""
let options = MTLCompileOptions()
options.mathMode = .safe
options.mathFloatingPointFunctions = .precise

guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") }
let library = try! device.makeLibrary(source:shader_source, options:options)
run_shader(library:library, kernel_name:"half_kernel", type: Float16.self)
run_shader(library:library, kernel_name:"float_kernel", type: Float.self)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145740
Approved by: https://github.com/dcci
2025-01-27 21:24:22 +00:00
Nikita Shulga
71caac2b30 [MPSInductor] Add rand support (#145705)
Using Philox4 as PRNG

Test plan (other that CI)
Run
```python
mport torch
from torch._inductor.utils import run_and_get_code
from contextlib import nullcontext

def foo(x):
   return x * torch.randn_like(x)

foo_c = torch.compile(foo)

x = torch.ones(100, 100, device="mps")

y = foo_c(x)

print(y.mean().item(), y.std().item())
for i in range(25):
  print(y[i].mean(), y[i].std())
```
And observe that printed values are close to 0 and 1

TODO: Better `randint` algorithm for large ranges

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145705
Approved by: https://github.com/dcci, https://github.com/jansel
2025-01-27 06:07:36 +00:00
Yichen Yan
ed015143ef Set RUNPATH on CUDA and XPU tests (#144305)
#136627 has almost fixed the issue that test binaries' runpath has not been set correctly, with few cases left.

This PR fixes the rest.

The binaries are found by `auditwheel repair` a wheel built with `BUILD_TEST=1`.

@malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144305
Approved by: https://github.com/malfet
2025-01-26 08:40:22 +00:00
Simon Mahns
6939a56e13 [autocast][pytorch] Support autocast for MTIA (#145627)
Summary: Add autocast support to MTIA

Reviewed By: egienvalue

Differential Revision: D68572548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145627
Approved by: https://github.com/egienvalue
2025-01-25 03:24:59 +00:00
Davide Italiano
57591edca1 [mps/inductor] Add support for erfinv. (#145643)
After several rounds of refactoring, this seems to be done now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145643
Approved by: https://github.com/malfet, https://github.com/jansel
2025-01-24 22:55:44 +00:00
Davide Italiano
f56c638849 [c10/metal] Add a vectype variant for short/int/long (#145430)
Some of the kernels (exp_complex/atan_complex) need the specialization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145430
Approved by: https://github.com/malfet, https://github.com/jansel
2025-01-23 04:52:56 +00:00
Nikita Shulga
70ccbade83 [MPSInductor] Add gamma op (#145341)
By moving `gamma` and `log_gamma` implementation from `Gamma.metal` to `c10/metal/special_math.h`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145341
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #145309
2025-01-22 19:37:45 +00:00
Isuru Fernando
4b77ff9784 Fix PythonMod printing for C++ (#143385)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143385
Approved by: https://github.com/leslie-fang-intel, https://github.com/anijain2305
2025-01-22 14:58:35 +00:00
Nikita Shulga
1908116ace [MPS][BE] Move vectypes from Quantized to utils (#145312)
That allows one to get appropriate vectorized types for templates using `c10:🤘:vec2type_t<>` or `c10:🤘:vec4type_t<>`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145312
Approved by: https://github.com/dcci
2025-01-22 00:37:28 +00:00
Davide Italiano
8cc415774f [mps/inductor] Introduce a metal approx for erf() and use it. (#145161)
Probably we can do better, but this is a start.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145161
Approved by: https://github.com/malfet
2025-01-19 02:29:05 +00:00
Nikita Shulga
cede43e06b [MPSInductor][BE] NaN-propagating min/max to header (#145157)
May be to be later reused from eager op as well

Also, didn't know that Metal already have type_traits
And use `metal::isunorderder(a, b)` instead of `metal::isnan(a + b)` is it is defined as function that is equivalent  `a != a || b != b`, but I suspect it might have a best native implementation for the specific architecture

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145157
Approved by: https://github.com/dcci
2025-01-18 22:52:44 +00:00
Nikita Shulga
dc9b77cc55 [MPS] Support includes in metal objects (#145087)
Useful for code reuse for Metal shader build both for eager mode and MPSInductor, but it requires one to implement `_cpp_embed_headers` tool that, as name suggests, would preprocess and embeds the for shader to be used in dynamic compilation.
Test using:
 -  `TestMetalLibrary.test_metal_include`
 - Moving `i0`/`i1` implementation to `c10/util/metal_special_math.h` and call it from `SpecialOps.metal` shader, which now looks much more compact:
 ```metal
template <typename T, typename Tout = T>
void kernel
i0(constant T* input,
   device Tout* output,
   uint index [[thread_position_in_grid]]) {
  output[index] = c10::i0(static_cast<Tout>(input[index]));
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145087
Approved by: https://github.com/dcci
ghstack dependencies: #145023
2025-01-18 05:35:22 +00:00
Scott Wolchok
69b883d7ac Remove C10_EMBEDDED (#144808)
I added this to support code sharing with ExecuTorch, but the operator<< overrides are load-bearing for builds -- we have other code that attempts to pretty-print Half/BFloat16, and implicit conversions can't be used to make that work because there are *multiple* implicit conversions from Half/BFloat16 to primitive types, so which one to select is ambiguous. Also, we don't actually seem to need it now in ExecuTorch core because we have `include <ostream>` in there at the moment anyway.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144808
Approved by: https://github.com/janeyx99, https://github.com/malfet
2025-01-15 06:08:53 +00:00
fan.mo
64829b356a [PrivateUse1] Support parseDispatchKey with modified PrivateUse1 (#144325)
PyTorch now support many private1 backend names like `AutogradPrivateUse1` or `QuantizedPrivateUse1`, not mentioned the original `PrivateUse1` backend.

However, users that implement `PrivateUse1` funtionalities would modified the backend name by calling  `torch.utils.rename_privateuse1_backend("my_backend")`, in that case, all `PrivateUse1` backend string would not be found when we call other functions related to it. For example, we utilize `torch.library` to register some customize functions to our new backend, we would use "my_backend" as the backend name instead of "PrivateUse1", in which the error will be throw:
```
could not parse dispatch key 'my_backend'
```

So, this PR changed the function `c10::DispatchKey parseDispatchKey(const std::string& k)`, it would double check if the `PrivateUse1` has been modified, and if so, we would change `k` to adapt new backend name then find it again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144325
Approved by: https://github.com/albanD
2025-01-14 21:21:29 +00:00