pytorch

mirror of https://github.com/saymrwulf/pytorch.git synced 2026-05-14 20:57:59 +00:00

Author	SHA1	Message	Date
Davide Italiano	91c4bf39d3	[mps] Add a shader for spherical_bessel_j0. (#146771 ) In preparation for adding the operation to inductor/eager. Adapted from the CUDA version of the shader. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146771 Approved by: https://github.com/malfet	2025-02-09 05:11:17 +00:00
Scott Wolchok	ade8fee512	Use c10 version of half/bfloat16 in executorch (#144111 ) Summary: X-link: https://github.com/pytorch/executorch/pull/7040 Accomplished by importing relevant files from c10 into executorch/runtime/core/portable_type/c10, and then using `using` in the top-level ExecuTorch headers. This approach should keep the ExecuTorch build hermetic for embedded use cases. In the future, we should add a CI job to ensure the c10 files stay identical to the PyTorch ones. ghstack-source-id: 260047850 exported-using-ghexport Test Plan: builds Differential Revision: D66106969 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144111 Approved by: https://github.com/malfet	2025-02-08 22:40:14 +00:00
Nikita Shulga	2328dcccb9	[MPSInductor] Implement Welford reduction (#146703 ) Still work in progress, though fallback works as expected, but custom shader is not Pull Request resolved: https://github.com/pytorch/pytorch/pull/146703 Approved by: https://github.com/jansel, https://github.com/dcci	2025-02-08 05:00:00 +00:00
Aaron Gokaslan	bc40ccf6aa	[BE]: Inline special functions for MPS (#146627 ) These header functions should be inlined for consistency and to avoid translation unit / symbol issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146627 Approved by: https://github.com/dcci	2025-02-07 05:15:15 +00:00
Nikita Shulga	624d94bdb8	[MPS] Extend `torch.special.sinc` to complex (#146648 ) And to integral data types as well Was too lazy to deduce the formula myself(or write a sympy script), but ChatGPT did a decent job of doing it, though it forgot that input must be multiplied by $$\pi$$: ```math \text{Re}\left(\text{sinc}(x + i y)\right) = \frac{\sin(x)\cosh(y) x - \cos(x)\sinh(y) y}{x^2 + y^2} ``` ```math \text{Im}\left(\text{sinc}(x + i y)\right) = \frac{\cos(x)\sinh(y) x + \sin(x)\cosh(y) y}{x^2 + y^2} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146648 Approved by: https://github.com/dcci	2025-02-07 01:12:37 +00:00
Michal Gallus	3379c65de6	[ROCm][Windows] Fix unrecognized _BitScanReverse intrinsic (#146606 ) Since PyTorch with ROCm on Windows is built with clang-cl and not MSVC, the intrinsics used are different and hence an attempt to compile with `_BitScanReverse` fails. However, a call to `__builtin_clz` which follows in the subsequent preprocessor branch is correctly recognized by the clang-cl compiler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146606 Approved by: https://github.com/jeffdaily	2025-02-06 23:47:18 +00:00
Michal Gallus	3f5ed05688	[Windows][ROCm] Fix c10 hip tests (#146599 ) - Solves a problem related to .hip source files being ignored by the build system when HIP language is not enabled in CMake. - Also ensures that the test executables link to an appropriate CRT Runtime Library and hence have access to all the necessary symbols. Previously, there were many problems related to linkage errors. - Moves part of Linux-related hipBLASLt changes in `LoadHIP.cmake` under the UNIX conditional branch, as these aren't supported on Windows yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146599 Approved by: https://github.com/jeffdaily	2025-02-06 23:41:25 +00:00
Isalia20	7725d0ba12	[METAL] inline bfloat min/max (#146588 ) After a recent commit `36c6e09528` , building from source with `python setup.py develop` leads to an error due to multiple symbols for min/max: ``` FAILED: caffe2/aten/src/ATen/kernels_bfloat.metallib /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen/kernels_bfloat.metallib cd /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen && xcrun metallib -o kernels_bfloat.metallib BinaryKernel_31.air Bucketization_31.air CrossKernel_31.air FusedOptimizerOps_31.air Gamma_31.air HistogramKernel_31.air Im2Col_31.air Indexing_31.air LinearAlgebra_31.air Quantized_31.air RMSNorm_31.air RenormKernel_31.air Repeat_31.air SpecialOps_31.air TriangularOps_31.air UnaryKernel_31.air UnfoldBackward_31.air UpSample_31.air LLVM ERROR: multiple symbols ('_ZN3c105metal3minIDF16bEEN5metal9enable_ifIXgssr5metalE19is_floating_point_vIT_EES4_E4typeES4_S4_')! ``` This PR fixes that. @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/146588 Approved by: https://github.com/FFFrog, https://github.com/Skylion007, https://github.com/malfet	2025-02-06 17:57:31 +00:00
Davide Italiano	46390e9a37	[mps] Implement support for sinc() operator (inductor and eager). (#146539 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146539 Approved by: https://github.com/malfet, https://github.com/jansel Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-06 16:37:27 +00:00
FEI	8a4dd763b8	[CCA] remove TODO for hardware_destructive_interference_size (#145591 ) @zyan0 @albanD @houseroad Pull Request resolved: https://github.com/pytorch/pytorch/pull/145591 Approved by: https://github.com/albanD	2025-02-06 14:41:25 +00:00
Nikita Shulga	36c6e09528	[MPSInductor] Fix min/max for bfloat16 (#146552 ) By introducing a full specialization that upcasts everything to float, as bfloat does not have a native min/max Test by runing `test_min_max_reduction` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146552 Approved by: https://github.com/dcci	2025-02-06 05:15:00 +00:00
Nikita Shulga	495049860b	[BE][Metal] Fix signed unsigned comparison warning (#146549 ) I wish I knew how to extract Metal warnings during JIT compilation but https://developer.apple.com/documentation/metal/mtldevice/makelibrary(source:options:)?changes=_7&language=objc is a lie as `error:` stays `nil` unless shader compilation fails. But when it does following warnings are thrown ``` program_source:666:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare] for (auto idx = 1; idx < size; ++idx) { ~~~ ^ ~~~~ program_source:677:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare] for (auto idx = 1; idx < size; ++idx) { ~~~ ^ ~~~~ program_source:688:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare] for (auto idx = 1; idx < size; ++idx) { ~~~ ^ ~~~~ program_source:699:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare] for (auto idx = 1; idx < size; ++idx) { ~~~ ^ ~~~~ program_source:710:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare] for (auto idx = 1; idx < size; ++idx) { ~~~ ^ ~~~~ program_source:723:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare] for (auto idx = 1; idx < size; ++idx) { ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146549 Approved by: https://github.com/dcci	2025-02-06 00:40:17 +00:00
Davide Italiano	09b0dfdc90	[metal] Add a missing cast to make the call to copysign unambiguous. (#146422 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146422 Approved by: https://github.com/Skylion007, https://github.com/Samkm0084	2025-02-04 22:04:25 +00:00
Nikita Shulga	3525b834f0	[MPSInductor] Implement `argmax`/`argmin` (#146429 ) TODOs: - Find test with NaN - Report internal compiler error when running `test_argmax_argmin1` (which is actually not enough shared memory) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146429 Approved by: https://github.com/dcci ghstack dependencies: #146423, #146428	2025-02-04 19:16:06 +00:00
Nikita Shulga	7d60235aa6	[Metal] Small speedup for `sum`/`prod` (#146428 ) As they can not really be invoked over empty arrays Pull Request resolved: https://github.com/pytorch/pytorch/pull/146428 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #146423	2025-02-04 19:10:33 +00:00
Nikita Shulga	b1663b31e1	[Metal][BE] Add `#pragma once` to all headers (#146423 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146423 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-02-04 19:10:33 +00:00
Nikita Shulga	5d81bc3696	[MPSInductor] Implement `prod` reduction (#146396 ) Mostly reusing `sum` reduction logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/146396 Approved by: https://github.com/dcci ghstack dependencies: #146369, #146370, #146380, #146389	2025-02-04 14:08:04 +00:00
Nikita Shulga	bbe95341d9	[MPSInductor] Implement `min` and `max` reductions (#146389 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146389 Approved by: https://github.com/jansel, https://github.com/dcci ghstack dependencies: #146369, #146370, #146380	2025-02-04 14:04:10 +00:00
Davide Italiano	bb4bd5f00b	[Metal][BE] Fix the arguments of `polygamma` (#146382 ) In the public API, order comes before input, while here they're reversed. Match for consistency (and make this less error prone). Pull Request resolved: https://github.com/pytorch/pytorch/pull/146382 Approved by: https://github.com/jansel, https://github.com/malfet	2025-02-04 06:40:34 +00:00
Nikita Shulga	54ceb7c565	[MPSInductor] Add support for `sum` reduction (#146380 ) - Add `threadgroup_sum` template to `c10/metal/reduction_utils.h` that so far uses barrier to compute the reductions TODOs: - Implement efficient reduction using cooperative functions such as `simd_shuffle_down` - Figure out how to merge several sum reduction together - Implement `reduction_store` that will only write results from the first thread Pull Request resolved: https://github.com/pytorch/pytorch/pull/146380 Approved by: https://github.com/jansel, https://github.com/dcci ghstack dependencies: #146369, #146370	2025-02-04 06:23:44 +00:00
Natalia Gimelshein	0bc036a9e9	use copy2d in h2d/d2h copy when possible (#146256 ) A rewrite of #138964 In addition to rewriting the conditions for using copy2d, this PR fixes a few other problems with #138964: 1) gpu-gpu copies when peer access is disabled shouldn't rely on copy2d 2) copy2d should record even for the host pinned memory, like the regular copy does 3) copy2d shouldn't pretend that it's synchronizing (for the purposes of cuda sanitizer tracer) when it's non-blocking In this PR copy2d behaves in exactly the same way as copy does wrt to those additional syncs, except it calls a different underlying cuda call. Tests for multiple cases going through copy2d and avoiding copy2d pattern due to unsatisfied conditions are added. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146256 Approved by: https://github.com/eqy, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-03 23:07:54 +00:00
Isalia20	f237172768	Fix not inlining functions used in metal files (#146316 ) Fixes issue when building PyTorch with Xcode installed after https://github.com/pytorch/pytorch/pull/146231 ``` FAILED: caffe2/aten/src/ATen/kernels_basic.metallib /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen/kernels_basic.metallib cd /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen && xcrun metallib -o kernels_basic.metallib BinaryKernel_30.air Bucketization_30.air CrossKernel_30.air FusedOptimizerOps_30.air Gamma_30.air HistogramKernel_30.air Im2Col_30.air Indexing_30.air LinearAlgebra_30.air Quantized_30.air RMSNorm_30.air RenormKernel_30.air Repeat_30.air SpecialOps_30.air TriangularOps_30.air UnaryKernel_30.air UnfoldBackward_30.air UpSample_30.air LLVM ERROR: multiple symbols ('_ZN3c105metal4zetaEff')! [3835/5420] Building CXX object c10/test/CMakeFiles/c10_small_vector_test.dir/util/small_vector_test.cpp.o ninja: build stopped: subcommand failed. ``` AI to @malfet: Add linter that ensures that `c10/metal/` headers do not have any functions there, only templates Pull Request resolved: https://github.com/pytorch/pytorch/pull/146316 Approved by: https://github.com/malfet, https://github.com/atalman	2025-02-03 19:33:52 +00:00
Davide Italiano	d28fe3ed47	[metal] Move digamma to special_math.h (#146284 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146284 Approved by: https://github.com/jansel	2025-02-03 01:29:14 +00:00
Davide Italiano	7854299b27	[mps/inductor] Implement support for polygamma(). (#146259 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146259 Approved by: https://github.com/jansel	2025-02-02 01:54:23 +00:00
Davide Italiano	dca5cc0255	[mps] Move polygamma to special_math.h. (#146253 ) In preparation to implement it in inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146253 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-02-01 21:45:23 +00:00
Aaron Gokaslan	07dbd539b4	[BE][Ez]: Make c10/special arrays constexpr (#146246 ) No reason to have array creation overhead for these constexpr arrays. This is better because it guarantees the array is not duplicated across templates or translation units unless necessary and allows the compiler to do static compile time bounds checking (even in loop based accesses) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146246 Approved by: https://github.com/dcci, https://github.com/malfet	2025-02-01 21:03:18 +00:00
Davide Italiano	d4ad7b91ad	[mps] Move zeta() to special_math.h. (#146231 ) In preparation for implementing digamma/polygamma Pull Request resolved: https://github.com/pytorch/pytorch/pull/146231 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-02-01 19:22:59 +00:00
PyTorch MergeBot	c39c679813	Revert "Tensor .cuda() very slow with specific array sizes (#138964 )" This reverts commit `98f87edd23`. Reverted https://github.com/pytorch/pytorch/pull/138964 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but some slow test start failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/138964#issuecomment-2628455198))	2025-01-31 21:48:51 +00:00
Donald Tolley	98f87edd23	Tensor .cuda() very slow with specific array sizes (#138964 ) ### Pull Request: Optimized Non-Contiguous Tensor Copy for CPU to GPU in PyTorch #### Summary This PR addresses the performance issue identified in [#111570](https://github.com/pytorch/pytorch/issues/111570), where non-contiguous tensors took significantly longer to transfer from CPU to GPU. Through detailed tracing of the call flow, we identified that PyTorch was creating temporary contiguous buffers for non-contiguous tensor transfers, which introduced unnecessary overhead. #### Tracing the Issue To pinpoint the cause of the slowdown, we followed the call flow from Python’s `tensor.cuda()` method through PyTorch’s backend, ultimately identifying `copy_kernel_cuda` as the key function responsible for CPU-to-GPU tensor transfers. Here’s a summary of the tracing process: 1. Python Call: `tensor.cuda()` - Starting from Python, the `cuda()` method initiates the tensor transfer to the GPU. 2. `TensorBody.h: cuda()` - The `cuda()` method calls `to()`, specifying the target device as CUDA. 3. `Tensor.cpp: TensorBase::to()` - The `to()` function prepares device and data type options before invoking `_ops::to_dtype_layout::call()`. 4. Operator Call: `_ops::to_dtype_layout::call()` - This operator dispatches the request to the backend-specific function responsible for managing the transfer. 5. `Copy.cpp: copy_()` - The `copy_()` function performs preliminary checks (e.g., zero-tensor immutability) and proceeds to call `copy_impl()`. 6. `Copy.cpp: copy_impl()` - This function sets up a tensor iterator and dispatches the copy operation to the appropriate backend through `copy_stub`. 7. Dispatch to CUDA: `copy_stub` - The dispatch mechanism routes the call to the CUDA-specific function, `copy_kernel_cuda`. 8. `Copy.cu: copy_kernel_cuda()` - Here, we identified that PyTorch was creating temporary contiguous buffers for 1D and 2D non-contiguous tensors, which slowed down the copy process. This behavior is managed by the `copy_requires_temporaries()` function. #### Solution To address this, we modified `copy_kernel_cuda` to handle non-contiguous 1D and 2D tensors directly by using `cudaMemcpy2DAsync`, which allows efficient, stride-aware memory transfers without temporary buffers. Here’s why this approach improves performance: - Efficiency of `cudaMemcpy2DAsync`: This CUDA function is optimized for pitched (stride-based) memory transfers, allowing it to handle non-contiguous data layouts effectively by specifying memory strides for source and destination tensors. - Reduction of Overhead: By directly copying non-contiguous tensors without intermediate buffers, we eliminate extra memory allocation and achieve faster CPU-to-GPU transfers. - Asynchronous Execution: `cudaMemcpy2DAsync` enables asynchronous transfer on the CUDA stream, further improving performance by taking advantage of CUDA's optimized memory handling for non-contiguous layouts. #### Performance Results In my testing, I created tensors of size `327680 x 2000` and used slices for transfer performance measurements. The tests show that the average time for transferring a non-contiguous slice (e.g., rows 10,000 to 50,000) from CPU to GPU now closely matches the contiguous case. This improvement indicates that the updated implementation effectively addresses the performance discrepancy. Below are the measured times and validation checks: ```plaintext Average time for contiguous slice (rows 10,000-50,000): 66 ms Average time for non-contiguous slice (rows 10,000-50,000): 66 ms Validation of contiguous and non-contiguous tensor copies: ✅ PASS: Tensor shapes match. ✅ PASS: Tensor contiguity matches. ✅ PASS: Tensor contents match. ✅ PASS: Tensor data types match. ✅ Success: Both contiguous and non-contiguous tensors were copied correctly to the GPU. ``` #### Conclusion This PR resolves the identified performance issue by eliminating the need for temporary buffers in non-contiguous 1D and 2D tensor transfers, ensuring faster and more efficient copies from CPU to GPU. Future optimizations could further enhance performance for higher-dimensional non-contiguous tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138964 Approved by: https://github.com/jeffdaily Co-authored-by: Natalia Gimelshein <ngimel@gmail.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-01-31 17:05:02 +00:00
Natalia Gimelshein	08ff11e9d0	initialize device when pinning memory on this device, short circuit i… (#145752 ) …s_pinned if device is not initialized Do not land RFC potential fix for #144687 Now `.is_pinned(device="cuda")` does not initialize device and thus doesn't poison the fork (but it complains about `device` arg being deprecated). To not need `device=` arg we'd need to fix get_accelerator to not initialize device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145752 Approved by: https://github.com/albanD Co-authored-by: albanD <albandes@fb.com>	2025-01-30 21:37:29 +00:00
cyyever	8a6e9a88e9	Let PYTORCH_NO_CUDA_MEMORY_CACHING has effect only when value is 1 (#145905 ) Fixes #145661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145905 Approved by: https://github.com/eqy, https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-01-30 05:11:10 +00:00
cyy	116af809eb	Use std::string_view (#145906 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/145906 Approved by: https://github.com/albanD	2025-01-30 03:14:27 +00:00
Yu, Guangye	3b3aac0cde	Filter out iGPU if dGPU is found on XPU (#144378 ) # Motivation for https://github.com/pytorch/pytorch/issues/143914 On Windows, there are two separate SYCL platforms for iGPU and dGPU. To simplify the logic, we will exclude iGPUs when a dGPU is present. This ensures that all XPU devices enumerated by PyTorch share the same SYCL context. Now I generalize the logic as below: 1. We find the first L0 platform containing at least one dGPU and enumerate all dGPUs of that platform. 2. If no dGPU is found, we find the first L0 platform containing iGPU and enumerate all iGPUs of that platform. 3. No GPU is found (neither iGPU nor dGPU). Pull Request resolved: https://github.com/pytorch/pytorch/pull/144378 Approved by: https://github.com/EikanWang, https://github.com/gujinghui	2025-01-29 15:53:16 +00:00
Nikita Shulga	3fd4691908	[MPS] Add `op_math_t` (#145808 ) Similar to `at::opmath_t` to be used for reduction (and int mms) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145808 Approved by: https://github.com/dcci	2025-01-28 18:03:52 +00:00
cyy	c751541e79	Fix cppcoreguidelines-init-variables ignorance (#141795 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141795 Approved by: https://github.com/albanD	2025-01-28 17:11:37 +00:00
cyy	67fcc7cf02	[3/N] Remove unnecessary once flag usage (#145672 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/145672 Approved by: https://github.com/albanD	2025-01-28 04:28:18 +00:00
Nikita Shulga	3a23d75b37	[MPS] Fix `c0:🤘:log_gamma` correctness on M4 (#145740 ) To workaround a bug where `abs` method call seems to be ignored before calling log, which could be reproduced by running the following code (submitted as FB16415011 ) ```swift import Metal func run_shader<T: BinaryFloatingPoint> (library: MTLLibrary, kernel_name: String, type: T.Type, nelem: Int = 16) { guard let mfunc = library.makeFunction(name: kernel_name) else { fatalError("Can't find function") } let device = library.device guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") } guard let cmdBuffer = queue.makeCommandBuffer() else { fatalError("Can't make command buffer") } guard let computeEncoder = cmdBuffer.makeComputeCommandEncoder() else { fatalError("Can't make compute encoder") } guard let ibuf = device.makeBuffer(length:nelem * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") } let ibuf_data = ibuf.contents().assumingMemoryBound(to: T.self) for i in 0..<nelem { ibuf_data[i] = T(sin(Float(2 + i))) } guard let obuf = device.makeBuffer(length:nelem * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") } let obuf_data = obuf.contents().assumingMemoryBound(to: T.self) computeEncoder.setComputePipelineState(try! device.makeComputePipelineState(function: mfunc)) computeEncoder.setBuffer(obuf, offset:0, index: 0) computeEncoder.setBuffer(ibuf, offset:0, index: 1) computeEncoder.dispatchThreads(MTLSizeMake(nelem, 1, 1), threadsPerThreadgroup:MTLSizeMake(nelem, 1, 1)) computeEncoder.endEncoding() cmdBuffer.commit() cmdBuffer.waitUntilCompleted() print("Results for \(String(describing: T.self)):", terminator: " ") for i in 0..<nelem { print(obuf_data[i], terminator: " ") } print() } let shader_source = """ #include <metal_stdlib> template<typename T> float foo(T x) { const auto abs_x = :🤘:abs(static_cast<float>(x)); auto rc = :🤘:log(abs_x); return rc - :🤘:log(:🤘:abs(abs_x * :🤘:sinpi(abs_x))); } kernel void half_kernel( device half* out_ptr0, constant half* in_ptr0, uint xindex [[thread_position_in_grid]] ) { auto inp = in_ptr0[xindex]; auto out = foo(inp); out_ptr0[xindex] = static_cast<half>(out); } kernel void float_kernel( device float* out_ptr0, constant float* in_ptr0, uint xindex [[thread_position_in_grid]] ) { auto inp = in_ptr0[xindex]; auto out = foo(inp); out_ptr0[xindex] = static_cast<float>(out); } """ let options = MTLCompileOptions() options.mathMode = .safe options.mathFloatingPointFunctions = .precise guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") } let library = try! device.makeLibrary(source:shader_source, options:options) run_shader(library:library, kernel_name:"half_kernel", type: Float16.self) run_shader(library:library, kernel_name:"float_kernel", type: Float.self) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145740 Approved by: https://github.com/dcci	2025-01-27 21:24:22 +00:00
Nikita Shulga	71caac2b30	[MPSInductor] Add rand support (#145705 ) Using Philox4 as PRNG Test plan (other that CI) Run ```python mport torch from torch._inductor.utils import run_and_get_code from contextlib import nullcontext def foo(x): return x * torch.randn_like(x) foo_c = torch.compile(foo) x = torch.ones(100, 100, device="mps") y = foo_c(x) print(y.mean().item(), y.std().item()) for i in range(25): print(y[i].mean(), y[i].std()) ``` And observe that printed values are close to 0 and 1 TODO: Better `randint` algorithm for large ranges Pull Request resolved: https://github.com/pytorch/pytorch/pull/145705 Approved by: https://github.com/dcci, https://github.com/jansel	2025-01-27 06:07:36 +00:00
Yichen Yan	ed015143ef	Set RUNPATH on CUDA and XPU tests (#144305 ) #136627 has almost fixed the issue that test binaries' runpath has not been set correctly, with few cases left. This PR fixes the rest. The binaries are found by `auditwheel repair` a wheel built with `BUILD_TEST=1`. @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/144305 Approved by: https://github.com/malfet	2025-01-26 08:40:22 +00:00
Simon Mahns	6939a56e13	[autocast][pytorch] Support autocast for MTIA (#145627 ) Summary: Add autocast support to MTIA Reviewed By: egienvalue Differential Revision: D68572548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145627 Approved by: https://github.com/egienvalue	2025-01-25 03:24:59 +00:00
Davide Italiano	57591edca1	[mps/inductor] Add support for `erfinv`. (#145643 ) After several rounds of refactoring, this seems to be done now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145643 Approved by: https://github.com/malfet, https://github.com/jansel	2025-01-24 22:55:44 +00:00
Davide Italiano	f56c638849	[c10/metal] Add a vectype variant for `short`/`int`/`long` (#145430 ) Some of the kernels (exp_complex/atan_complex) need the specialization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145430 Approved by: https://github.com/malfet, https://github.com/jansel	2025-01-23 04:52:56 +00:00
Nikita Shulga	70ccbade83	[MPSInductor] Add `gamma` op (#145341 ) By moving `gamma` and `log_gamma` implementation from `Gamma.metal` to `c10/metal/special_math.h` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145341 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #145309	2025-01-22 19:37:45 +00:00
Isuru Fernando	4b77ff9784	Fix PythonMod printing for C++ (#143385 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143385 Approved by: https://github.com/leslie-fang-intel, https://github.com/anijain2305	2025-01-22 14:58:35 +00:00
Nikita Shulga	1908116ace	[MPS][BE] Move vectypes from Quantized to utils (#145312 ) That allows one to get appropriate vectorized types for templates using `c10:🤘:vec2type_t<>` or `c10:🤘:vec4type_t<>` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145312 Approved by: https://github.com/dcci	2025-01-22 00:37:28 +00:00
Davide Italiano	8cc415774f	[mps/inductor] Introduce a metal approx for erf() and use it. (#145161 ) Probably we can do better, but this is a start. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145161 Approved by: https://github.com/malfet	2025-01-19 02:29:05 +00:00
Nikita Shulga	cede43e06b	[MPSInductor][BE] NaN-propagating min/max to header (#145157 ) May be to be later reused from eager op as well Also, didn't know that Metal already have type_traits And use `metal::isunorderder(a, b)` instead of `metal::isnan(a + b)` is it is defined as function that is equivalent `a != a \|\| b != b`, but I suspect it might have a best native implementation for the specific architecture Pull Request resolved: https://github.com/pytorch/pytorch/pull/145157 Approved by: https://github.com/dcci	2025-01-18 22:52:44 +00:00
Nikita Shulga	dc9b77cc55	[MPS] Support includes in metal objects (#145087 ) Useful for code reuse for Metal shader build both for eager mode and MPSInductor, but it requires one to implement `_cpp_embed_headers` tool that, as name suggests, would preprocess and embeds the for shader to be used in dynamic compilation. Test using: - `TestMetalLibrary.test_metal_include` - Moving `i0`/`i1` implementation to `c10/util/metal_special_math.h` and call it from `SpecialOps.metal` shader, which now looks much more compact: ```metal template <typename T, typename Tout = T> void kernel i0(constant T* input, device Tout* output, uint index [[thread_position_in_grid]]) { output[index] = c10::i0(static_cast<Tout>(input[index])); } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145087 Approved by: https://github.com/dcci ghstack dependencies: #145023	2025-01-18 05:35:22 +00:00
Scott Wolchok	69b883d7ac	Remove C10_EMBEDDED (#144808 ) I added this to support code sharing with ExecuTorch, but the operator<< overrides are load-bearing for builds -- we have other code that attempts to pretty-print Half/BFloat16, and implicit conversions can't be used to make that work because there are multiple implicit conversions from Half/BFloat16 to primitive types, so which one to select is ambiguous. Also, we don't actually seem to need it now in ExecuTorch core because we have `include <ostream>` in there at the moment anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144808 Approved by: https://github.com/janeyx99, https://github.com/malfet	2025-01-15 06:08:53 +00:00
fan.mo	64829b356a	[PrivateUse1] Support parseDispatchKey with modified PrivateUse1 (#144325 ) PyTorch now support many private1 backend names like `AutogradPrivateUse1` or `QuantizedPrivateUse1`, not mentioned the original `PrivateUse1` backend. However, users that implement `PrivateUse1` funtionalities would modified the backend name by calling `torch.utils.rename_privateuse1_backend("my_backend")`, in that case, all `PrivateUse1` backend string would not be found when we call other functions related to it. For example, we utilize `torch.library` to register some customize functions to our new backend, we would use "my_backend" as the backend name instead of "PrivateUse1", in which the error will be throw: ``` could not parse dispatch key 'my_backend' ``` So, this PR changed the function `c10::DispatchKey parseDispatchKey(const std::string& k)`, it would double check if the `PrivateUse1` has been modified, and if so, we would change `k` to adapt new backend name then find it again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144325 Approved by: https://github.com/albanD	2025-01-14 21:21:29 +00:00

1 2 3 4 5 ...

2733 commits