pytorch

mirror of https://github.com/saymrwulf/pytorch.git synced 2026-05-14 20:57:59 +00:00

Author	SHA1	Message	Date
Davide Italiano	d4ad7b91ad	[mps] Move zeta() to special_math.h. (#146231 ) In preparation for implementing digamma/polygamma Pull Request resolved: https://github.com/pytorch/pytorch/pull/146231 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-02-01 19:22:59 +00:00
PyTorch MergeBot	c39c679813	Revert "Tensor .cuda() very slow with specific array sizes (#138964 )" This reverts commit `98f87edd23`. Reverted https://github.com/pytorch/pytorch/pull/138964 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but some slow test start failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/138964#issuecomment-2628455198))	2025-01-31 21:48:51 +00:00
Donald Tolley	98f87edd23	Tensor .cuda() very slow with specific array sizes (#138964 ) ### Pull Request: Optimized Non-Contiguous Tensor Copy for CPU to GPU in PyTorch #### Summary This PR addresses the performance issue identified in [#111570](https://github.com/pytorch/pytorch/issues/111570), where non-contiguous tensors took significantly longer to transfer from CPU to GPU. Through detailed tracing of the call flow, we identified that PyTorch was creating temporary contiguous buffers for non-contiguous tensor transfers, which introduced unnecessary overhead. #### Tracing the Issue To pinpoint the cause of the slowdown, we followed the call flow from Python’s `tensor.cuda()` method through PyTorch’s backend, ultimately identifying `copy_kernel_cuda` as the key function responsible for CPU-to-GPU tensor transfers. Here’s a summary of the tracing process: 1. Python Call: `tensor.cuda()` - Starting from Python, the `cuda()` method initiates the tensor transfer to the GPU. 2. `TensorBody.h: cuda()` - The `cuda()` method calls `to()`, specifying the target device as CUDA. 3. `Tensor.cpp: TensorBase::to()` - The `to()` function prepares device and data type options before invoking `_ops::to_dtype_layout::call()`. 4. Operator Call: `_ops::to_dtype_layout::call()` - This operator dispatches the request to the backend-specific function responsible for managing the transfer. 5. `Copy.cpp: copy_()` - The `copy_()` function performs preliminary checks (e.g., zero-tensor immutability) and proceeds to call `copy_impl()`. 6. `Copy.cpp: copy_impl()` - This function sets up a tensor iterator and dispatches the copy operation to the appropriate backend through `copy_stub`. 7. Dispatch to CUDA: `copy_stub` - The dispatch mechanism routes the call to the CUDA-specific function, `copy_kernel_cuda`. 8. `Copy.cu: copy_kernel_cuda()` - Here, we identified that PyTorch was creating temporary contiguous buffers for 1D and 2D non-contiguous tensors, which slowed down the copy process. This behavior is managed by the `copy_requires_temporaries()` function. #### Solution To address this, we modified `copy_kernel_cuda` to handle non-contiguous 1D and 2D tensors directly by using `cudaMemcpy2DAsync`, which allows efficient, stride-aware memory transfers without temporary buffers. Here’s why this approach improves performance: - Efficiency of `cudaMemcpy2DAsync`: This CUDA function is optimized for pitched (stride-based) memory transfers, allowing it to handle non-contiguous data layouts effectively by specifying memory strides for source and destination tensors. - Reduction of Overhead: By directly copying non-contiguous tensors without intermediate buffers, we eliminate extra memory allocation and achieve faster CPU-to-GPU transfers. - Asynchronous Execution: `cudaMemcpy2DAsync` enables asynchronous transfer on the CUDA stream, further improving performance by taking advantage of CUDA's optimized memory handling for non-contiguous layouts. #### Performance Results In my testing, I created tensors of size `327680 x 2000` and used slices for transfer performance measurements. The tests show that the average time for transferring a non-contiguous slice (e.g., rows 10,000 to 50,000) from CPU to GPU now closely matches the contiguous case. This improvement indicates that the updated implementation effectively addresses the performance discrepancy. Below are the measured times and validation checks: ```plaintext Average time for contiguous slice (rows 10,000-50,000): 66 ms Average time for non-contiguous slice (rows 10,000-50,000): 66 ms Validation of contiguous and non-contiguous tensor copies: ✅ PASS: Tensor shapes match. ✅ PASS: Tensor contiguity matches. ✅ PASS: Tensor contents match. ✅ PASS: Tensor data types match. ✅ Success: Both contiguous and non-contiguous tensors were copied correctly to the GPU. ``` #### Conclusion This PR resolves the identified performance issue by eliminating the need for temporary buffers in non-contiguous 1D and 2D tensor transfers, ensuring faster and more efficient copies from CPU to GPU. Future optimizations could further enhance performance for higher-dimensional non-contiguous tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138964 Approved by: https://github.com/jeffdaily Co-authored-by: Natalia Gimelshein <ngimel@gmail.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-01-31 17:05:02 +00:00
Natalia Gimelshein	08ff11e9d0	initialize device when pinning memory on this device, short circuit i… (#145752 ) …s_pinned if device is not initialized Do not land RFC potential fix for #144687 Now `.is_pinned(device="cuda")` does not initialize device and thus doesn't poison the fork (but it complains about `device` arg being deprecated). To not need `device=` arg we'd need to fix get_accelerator to not initialize device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145752 Approved by: https://github.com/albanD Co-authored-by: albanD <albandes@fb.com>	2025-01-30 21:37:29 +00:00
cyyever	8a6e9a88e9	Let PYTORCH_NO_CUDA_MEMORY_CACHING has effect only when value is 1 (#145905 ) Fixes #145661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145905 Approved by: https://github.com/eqy, https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-01-30 05:11:10 +00:00
cyy	116af809eb	Use std::string_view (#145906 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/145906 Approved by: https://github.com/albanD	2025-01-30 03:14:27 +00:00
Yu, Guangye	3b3aac0cde	Filter out iGPU if dGPU is found on XPU (#144378 ) # Motivation for https://github.com/pytorch/pytorch/issues/143914 On Windows, there are two separate SYCL platforms for iGPU and dGPU. To simplify the logic, we will exclude iGPUs when a dGPU is present. This ensures that all XPU devices enumerated by PyTorch share the same SYCL context. Now I generalize the logic as below: 1. We find the first L0 platform containing at least one dGPU and enumerate all dGPUs of that platform. 2. If no dGPU is found, we find the first L0 platform containing iGPU and enumerate all iGPUs of that platform. 3. No GPU is found (neither iGPU nor dGPU). Pull Request resolved: https://github.com/pytorch/pytorch/pull/144378 Approved by: https://github.com/EikanWang, https://github.com/gujinghui	2025-01-29 15:53:16 +00:00
Nikita Shulga	3fd4691908	[MPS] Add `op_math_t` (#145808 ) Similar to `at::opmath_t` to be used for reduction (and int mms) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145808 Approved by: https://github.com/dcci	2025-01-28 18:03:52 +00:00
cyy	c751541e79	Fix cppcoreguidelines-init-variables ignorance (#141795 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141795 Approved by: https://github.com/albanD	2025-01-28 17:11:37 +00:00
cyy	67fcc7cf02	[3/N] Remove unnecessary once flag usage (#145672 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/145672 Approved by: https://github.com/albanD	2025-01-28 04:28:18 +00:00
Nikita Shulga	3a23d75b37	[MPS] Fix `c0:🤘:log_gamma` correctness on M4 (#145740 ) To workaround a bug where `abs` method call seems to be ignored before calling log, which could be reproduced by running the following code (submitted as FB16415011 ) ```swift import Metal func run_shader<T: BinaryFloatingPoint> (library: MTLLibrary, kernel_name: String, type: T.Type, nelem: Int = 16) { guard let mfunc = library.makeFunction(name: kernel_name) else { fatalError("Can't find function") } let device = library.device guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") } guard let cmdBuffer = queue.makeCommandBuffer() else { fatalError("Can't make command buffer") } guard let computeEncoder = cmdBuffer.makeComputeCommandEncoder() else { fatalError("Can't make compute encoder") } guard let ibuf = device.makeBuffer(length:nelem * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") } let ibuf_data = ibuf.contents().assumingMemoryBound(to: T.self) for i in 0..<nelem { ibuf_data[i] = T(sin(Float(2 + i))) } guard let obuf = device.makeBuffer(length:nelem * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") } let obuf_data = obuf.contents().assumingMemoryBound(to: T.self) computeEncoder.setComputePipelineState(try! device.makeComputePipelineState(function: mfunc)) computeEncoder.setBuffer(obuf, offset:0, index: 0) computeEncoder.setBuffer(ibuf, offset:0, index: 1) computeEncoder.dispatchThreads(MTLSizeMake(nelem, 1, 1), threadsPerThreadgroup:MTLSizeMake(nelem, 1, 1)) computeEncoder.endEncoding() cmdBuffer.commit() cmdBuffer.waitUntilCompleted() print("Results for \(String(describing: T.self)):", terminator: " ") for i in 0..<nelem { print(obuf_data[i], terminator: " ") } print() } let shader_source = """ #include <metal_stdlib> template<typename T> float foo(T x) { const auto abs_x = :🤘:abs(static_cast<float>(x)); auto rc = :🤘:log(abs_x); return rc - :🤘:log(:🤘:abs(abs_x * :🤘:sinpi(abs_x))); } kernel void half_kernel( device half* out_ptr0, constant half* in_ptr0, uint xindex [[thread_position_in_grid]] ) { auto inp = in_ptr0[xindex]; auto out = foo(inp); out_ptr0[xindex] = static_cast<half>(out); } kernel void float_kernel( device float* out_ptr0, constant float* in_ptr0, uint xindex [[thread_position_in_grid]] ) { auto inp = in_ptr0[xindex]; auto out = foo(inp); out_ptr0[xindex] = static_cast<float>(out); } """ let options = MTLCompileOptions() options.mathMode = .safe options.mathFloatingPointFunctions = .precise guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") } let library = try! device.makeLibrary(source:shader_source, options:options) run_shader(library:library, kernel_name:"half_kernel", type: Float16.self) run_shader(library:library, kernel_name:"float_kernel", type: Float.self) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145740 Approved by: https://github.com/dcci	2025-01-27 21:24:22 +00:00
Nikita Shulga	71caac2b30	[MPSInductor] Add rand support (#145705 ) Using Philox4 as PRNG Test plan (other that CI) Run ```python mport torch from torch._inductor.utils import run_and_get_code from contextlib import nullcontext def foo(x): return x * torch.randn_like(x) foo_c = torch.compile(foo) x = torch.ones(100, 100, device="mps") y = foo_c(x) print(y.mean().item(), y.std().item()) for i in range(25): print(y[i].mean(), y[i].std()) ``` And observe that printed values are close to 0 and 1 TODO: Better `randint` algorithm for large ranges Pull Request resolved: https://github.com/pytorch/pytorch/pull/145705 Approved by: https://github.com/dcci, https://github.com/jansel	2025-01-27 06:07:36 +00:00
Yichen Yan	ed015143ef	Set RUNPATH on CUDA and XPU tests (#144305 ) #136627 has almost fixed the issue that test binaries' runpath has not been set correctly, with few cases left. This PR fixes the rest. The binaries are found by `auditwheel repair` a wheel built with `BUILD_TEST=1`. @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/144305 Approved by: https://github.com/malfet	2025-01-26 08:40:22 +00:00
Simon Mahns	6939a56e13	[autocast][pytorch] Support autocast for MTIA (#145627 ) Summary: Add autocast support to MTIA Reviewed By: egienvalue Differential Revision: D68572548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145627 Approved by: https://github.com/egienvalue	2025-01-25 03:24:59 +00:00
Davide Italiano	57591edca1	[mps/inductor] Add support for `erfinv`. (#145643 ) After several rounds of refactoring, this seems to be done now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145643 Approved by: https://github.com/malfet, https://github.com/jansel	2025-01-24 22:55:44 +00:00
Davide Italiano	f56c638849	[c10/metal] Add a vectype variant for `short`/`int`/`long` (#145430 ) Some of the kernels (exp_complex/atan_complex) need the specialization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145430 Approved by: https://github.com/malfet, https://github.com/jansel	2025-01-23 04:52:56 +00:00
Nikita Shulga	70ccbade83	[MPSInductor] Add `gamma` op (#145341 ) By moving `gamma` and `log_gamma` implementation from `Gamma.metal` to `c10/metal/special_math.h` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145341 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #145309	2025-01-22 19:37:45 +00:00
Isuru Fernando	4b77ff9784	Fix PythonMod printing for C++ (#143385 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143385 Approved by: https://github.com/leslie-fang-intel, https://github.com/anijain2305	2025-01-22 14:58:35 +00:00
Nikita Shulga	1908116ace	[MPS][BE] Move vectypes from Quantized to utils (#145312 ) That allows one to get appropriate vectorized types for templates using `c10:🤘:vec2type_t<>` or `c10:🤘:vec4type_t<>` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145312 Approved by: https://github.com/dcci	2025-01-22 00:37:28 +00:00
Davide Italiano	8cc415774f	[mps/inductor] Introduce a metal approx for erf() and use it. (#145161 ) Probably we can do better, but this is a start. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145161 Approved by: https://github.com/malfet	2025-01-19 02:29:05 +00:00
Nikita Shulga	cede43e06b	[MPSInductor][BE] NaN-propagating min/max to header (#145157 ) May be to be later reused from eager op as well Also, didn't know that Metal already have type_traits And use `metal::isunorderder(a, b)` instead of `metal::isnan(a + b)` is it is defined as function that is equivalent `a != a \|\| b != b`, but I suspect it might have a best native implementation for the specific architecture Pull Request resolved: https://github.com/pytorch/pytorch/pull/145157 Approved by: https://github.com/dcci	2025-01-18 22:52:44 +00:00
Nikita Shulga	dc9b77cc55	[MPS] Support includes in metal objects (#145087 ) Useful for code reuse for Metal shader build both for eager mode and MPSInductor, but it requires one to implement `_cpp_embed_headers` tool that, as name suggests, would preprocess and embeds the for shader to be used in dynamic compilation. Test using: - `TestMetalLibrary.test_metal_include` - Moving `i0`/`i1` implementation to `c10/util/metal_special_math.h` and call it from `SpecialOps.metal` shader, which now looks much more compact: ```metal template <typename T, typename Tout = T> void kernel i0(constant T* input, device Tout* output, uint index [[thread_position_in_grid]]) { output[index] = c10::i0(static_cast<Tout>(input[index])); } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145087 Approved by: https://github.com/dcci ghstack dependencies: #145023	2025-01-18 05:35:22 +00:00
Scott Wolchok	69b883d7ac	Remove C10_EMBEDDED (#144808 ) I added this to support code sharing with ExecuTorch, but the operator<< overrides are load-bearing for builds -- we have other code that attempts to pretty-print Half/BFloat16, and implicit conversions can't be used to make that work because there are multiple implicit conversions from Half/BFloat16 to primitive types, so which one to select is ambiguous. Also, we don't actually seem to need it now in ExecuTorch core because we have `include <ostream>` in there at the moment anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144808 Approved by: https://github.com/janeyx99, https://github.com/malfet	2025-01-15 06:08:53 +00:00
fan.mo	64829b356a	[PrivateUse1] Support parseDispatchKey with modified PrivateUse1 (#144325 ) PyTorch now support many private1 backend names like `AutogradPrivateUse1` or `QuantizedPrivateUse1`, not mentioned the original `PrivateUse1` backend. However, users that implement `PrivateUse1` funtionalities would modified the backend name by calling `torch.utils.rename_privateuse1_backend("my_backend")`, in that case, all `PrivateUse1` backend string would not be found when we call other functions related to it. For example, we utilize `torch.library` to register some customize functions to our new backend, we would use "my_backend" as the backend name instead of "PrivateUse1", in which the error will be throw: ``` could not parse dispatch key 'my_backend' ``` So, this PR changed the function `c10::DispatchKey parseDispatchKey(const std::string& k)`, it would double check if the `PrivateUse1` has been modified, and if so, we would change `k` to adapt new backend name then find it again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144325 Approved by: https://github.com/albanD	2025-01-14 21:21:29 +00:00
PyTorch MergeBot	bdd942efd7	Revert "Increase C10_COMPILE_TIME_MAX_GPUS to 128 (#144138 )" This reverts commit `6cfc081675`. Reverted https://github.com/pytorch/pytorch/pull/144138 on behalf of https://github.com/albanD due to This seems to impact the caffe2 code ([comment](https://github.com/pytorch/pytorch/pull/144138#issuecomment-2590891200))	2025-01-14 19:04:12 +00:00
Richard Barnes	1dab79470d	c10::string_view -> std::string_view in pytorch (#143591 ) Test Plan: Sandcastle Differential Revision: D67312322 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143591 Approved by: https://github.com/malfet	2025-01-13 21:44:05 +00:00
Nikita Shulga	9ae35b8bb1	[BE] Introduce `c10::SyntaxError` (#144647 ) Which will be translated into Python's SyntaxError Pull Request resolved: https://github.com/pytorch/pytorch/pull/144647 Approved by: https://github.com/Skylion007	2025-01-12 23:23:54 +00:00
cyy	6cfc081675	Increase C10_COMPILE_TIME_MAX_GPUS to 128 (#144138 ) To facilitate further possible changes of DeviceIndex to int16_t. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144138 Approved by: https://github.com/albanD	2025-01-10 23:53:19 +00:00
Scott Wolchok	0529908f13	Remove is_reduced_floating_point from namespace std (#144502 ) Partial fix for #144495. Avoiding BC-break using existing practice of removing only if FBCODE_CAFFE2 and C10_NODEPRECATED are not defined. Differential Revision: [D67992342](https://our.internmc.facebook.com/intern/diff/D67992342/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144502 Approved by: https://github.com/malfet	2025-01-10 03:24:10 +00:00
cyy	9a841f9321	Enable bugprone-unchecked-optional-access (#144226 ) We can actually enable bugprone-unchecked-optional-access without the risk of hang. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144226 Approved by: https://github.com/albanD	2025-01-10 03:16:56 +00:00
cyy	d0070ca07e	[18/N] Fix extra warnings brought by clang-tidy-17 (#144014 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/144014 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-01-08 17:21:55 +00:00
Michal Gallus	93633d0e80	[ROCm][Windows] Fix export macros (#144098 ) For correct import and export of functions when the dynamic linkage is used for HIP libraries on windows, the appropriate export/import macros need to be put in place. This Pull Request utilizes existing CUDA import/export macros by converting them to corresponding HIP macros during the hipification process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144098 Approved by: https://github.com/jeffdaily	2025-01-04 17:12:46 +00:00
Yu, Guangye	09e47ab7ab	Refine CUDA Stream priority (#143849 ) # Motivation As mentioned in https://github.com/pytorch/pytorch/pull/141119#discussion_r1897480515, we properly handle the priority value if it is outside of the priority range. # Additional Context If the value falls outside of the allowed priority range, it will automatically be mapped to the nearest valid priority(either lowest or highest). Pull Request resolved: https://github.com/pytorch/pytorch/pull/143849 Approved by: https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #142347, #141119, #141123, #143799	2024-12-31 11:15:59 +00:00
Yu, Guangye	a68c0ca497	Add low priority XPU Stream (#141119 ) # Motivation Due to the potential for the external SYCL queue to have a low priority, we need to support the low-priority SYCL queue for native XPU Streams to maintain consistency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141119 Approved by: https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: #142347	2024-12-31 11:15:45 +00:00
Yu, Guangye	39450ae655	Refine XPU external Stream (#142347 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142347 Approved by: https://github.com/gujinghui, https://github.com/albanD	2024-12-31 11:15:38 +00:00
cyy	af629a8146	Enable readability-redundant-declaration (#143982 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143982 Approved by: https://github.com/Skylion007	2024-12-31 00:20:10 +00:00
cyy	dca443835e	Enable more readability-redundant checks (#143963 ) They are helpful to simplifying code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143963 Approved by: https://github.com/albanD	2024-12-30 14:49:33 +00:00
Yu, Guangye	07fa6e2c8b	Fix torch.accelerator api abort when passing invaild device (#143550 ) # Motivation Fix https://github.com/pytorch/pytorch/issues/143543 # Solution We should raise python exception instead of aborting... # Additional Context without this PR: ```python >>> import torch >>> torch.accelerator.current_stream(torch.accelerator.device_count()) terminate called after throwing an instance of 'c10::Error' what(): device is out of range, device is 2, total number of device is 2. Exception raised from check_device_index at /home/dvrogozh/git/pytorch/pytorch/c10/xpu/XPUFunctions.h:36 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f30707eb95c in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f307078fc57 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so) frame #2: <unknown function> + 0x19a3e (0x7f3070c2ba3e in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #3: c10::xpu::getCurrentXPUStream(signed char) + 0x2f (0x7f3070c2c83f in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #4: <unknown function> + 0x1ca35 (0x7f3070c2ea35 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #5: <unknown function> + 0x653f15 (0x7f3083391f15 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0x39e5f2 (0x7f30830dc5f2 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so) <omitting python frames> frame #20: <unknown function> + 0x29d90 (0x7f308b19bd90 in /lib/x86_64-linux-gnu/libc.so.6) frame #21: __libc_start_main + 0x80 (0x7f308b19be40 in /lib/x86_64-linux-gnu/libc.so.6) Aborted (core dumped) ``` with this PR: ```python >>> import torch >>> torch.accelerator.current_stream(torch.accelerator.device_count()) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/pt-gpu/4T-4652/guangyey/stock-pytorch/torch/accelerator/__init__.py", line 123, in current_stream return torch._C._accelerator_getStream(device_index) RuntimeError: The device index is out of range. It must be in [0, 2), but got 2. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143550 Approved by: https://github.com/EikanWang, https://github.com/dvrogozh, https://github.com/albanD	2024-12-23 03:44:22 +00:00
Richard Barnes	e2d47a133b	Disable c10::optional macros (#138912 ) Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/138912 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-12-17 09:22:47 +00:00
Richard Barnes	82ce888273	c10::string_view -> std::string_view in more places (#142517 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142517 Approved by: https://github.com/malfet	2024-12-12 19:45:59 +00:00
cyyever	a108b282ff	[4/N] Avoid copy in std::get (#142285 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/142285 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-12-09 07:59:35 +00:00
xinan.lin	3d227ae315	[Intel GPU] Support getStreamFromExternel for XPU. (#140268 ) In AOT inductor scenario, the GPU Stream can be created outside of the pool of `XPUStream`, and we need to create a `XPUStream` which refers to this stream for the the common logic of AOTI, for example a stream guard is a guard for `XPUStream`. So we add the getStreamFromExternel following the design of CUDAStream. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140268 Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/EikanWang	2024-12-07 19:22:04 +00:00
Nikita Shulga	3a3638be50	[BE] Enable Scalar.h compilation on 32-bit system (#142235 ) By hiding ambiguous Scalar(long long) constructor behind `std::enable_if_t<sizeof(void *) == 8>` Followup after https://github.com/pytorch/pytorch/pull/141244 Test Plan: Run `printf "#include <c10/core/Scalar.h>\n c10::Scalar x(3);" \| gcc -x c++ -std=c++17 -I. -Ibuild - -c` on ARMv7 system. Before this change it failed with: ``` In file included from <stdin>:1: ./c10/core/Scalar.h:83:3: error: ‘c10::Scalar::Scalar(long long int)’ cannot be overloaded with ‘c10::Scalar::Scalar(int64_t)’ 83 \| Scalar(long long vv) : Scalar(vv, true) {} \| ^~~~~~ ./c10/core/Scalar.h:50:3: note: previous declaration ‘c10::Scalar::Scalar(int64_t)’ 50 \| Scalar(type vv) : Scalar(vv, true) {} \| ^~~~~~ ./c10/core/ScalarType.h:288:3: note: in expansion of macro ‘DEFINE_IMPLICIT_CTOR’ 288 \| _(int64_t, Long) \ \| ^ ./c10/core/Scalar.h:52:3: note: in expansion of macro ‘AT_FORALL_SCALAR_TYPES_AND7’ 52 \| AT_FORALL_SCALAR_TYPES_AND7( \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142235 Approved by: https://github.com/Skylion007	2024-12-07 01:05:55 +00:00
Brian Hirsh	20912ba582	fix incorrect c10::SymFloat::sqrt (#141728 ) Fixes the silent correctness for SDPA in https://github.com/pytorch/pytorch/issues/141710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141728 Approved by: https://github.com/Skylion007, https://github.com/ezyang, https://github.com/drisspg ghstack dependencies: #141725	2024-12-03 23:34:16 +00:00
cyy	7c1d5db1f3	[2/N] Enable UBSAN tests (#141740 ) Apply c10::load in more places. The function was introduced to cast a byte to valid boolean values, thus fixing the UBSAN errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141740 Approved by: https://github.com/ezyang	2024-12-03 20:52:26 +00:00
Yu, Guangye	77748ed8ec	fix c10::Event UT failure on XPU backend (#141800 ) # Motivation Fix this UT failure introduced by https://github.com/pytorch/pytorch/pull/140865. The unrelated failure suppressed this UT failure. It goes to happen since https://github.com/pytorch/pytorch/pull/141546 is landed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141800 Approved by: https://github.com/EikanWang	2024-12-03 01:34:42 +00:00
Benjamin Glass	4959784dac	Add API query for available per-process CUDA memory (#140620 ) Certain `cpp_wrapper`-enabled tests were OOM-ing in the CI pipeline, with error messages suggesting that sufficient memory was accessible. This ultimately resulted from an internal memory limitation that was not queryable in the API. This PR adds querying for that limit. Additionally, the failing tests had incorrect memory availability checks, and are updated with measured memory requirements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140620 Approved by: https://github.com/malfet, https://github.com/eqy ghstack dependencies: #141367	2024-12-03 00:24:03 +00:00
Edward Z. Yang	5212ec3879	Add admonition about as_float_unchecked() (#141742 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141742 Approved by: https://github.com/bdhirsh	2024-11-28 06:25:18 +00:00
Yu, Guangye	d905f1350a	Friendly catch exception when fail to initialize XPU devices (#141658 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141658 Approved by: https://github.com/EikanWang	2024-11-28 05:17:08 +00:00
Umberto Valleriani	3becdaf8a7	[c10] Fix static_assert for 32-bit systems (#141244 ) the `__ANDROID__` macro was used as a proxy to check whether compilation is targeting a 32 or 64 bit system, causing build failure on non-android 32 bit linux targets like arm v7. This modification adjusts the check to fail if and only if int64_t and long and not the same on 64-bit systems, on systems where `sizeof(void*) == 8` Like I said in the issue #141043 , I'm not sure whether a different `Scalar` constructor should be defined in the 32 bit case. My code does not break but I'm not sure other people's code won't. Fixes #141043 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141244 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-11-28 03:11:52 +00:00

1 2 3 4 5 ...

2707 commits