pytorch

mirror of https://github.com/saymrwulf/pytorch.git synced 2026-05-14 20:57:59 +00:00

Author	SHA1	Message	Date
Chien-Chin Huang	c0e5cca4f8	[DDP] Change the --no-optimize-ddp flag to reflect the latest usage (#119437 ) Compiled DDP now has 4 different optimization modes. This PR changes the Dynamo benchmark flag to reflect that change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119437 Approved by: https://github.com/wconstab, https://github.com/xmfan	2024-02-13 16:53:56 +00:00
chuanqiw	074f2bb5ce	Fix dynamo benchmark runner for torchbench skip sets (#118615 ) Fix dynamo benchmark runner for torchbench skip sets, which introduced by PR #118032 This runner.py script is still used in the [Inductor CPU Performance Dashboard](https://github.com/pytorch/pytorch/issues/93531) regular test Pull Request resolved: https://github.com/pytorch/pytorch/pull/118615 Approved by: https://github.com/jgong5, https://github.com/ysiraichi, https://github.com/ezyang	2024-02-06 02:06:54 +00:00
PyTorch MergeBot	966db82c9d	Revert "Remove extra graph breaks (#118987 )" This reverts commit `9a8e3b07d7`. Reverted https://github.com/pytorch/pytorch/pull/118987 on behalf of https://github.com/eellison due to reverting because it causes regression ([comment](https://github.com/pytorch/pytorch/pull/118987#issuecomment-1928224447))	2024-02-05 22:19:37 +00:00
Michael Lazos	9a8e3b07d7	Remove extra graph breaks (#118987 ) Fixes https://github.com/pytorch/pytorch/issues/104053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118987 Approved by: https://github.com/janeyx99	2024-02-03 05:55:09 +00:00
BowenBao	30f43e3d89	[ONNX][bench] Deepcopy model to another device before export to avoid OOM (#118710 ) Prior to onnx export, the model is deepcopied to avoid modifications that may affect later performance profiling. However this increases the memory requirement on the device. This PR modifies the script to deepcopy and export the model on another device when possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118710 Approved by: https://github.com/thiagocrepaldi	2024-01-31 23:03:39 +00:00
Aaron Gokaslan	1562dae62c	[BE]: Apply RUF025 dict.fromkeys preview rule (#118637 ) Simplifies and optimizes dict construction using the `fromkeys` classmethod ctor. This also makes it really obvious when all the keys will have the same static value, which could be a bug if unintentional. It is also significantly faster than using a dict comprehension. The rule is in preview, but I am adding a forward fix for when it becomes stable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118637 Approved by: https://github.com/albanD	2024-01-30 20:46:54 +00:00
Edward Z. Yang	119b66ba16	Use strict to toggle strict options in MYPYSTRICT (#118479 ) As we force a specific version of mypy, it's OK to use the agglomerated flag. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118479 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #118414, #118418, #118432, #118467, #118468, #118469, #118475	2024-01-28 19:22:22 +00:00
Yukio Siraichi	2f6fc33c20	Move skip sets into a new file. (#118032 ) This PR moves the skip sets that lived in benchmarks/dynamo/torchbench.py into a more readable YAML file, so that it is consumable from other projects (e.g. XLA). Pull Request resolved: https://github.com/pytorch/pytorch/pull/118032 Approved by: https://github.com/lezcano, https://github.com/ezyang	2024-01-24 19:22:01 +00:00
Jason Ansel	c5702a0891	[dynamo] Optimize BACKEND_MATCH guard (#118065 ) As measured by `benchmarks/dynamo/microbenchmarks/overheads.py`: - Before `22.5us` - After `18.1us` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118065 Approved by: https://github.com/ydwu4	2024-01-24 07:47:52 +00:00
Simon Fan	ed0ec2e0be	Remove dynamo runner's dependency on distributed build (#117903 ) So that we can bisect faster without needing to rebuild distributed module. We remove the annotation to avoid flake8 undefined name lint Pull Request resolved: https://github.com/pytorch/pytorch/pull/117903 Approved by: https://github.com/xuzhao9	2024-01-24 06:51:14 +00:00
Jane Xu	13d2cdffa2	Remove optimizer.step patching for profiler hook (#115772 ) 1. I'd like to remove the patching that avoids the profiler hook, but it adds an additional graph break due to nested wrappers. #117767 if interested, see (internal only) paste for [before](P996529232) and [after](P997507449) this PR. ``` I've locally run perf benchmarks for yolov3: Before the speedup is 4.183x, and after it is 4.208x. I've also run it for resnet50: before, speedup is 3.706x and now it is 3.924x. ``` 2. @mlazos I now unwrap twice in the dynamo and inductor tests. This feels like we're testing deficiently--should we add tests to test that tracing through the profiler hook and the use_grad hook are functioning according to expectations (I know there's at least one graph break in one). 3. There's a strange memory thing going on...what is happening? This has been resolved with @voznesenskym's [change](https://github.com/pytorch/pytorch/pull/116169). (for details see below) <details> This PR will fail the test_static_address_finalizer test due to a mysterious thing that is happening (idk what, but maybe the dynamo cache or a frame _expecting_ the patching to have been done). There is no Python refcycle, as the backrefs for `p_ref()` look like: ![image](https://github.com/pytorch/pytorch/assets/31798555/4d6cbf50-3924-4efe-b578-d93389eebec8) (so 5 backrefs but none of them python) And the refs: ![image](https://github.com/pytorch/pytorch/assets/31798555/25e01105-bcb9-44ca-997a-2cf1670a6d42) </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115772 Approved by: https://github.com/jansel, https://github.com/mlazos	2024-01-23 20:15:41 +00:00
Bin Bao	4d625c1c92	[AOTI] Fix a bug in the torch._export.aot_load API (#118039 ) Summary: tree_flatten_spec should use args instead of *args clone of https://github.com/pytorch/pytorch/pull/117948 but with some fbcode specific changes Test Plan: CI Differential Revision: D52982401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118039 Approved by: https://github.com/angelayi	2024-01-23 14:54:02 +00:00
Michael Lazos	f302a0d380	Re-enable SGD (#117434 ) Re-enables the SGD optimizer now that compile times are more reasonable. [Benchmark run](https://github.com/pytorch/pytorch/actions/runs/7511073761) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117434 Approved by: https://github.com/anijain2305, https://github.com/janeyx99	2024-01-19 04:28:50 +00:00
PyTorch MergeBot	2f84a9d37c	Revert "[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663 )" This reverts commit `5aa92b5090`. Reverted https://github.com/pytorch/pytorch/pull/115663 on behalf of https://github.com/PaliC due to Unfortunately, this pr breaks cuda builds internally ([comment](https://github.com/pytorch/pytorch/pull/115663#issuecomment-1899388813))	2024-01-18 23:40:30 +00:00
Jason Ansel	a669319450	[inductor] Faster C++ kernel python bindings (#117500 ) Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark: ```python from ctypes import c_void_p import torch from torch import empty from torch._inductor.codecache import AsyncCompile from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance from torch._inductor.wrapper_benchmark import compiled_module_main async_compile = AsyncCompile() src = ''' #include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { auto tmp0 = in_ptr0[static_cast<long>(0L)]; auto tmp1 = static_cast<float>(1.0); auto tmp2 = decltype(tmp0)(tmp0 + tmp1); out_ptr0[static_cast<long>(0L)] = tmp2; } } ''' cpp_fused_add_ctypes = async_compile.cpp(src) cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float", "float"], src) async_compile.wait(globals()) del async_compile def call(arg0_1): buf0 = empty((1,), device='cpu', dtype=torch.float32) if use_ctypes: for _ in range(100): cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) else: for _ in range(100): cpp_fused_add_cpython(arg0_1, buf0) del arg0_1 return (buf0,) def benchmark_compiled_module(times=1000, repeat=100): arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32) return print_performance(lambda: call(arg0_1), times=times, repeat=repeat) print("old ctypes bindings: ", end='') use_ctypes = True compiled_module_main('None', benchmark_compiled_module) print("new bindings: ", end='') use_ctypes = False compiled_module_main('None', benchmark_compiled_module) ``` Output: ``` old ctypes bindings: 0.000073 new bindings: 0.000013 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500 Approved by: https://github.com/desertfire	2024-01-18 16:20:12 +00:00
Animesh Jain	6e4e81a9ef	[dynamo] Extend LazyVariableTracker to tuples (#117426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117426 Approved by: https://github.com/lezcano, https://github.com/jansel	2024-01-18 15:51:28 +00:00
Bin Bao	26956980c6	[AOTI] Add torch._export.aot_load (#117610 ) Summary: Add a torch._export.aot_load API that can load an AOTInductor-compiled model.so into a python executable. Test Plan: CI Differential Revision: D52825456 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117610 Approved by: https://github.com/angelayi, https://github.com/khabinov, https://github.com/chenyang78	2024-01-18 15:02:16 +00:00
PyTorch MergeBot	b0084be114	Revert "Re-enable SGD (#117434 )" This reverts commit `e7fac72be7`. Reverted https://github.com/pytorch/pytorch/pull/117434 on behalf of https://github.com/lezcano due to breaks test_profiler.py when run with dynamo ([comment](https://github.com/pytorch/pytorch/pull/117434#issuecomment-1898311961))	2024-01-18 11:37:36 +00:00
Michael Lazos	e7fac72be7	Re-enable SGD (#117434 ) Re-enables the SGD optimizer now that compile times are more reasonable. [Benchmark run](https://github.com/pytorch/pytorch/actions/runs/7511073761) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117434 Approved by: https://github.com/anijain2305, https://github.com/janeyx99	2024-01-18 06:47:15 +00:00
Eddie Yan	5aa92b5090	[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663 ) #113713 Going to clean up some of the checks and will remove draft status after. Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`. CC @drisspg @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663 Approved by: https://github.com/drisspg	2024-01-18 01:20:36 +00:00
Nikita Shulga	a1afd1b195	Revert "[inductor] Faster C++ kernel python bindings (#117500 )" It should have never been landed, but was landed again, thanks to ghstack grafting/ungrafting see discussion on https://github.com/pytorch/pytorch/pull/116910 This reverts commit `e457b6fb18`.	2024-01-17 17:06:32 -08:00
titaiwangms	e457b6fb18	[inductor] Faster C++ kernel python bindings (#117500 ) Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark: ```python from ctypes import c_void_p import torch from torch import empty from torch._inductor.codecache import AsyncCompile from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance from torch._inductor.wrapper_benchmark import compiled_module_main async_compile = AsyncCompile() src = ''' #include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { auto tmp0 = in_ptr0[static_cast<long>(0L)]; auto tmp1 = static_cast<float>(1.0); auto tmp2 = decltype(tmp0)(tmp0 + tmp1); out_ptr0[static_cast<long>(0L)] = tmp2; } } ''' cpp_fused_add_ctypes = async_compile.cpp(src) cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float", "float"], src) async_compile.wait(globals()) del async_compile def call(arg0_1): buf0 = empty((1,), device='cpu', dtype=torch.float32) if use_ctypes: for _ in range(100): cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) else: for _ in range(100): cpp_fused_add_cpython(arg0_1, buf0) del arg0_1 return (buf0,) def benchmark_compiled_module(times=1000, repeat=100): arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32) return print_performance(lambda: call(arg0_1), times=times, repeat=repeat) print("old ctypes bindings: ", end='') use_ctypes = True compiled_module_main('None', benchmark_compiled_module) print("new bindings: ", end='') use_ctypes = False compiled_module_main('None', benchmark_compiled_module) ``` Output: ``` old ctypes bindings: 0.000073 new bindings: 0.000013 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500 Approved by: https://github.com/desertfire ghstack dependencies: #117409, #116667, #117591	2024-01-17 23:03:15 +00:00
PyTorch MergeBot	da6abaeeac	Revert "[inductor] Faster C++ kernel python bindings (#117500 )" This reverts commit `bb0fd1bd3c`. Reverted https://github.com/pytorch/pytorch/pull/117500 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896516512))	2024-01-17 19:34:26 +00:00
titaiwangms	bb0fd1bd3c	[inductor] Faster C++ kernel python bindings (#117500 ) Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark: ```python from ctypes import c_void_p import torch from torch import empty from torch._inductor.codecache import AsyncCompile from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance from torch._inductor.wrapper_benchmark import compiled_module_main async_compile = AsyncCompile() src = ''' #include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { auto tmp0 = in_ptr0[static_cast<long>(0L)]; auto tmp1 = static_cast<float>(1.0); auto tmp2 = decltype(tmp0)(tmp0 + tmp1); out_ptr0[static_cast<long>(0L)] = tmp2; } } ''' cpp_fused_add_ctypes = async_compile.cpp(src) cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float", "float"], src) async_compile.wait(globals()) del async_compile def call(arg0_1): buf0 = empty((1,), device='cpu', dtype=torch.float32) if use_ctypes: for _ in range(100): cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) else: for _ in range(100): cpp_fused_add_cpython(arg0_1, buf0) del arg0_1 return (buf0,) def benchmark_compiled_module(times=1000, repeat=100): arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32) return print_performance(lambda: call(arg0_1), times=times, repeat=repeat) print("old ctypes bindings: ", end='') use_ctypes = True compiled_module_main('None', benchmark_compiled_module) print("new bindings: ", end='') use_ctypes = False compiled_module_main('None', benchmark_compiled_module) ``` Output: ``` old ctypes bindings: 0.000073 new bindings: 0.000013 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500 Approved by: https://github.com/desertfire ghstack dependencies: #117409, #116667, #117591	2024-01-17 19:12:24 +00:00
PyTorch MergeBot	9da01affd3	Revert "[inductor] Faster C++ kernel python bindings (#117500 )" This reverts commit `3a52147cc5`. Reverted https://github.com/pytorch/pytorch/pull/117500 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896426304))	2024-01-17 18:42:39 +00:00
sanchitintel	8852bb561c	More efficient multi-threading in Softmax & LogSoftmax CPU kernels (#116367 ) ### Summary In #85398, while fixing a bug (which was _not caused by, but was exposed by_ AVX512 implementation) in `_vec_logsoftmax_lastdim`, I had made some revisions to use more threads in some cases, but was asked to roll back [those changes](https://github.com/pytorch/pytorch/pull/85398#discussion_r1087680237) during the PR's review. At the time, landing that PR asap seemed essential, so I agreed to roll-back that change, In some cases, more threads can be used than are being used with the current approach. <strike>In this PR, I'm reintroducing those changes, which are geared towards more efficient multi-threading.</strike>. On second thought, even for other softmax kernels besides `_vec_log_softmax_lastdim` and `_vec_softmax_lastdim`, we could simply use `grain_size` of 0 or 1, instead of complicating code because `CHUNK_SIZE` for each thread is already being computed as per some heuristic, and if `grain_size` would be `0`, then work among the OpenMP threads (which, BTW, stay constant in number, unless explicitly changed, since we don't use the OpenMP `num_threads` clause in PyTorch) would be distributed equitably, thus yielding the similar speedup as the approach in the first commit of this PR. I've also added op-level benchmarks pertaining to example input shapes in this PR. ### Benchmarks Machine - Intel(R) Xeon(R) Platinum 8468H (Xeon 4th gen, formerly codenamed Sapphire Rapids) One socket of 48 physical cores was used, with & without HyperThreading. Intel OpenMP & tcmalloc were preloaded. Softmax benchmarks can be run with the following command, but the relevant benchmarks are the last dim ones - `KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 KMP_SETTINGS=1 OMP_NUM_THREADS=48 MKL_NUM_THREADS=48 numactl --membind=0 --cpunodebind=0 python -m pt.softmax_test --tag-filter all` #### Already existing benchmarks \|Benchmark name (dim is 1, by default) \| Previous implementation's latency (in ms) \| This implementation's latency (in ms)\|Speedup Percentage = (old-new)100/old \| Speedup ratio (old/new)\| \|-------------\|--------\|-------\|----------------------------\|----------\| \|Softmax_N1_C3_H256_W256_cpu\|31.364\|11.594\|63.03% \|2.705\| \|Softmax_N4_C3_H256_W256_cpu\|34.475\|24.966\| 27.58%\|1.380\| \|Softmax_N8_C3_H512_W256_cpu\|94.044\|78.372\|16.66%\|1.199\| \|Softmax2d_N8_C3_H512_W256_cpu\|100.195\|79.529\|20.62%\|1.259\| #### Some of the following benchmarks are being added in this PR \|Benchmark name\| Previous implementation's latency (in ms) \| This implementation's latency (in ms)\|Speedup percentage = (old-new)100/old\| Speedup ratio (old/new) \| \|-------------\|--------\|-------\|----------------------------\|--------------------\| \|LogSoftmax_M128_N128_dim1_cpu\|7.629\|6.475\|15.12%\| 1.178\| \|LogSoftmax_M48_N128_dim1_cpu\|6.848\|5.969\|12.83%\| 1.147\| \|LogSoftmax_M16_N1024_dim1_cpu\|7.004\|6.322\|9.73%\| 1.107\| \|LogSoftmax_M32_N1024_dim1_cpu\|7.037\|6.558\|6.80%\| 1.073\| \|LogSoftmax_M48_N1024_dim1_cpu\|7.155\|6.773\|5.33%\|1.056\| \|LogSoftmax_M16_N512_dim1_cpu\|6.797\|5.862\|13.75%\|1.159\| \|LogSoftmax_M32_N512_dim1_cpu\|7.223\|6.202\|14.13%\|1.164\| \|LogSoftmax_M48_N512_dim1_cpu\|7.159\|6.301\|11.98%\|1.136\| \|LogSoftmax_M16_N256_dim1_cpu\|6.842\|5.682\|16.95%\|1.204\| \|LogSoftmax_M32_N256_dim1_cpu\|6.840\|6.086\|11.02%\|1.123\| \|LogSoftmax_M48_N256_dim1_cpu\|7.005\|6.031\|13.94%\|1.161\| Pull Request resolved: https://github.com/pytorch/pytorch/pull/116367 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-01-17 02:26:29 +00:00
Jason Ansel	3a52147cc5	[inductor] Faster C++ kernel python bindings (#117500 ) Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark: ```python from ctypes import c_void_p import torch from torch import empty from torch._inductor.codecache import AsyncCompile from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance from torch._inductor.wrapper_benchmark import compiled_module_main async_compile = AsyncCompile() src = ''' #include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { auto tmp0 = in_ptr0[static_cast<long>(0L)]; auto tmp1 = static_cast<float>(1.0); auto tmp2 = decltype(tmp0)(tmp0 + tmp1); out_ptr0[static_cast<long>(0L)] = tmp2; } } ''' cpp_fused_add_ctypes = async_compile.cpp(src) cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float", "float"], src) async_compile.wait(globals()) del async_compile def call(arg0_1): buf0 = empty((1,), device='cpu', dtype=torch.float32) if use_ctypes: for _ in range(100): cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) else: for _ in range(100): cpp_fused_add_cpython(arg0_1, buf0) del arg0_1 return (buf0,) def benchmark_compiled_module(times=1000, repeat=100): arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32) return print_performance(lambda: call(arg0_1), times=times, repeat=repeat) print("old ctypes bindings: ", end='') use_ctypes = True compiled_module_main('None', benchmark_compiled_module) print("new bindings: ", end='') use_ctypes = False compiled_module_main('None', benchmark_compiled_module) ``` Output: ``` old ctypes bindings: 0.000073 new bindings: 0.000013 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500 Approved by: https://github.com/desertfire	2024-01-16 22:30:04 +00:00
Simon Fan	4b25948ee6	Torchbench Dynamo Runner: Enable DDP for perf test and traces (#113332 ) - Removes an outdated assert that prevents perf tests from running DDP, we now have single node --multiprocess and perf tests are already wrapping the model using `deepcopy_and_maybe_ddp` - Append rank name to traces to avoid all ranks trying to create the same file - Renames `deepcopy_and_maybe_ddp` to `deepcopy_and_maybe_parallelize` to include FSDP Pull Request resolved: https://github.com/pytorch/pytorch/pull/113332 Approved by: https://github.com/H-Huang, https://github.com/wconstab	2024-01-12 22:41:09 +00:00
Simon Fan	88bf84f106	[benchmark] add --compile-autograd to dynamo benchmarks (#117196 ) Adds `--compile-autograd` flag to benchmark suite to run accuracy and performance tests. Also adds autograd_captures and autograd_compiles to dynamo stats e.g. accuracy_inductor.csv ``` dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles cuda,BERT_pytorch,4,pass,2655,2,8,7,1,1 cuda,Background_Matting,4,pass_due_to_skip,0,0,0,0,0,0 cuda,DALLE2_pytorch,0,eager_fail_to_run,0,0,0,0,0,0 cuda,LearningToPaint,4,pass,639,2,8,7,1,1 ... ``` e.g. speedup_inductor.csv ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles cuda,hf_T5,8,1.214311,136.236793,88.350570,0.751322,18.754706,24.962275,3298,2,8,8,1,1 cuda,hf_T5,8,1.226645,135.431856,52.461461,1.040973,18.754706,18.016508,795,1,7,7,0,0 ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117196 Approved by: https://github.com/jansel	2024-01-11 20:12:58 +00:00
Bin Bao	7e9cbc6834	[CI] Catch more exception types when running eager in PT2 tests (#117120 ) Summary: https://github.com/pytorch/pytorch/actions/runs/7467073391/job/20320251143#step:16:1332 shows a case where model loading fails with KeyError but the error is not logged in the report csv file, which can cause an eager model failure silently ignored in the PT2 integration test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117120 Approved by: https://github.com/huydhn	2024-01-11 17:46:11 +00:00
Huy Do	3b2ddb6f71	Update TorchBench pinned commit (#117073 ) ~~To match their recent v4.36.2 release https://github.com/huggingface/transformers/commits/v4.36.2. This is to fix the KeyError showing on release branch https://github.com/pytorch/pytorch/actions/runs/7451512288/job/20279117324#step:16:1336. I think this can be updated in main too because the current pinned commit is already 4-month old.~~ Check with @desertfire, trying to update TorchBench pinned commit instead. The test is also failing in main https://github.com/pytorch/pytorch/actions/runs/7467073391/job/20320251143#step:16:1120, but for some reason, it doesn't surface as a failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117073 Approved by: https://github.com/atalman, https://github.com/thiagocrepaldi, https://github.com/desertfire	2024-01-11 08:35:00 +00:00
Bin Bao	b8374314cc	[AOTI] Update AOTI runner util (#116971 ) Summary: Update the runner used in integration tests after https://github.com/pytorch/torchrec/pull/1604 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116971 Approved by: https://github.com/chenyang78	2024-01-09 19:07:54 +00:00
Huy Do	3c7f358c91	Update the expected accuracy value for demucs (#116944 ) Update the expected value with `python benchmarks/dynamo/ci_expected_accuracy/update_expected.py b847290ddd9c6a5a598c70f8b660ee2b1e71dc95` as this is now failing in trunk after `95041829c8` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116944 Approved by: https://github.com/voznesenskym	2024-01-07 13:34:51 +00:00
Bert Maher	521dbbfaff	Remove cpp/tensorexpr benchmarks (#116868 ) Summary: These refer to a deprecated backend of torchscript which is no longer built in releases, and require llvm to be built. Test Plan: ``` python setup.py develop ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/116868 Approved by: https://github.com/hl475, https://github.com/chenyang78, https://github.com/eellison, https://github.com/mikekgfb	2024-01-05 21:23:30 +00:00
Bin Bao	640d46f823	[inductor] Control the cpp_wrapper mode with an env variable (#116615 ) Summary: also add one model test for the cpp_wrapper mode on CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/116615 Approved by: https://github.com/angelayi	2024-01-02 21:50:25 +00:00
Aaron Gokaslan	bd10fea79a	[BE]: Enable F821 and fix bugs (#116579 ) Fixes #112371 I tried to fix as many of the bugs as I could, a few I could not figure out what the proper fix for them was though and so I left them with noqas. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116579 Approved by: https://github.com/ezyang	2024-01-01 08:40:46 +00:00
baocheny	e01e00fba8	fix code spell (#116530 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116530 Approved by: https://github.com/albanD	2023-12-29 12:58:38 +00:00
Isuru Fernando	a254fbfd61	Initialize variable for all codepaths in dynamo benchmarks (#116260 ) Sometimes, the first statement that sets this variable in the try block fails due to out of memory issues and the finally block tries to delete this variable, but it was not written to in the first place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116260 Approved by: https://github.com/lezcano	2023-12-26 05:15:39 +00:00
BowenBao	259b0af367	[ONNX] Add copy before export for perf bench to avoid mutating base model (#115945 ) Otherwise base model might be mutated and affects the performance measured. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115945 Approved by: https://github.com/justinchuby, https://github.com/titaiwangms	2023-12-21 01:20:46 +00:00
Michael Lazos	be90b757d9	Enable compiled Adam in the benchmarks (#116093 ) Commit b697bcc583 of mlazos/compiled-adam2 at https://hud.pytorch.org/benchmark/compilers is an initial benchmark run Increases compile time by 20s for torchbench and HF, and 30s for TIMM I expect the compile time to come down significantly with fake tensor prop caching Pull Request resolved: https://github.com/pytorch/pytorch/pull/116093 Approved by: https://github.com/janeyx99	2023-12-21 00:17:36 +00:00
Mikayla Gawarecki	19207b9183	Allow more backend worker threads with each using a separate cuda stream (#116190 ) Added a `--num_workers` option to `server.py` that allows more than 1 worker in the `ThreadPoolWorker` used for model predictions. Each worker uses its own `cuda.Stream()` that is created when the worker thread is initialized. Ran benchmark for 2-4 workers with `compile=False` (since compile is not thread-safe) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116190 Approved by: https://github.com/albanD ghstack dependencies: #115286, #116187, #116188, #116189	2023-12-20 22:08:29 +00:00
Mikayla Gawarecki	0dd64174bd	Do H2D/D2H of input/result on separate threads/cuda.Streams (#116189 ) Added two `ThreadPoolExecutor`s with 1 worker each for D2H and H2D copies. Each uses its own `cuda.Stream`. The purpose is to try to overlap D2H and H2D with compute and allow the worker handling prediction to launch compute kernels without being blocked by D2H/H2D. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116189 Approved by: https://github.com/albanD ghstack dependencies: #115286, #116187, #116188	2023-12-20 22:08:29 +00:00
Mikayla Gawarecki	3793ad6a7e	Fix bugs in metrics calculation in inference benchmark and rerun baseline (#116188 ) Before this PR, each `request_time` was separated by the time for a `torch.randn(...)` to create the fake `data` tensor on CPU. This meant that the gap between `request_times` scaled with the batch_size. So the latency comparisons across batch sizes were inaccurate. In this PR we generate all the fake data outside the loop to avoid this. Other bug fixes: - Only start polling GPU utilization after warmup event is complete - Correct calculation of throughput: previously `(num_batches * batch_size) / sum(response_times)`, should have been `(num_batches * batch_size) / (last_response_time - first_request_time)` - Make sure that response sent back to frontend is on CPU - Use a lock to ensure writing to `metrics_dict` in `metrics_thread` and `gpu_utilization_thread` in a thread-safe manner Pull Request resolved: https://github.com/pytorch/pytorch/pull/116188 Approved by: https://github.com/albanD ghstack dependencies: #115286, #116187	2023-12-20 22:08:22 +00:00
Mikayla Gawarecki	75a4b10d56	[easy] Add option for profiling backend in inference benchmark (#116187 ) Some misc fixes, also added option for experiment name to add to result table Pull Request resolved: https://github.com/pytorch/pytorch/pull/116187 Approved by: https://github.com/albanD ghstack dependencies: #115286	2023-12-20 22:08:11 +00:00
Mikayla Gawarecki	31f21e033e	Run inference in an Executor (#115286 ) Experiment: run model predictions in the backend in a ThreadPoolExecutor so that each model prediction does not block reading requests from the queue Baseline is reset in above PR that bugfixes a lot of the metrics calculations but I kept the metrics here anyway Pull Request resolved: https://github.com/pytorch/pytorch/pull/115286 Approved by: https://github.com/albanD	2023-12-20 22:08:02 +00:00
Aaron Gokaslan	6de28e92d2	[BE]: Apply FURB118 (prev): replaces unnecessary lambdas with operator. (#116027 ) This replaces a bunch of unnecessary lambdas with the operator package. This is semantically equivalent, but the operator package is faster, and arguably more readable. When the FURB rules are taken out of preview, I will enable it as a ruff check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116027 Approved by: https://github.com/malfet	2023-12-20 19:35:08 +00:00
drisspg	6b120c6cf9	Update the sdpa benchmark to measure forward backward time in isolation (#115986 ) # Summary The benchmarks were getting a little stale and I think it makes more sense to measure in isolation now rather than E2E in a mha component. This is a pre-req for getting the data for https://github.com/pytorch/pytorch/pull/115357 Output from run: ``` Shell +------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+ \| batch_size \| num_heads \| q_seq_len \| kv_seq_len \| embed_dim \| is_causal \| dtype \| forward_time \| backward_time \| +------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+ \| 1 \| 16 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 23.86634959839284 \| 66.21150835417211 \| \| 1 \| 16 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 23.452017060481012 \| 66.90612225793302 \| \| 1 \| 16 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 24.478124547749758 \| 76.4232068322599 \| \| 1 \| 16 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 24.6928428998217 \| 75.76151192188263 \| \| 1 \| 16 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 28.69622849393636 \| 114.73898496478796 \| \| 1 \| 16 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 34.399422979913645 \| 112.96746158041059 \| \| 1 \| 16 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 65.4690912924707 \| 216.26344555988908 \| \| 1 \| 16 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 88.57532404363155 \| 212.07790216431025 \| \| 8 \| 16 \| 128 \| 128 \| 2048 \| True \| torch.bfloat16 \| 11.582905380055308 \| 70.09557797573505 \| \| 8 \| 16 \| 128 \| 128 \| 2048 \| False \| torch.bfloat16 \| 12.068384909071026 \| 70.01491216942668 \| \| 8 \| 16 \| 256 \| 256 \| 2048 \| True \| torch.bfloat16 \| 31.671419646590945 \| 203.54910241439939 \| \| 8 \| 16 \| 256 \| 256 \| 2048 \| False \| torch.bfloat16 \| 33.0585768679157 \| 209.45609430782497 \| \| 8 \| 16 \| 512 \| 512 \| 2048 \| True \| torch.bfloat16 \| 87.43969700299202 \| 469.8729298543185 \| \| 8 \| 16 \| 512 \| 512 \| 2048 \| False \| torch.bfloat16 \| 123.9265550393611 \| 580.1084265112877 \| \| 8 \| 16 \| 1024 \| 1024 \| 2048 \| True \| torch.bfloat16 \| 561.1918237991632 \| 1181.655174586922 \| \| 8 \| 16 \| 1024 \| 1024 \| 2048 \| False \| torch.bfloat16 \| 884.2707145959139 \| 1662.4679416418073 \| +------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115986 Approved by: https://github.com/mikaylagawarecki	2023-12-18 22:40:47 +00:00
Michael Lazos	80b1ecc308	Run eager adam optimizer in benchmarks where possible (#115445 ) Runs eager Adam (instead of SGD) on all models that don't fail accuracy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115445 Approved by: https://github.com/desertfire	2023-12-18 18:28:23 +00:00
BowenBao	7e6ec8d3db	[ONNX] Add proper iobinding synchronize for ONNX cuda bench (#115773 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115773 Approved by: https://github.com/thiagocrepaldi ghstack dependencies: #115670, #115673	2023-12-15 00:37:32 +00:00
BowenBao	823523acc0	[ONNX] Dump sarif diagnostics for failed onnx exports in benchmark (#115673 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115673 Approved by: https://github.com/thiagocrepaldi ghstack dependencies: #115670	2023-12-15 00:37:32 +00:00

1 2 3 4 5 ...

1470 commits