mirror of https://github.com/saymrwulf/pytorch.git synced 2026-05-14 20:57:59 +00:00

History

sanchitintel 8852bb561c More efficient multi-threading in Softmax & LogSoftmax CPU kernels (#116367 ) ### Summary In #85398, while fixing a bug (which was _not caused by, but was exposed by_ AVX512 implementation) in `_vec_logsoftmax_lastdim`, I had made some revisions to use more threads in some cases, but was asked to roll back [those changes](https://github.com/pytorch/pytorch/pull/85398#discussion_r1087680237) during the PR's review. At the time, landing that PR asap seemed essential, so I agreed to roll-back that change, In some cases, more threads can be used than are being used with the current approach. <strike>In this PR, I'm reintroducing those changes, which are geared towards more efficient multi-threading.</strike>. On second thought, even for other softmax kernels besides `_vec_log_softmax_lastdim` and `_vec_softmax_lastdim`, we could simply use `grain_size` of 0 or 1, instead of complicating code because `CHUNK_SIZE` for each thread is already being computed as per some heuristic, and if `grain_size` would be `0`, then work among the OpenMP threads (which, BTW, stay constant in number, unless explicitly changed, since we don't use the OpenMP `num_threads` clause in PyTorch) would be distributed equitably, thus yielding the similar speedup as the approach in the first commit of this PR. I've also added op-level benchmarks pertaining to example input shapes in this PR. ### Benchmarks Machine - Intel(R) Xeon(R) Platinum 8468H (Xeon 4th gen, formerly codenamed Sapphire Rapids) One socket of 48 physical cores was used, with & without HyperThreading. Intel OpenMP & tcmalloc were preloaded. Softmax benchmarks can be run with the following command, but the relevant benchmarks are the last dim ones - `KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 KMP_SETTINGS=1 OMP_NUM_THREADS=48 MKL_NUM_THREADS=48 numactl --membind=0 --cpunodebind=0 python -m pt.softmax_test --tag-filter all` #### Already existing benchmarks \|Benchmark name (dim is 1, by default) \| Previous implementation's latency (in ms) \| This implementation's latency (in ms)\|Speedup Percentage = (old-new)100/old \| Speedup ratio (old/new)\| \|-------------\|--------\|-------\|----------------------------\|----------\| \|Softmax_N1_C3_H256_W256_cpu\|31.364\|11.594\|63.03% \|2.705\| \|Softmax_N4_C3_H256_W256_cpu\|34.475\|24.966\| 27.58%\|1.380\| \|Softmax_N8_C3_H512_W256_cpu\|94.044\|78.372\|16.66%\|1.199\| \|Softmax2d_N8_C3_H512_W256_cpu\|100.195\|79.529\|20.62%\|1.259\| #### Some of the following benchmarks are being added in this PR \|Benchmark name\| Previous implementation's latency (in ms) \| This implementation's latency (in ms)\|Speedup percentage = (old-new)100/old\| Speedup ratio (old/new) \| \|-------------\|--------\|-------\|----------------------------\|--------------------\| \|LogSoftmax_M128_N128_dim1_cpu\|7.629\|6.475\|15.12%\| 1.178\| \|LogSoftmax_M48_N128_dim1_cpu\|6.848\|5.969\|12.83%\| 1.147\| \|LogSoftmax_M16_N1024_dim1_cpu\|7.004\|6.322\|9.73%\| 1.107\| \|LogSoftmax_M32_N1024_dim1_cpu\|7.037\|6.558\|6.80%\| 1.073\| \|LogSoftmax_M48_N1024_dim1_cpu\|7.155\|6.773\|5.33%\|1.056\| \|LogSoftmax_M16_N512_dim1_cpu\|6.797\|5.862\|13.75%\|1.159\| \|LogSoftmax_M32_N512_dim1_cpu\|7.223\|6.202\|14.13%\|1.164\| \|LogSoftmax_M48_N512_dim1_cpu\|7.159\|6.301\|11.98%\|1.136\| \|LogSoftmax_M16_N256_dim1_cpu\|6.842\|5.682\|16.95%\|1.204\| \|LogSoftmax_M32_N256_dim1_cpu\|6.840\|6.086\|11.02%\|1.123\| \|LogSoftmax_M48_N256_dim1_cpu\|7.005\|6.031\|13.94%\|1.161\| Pull Request resolved: https://github.com/pytorch/pytorch/pull/116367 Approved by: https://github.com/jgong5, https://github.com/ezyang		2024-01-17 02:26:29 +00:00
..
distributed	[BE]: Apply FURB118 (prev): replaces unnecessary lambdas with operator. (#116027 )	2023-12-20 19:35:08 +00:00
dynamo	[inductor] Faster C++ kernel python bindings (#117500 )	2024-01-16 22:30:04 +00:00
fastrnns
framework_overhead_benchmark	[BE]: Enable F821 and fix bugs (#116579 )	2024-01-01 08:40:46 +00:00
functional_autograd_benchmark	[BE]: Enable F821 and fix bugs (#116579 )	2024-01-01 08:40:46 +00:00
fuser
inference	Allow more backend worker threads with each using a separate cuda stream (#116190 )	2023-12-20 22:08:29 +00:00
instruction_counts
nested
operator_benchmark	More efficient multi-threading in Softmax & LogSoftmax CPU kernels (#116367 )	2024-01-17 02:26:29 +00:00
overrides_benchmark
profiler_benchmark
record_function_benchmark
serialization
sparse	[BE]: Enable F821 and fix bugs (#116579 )	2024-01-01 08:40:46 +00:00
static_runtime
tensorexpr	[BE]: Enable F821 and fix bugs (#116579 )	2024-01-01 08:40:46 +00:00
transformer	Update the sdpa benchmark to measure forward backward time in isolation (#115986 )	2023-12-18 22:40:47 +00:00
compare-fastrnn-results.py
compare.sh
README.md
upload_scribe.py

README.md

PyTorch Benchmarks

This folder contains scripts that produce reproducible timings of various PyTorch features.

It also provides mechanisms to compare PyTorch with other frameworks.

Setup environment

Make sure you're on a machine with CUDA, torchvision, and pytorch installed. Install in the following order:

# Install torchvision. It comes with the pytorch stable release binary
conda install pytorch torchvision -c pytorch

# Install the latest pytorch master from source.
# It should supersede the installation from the release binary.
cd $PYTORCH_HOME
python setup.py build develop

# Check the pytorch installation version
python -c "import torch; print(torch.__version__)"

Benchmark List

Please refer to each subfolder to discover each benchmark suite. Links are provided where descriptions exist: