mirror of
https://github.com/saymrwulf/pytorch.git
synced 2026-05-14 20:57:59 +00:00
### Summary In #85398, while fixing a bug (which was _not caused by, but was exposed by_ AVX512 implementation) in `_vec_logsoftmax_lastdim`, I had made some revisions to use more threads in some cases, but was asked to roll back [those changes](https://github.com/pytorch/pytorch/pull/85398#discussion_r1087680237) during the PR's review. At the time, landing that PR asap seemed essential, so I agreed to roll-back that change, In some cases, more threads can be used than are being used with the current approach. <strike>In this PR, I'm reintroducing those changes, which are geared towards more efficient multi-threading.</strike>. On second thought, even for other softmax kernels besides `_vec_log_softmax_lastdim` and `_vec_softmax_lastdim`, we could simply use `grain_size` of 0 or 1, instead of complicating code because `CHUNK_SIZE` for each thread is already being computed as per some heuristic, and if `grain_size` would be `0`, then work among the OpenMP threads (which, BTW, stay constant in number, unless explicitly changed, since we don't use the OpenMP `num_threads` clause in PyTorch) would be distributed equitably, thus yielding the similar speedup as the approach in the first commit of this PR. I've also added op-level benchmarks pertaining to example input shapes in this PR. ### Benchmarks Machine - Intel(R) Xeon(R) Platinum 8468H (Xeon 4th gen, formerly codenamed Sapphire Rapids) One socket of 48 physical cores was used, with & without HyperThreading. Intel OpenMP & tcmalloc were preloaded. Softmax benchmarks can be run with the following command, but the relevant benchmarks are the last dim ones - `KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 KMP_SETTINGS=1 OMP_NUM_THREADS=48 MKL_NUM_THREADS=48 numactl --membind=0 --cpunodebind=0 python -m pt.softmax_test --tag-filter all` #### Already existing benchmarks |Benchmark name (dim is 1, by default) | Previous implementation's latency (in ms) | This implementation's latency (in ms)|Speedup Percentage = (old-new)*100/old | Speedup ratio (old/new)| |-------------|--------|-------|----------------------------|----------| |Softmax_N1_C3_H256_W256_cpu|31.364|11.594|63.03% |2.705| |Softmax_N4_C3_H256_W256_cpu|34.475|24.966| 27.58%|1.380| |Softmax_N8_C3_H512_W256_cpu|94.044|78.372|16.66%|1.199| |Softmax2d_N8_C3_H512_W256_cpu|100.195|79.529|20.62%|1.259| #### Some of the following benchmarks are being added in this PR |Benchmark name| Previous implementation's latency (in ms) | This implementation's latency (in ms)|Speedup percentage = (old-new)*100/old| Speedup ratio (old/new) | |-------------|--------|-------|----------------------------|--------------------| |LogSoftmax_M128_N128_dim1_cpu|7.629|6.475|15.12%| 1.178| |LogSoftmax_M48_N128_dim1_cpu|6.848|5.969|12.83%| 1.147| |LogSoftmax_M16_N1024_dim1_cpu|7.004|6.322|9.73%| 1.107| |LogSoftmax_M32_N1024_dim1_cpu|7.037|6.558|6.80%| 1.073| |LogSoftmax_M48_N1024_dim1_cpu|7.155|6.773|5.33%|1.056| |LogSoftmax_M16_N512_dim1_cpu|6.797|5.862|13.75%|1.159| |LogSoftmax_M32_N512_dim1_cpu|7.223|6.202|14.13%|1.164| |LogSoftmax_M48_N512_dim1_cpu|7.159|6.301|11.98%|1.136| |LogSoftmax_M16_N256_dim1_cpu|6.842|5.682|16.95%|1.204| |LogSoftmax_M32_N256_dim1_cpu|6.840|6.086|11.02%|1.123| |LogSoftmax_M48_N256_dim1_cpu|7.005|6.031|13.94%|1.161| Pull Request resolved: https://github.com/pytorch/pytorch/pull/116367 Approved by: https://github.com/jgong5, https://github.com/ezyang |
||
|---|---|---|
| .. | ||
| distributed | ||
| dynamo | ||
| fastrnns | ||
| framework_overhead_benchmark | ||
| functional_autograd_benchmark | ||
| fuser | ||
| inference | ||
| instruction_counts | ||
| nested | ||
| operator_benchmark | ||
| overrides_benchmark | ||
| profiler_benchmark | ||
| record_function_benchmark | ||
| serialization | ||
| sparse | ||
| static_runtime | ||
| tensorexpr | ||
| transformer | ||
| compare-fastrnn-results.py | ||
| compare.sh | ||
| README.md | ||
| upload_scribe.py | ||
PyTorch Benchmarks
This folder contains scripts that produce reproducible timings of various PyTorch features.
It also provides mechanisms to compare PyTorch with other frameworks.
Setup environment
Make sure you're on a machine with CUDA, torchvision, and pytorch installed. Install in the following order:
# Install torchvision. It comes with the pytorch stable release binary
conda install pytorch torchvision -c pytorch
# Install the latest pytorch master from source.
# It should supersede the installation from the release binary.
cd $PYTORCH_HOME
python setup.py build develop
# Check the pytorch installation version
python -c "import torch; print(torch.__version__)"
Benchmark List
Please refer to each subfolder to discover each benchmark suite. Links are provided where descriptions exist: