pytorch/benchmarks
sanchitintel 8852bb561c More efficient multi-threading in Softmax & LogSoftmax CPU kernels (#116367)
### Summary
In #85398, while fixing a bug (which was _not caused by, but was exposed by_ AVX512 implementation) in `_vec_logsoftmax_lastdim`, I had made some revisions to use more threads in some cases, but was asked to roll back [those changes](https://github.com/pytorch/pytorch/pull/85398#discussion_r1087680237) during the PR's review.
At the time, landing that PR asap seemed essential, so I agreed to roll-back that change,

In some cases, more threads can be used than are being used with the current approach.
<strike>In this PR, I'm reintroducing those changes, which are geared towards more efficient multi-threading.</strike>.
On second thought, even for other softmax kernels besides `_vec_log_softmax_lastdim` and `_vec_softmax_lastdim`, we could simply use `grain_size` of 0 or 1, instead of complicating code because `CHUNK_SIZE` for each thread is already being computed as per some heuristic, and if `grain_size` would be `0`, then work among the OpenMP threads (which, BTW, stay constant in number, unless explicitly changed, since we don't use the OpenMP `num_threads` clause in PyTorch) would be distributed equitably, thus yielding the similar speedup as the approach in the first commit of this PR.
I've also added op-level benchmarks pertaining to example input shapes in this PR.

### Benchmarks

Machine - Intel(R) Xeon(R) Platinum 8468H (Xeon 4th gen, formerly codenamed Sapphire Rapids)
One socket of 48 physical cores was used, with & without HyperThreading.
Intel OpenMP & tcmalloc were preloaded.

Softmax benchmarks can be run with the following command, but the relevant benchmarks are the last dim ones -
`KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 KMP_SETTINGS=1 OMP_NUM_THREADS=48 MKL_NUM_THREADS=48 numactl --membind=0 --cpunodebind=0 python -m pt.softmax_test --tag-filter all`

#### Already existing benchmarks
|Benchmark name (dim is 1, by default) | Previous implementation's latency (in ms) | This implementation's latency (in ms)|Speedup Percentage = (old-new)*100/old | Speedup ratio (old/new)|
|-------------|--------|-------|----------------------------|----------|
|Softmax_N1_C3_H256_W256_cpu|31.364|11.594|63.03%  |2.705|
|Softmax_N4_C3_H256_W256_cpu|34.475|24.966| 27.58%|1.380|
|Softmax_N8_C3_H512_W256_cpu|94.044|78.372|16.66%|1.199|
|Softmax2d_N8_C3_H512_W256_cpu|100.195|79.529|20.62%|1.259|

#### Some of the following benchmarks are being added in this PR
|Benchmark name| Previous implementation's latency (in ms) | This implementation's latency (in ms)|Speedup percentage = (old-new)*100/old| Speedup ratio  (old/new) |
|-------------|--------|-------|----------------------------|--------------------|
|LogSoftmax_M128_N128_dim1_cpu|7.629|6.475|15.12%| 1.178|
|LogSoftmax_M48_N128_dim1_cpu|6.848|5.969|12.83%| 1.147|
|LogSoftmax_M16_N1024_dim1_cpu|7.004|6.322|9.73%| 1.107|
|LogSoftmax_M32_N1024_dim1_cpu|7.037|6.558|6.80%| 1.073|
|LogSoftmax_M48_N1024_dim1_cpu|7.155|6.773|5.33%|1.056|
|LogSoftmax_M16_N512_dim1_cpu|6.797|5.862|13.75%|1.159|
|LogSoftmax_M32_N512_dim1_cpu|7.223|6.202|14.13%|1.164|
|LogSoftmax_M48_N512_dim1_cpu|7.159|6.301|11.98%|1.136|
|LogSoftmax_M16_N256_dim1_cpu|6.842|5.682|16.95%|1.204|
|LogSoftmax_M32_N256_dim1_cpu|6.840|6.086|11.02%|1.123|
|LogSoftmax_M48_N256_dim1_cpu|7.005|6.031|13.94%|1.161|

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116367
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-17 02:26:29 +00:00
..
distributed [BE]: Apply FURB118 (prev): replaces unnecessary lambdas with operator. (#116027) 2023-12-20 19:35:08 +00:00
dynamo [inductor] Faster C++ kernel python bindings (#117500) 2024-01-16 22:30:04 +00:00
fastrnns
framework_overhead_benchmark [BE]: Enable F821 and fix bugs (#116579) 2024-01-01 08:40:46 +00:00
functional_autograd_benchmark [BE]: Enable F821 and fix bugs (#116579) 2024-01-01 08:40:46 +00:00
fuser
inference Allow more backend worker threads with each using a separate cuda stream (#116190) 2023-12-20 22:08:29 +00:00
instruction_counts
nested
operator_benchmark More efficient multi-threading in Softmax & LogSoftmax CPU kernels (#116367) 2024-01-17 02:26:29 +00:00
overrides_benchmark
profiler_benchmark
record_function_benchmark
serialization
sparse [BE]: Enable F821 and fix bugs (#116579) 2024-01-01 08:40:46 +00:00
static_runtime
tensorexpr [BE]: Enable F821 and fix bugs (#116579) 2024-01-01 08:40:46 +00:00
transformer Update the sdpa benchmark to measure forward backward time in isolation (#115986) 2023-12-18 22:40:47 +00:00
compare-fastrnn-results.py
compare.sh
README.md
upload_scribe.py

PyTorch Benchmarks

This folder contains scripts that produce reproducible timings of various PyTorch features.

It also provides mechanisms to compare PyTorch with other frameworks.

Setup environment

Make sure you're on a machine with CUDA, torchvision, and pytorch installed. Install in the following order:

# Install torchvision. It comes with the pytorch stable release binary
conda install pytorch torchvision -c pytorch

# Install the latest pytorch master from source.
# It should supersede the installation from the release binary.
cd $PYTORCH_HOME
python setup.py build develop

# Check the pytorch installation version
python -c "import torch; print(torch.__version__)"

Benchmark List

Please refer to each subfolder to discover each benchmark suite. Links are provided where descriptions exist: