mirror of
https://github.com/saymrwulf/pytorch.git
synced 2026-05-14 20:57:59 +00:00
### Description This PR is to add NNC post-op fusion support in ideep for further NNC development. It includes: - element wise post op fusion - conv/matmal/linear + binary post op fusion ### Performance **Common configuration:** - Jemalloc and iomp enabled - BS=1 - num_warmup = 300 - num_run = 500 - Average time of 1 iteration in ms is used - time_before: no fusion - time_after: with fusion - Eltwise OPs selected: hardswish and abs - Using oneDNN v2.6 **On ICX (32 cores per socket): Conv2d FP32 (in channels Last format)** | shape | time_(ms)_before | time_(ms)_after | Gain -- | -- | -- | -- | -- 1socket | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.112174 | 0.071106 | 36.61% 1socket | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.11269 | 0.070586 | 37.36% 1socket | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.164219 | 0.129498 | 21.14% 1socket | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.169371 | 0.1277 | 24.60% | | | | | shape | time_(ms)_before | time_(ms)_after | Gain 1thread | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 1.994555 | 1.429813 | 28.31% 1thread | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 1.715168 | 1.459937 | 14.88% 1thread | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 2.997382 | 2.47915 | 17.29% 1thread | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 3.044476 | 2.499366 | 17.90% | | | | | shape | time_(ms)_before | time_(ms)_after | Gain 4thread | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.405204 | 0.38117 | 5.93% 4thread | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.410145 | 0.389279 | 5.09% 4thread | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.67917 | 0.662792 | 2.41% 4thread | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.682302 | 0.671226 | 1.62% **On CPX (28 cores per socket): Conv2d BF16 (in channels Last format)** | shape | time_(ms)_before | time_(ms)_after | Gain -- | -- | -- | -- | -- 1socket | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.119289 | 0.091015 | 23.70% 1socket | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.144116 | 0.09339 | 35.20% 1socket | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.209975 | 0.177111 | 15.65% 1socket | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.234777 | 0.179945 | 23.36% | | | | | shape | time_(ms)_before | time_(ms)_after | Gain 1thread | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 1.296252 | 1.086423 | 16.19% 1thread | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 1.364738 | 1.131289 | 17.11% 1thread | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 3.99519 | 3.736147 | 6.48% 1thread | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 4.03415 | 3.77981 | 6.30% | | | | | shape | time_(ms)_before | time_(ms)_after | Gain 4thread | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.27474 | 0.245281 | 10.72% 4thread | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.28595 | 0.254748 | 10.91% 4thread | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.847318 | 0.791453 | 6.59% 4thread | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.870212 | 0.801594 | 7.89% **On CPX (28 cores per socket): Linear BF16** | shape | time_(ms)_before | time_(ms)_after | Gain -- | -- | -- | -- | -- 1socket | Linear+abs_N=1_iC=1024_oC=4096 | 0.043199 | 0.037603 | 12.95% 1socket | Linear+hardswish_N=1_iC=1024_oC=4096 | 0.041845 | 0.038332 | 8.40% 1socket | Linear+abs_N=1_iC=4096_oC=1024 | 0.048282 | 0.044281 | 8.29% 1socket | Linear+hardswish_N=1_iC=4096_oC=1024 | 0.048362 | 0.044106 | 8.80% 1socket | Linear+abs_N=1_iC=2048_oC=1000 | 0.036302 | 0.0344 | 5.24% 1socket | Linear+hardswish_N=1_iC=2048_oC=1000 | 0.035734 | 0.035593 | 0.39% | | | | | shape | time_(ms)_before | time_(ms)_after | Gain 1thread | Linear+abs_N=1_iC=1024_oC=4096 | 0.365143 | 0.36279 | 0.64% 1thread | Linear+hardswish_N=1_iC=1024_oC=4096 | 0.364464 | 0.363392 | 0.29% 1thread | Linear+abs_N=1_iC=4096_oC=1024 | 0.384498 | 0.379902 | 1.20% 1thread | Linear+hardswish_N=1_iC=4096_oC=1024 | 0.382545 | 0.381252 | 0.34% 1thread | Linear+abs_N=1_iC=2048_oC=1000 | 0.213244 | 0.209999 | 1.52% 1thread | Linear+hardswish_N=1_iC=2048_oC=1000 | 0.212003 | 0.208567 | 1.62% | | | | | shape | time_(ms)_before | time_(ms)_after | Gain 4thread | Linear+abs_N=1_iC=1024_oC=4096 | 0.126096 | 0.12157 | 3.59% 4thread | Linear+hardswish_N=1_iC=1024_oC=4096 | 0.126627 | 0.121662 | 3.92% 4thread | Linear+abs_N=1_iC=4096_oC=1024 | 0.132845 | 0.128921 | 2.95% 4thread | Linear+hardswish_N=1_iC=4096_oC=1024 | 0.132642 | 0.12783 | 3.63% 4thread | Linear+abs_N=1_iC=2048_oC=1000 | 0.079582 | 0.072584 | 8.79% 4thread | Linear+hardswish_N=1_iC=2048_oC=1000 | 0.077761 | 0.071981 | 7.43% Pull Request resolved: https://github.com/pytorch/pytorch/pull/82705 Approved by: https://github.com/frank-wei, https://github.com/eellison |
||
|---|---|---|
| .. | ||
| benchmark@0d98dba29d | ||
| cpuinfo@5916273f79 | ||
| cub@d106ddb991 | ||
| cudnn_frontend@43709ab96c | ||
| eigen@3147391d94 | ||
| fbgemm@499cd22f5c | ||
| flatbuffers@d0cede9c90 | ||
| fmt@cd4af11efc | ||
| foxi@c278588e34 | ||
| FP16@4dfe081cf6 | ||
| FXdiv@b408327ac2 | ||
| gemmlowp | ||
| gloo@5b14351326 | ||
| googletest@e2239ee604 | ||
| ideep@77d662b313 | ||
| ios-cmake@8abaed637d | ||
| ittapi@5b8a7d7422 | ||
| kineto@0703c78999 | ||
| miniz-2.1.0 | ||
| nccl | ||
| neon2sse@97a126f08c | ||
| nlohmann@87cda1d664 | ||
| NNPACK@c07e3a0400 | ||
| onnx@f7ee1ac60d | ||
| onnx-tensorrt@c153211418 | ||
| pocketfft@ea778e3771 | ||
| protobuf@d1eca4e4b4 | ||
| psimd@072586a71b | ||
| pthreadpool@a134dd5d4c | ||
| pybind11@aa304c9c7d | ||
| python-enum@4cfedc426c | ||
| python-peachpy@f45429b087 | ||
| python-six@15e31431af | ||
| QNNPACK@7d2a4e9931 | ||
| sleef@e0a003ee83 | ||
| tbb@a51a90bc60 | ||
| tensorflow_cuda_bazel_build/cuda | ||
| tensorpipe@52791a2fd2 | ||
| valgrind-headers | ||
| XNNPACK@ae108ef49a | ||
| zstd@aec56a52fb | ||
| BUCK.oss | ||
| BUILD | ||
| build_bundled.py | ||
| cpuinfo.BUILD | ||
| cuda.BUILD | ||
| cudnn.BUILD | ||
| eigen.BUILD | ||
| fmt.BUILD | ||
| foxi.BUILD | ||
| generate-cpuinfo-wrappers.py | ||
| generate-xnnpack-wrappers.py | ||
| glog.buck.bzl | ||
| gloo.BUILD | ||
| ideep.BUILD | ||
| kineto.buck.bzl | ||
| kineto.BUILD | ||
| LICENSES_BUNDLED.txt | ||
| METADATA.bzl | ||
| mkl-dnn.BUILD | ||
| mkl.BUILD | ||
| mkl_headers.BUILD | ||
| onnx.BUILD | ||
| README.md | ||
| sleef.BUILD | ||
| sleef.bzl | ||
| substitution.bzl | ||
| tbb.BUILD | ||
| tbb.patch | ||
| tensorpipe.BUILD | ||
| xnnpack.buck.bzl | ||
| xnnpack_src_defs.bzl | ||
| xnnpack_wrapper_defs.bzl | ||
This folder contains vendored copies of third-party libraries that we use.