pytorch/third_party
yanbing-j 6dc8673b1b Update ideep for NNC post-op (#82705)
### Description
This PR is to add NNC post-op fusion support in ideep for further NNC development. It includes:

- element wise post op fusion
- conv/matmal/linear + binary post op fusion

### Performance
**Common configuration:**
- Jemalloc and iomp enabled
- BS=1
- num_warmup = 300
- num_run = 500
- Average time of 1 iteration in ms is used
- time_before: no fusion
- time_after: with fusion
- Eltwise OPs selected: hardswish and abs
- Using oneDNN v2.6

**On ICX (32 cores per socket):
Conv2d FP32 (in channels Last format)**

  | shape | time_(ms)_before | time_(ms)_after | Gain
-- | -- | -- | -- | --
1socket | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.112174 | 0.071106 | 36.61%
1socket | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.11269 | 0.070586 | 37.36%
1socket | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.164219 | 0.129498 | 21.14%
1socket | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.169371 | 0.1277 | 24.60%
  |   |   |   |  
  | shape | time_(ms)_before | time_(ms)_after | Gain
1thread | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 1.994555 | 1.429813 | 28.31%
1thread | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 1.715168 | 1.459937 | 14.88%
1thread | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 2.997382 | 2.47915 | 17.29%
1thread | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 3.044476 | 2.499366 | 17.90%
  |   |   |   |  
  | shape | time_(ms)_before | time_(ms)_after | Gain
4thread | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.405204 | 0.38117 | 5.93%
4thread | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.410145 | 0.389279 | 5.09%
4thread | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.67917 | 0.662792 | 2.41%
4thread | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.682302 | 0.671226 | 1.62%

**On CPX (28 cores per socket):
Conv2d BF16 (in channels Last format)**

  | shape | time_(ms)_before | time_(ms)_after | Gain
-- | -- | -- | -- | --
1socket | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.119289 | 0.091015 | 23.70%
1socket | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.144116 | 0.09339 | 35.20%
1socket | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.209975 | 0.177111 | 15.65%
1socket | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.234777 | 0.179945 | 23.36%
  |   |   |   |  
  | shape | time_(ms)_before | time_(ms)_after | Gain
1thread | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 1.296252 | 1.086423 | 16.19%
1thread | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 1.364738 | 1.131289 | 17.11%
1thread | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 3.99519 | 3.736147 | 6.48%
1thread | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 4.03415 | 3.77981 | 6.30%
  |   |   |   |  
  | shape | time_(ms)_before | time_(ms)_after | Gain
4thread | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.27474 | 0.245281 | 10.72%
4thread | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.28595 | 0.254748 | 10.91%
4thread | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.847318 | 0.791453 | 6.59%
4thread | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.870212 | 0.801594 | 7.89%

**On CPX (28 cores per socket):
Linear BF16**

  | shape | time_(ms)_before | time_(ms)_after | Gain
-- | -- | -- | -- | --
1socket | Linear+abs_N=1_iC=1024_oC=4096 | 0.043199 | 0.037603 | 12.95%
1socket | Linear+hardswish_N=1_iC=1024_oC=4096 | 0.041845 | 0.038332 | 8.40%
1socket | Linear+abs_N=1_iC=4096_oC=1024 | 0.048282 | 0.044281 | 8.29%
1socket | Linear+hardswish_N=1_iC=4096_oC=1024 | 0.048362 | 0.044106 | 8.80%
1socket | Linear+abs_N=1_iC=2048_oC=1000 | 0.036302 | 0.0344 | 5.24%
1socket | Linear+hardswish_N=1_iC=2048_oC=1000 | 0.035734 | 0.035593 | 0.39%
  |   |   |   |  
  | shape | time_(ms)_before | time_(ms)_after | Gain
1thread | Linear+abs_N=1_iC=1024_oC=4096 | 0.365143 | 0.36279 | 0.64%
1thread | Linear+hardswish_N=1_iC=1024_oC=4096 | 0.364464 | 0.363392 | 0.29%
1thread | Linear+abs_N=1_iC=4096_oC=1024 | 0.384498 | 0.379902 | 1.20%
1thread | Linear+hardswish_N=1_iC=4096_oC=1024 | 0.382545 | 0.381252 | 0.34%
1thread | Linear+abs_N=1_iC=2048_oC=1000 | 0.213244 | 0.209999 | 1.52%
1thread | Linear+hardswish_N=1_iC=2048_oC=1000 | 0.212003 | 0.208567 | 1.62%
  |   |   |   |  
  | shape | time_(ms)_before | time_(ms)_after | Gain
4thread | Linear+abs_N=1_iC=1024_oC=4096 | 0.126096 | 0.12157 | 3.59%
4thread | Linear+hardswish_N=1_iC=1024_oC=4096 | 0.126627 | 0.121662 | 3.92%
4thread | Linear+abs_N=1_iC=4096_oC=1024 | 0.132845 | 0.128921 | 2.95%
4thread | Linear+hardswish_N=1_iC=4096_oC=1024 | 0.132642 | 0.12783 | 3.63%
4thread | Linear+abs_N=1_iC=2048_oC=1000 | 0.079582 | 0.072584 | 8.79%
4thread | Linear+hardswish_N=1_iC=2048_oC=1000 | 0.077761 | 0.071981 | 7.43%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82705
Approved by: https://github.com/frank-wei, https://github.com/eellison
2022-08-18 05:08:12 +00:00
..
benchmark@0d98dba29d
cpuinfo@5916273f79
cub@d106ddb991
cudnn_frontend@43709ab96c
eigen@3147391d94
fbgemm@499cd22f5c Upgrade fbgemm in OSS PyTorch (#82676) 2022-08-03 00:28:43 +00:00
flatbuffers@d0cede9c90
fmt@cd4af11efc
foxi@c278588e34
FP16@4dfe081cf6
FXdiv@b408327ac2
gemmlowp
gloo@5b14351326
googletest@e2239ee604
ideep@77d662b313 Update ideep for NNC post-op (#82705) 2022-08-18 05:08:12 +00:00
ios-cmake@8abaed637d
ittapi@5b8a7d7422 Enable Intel® VTune™ Profiler's Instrumentation and Tracing Technology APIs (ITT) to PyTorch (#63289) 2022-07-13 13:50:15 +00:00
kineto@0703c78999 Revert "Automated submodule update: kineto (#79925)" 2022-07-06 22:14:13 +00:00
miniz-2.1.0 Updating miniz library from version 2.0.8 -> 2.1.0 (#79636) 2022-06-22 15:02:16 +00:00
nccl Update NCCL to v2.13.4-1 (#82775) 2022-08-04 19:36:45 +00:00
neon2sse@97a126f08c
nlohmann@87cda1d664 Add nlohmann/json submodule (#80322) 2022-06-28 23:54:33 +00:00
NNPACK@c07e3a0400
onnx@f7ee1ac60d Revert "[MPS] Add test consistency from OpInfo based tests from PR 78504 (#79532)" 2022-06-30 16:37:11 +00:00
onnx-tensorrt@c153211418
pocketfft@ea778e3771
protobuf@d1eca4e4b4
psimd@072586a71b
pthreadpool@a134dd5d4c
pybind11@aa304c9c7d Revert "sym_numel (#82374)" (#82726) 2022-08-03 15:23:47 +00:00
python-enum@4cfedc426c
python-peachpy@f45429b087
python-six@15e31431af
QNNPACK@7d2a4e9931
sleef@e0a003ee83
tbb@a51a90bc60
tensorflow_cuda_bazel_build/cuda
tensorpipe@52791a2fd2
valgrind-headers
XNNPACK@ae108ef49a
zstd@aec56a52fb
BUCK.oss [pocket fft] turning on pocketfft flag (#81670) 2022-07-21 02:45:20 +00:00
BUILD
build_bundled.py create a concated LICENSE file for wheels (#81500) 2022-07-18 14:02:37 +00:00
cpuinfo.BUILD
cuda.BUILD
cudnn.BUILD
eigen.BUILD
fmt.BUILD
foxi.BUILD
generate-cpuinfo-wrappers.py
generate-xnnpack-wrappers.py [5] move XNNPACK to shared BUCK build (#80209) 2022-06-28 02:25:07 +00:00
glog.buck.bzl
gloo.BUILD
ideep.BUILD
kineto.buck.bzl Revert "[Codemod][Format buck files with arc lint] caffe2/third_party (#81441)" 2022-07-19 09:57:32 +00:00
kineto.BUILD Back out "Revert D37720837: Back out "Revert D37228314: [Profiler] Include ActivityType from Kineto"" (#81450) 2022-07-15 18:25:40 +00:00
LICENSES_BUNDLED.txt create a concated LICENSE file for wheels (#81500) 2022-07-18 14:02:37 +00:00
METADATA.bzl
mkl-dnn.BUILD
mkl.BUILD
mkl_headers.BUILD
onnx.BUILD
README.md
sleef.BUILD
sleef.bzl
substitution.bzl
tbb.BUILD
tbb.patch
tensorpipe.BUILD
xnnpack.buck.bzl [5] move XNNPACK to shared BUCK build (#80209) 2022-06-28 02:25:07 +00:00
xnnpack_src_defs.bzl [5] move XNNPACK to shared BUCK build (#80209) 2022-06-28 02:25:07 +00:00
xnnpack_wrapper_defs.bzl [5] move XNNPACK to shared BUCK build (#80209) 2022-06-28 02:25:07 +00:00

This folder contains vendored copies of third-party libraries that we use.