mirror of
https://github.com/saymrwulf/pytorch.git
synced 2026-05-14 20:57:59 +00:00
We did a lot of optimization for PyTorch Windows, and we got good progress of it. But still some models have performance gap between PyTorch Windows and PyTorch Linux. Ref: https://pytorch.org/blog/performance-boost-windows/#conclusion From the blog conclusion, we found the `ResNet50` is typical case of it. Let's focus on the `ResNet50`, and collect the profiling log: ```cmd (nightly) D:\xu_git\dnnl_cb>python test_script_resnet50.py --------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls --------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ model_inference 3.91% 682.427ms 100.00% 17.448s 17.448s 1 aten::conv2d 0.18% 30.906ms 64.79% 11.305s 2.133ms 5300 aten::convolution 0.45% 78.031ms 64.62% 11.275s 2.127ms 5300 aten::_convolution 0.30% 51.670ms 64.17% 11.196s 2.113ms 5300 aten::mkldnn_convolution 63.58% 11.093s 63.87% 11.145s 2.103ms 5300 aten::batch_norm 0.13% 23.536ms 20.10% 3.506s 661.580us 5300 aten::_batch_norm_impl_index 0.28% 49.486ms 19.96% 3.483s 657.139us 5300 aten::native_batch_norm 19.26% 3.360s 19.64% 3.427s 646.615us 5300 aten::max_pool2d 0.01% 1.038ms 5.84% 1.018s 10.181ms 100 aten::max_pool2d_with_indices 5.83% 1.017s 5.83% 1.017s 10.171ms 100 aten::add_ 3.38% 588.907ms 3.38% 588.907ms 85.349us 6900 aten::relu_ 0.35% 60.358ms 1.67% 292.155ms 59.624us 4900 aten::clamp_min_ 1.33% 231.797ms 1.33% 231.797ms 47.306us 4900 aten::empty 0.46% 80.195ms 0.46% 80.195ms 1.513us 53000 aten::linear 0.01% 927.300us 0.23% 39.353ms 393.532us 100 aten::addmm 0.20% 35.379ms 0.21% 37.016ms 370.155us 100 aten::empty_like 0.12% 20.455ms 0.17% 29.976ms 5.656us 5300 aten::as_strided_ 0.11% 18.830ms 0.11% 18.830ms 3.553us 5300 aten::adaptive_avg_pool2d 0.00% 419.900us 0.08% 14.265ms 142.647us 100 aten::mean 0.01% 1.737ms 0.08% 13.845ms 138.448us 100 aten::sum 0.05% 8.113ms 0.05% 8.648ms 86.479us 100 aten::resize_ 0.03% 5.182ms 0.03% 5.182ms 0.978us 5300 aten::div_ 0.01% 1.445ms 0.02% 3.460ms 34.600us 100 aten::to 0.00% 337.000us 0.01% 2.015ms 20.154us 100 aten::_to_copy 0.01% 977.500us 0.01% 1.678ms 16.784us 100 aten::copy_ 0.01% 1.474ms 0.01% 1.474ms 7.371us 200 aten::t 0.00% 775.900us 0.01% 1.410ms 14.104us 100 aten::flatten 0.00% 420.900us 0.01% 1.311ms 13.106us 100 aten::view 0.01% 889.700us 0.01% 889.700us 8.897us 100 aten::transpose 0.00% 410.700us 0.00% 634.500us 6.345us 100 aten::expand 0.00% 496.800us 0.00% 566.800us 5.668us 100 aten::fill_ 0.00% 534.800us 0.00% 534.800us 5.348us 100 aten::as_strided 0.00% 293.800us 0.00% 293.800us 1.469us 200 aten::empty_strided 0.00% 241.700us 0.00% 241.700us 2.417us 100 aten::resolve_conj 0.00% 54.800us 0.00% 54.800us 0.274us 200 --------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 17.448s Execution time: 20.02380895614624 ``` We found the major kernel consume CPU resource is `aten::mkldnn_convolution`. It was dispatched to `MKLDNN`. Acturally, we had optimized memory allocation via integrated mimalloc to pytorch C10 module. It helps PyTorch Windows boost a lot, but it does not cover `MKL` and `MKLDNN`'s intermediary temporary memory. We still have potential to improve PyTorch Windows performance via optimize `MKL` and `MKLDNN`'s intermediary temporary memory. So, I discussed with Intel MKL team, and get a method to register high performance memory allocation API to MKL, and it would help MKL to boost memory performance. Please check the online document: https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-windows/2023-0/redefining-memory-functions.html This PR is optimize MKL memory alloction performance on Windows, via register mi_malloc to MKL. PR Changes: 1. Add cmake option: `USE_MIMALLOC_ON_MKL`, It is sub-option of `USE_MIMALLOC`. 2. Wrap and export mi_malloc APIs in C10, when `USE_MIMALLOC_ON_MKL` is `ON`. 3. Add MklAllocationHelp.cpp to register allocation APIs to MKL, when `USE_MIMALLOC_ON_MKL` is `ON`. For `oneDNN`, it is still tracking in this proposal: https://github.com/oneapi-src/oneDNN/issues/1898 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138419 Approved by: https://github.com/jgong5, https://github.com/ezyang
22 lines
532 B
C++
22 lines
532 B
C++
#pragma once
|
|
|
|
#include <c10/macros/Export.h>
|
|
|
|
#include <cstddef>
|
|
|
|
namespace c10 {
|
|
|
|
C10_API void* alloc_cpu(size_t nbytes);
|
|
C10_API void free_cpu(void* data);
|
|
|
|
#ifdef USE_MIMALLOC_ON_MKL
|
|
namespace mi_malloc_wrapper {
|
|
C10_API void* c10_mi_malloc(size_t size);
|
|
C10_API void* c10_mi_calloc(size_t count, size_t size);
|
|
C10_API void* c10_mi_realloc(void* p, size_t newsize);
|
|
C10_API void* c10_mi_malloc_aligned(size_t size, size_t alignment);
|
|
C10_API void c10_mi_free(void* p);
|
|
} // namespace mi_malloc_wrapper
|
|
#endif
|
|
|
|
} // namespace c10
|