mirror of
https://github.com/saymrwulf/pytorch.git
synced 2026-05-14 20:57:59 +00:00
More detailed description of benefits can be found at #41001. This is Intel's counterpart of NVidia’s NVTX (https://pytorch.org/docs/stable/autograd.html#torch.autograd.profiler.emit_nvtx). ITT is a functionality for labeling trace data during application execution across different Intel tools. For integrating Intel(R) VTune Profiler into Kineto, ITT needs to be integrated into PyTorch first. It works with both standalone VTune Profiler [(https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html](https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html)) and Kineto-integrated VTune functionality in the future. It works for both Intel CPU and Intel XPU devices. Pitch Add VTune Profiler's ITT API function calls to annotate PyTorch ops, as well as developer customized code scopes on CPU, like NVTX for NVidia GPU. This PR rebases the code changes at https://github.com/pytorch/pytorch/pull/61335 to the latest master branch. Usage example: ``` with torch.autograd.profiler.emit_itt(): for i in range(10): torch.itt.range_push('step_{}'.format(i)) model(input) torch.itt.range_pop() ``` cc @ilia-cher @robieta @chaekit @gdankel @bitfort @ngimel @orionr @nbcsm @guotuofeng @guyang3532 @gaoteng-git Pull Request resolved: https://github.com/pytorch/pytorch/pull/63289 Approved by: https://github.com/malfet
62 lines
2.8 KiB
ReStructuredText
62 lines
2.8 KiB
ReStructuredText
torch.utils.bottleneck
|
|
======================
|
|
|
|
.. automodule:: torch.utils.bottleneck
|
|
.. currentmodule:: torch.utils.bottleneck
|
|
|
|
`torch.utils.bottleneck` is a tool that can be used as an initial step for
|
|
debugging bottlenecks in your program. It summarizes runs of your script with
|
|
the Python profiler and PyTorch's autograd profiler.
|
|
|
|
Run it on the command line with
|
|
|
|
::
|
|
|
|
python -m torch.utils.bottleneck /path/to/source/script.py [args]
|
|
|
|
where [args] are any number of arguments to `script.py`, or run
|
|
``python -m torch.utils.bottleneck -h`` for more usage instructions.
|
|
|
|
.. warning::
|
|
Because your script will be profiled, please ensure that it exits in a
|
|
finite amount of time.
|
|
|
|
.. warning::
|
|
Due to the asynchronous nature of CUDA kernels, when running against
|
|
CUDA code, the cProfile output and CPU-mode autograd profilers may
|
|
not show correct timings: the reported CPU time reports the amount of time
|
|
used to launch the kernels but does not include the time the kernel
|
|
spent executing on a GPU unless the operation does a synchronize.
|
|
Ops that do synchronize appear to be extremely expensive under regular
|
|
CPU-mode profilers.
|
|
In these case where timings are incorrect, the CUDA-mode autograd profiler
|
|
may be helpful.
|
|
|
|
.. note::
|
|
To decide which (CPU-only-mode or CUDA-mode) autograd profiler output to
|
|
look at, you should first check if your script is CPU-bound
|
|
("CPU total time is much greater than CUDA total time").
|
|
If it is CPU-bound, looking at the results of the CPU-mode autograd
|
|
profiler will help. If on the other hand your script spends most of its
|
|
time executing on the GPU, then it makes sense to start
|
|
looking for responsible CUDA operators in the output of the CUDA-mode
|
|
autograd profiler.
|
|
|
|
Of course the reality is much more complicated and your script might not be
|
|
in one of those two extremes depending on the part of the model you're
|
|
evaluating. If the profiler outputs don't help, you could try looking at
|
|
the result of :func:`torch.autograd.profiler.emit_nvtx()` with ``nvprof``.
|
|
However, please take into account that the NVTX overhead is very high and
|
|
often gives a heavily skewed timeline. Similarly, Intel VTune Profiler helps
|
|
to analyze performance on Intel platforms further with
|
|
:func:`torch.autograd.profiler.emit_nvtx()`.
|
|
|
|
.. warning::
|
|
If you are profiling CUDA code, the first profiler that ``bottleneck`` runs
|
|
(cProfile) will include the CUDA startup time (CUDA buffer allocation cost)
|
|
in its time reporting. This should not matter if your bottlenecks result
|
|
in code much slower than the CUDA startup time.
|
|
|
|
For more complicated uses of the profilers (like in a multi-GPU case),
|
|
please see https://docs.python.org/3/library/profile.html
|
|
or :func:`torch.autograd.profiler.profile()` for more information.
|