pytorch/test/cpp/profiler/containers.cpp
Taylor Robie 0b1f3bd158 [Profiler] Prefer TSC to wall clock when available (#73855)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73855

Calling the clock is one of the most expensive parts of profiling. We can reduce the profiling overhead by using `rdtsc` instead. The tradeoff is that we have to measure and convert. (shift and scale)

Test Plan: I added a cpp unit test with *very* aggressive anti-flake measures. I also ran the overhead benchmark (9 replicates) with `--stressTestKineto` (0.94 -> 0.89 us) and `--stressTestKineto --kinetoProfileMemory` (1.27 -> 1.17 us)

Reviewed By: chaekit

Differential Revision: D34231071

fbshipit-source-id: e3b3dd7580d93bcc783e87c7f2fc726cb74f4df8
(cherry picked from commit e8be9f8160793c6ee35d5af02bca3e01703e377d)
2022-03-13 18:29:06 +00:00

76 lines
2.5 KiB
C++

#include <algorithm>
#include <cmath>
#include <utility>
#include <vector>
#include <gtest/gtest.h>
#include <c10/util/irange.h>
#include <torch/csrc/profiler/containers.h>
#include <torch/csrc/profiler/util.h>
TEST(ProfilerTest, AppendOnlyList) {
const int n = 4096;
torch::profiler::impl::AppendOnlyList<int, 1024> list;
for (const auto i : c10::irange(n)) {
list.emplace_back(i);
ASSERT_EQ(list.size(), i + 1);
}
int expected = 0;
for (const auto i : list) {
ASSERT_EQ(i, expected++);
}
ASSERT_EQ(expected, n);
list.clear();
ASSERT_EQ(list.size(), 0);
}
TEST(ProfilerTest, AppendOnlyList_ref) {
const int n = 512;
torch::profiler::impl::AppendOnlyList<std::pair<int, int>, 64> list;
std::vector<std::pair<int, int>*> refs;
for (const auto _ : c10::irange(n)) {
refs.push_back(list.emplace_back());
}
for (const auto i : c10::irange(n)) {
*refs.at(i) = {i, 0};
}
int expected = 0;
for (const auto& i : list) {
ASSERT_EQ(i.first, expected++);
}
}
// Test that we can convert TSC measurements back to wall clock time.
TEST(ProfilerTest, clock_converter) {
const int n = 10001;
torch::profiler::impl::ApproximateClockToUnixTimeConverter converter;
std::vector<torch::profiler::impl::ApproximateClockToUnixTimeConverter::UnixAndApproximateTimePair> pairs;
for (const auto i : c10::irange(n)) {
pairs.push_back(torch::profiler::impl::ApproximateClockToUnixTimeConverter::measurePair());
}
auto count_to_ns = converter.makeConverter();
std::vector<int64_t> deltas;
for (const auto& i : pairs) {
deltas.push_back(i.t_ - count_to_ns(i.approx_t_));
}
std::sort(deltas.begin(), deltas.end());
// In general it's not a good idea to put clocks in unit tests as it leads
// to flakiness. We mitigate this by:
// 1) Testing the clock itself. While the time to complete a task may
// vary, two clocks measuring the same time should be much more
// consistent.
// 2) Only testing the interquartile range. Context switches between
// calls to the two timers do occur and can result in hundreds of
// nanoseconds of noise, but such switches are only a few percent
// of cases.
// 3) We're willing to accept a somewhat large bias which can emerge from
// differences in the cost of calling each clock.
EXPECT_LT(std::abs(deltas[n / 2]), 200);
EXPECT_LT(deltas[n * 3 / 4] - deltas[n / 4], 50);
}