2020-04-10 19:03:54 +00:00
|
|
|
#include <c10/core/DispatchKey.h>
|
Reland: "free up dispatch key space (in C++)" (#74963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963
This is a re-land of D35192346 (https://github.com/pytorch/pytorch/commit/9872a06d77582e91e834103db75f774ca75f7fff) and D35192317 (https://github.com/pytorch/pytorch/commit/a9216cde6cc57f94586ea71a75a35aaabee720ff), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.
The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.
**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294.
The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.
**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.
**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**
Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?
**The debugging experience was pretty difficult**
Debugging the Milan-specific failure was made difficult by the following:
(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.
(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)
(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542
Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.
Reviewed By: phding, albanD
Differential Revision: D35222806
fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
2022-03-31 21:45:53 +00:00
|
|
|
#include <c10/core/DispatchKeySet.h>
|
2020-01-15 19:12:17 +00:00
|
|
|
|
[PrivateUse1] Support parseDispatchKey with modified PrivateUse1 (#144325)
PyTorch now support many private1 backend names like `AutogradPrivateUse1` or `QuantizedPrivateUse1`, not mentioned the original `PrivateUse1` backend.
However, users that implement `PrivateUse1` funtionalities would modified the backend name by calling `torch.utils.rename_privateuse1_backend("my_backend")`, in that case, all `PrivateUse1` backend string would not be found when we call other functions related to it. For example, we utilize `torch.library` to register some customize functions to our new backend, we would use "my_backend" as the backend name instead of "PrivateUse1", in which the error will be throw:
```
could not parse dispatch key 'my_backend'
```
So, this PR changed the function `c10::DispatchKey parseDispatchKey(const std::string& k)`, it would double check if the `PrivateUse1` has been modified, and if so, we would change `k` to adapt new backend name then find it again.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144325
Approved by: https://github.com/albanD
2025-01-14 21:21:28 +00:00
|
|
|
#include <regex>
|
2021-09-22 16:14:01 +00:00
|
|
|
#include <unordered_map>
|
|
|
|
|
|
2020-01-15 19:12:17 +00:00
|
|
|
namespace c10 {
|
|
|
|
|
|
Reland: "free up dispatch key space (in C++)" (#74963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963
This is a re-land of D35192346 (https://github.com/pytorch/pytorch/commit/9872a06d77582e91e834103db75f774ca75f7fff) and D35192317 (https://github.com/pytorch/pytorch/commit/a9216cde6cc57f94586ea71a75a35aaabee720ff), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.
The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.
**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294.
The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.
**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.
**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**
Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?
**The debugging experience was pretty difficult**
Debugging the Milan-specific failure was made difficult by the following:
(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.
(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)
(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542
Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.
Reviewed By: phding, albanD
Differential Revision: D35222806
fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
2022-03-31 21:45:53 +00:00
|
|
|
const char* toString(BackendComponent t) {
|
|
|
|
|
switch (t) {
|
|
|
|
|
case BackendComponent::CPUBit:
|
|
|
|
|
return "CPUBit";
|
|
|
|
|
case BackendComponent::CUDABit:
|
|
|
|
|
return "CUDABit";
|
|
|
|
|
case BackendComponent::HIPBit:
|
|
|
|
|
return "HIPBit";
|
|
|
|
|
case BackendComponent::XLABit:
|
|
|
|
|
return "XLABit";
|
|
|
|
|
case BackendComponent::LazyBit:
|
|
|
|
|
return "LazyBit";
|
2022-05-31 14:26:04 +00:00
|
|
|
case BackendComponent::MetaBit:
|
|
|
|
|
return "MetaBit";
|
Reland: "free up dispatch key space (in C++)" (#74963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963
This is a re-land of D35192346 (https://github.com/pytorch/pytorch/commit/9872a06d77582e91e834103db75f774ca75f7fff) and D35192317 (https://github.com/pytorch/pytorch/commit/a9216cde6cc57f94586ea71a75a35aaabee720ff), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.
The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.
**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294.
The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.
**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.
**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**
Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?
**The debugging experience was pretty difficult**
Debugging the Milan-specific failure was made difficult by the following:
(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.
(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)
(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542
Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.
Reviewed By: phding, albanD
Differential Revision: D35222806
fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
2022-03-31 21:45:53 +00:00
|
|
|
case BackendComponent::XPUBit:
|
|
|
|
|
return "XPUBit";
|
2022-04-07 17:18:45 +00:00
|
|
|
case BackendComponent::IPUBit:
|
|
|
|
|
return "IPUBit";
|
2022-04-27 19:21:57 +00:00
|
|
|
case BackendComponent::MPSBit:
|
|
|
|
|
return "MPSBit";
|
Reland: "free up dispatch key space (in C++)" (#74963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963
This is a re-land of D35192346 (https://github.com/pytorch/pytorch/commit/9872a06d77582e91e834103db75f774ca75f7fff) and D35192317 (https://github.com/pytorch/pytorch/commit/a9216cde6cc57f94586ea71a75a35aaabee720ff), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.
The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.
**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294.
The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.
**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.
**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**
Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?
**The debugging experience was pretty difficult**
Debugging the Milan-specific failure was made difficult by the following:
(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.
(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)
(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542
Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.
Reviewed By: phding, albanD
Differential Revision: D35222806
fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
2022-03-31 21:45:53 +00:00
|
|
|
case BackendComponent::HPUBit:
|
|
|
|
|
return "HPUBit";
|
|
|
|
|
case BackendComponent::VEBit:
|
|
|
|
|
return "VEBit";
|
2023-01-16 12:20:23 +00:00
|
|
|
case BackendComponent::MTIABit:
|
|
|
|
|
return "MTIA";
|
Reland: "free up dispatch key space (in C++)" (#74963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963
This is a re-land of D35192346 (https://github.com/pytorch/pytorch/commit/9872a06d77582e91e834103db75f774ca75f7fff) and D35192317 (https://github.com/pytorch/pytorch/commit/a9216cde6cc57f94586ea71a75a35aaabee720ff), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.
The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.
**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294.
The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.
**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.
**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**
Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?
**The debugging experience was pretty difficult**
Debugging the Milan-specific failure was made difficult by the following:
(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.
(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)
(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542
Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.
Reviewed By: phding, albanD
Differential Revision: D35222806
fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
2022-03-31 21:45:53 +00:00
|
|
|
case BackendComponent::PrivateUse1Bit:
|
|
|
|
|
return "PrivateUse1Bit";
|
|
|
|
|
case BackendComponent::PrivateUse2Bit:
|
|
|
|
|
return "PrivateUse2Bit";
|
|
|
|
|
case BackendComponent::PrivateUse3Bit:
|
|
|
|
|
return "PrivateUse3Bit";
|
|
|
|
|
case BackendComponent::InvalidBit:
|
|
|
|
|
return "InvalidBit";
|
|
|
|
|
default:
|
|
|
|
|
return "UNKNOWN_BACKEND_BIT";
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2022-08-01 18:02:10 +00:00
|
|
|
BackendComponent toBackendComponent(DeviceType device_type) {
|
|
|
|
|
switch (device_type) {
|
|
|
|
|
#define DO_CASE(device, _) \
|
|
|
|
|
case DeviceType::device: { \
|
|
|
|
|
return toBackendComponent(DispatchKey::device); \
|
|
|
|
|
}
|
|
|
|
|
C10_FORALL_BACKEND_DEVICE_TYPES(DO_CASE, unused)
|
|
|
|
|
#undef DO_CASE
|
|
|
|
|
default:
|
|
|
|
|
return BackendComponent::InvalidBit;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2020-01-15 19:12:17 +00:00
|
|
|
const char* toString(DispatchKey t) {
|
|
|
|
|
switch (t) {
|
2020-01-30 19:45:30 +00:00
|
|
|
case DispatchKey::Undefined:
|
|
|
|
|
return "Undefined";
|
2022-07-20 18:10:50 +00:00
|
|
|
|
|
|
|
|
case DispatchKey::Dense:
|
|
|
|
|
return "Dense";
|
2020-06-03 14:26:12 +00:00
|
|
|
case DispatchKey::FPGA:
|
|
|
|
|
return "FPGA";
|
2024-04-23 00:33:20 +00:00
|
|
|
case DispatchKey::MAIA:
|
|
|
|
|
return "MAIA";
|
2022-07-20 18:10:50 +00:00
|
|
|
case DispatchKey::Vulkan:
|
|
|
|
|
return "Vulkan";
|
|
|
|
|
case DispatchKey::Metal:
|
|
|
|
|
return "Metal";
|
|
|
|
|
|
2021-07-27 05:59:10 +00:00
|
|
|
case DispatchKey::Lazy:
|
|
|
|
|
return "Lazy";
|
2022-04-27 19:21:57 +00:00
|
|
|
case DispatchKey::MPS:
|
|
|
|
|
return "MPS";
|
2021-05-12 20:03:55 +00:00
|
|
|
case DispatchKey::HPU:
|
|
|
|
|
return "HPU";
|
2023-01-16 12:20:23 +00:00
|
|
|
case DispatchKey::MTIA:
|
|
|
|
|
return "MTIA";
|
2020-08-08 05:37:37 +00:00
|
|
|
|
2022-07-20 18:10:50 +00:00
|
|
|
case DispatchKey::Quantized:
|
|
|
|
|
return "Quantized";
|
2024-11-18 05:09:56 +00:00
|
|
|
case DispatchKey::QuantizedPrivateUse1:
|
|
|
|
|
return "QuantizedPrivateUse1";
|
2020-08-08 05:37:37 +00:00
|
|
|
case DispatchKey::CustomRNGKeyId:
|
|
|
|
|
return "CustomRNGKeyId";
|
|
|
|
|
case DispatchKey::MkldnnCPU:
|
|
|
|
|
return "MkldnnCPU";
|
2022-07-20 18:10:50 +00:00
|
|
|
|
|
|
|
|
case DispatchKey::Sparse:
|
|
|
|
|
return "Sparse";
|
2024-03-01 08:06:21 +00:00
|
|
|
|
|
|
|
|
case DispatchKey::SparseCsr:
|
|
|
|
|
return "SparseCsr";
|
2020-08-08 05:37:37 +00:00
|
|
|
|
2020-11-03 05:34:01 +00:00
|
|
|
case DispatchKey::NestedTensor:
|
|
|
|
|
return "NestedTensor";
|
2022-07-20 18:10:50 +00:00
|
|
|
|
|
|
|
|
case DispatchKey::BackendSelect:
|
|
|
|
|
return "BackendSelect";
|
2020-11-03 05:34:01 +00:00
|
|
|
|
Dispatch to Python via __torch_dispatch__ (#59760)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59760
See https://github.com/pytorch/pytorch/issues/59049
There are some moving parts to this PR, I'll structure this explanation so the straightforward parts go first, and then the less straightforward parts.
**The actual dispatch to Python.** The core logic of dispatch to Python lives in `concrete_dispatch_fn` in `torch/csrc/autograd/python_variable.cpp`. It takes the input IValue stack, scans all the arguments for Tensor arguments, and defers most of the heavy lifting to `handle_torch_function_no_python_arg_parser` which actually does all of the logic for calling out to torch dispatch (in particular, this function handles multiple dispatch situations for you). Because we have a different function name than regular `__torch_function__` handling, `handle_torch_function_no_python_arg_parser` is generalized to accept a magic method name to look for when testing if Tensors have custom handling or not. Unlike `__torch_function__`, by default there is no `__torch_dispatch__` on Tensor classes.
**Maintaining the Python dispatch key.** In order to get to the dispatch to Python logic, we must tag Tensors with the `__torch_dispatch__` magic method with the newly added Python dispatch key (separated from PythonFuncTorch to allow for a transitional period while they migrate to this mechanism). We expose a new private property `_is_python_dispatch` that assists in debugging if a Tensor is participating in Python dispatch or not. We apply the Python dispatch key the first time a PyObject for a Tensor is constructed (THPVariable_NewWithVar), testing if `__torch_dispatch__` exists with then newly added `check_has_torch_dispatch`.
**Shallow copy and detach.** For the simple examples tested in this PR, most creations of Tensor route through the dispatcher. The exception to this is `shallow_copy_and_detach`, which bypasses the dispatcher and is used when saving tensors for backwards. When a Tensor is Python dispatch, we override the behavior of `shallow_copy_and_detach` to instead directly call into `__torch_dispatch__` to perform a `detach` operation (in the same way it would be invoked if you called `detach` directly). Because this Python call is triggered directly from c10::TensorImpl, it must be indirected through `PyInterpreter::detach`, which is the general mechanism for dynamic dispatching to the Python interpreter associated with a TensorImpl.
**torchdeploy compatibility.** The dispatch to Python logic cannot be directly registered to the dispatcher as it is compiled in the Python library, which will get loaded multiple times per torchdeploy interpreter. Thus, we must employ a two phase process. First, we register a fallback inside a non-Python library (aten/src/ATen/core/PythonFallbackKernel.cpp). Its job is to determine the appropriate PyInterpreter to handle the Python dispatch by going through all of the arguments and finding the first argument that has a PyObject/PyInterpreter. With this PyInterpreter, it makes another dynamic dispatch via "dispatch" which will go to the correct torchdeploy interpreter to handle dispatching to actual Python.
**Testing.** We provide a simple example of a LoggingTensor for testing, which can be used to generate TorchScript-like traces to observe what operations are being called when a Tensor is invoked. Although a LoggingTensor would be better implemented via an is-a relationship rather than a has-a relationship (as is done in the test), we've done it this way to show that arbitrarily complex compositions of tensors inside a tensor work properly.
**Known limitations.**
* We haven't adjusted any operator code, so some patterns may not work (as they lose the Python subclass in an unrecoverable way)
* `__torch_function__` must be explicitly disabled with `_disabled_torch_function_impl` otherwise things don't work quite correctly (in particular, what is being disabled is default subclass preservation behavior.)
* We don't ever populate kwargs, even when an argument is kwarg-only
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision:
D29017912
D29017912
Test Plan: Imported from OSS
Reviewed By: bdhirsh
Pulled By: ezyang
fbshipit-source-id: a67714d9e541d09203a8cfc85345b8967db86238
2021-06-25 18:49:20 +00:00
|
|
|
case DispatchKey::Python:
|
|
|
|
|
return "Python";
|
|
|
|
|
|
2022-07-20 18:10:50 +00:00
|
|
|
case DispatchKey::Fake:
|
|
|
|
|
return "Fake";
|
|
|
|
|
case DispatchKey::FuncTorchDynamicLayerBackMode:
|
|
|
|
|
return "FuncTorchDynamicLayerBackMode";
|
|
|
|
|
|
|
|
|
|
case DispatchKey::Functionalize:
|
|
|
|
|
return "Functionalize";
|
|
|
|
|
|
|
|
|
|
case DispatchKey::Named:
|
|
|
|
|
return "Named";
|
2020-08-08 05:37:37 +00:00
|
|
|
|
Conjugate View (#54987)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54987
Based off of ezyang (https://github.com/pytorch/pytorch/pull/44799) and bdhirsh (https://github.com/pytorch/pytorch/pull/43702) 's prototype:
Here's a summary of the changes in this PR:
This PR adds a new dispatch key called Conjugate. This enables us to make conjugate operation a view and leverage the specialized library functions that fast path with the hermitian operation (conj + transpose).
1. Conjugate operation will now return a view with conj bit (1) for complex tensors and returns self for non-complex tensors as before. This also means `torch.view_as_real` will no longer be a view on conjugated complex tensors and is hence disabled. To fill the gap, we have added `torch.view_as_real_physical` which would return the real tensor agnostic of the conjugate bit on the input complex tensor. The information about conjugation on the old tensor can be obtained by calling `.is_conj()` on the new tensor.
2. NEW API:
a) `.conj()` -- now returning a view.
b) `.conj_physical()` -- does the physical conjugate operation. If the conj bit for input was set, you'd get `self.clone()`, else you'll get a new tensor with conjugated value in its memory.
c) `.conj_physical_()`, and `out=` variant
d) `.resolve_conj()` -- materializes the conjugation. returns self if the conj bit is unset, else returns a new tensor with conjugated values and conj bit set to 0.
e) `.resolve_conj_()` in-place version of (d)
f) `view_as_real_physical` -- as described in (1), it's functionally same as `view_as_real`, just that it doesn't error out on conjugated tensors.
g) `view_as_real` -- existing function, but now errors out on conjugated tensors.
3. Conjugate Fallback
a) Vast majority of PyTorch functions would currently use this fallback when they are called on a conjugated tensor.
b) This fallback is well equipped to handle the following cases:
- functional operation e.g., `torch.sin(input)`
- Mutable inputs and in-place operations e.g., `tensor.add_(2)`
- out-of-place operation e.g., `torch.sin(input, out=out)`
- Tensorlist input args
- NOTE: Meta tensors don't work with conjugate fallback.
4. Autograd
a) `resolve_conj()` is an identity function w.r.t. autograd
b) Everything else works as expected.
5. Testing:
a) All method_tests run with conjugate view tensors.
b) OpInfo tests that run with conjugate views
- test_variant_consistency_eager/jit
- gradcheck, gradgradcheck
- test_conj_views (that only run for `torch.cfloat` dtype)
NOTE: functions like `empty_like`, `zero_like`, `randn_like`, `clone` don't propagate the conjugate bit.
Follow up work:
1. conjugate view RFC
2. Add neg bit to re-enable view operation on conjugated tensors
3. Update linalg functions to call into specialized functions that fast path with the hermitian operation.
Test Plan: Imported from OSS
Reviewed By: VitalyFedyunin
Differential Revision: D28227315
Pulled By: anjali411
fbshipit-source-id: acab9402b9d6a970c6d512809b627a290c8def5f
2021-06-04 21:11:23 +00:00
|
|
|
case DispatchKey::Conjugate:
|
|
|
|
|
return "Conjugate";
|
2022-07-20 18:10:50 +00:00
|
|
|
case DispatchKey::Negative:
|
|
|
|
|
return "Negative";
|
|
|
|
|
case DispatchKey::ZeroTensor:
|
|
|
|
|
return "ZeroTensor";
|
2020-08-08 05:37:37 +00:00
|
|
|
|
2021-05-02 05:55:12 +00:00
|
|
|
case DispatchKey::ADInplaceOrView:
|
|
|
|
|
return "ADInplaceOrView";
|
2021-03-15 21:39:12 +00:00
|
|
|
|
2020-09-01 16:03:45 +00:00
|
|
|
case DispatchKey::AutogradOther:
|
|
|
|
|
return "AutogradOther";
|
2022-07-20 18:10:50 +00:00
|
|
|
case DispatchKey::AutogradFunctionality:
|
|
|
|
|
return "AutogradFunctionality";
|
Reland: "free up dispatch key space (in C++)" (#74963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963
This is a re-land of D35192346 (https://github.com/pytorch/pytorch/commit/9872a06d77582e91e834103db75f774ca75f7fff) and D35192317 (https://github.com/pytorch/pytorch/commit/a9216cde6cc57f94586ea71a75a35aaabee720ff), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.
The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.
**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294.
The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.
**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.
**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**
Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?
**The debugging experience was pretty difficult**
Debugging the Milan-specific failure was made difficult by the following:
(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.
(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)
(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542
Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.
Reviewed By: phding, albanD
Differential Revision: D35222806
fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
2022-03-31 21:45:53 +00:00
|
|
|
case DispatchKey::AutogradNestedTensor:
|
|
|
|
|
return "AutogradNestedTensor";
|
2021-12-08 18:34:08 +00:00
|
|
|
|
[pytorch] move tracing logic to a separate dispatch backend (#38467)
Summary:
This PR moves tracing logic out of the generated VariableType kernels, to associate it with a new dedicated dispatch key Tracer.
It also toggles the dispatch key set at various places to keep the semantics unchanged - see the inline [Tracing Mode Switches] note.
Sample generated code:
```
Tensor & __ilshift___Tensor(Tensor & self, const Tensor & other) {
#if !defined(PYTORCH_DISABLE_TRACING)
torch::jit::Node* node = nullptr;
std::shared_ptr<jit::tracer::TracingState> tracer_state;
if (jit::tracer::isTracing()) {
tracer_state = jit::tracer::getTracingState();
at::Symbol op_name;
op_name = jit::Symbol::fromQualString("aten::__ilshift__");
node = tracer_state->graph->create(op_name, /*num_outputs=*/0);
jit::tracer::recordSourceLocation(node);
jit::tracer::addInputs(node, "self", self);
jit::tracer::addInputs(node, "other", other);
tracer_state->graph->insertNode(node);
jit::tracer::setTracingState(nullptr);
}
#endif
static auto op = c10::Dispatcher::singleton().findSchemaOrThrow("aten::__ilshift__", "Tensor");
c10::Dispatcher::singleton().redispatch<Tensor &, Tensor &, const Tensor &>(op, c10::DispatchKey::Tracer, self, other);
#if !defined(PYTORCH_DISABLE_TRACING)
if (tracer_state) {
jit::tracer::setTracingState(std::move(tracer_state));
jit::tracer::addOutput(node, self);
}
#endif
return self;
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38467
ghstack-source-id: 105215150
Test Plan: CI
Differential Revision: D21570684
fbshipit-source-id: 1a96761830307f9a934f38bfb9fe8b5b1763e0e0
2020-06-04 08:49:27 +00:00
|
|
|
case DispatchKey::Tracer:
|
|
|
|
|
return "Tracer";
|
2020-08-08 05:37:37 +00:00
|
|
|
|
2022-01-12 20:05:12 +00:00
|
|
|
case DispatchKey::AutocastCPU:
|
|
|
|
|
return "AutocastCPU";
|
2025-01-25 03:24:57 +00:00
|
|
|
case DispatchKey::AutocastMTIA:
|
|
|
|
|
return "AutocastMTIA";
|
2022-04-19 21:18:23 +00:00
|
|
|
case DispatchKey::AutocastXPU:
|
|
|
|
|
return "AutocastXPU";
|
2023-06-22 15:38:37 +00:00
|
|
|
case DispatchKey::AutocastIPU:
|
|
|
|
|
return "AutocastIPU";
|
2022-10-12 19:37:13 +00:00
|
|
|
case DispatchKey::AutocastHPU:
|
|
|
|
|
return "AutocastHPU";
|
2022-07-20 18:10:50 +00:00
|
|
|
case DispatchKey::AutocastCUDA:
|
|
|
|
|
return "AutocastCUDA";
|
2023-06-23 19:46:42 +00:00
|
|
|
case DispatchKey::AutocastXLA:
|
|
|
|
|
return "AutocastXLA";
|
2023-03-20 20:27:35 +00:00
|
|
|
case DispatchKey::AutocastPrivateUse1:
|
|
|
|
|
return "AutocastPrivateUse1";
|
2024-09-05 23:23:15 +00:00
|
|
|
case DispatchKey::AutocastMPS:
|
|
|
|
|
return "AutocastMPS";
|
2022-07-20 18:10:50 +00:00
|
|
|
|
|
|
|
|
case DispatchKey::FuncTorchBatched:
|
|
|
|
|
return "FuncTorchBatched";
|
2023-09-06 17:49:14 +00:00
|
|
|
case DispatchKey::BatchedNestedTensor:
|
|
|
|
|
return "BatchedNestedTensor";
|
2022-07-20 18:10:50 +00:00
|
|
|
case DispatchKey::FuncTorchVmapMode:
|
|
|
|
|
return "FuncTorchVmapMode";
|
2022-04-19 21:18:23 +00:00
|
|
|
|
2020-08-08 05:37:37 +00:00
|
|
|
case DispatchKey::Batched:
|
|
|
|
|
return "Batched";
|
2020-08-11 14:17:41 +00:00
|
|
|
case DispatchKey::VmapMode:
|
|
|
|
|
return "VmapMode";
|
|
|
|
|
|
2022-07-20 18:10:50 +00:00
|
|
|
case DispatchKey::FuncTorchGradWrapper:
|
|
|
|
|
return "FuncTorchGradWrapper";
|
2020-08-08 05:37:37 +00:00
|
|
|
|
2022-07-20 18:10:50 +00:00
|
|
|
case DispatchKey::DeferredInit:
|
|
|
|
|
return "DeferredInit";
|
|
|
|
|
case DispatchKey::PythonTLSSnapshot:
|
|
|
|
|
return "PythonTLSSnapshot";
|
2020-08-08 05:37:37 +00:00
|
|
|
|
2021-04-27 15:59:41 +00:00
|
|
|
// Note [Out-of-tree vmap+grad prototype]
|
|
|
|
|
// The following keys are used in the implementation of the out-of-tree
|
|
|
|
|
// composable functions transforms (vmap+grad) prototype that lives at
|
|
|
|
|
// https://github.com/zou3519/functorch
|
|
|
|
|
// We plan on eventually upstreaming the prototype into core, at which
|
|
|
|
|
// point it will have a different design that should use fewer keys.
|
|
|
|
|
case DispatchKey::FuncTorchDynamicLayerFrontMode:
|
|
|
|
|
return "FuncTorchDynamicLayerFrontMode";
|
|
|
|
|
|
2022-07-20 18:10:50 +00:00
|
|
|
case DispatchKey::TESTING_ONLY_GenericWrapper:
|
|
|
|
|
return "TESTING_ONLY_GenericWrapper";
|
2022-04-27 18:48:44 +00:00
|
|
|
|
2022-07-20 18:10:50 +00:00
|
|
|
case DispatchKey::TESTING_ONLY_GenericMode:
|
|
|
|
|
return "TESTING_ONLY_GenericMode";
|
|
|
|
|
|
2023-06-21 00:43:00 +00:00
|
|
|
case DispatchKey::PreDispatch:
|
|
|
|
|
return "PreDispatch";
|
|
|
|
|
|
2022-09-15 00:43:36 +00:00
|
|
|
case DispatchKey::PythonDispatcher:
|
|
|
|
|
return "PythonDispatcher";
|
|
|
|
|
|
2022-07-20 18:10:50 +00:00
|
|
|
// Aliases
|
|
|
|
|
|
|
|
|
|
case DispatchKey::Autograd:
|
|
|
|
|
return "Autograd";
|
|
|
|
|
case DispatchKey::CompositeImplicitAutograd:
|
|
|
|
|
return "CompositeImplicitAutograd";
|
2022-09-01 20:01:39 +00:00
|
|
|
case DispatchKey::CompositeImplicitAutogradNestedTensor:
|
|
|
|
|
return "CompositeImplicitAutogradNestedTensor";
|
2022-07-20 18:10:50 +00:00
|
|
|
case DispatchKey::CompositeExplicitAutograd:
|
|
|
|
|
return "CompositeExplicitAutograd";
|
|
|
|
|
case DispatchKey::CompositeExplicitAutogradNonFunctional:
|
|
|
|
|
return "CompositeExplicitAutogradNonFunctional";
|
2022-12-02 04:38:25 +00:00
|
|
|
case DispatchKey::FuncTorchBatchedDecomposition:
|
|
|
|
|
return "FuncTorchBatchedDecomposition";
|
2022-07-20 18:10:50 +00:00
|
|
|
|
|
|
|
|
// Per-backend dispatch keys
|
Reland: "free up dispatch key space (in C++)" (#74963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963
This is a re-land of D35192346 (https://github.com/pytorch/pytorch/commit/9872a06d77582e91e834103db75f774ca75f7fff) and D35192317 (https://github.com/pytorch/pytorch/commit/a9216cde6cc57f94586ea71a75a35aaabee720ff), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.
The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.
**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294.
The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.
**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.
**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**
Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?
**The debugging experience was pretty difficult**
Debugging the Milan-specific failure was made difficult by the following:
(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.
(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)
(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542
Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.
Reviewed By: phding, albanD
Differential Revision: D35222806
fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
2022-03-31 21:45:53 +00:00
|
|
|
|
2020-01-15 19:12:17 +00:00
|
|
|
default:
|
2022-07-20 18:10:50 +00:00
|
|
|
auto bc = toBackendComponent(t);
|
|
|
|
|
auto fk = toFunctionalityKey(t);
|
|
|
|
|
|
|
|
|
|
switch (fk) {
|
|
|
|
|
#define ENTRY(backend, functionality) \
|
|
|
|
|
case BackendComponent::backend##Bit: \
|
|
|
|
|
return #functionality #backend;
|
|
|
|
|
|
|
|
|
|
#define FORALL_BC(dkname, prefix) \
|
|
|
|
|
case DispatchKey::dkname: \
|
|
|
|
|
switch (bc) { \
|
|
|
|
|
C10_FORALL_BACKEND_COMPONENTS(ENTRY, prefix) \
|
|
|
|
|
default: \
|
2022-10-13 20:19:16 +00:00
|
|
|
return #prefix "Undefined"; \
|
2022-07-20 18:10:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
C10_FORALL_FUNCTIONALITY_KEYS(FORALL_BC)
|
|
|
|
|
|
|
|
|
|
default:
|
|
|
|
|
switch (bc) {
|
|
|
|
|
C10_FORALL_BACKEND_COMPONENTS(ENTRY, Unknown)
|
|
|
|
|
default:
|
|
|
|
|
return "UnknownUnknown";
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
#undef FORALL_BC
|
|
|
|
|
#undef ENTRY
|
|
|
|
|
}
|
2020-01-15 19:12:17 +00:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
std::ostream& operator<<(std::ostream& str, DispatchKey rhs) {
|
|
|
|
|
return str << toString(rhs);
|
|
|
|
|
}
|
Reland: "free up dispatch key space (in C++)" (#74963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963
This is a re-land of D35192346 (https://github.com/pytorch/pytorch/commit/9872a06d77582e91e834103db75f774ca75f7fff) and D35192317 (https://github.com/pytorch/pytorch/commit/a9216cde6cc57f94586ea71a75a35aaabee720ff), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.
The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.
**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294.
The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.
**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.
**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**
Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?
**The debugging experience was pretty difficult**
Debugging the Milan-specific failure was made difficult by the following:
(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.
(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)
(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542
Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.
Reviewed By: phding, albanD
Differential Revision: D35222806
fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
2022-03-31 21:45:53 +00:00
|
|
|
std::ostream& operator<<(std::ostream& str, BackendComponent rhs) {
|
|
|
|
|
return str << toString(rhs);
|
|
|
|
|
}
|
2020-01-15 19:12:17 +00:00
|
|
|
|
Reland: "free up dispatch key space (in C++)" (#74963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963
This is a re-land of D35192346 (https://github.com/pytorch/pytorch/commit/9872a06d77582e91e834103db75f774ca75f7fff) and D35192317 (https://github.com/pytorch/pytorch/commit/a9216cde6cc57f94586ea71a75a35aaabee720ff), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.
The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.
**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294.
The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.
**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.
**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**
Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?
**The debugging experience was pretty difficult**
Debugging the Milan-specific failure was made difficult by the following:
(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.
(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)
(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542
Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.
Reviewed By: phding, albanD
Differential Revision: D35222806
fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
2022-03-31 21:45:53 +00:00
|
|
|
DispatchKey getAutogradKeyFromBackend(BackendComponent k) {
|
|
|
|
|
// We want this to return an autograd key. We're relying on the fact that
|
|
|
|
|
// getAutogradRelatedKeySetFromBackend returns an autograd key +
|
|
|
|
|
// ADInplaceOrView, and autograd has higher precedence. The core mapping from
|
|
|
|
|
// backend -> autograd key lives in `getAutogradRelatedKeySetFromBackend`
|
|
|
|
|
// instead of here for performance. `getAutogradRelatedKeySetFromBackend` is a
|
|
|
|
|
// hotpath function, and we want to make sure that it doesn't have to
|
|
|
|
|
// construct any DispatchKeySets at runtime.
|
|
|
|
|
return getAutogradRelatedKeySetFromBackend(k).highestPriorityTypeId();
|
2020-09-01 16:03:45 +00:00
|
|
|
}
|
|
|
|
|
|
2021-09-22 16:14:01 +00:00
|
|
|
c10::DispatchKey parseDispatchKey(const std::string& k) {
|
|
|
|
|
static std::unordered_map<std::string, c10::DispatchKey> key_map = {
|
|
|
|
|
{"Undefined", c10::DispatchKey::Undefined},
|
Reland: "free up dispatch key space (in C++)" (#74963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963
This is a re-land of D35192346 (https://github.com/pytorch/pytorch/commit/9872a06d77582e91e834103db75f774ca75f7fff) and D35192317 (https://github.com/pytorch/pytorch/commit/a9216cde6cc57f94586ea71a75a35aaabee720ff), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.
The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.
**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294.
The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.
**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.
**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**
Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?
**The debugging experience was pretty difficult**
Debugging the Milan-specific failure was made difficult by the following:
(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.
(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)
(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542
Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.
Reviewed By: phding, albanD
Differential Revision: D35222806
fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
2022-03-31 21:45:53 +00:00
|
|
|
{"Dense", c10::DispatchKey::Dense},
|
2021-09-22 16:14:01 +00:00
|
|
|
{"FPGA", c10::DispatchKey::FPGA},
|
2024-04-23 00:33:20 +00:00
|
|
|
{"MAIA", c10::DispatchKey::MAIA},
|
2022-05-11 17:19:45 +00:00
|
|
|
{"MPS", c10::DispatchKey::MPS},
|
2021-09-22 16:14:01 +00:00
|
|
|
{"Vulkan", c10::DispatchKey::Vulkan},
|
|
|
|
|
{"Metal", c10::DispatchKey::Metal},
|
|
|
|
|
{"VE", c10::DispatchKey::VE},
|
|
|
|
|
{"Meta", c10::DispatchKey::Meta},
|
Reland: "free up dispatch key space (in C++)" (#74963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963
This is a re-land of D35192346 (https://github.com/pytorch/pytorch/commit/9872a06d77582e91e834103db75f774ca75f7fff) and D35192317 (https://github.com/pytorch/pytorch/commit/a9216cde6cc57f94586ea71a75a35aaabee720ff), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.
The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.
**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294.
The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.
**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.
**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**
Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?
**The debugging experience was pretty difficult**
Debugging the Milan-specific failure was made difficult by the following:
(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.
(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)
(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542
Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.
Reviewed By: phding, albanD
Differential Revision: D35222806
fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
2022-03-31 21:45:53 +00:00
|
|
|
{"Quantized", c10::DispatchKey::Quantized},
|
2021-09-22 16:14:01 +00:00
|
|
|
{"CustomRNGKeyId", c10::DispatchKey::CustomRNGKeyId},
|
|
|
|
|
{"MkldnnCPU", c10::DispatchKey::MkldnnCPU},
|
Reland: "free up dispatch key space (in C++)" (#74963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963
This is a re-land of D35192346 (https://github.com/pytorch/pytorch/commit/9872a06d77582e91e834103db75f774ca75f7fff) and D35192317 (https://github.com/pytorch/pytorch/commit/a9216cde6cc57f94586ea71a75a35aaabee720ff), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.
The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.
**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294.
The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.
**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.
**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**
Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?
**The debugging experience was pretty difficult**
Debugging the Milan-specific failure was made difficult by the following:
(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.
(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)
(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542
Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.
Reviewed By: phding, albanD
Differential Revision: D35222806
fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
2022-03-31 21:45:53 +00:00
|
|
|
{"Sparse", c10::DispatchKey::Sparse},
|
2024-03-01 08:06:21 +00:00
|
|
|
{"SparseCsr", c10::DispatchKey::SparseCsr},
|
2021-09-22 16:14:01 +00:00
|
|
|
{"BackendSelect", c10::DispatchKey::BackendSelect},
|
|
|
|
|
{"Python", c10::DispatchKey::Python},
|
2022-02-15 18:54:32 +00:00
|
|
|
{"PythonTLSSnapshot", c10::DispatchKey::PythonTLSSnapshot},
|
2022-04-27 18:48:44 +00:00
|
|
|
{"Fake", c10::DispatchKey::Fake},
|
2021-09-22 16:14:01 +00:00
|
|
|
{"Named", c10::DispatchKey::Named},
|
|
|
|
|
{"Conjugate", c10::DispatchKey::Conjugate},
|
|
|
|
|
{"Negative", c10::DispatchKey::Negative},
|
2021-12-08 18:34:08 +00:00
|
|
|
{"ZeroTensor", c10::DispatchKey::ZeroTensor},
|
2021-09-22 16:14:01 +00:00
|
|
|
{"FuncTorchDynamicLayerBackMode",
|
|
|
|
|
c10::DispatchKey::FuncTorchDynamicLayerBackMode},
|
New calling convention for Python dispatcher (#85133)
Instead of calling into the Python dispatcher for EVERY dispatcher
call, we now have a two step process. First, we
getattr(op: OpOverload, dispatch_key) to "load" the handler for the
function. This can either be a conventional function (in which
case we will call it, in the same way the old Python dispatcher
worked), or it can be a DispatchKey, in which case we will directly
call that DispatchKey in C++, bypassing marshalling between Python
and C++ entirely. OpOverload.__getattr__ is carefully written so
that it will cache the
A further optimization would be to define __slots__ on OpOverload,
and ensuring that the DispatchKey strings are interned.
The resulting Python dispatcher is less flexible: after the first
lookup, the handler is cached and we won't recompute it. Furthermore,
by default, dispatches will not go into Python, and so you won't
get stack frames for the Python dispatcher by default. But we get
a huge performance improvement: on the following microbenchmark
we go from 2.5s to 1.9s.
```
import time
import torch
from functorch import make_fx
def f(x):
for i in range(1000):
x = x * x
return x
begin = time.time()
res = make_fx(f, tracing_mode="symbolic")(torch.randn(10, 20))
print(time.time()-begin)
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85133
Approved by: https://github.com/wconstab
2022-09-16 17:23:01 +00:00
|
|
|
{"Functionalize", c10::DispatchKey::Functionalize},
|
2021-09-22 16:14:01 +00:00
|
|
|
{"ADInplaceOrView", c10::DispatchKey::ADInplaceOrView},
|
|
|
|
|
{"AutogradOther", c10::DispatchKey::AutogradOther},
|
Reland: "free up dispatch key space (in C++)" (#74963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963
This is a re-land of D35192346 (https://github.com/pytorch/pytorch/commit/9872a06d77582e91e834103db75f774ca75f7fff) and D35192317 (https://github.com/pytorch/pytorch/commit/a9216cde6cc57f94586ea71a75a35aaabee720ff), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.
The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.
**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294.
The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.
**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.
**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**
Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?
**The debugging experience was pretty difficult**
Debugging the Milan-specific failure was made difficult by the following:
(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.
(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)
(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542
Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.
Reviewed By: phding, albanD
Differential Revision: D35222806
fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
2022-03-31 21:45:53 +00:00
|
|
|
{"AutogradFunctionality", c10::DispatchKey::AutogradFunctionality},
|
2021-09-22 16:14:01 +00:00
|
|
|
{"AutogradNestedTensor", c10::DispatchKey::AutogradNestedTensor},
|
|
|
|
|
{"Tracer", c10::DispatchKey::Tracer},
|
|
|
|
|
{"AutocastCPU", c10::DispatchKey::AutocastCPU},
|
2025-01-25 03:24:57 +00:00
|
|
|
{"AutocastMTIA", c10::DispatchKey::AutocastMTIA},
|
2022-04-19 21:18:23 +00:00
|
|
|
{"AutocastXPU", c10::DispatchKey::AutocastXPU},
|
2023-06-22 15:38:37 +00:00
|
|
|
{"AutocastIPU", c10::DispatchKey::AutocastIPU},
|
2022-10-12 19:37:13 +00:00
|
|
|
{"AutocastHPU", c10::DispatchKey::AutocastHPU},
|
2021-09-22 16:14:01 +00:00
|
|
|
{"AutocastCUDA", c10::DispatchKey::AutocastCUDA},
|
2023-06-23 19:46:42 +00:00
|
|
|
{"AutocastXLA", c10::DispatchKey::AutocastXLA},
|
2023-03-20 20:27:35 +00:00
|
|
|
{"AutocastPrivateUse1", c10::DispatchKey::AutocastPrivateUse1},
|
2024-09-05 23:23:15 +00:00
|
|
|
{"AutocastMPS", c10::DispatchKey::AutocastMPS},
|
2021-09-22 16:14:01 +00:00
|
|
|
{"FuncTorchBatched", c10::DispatchKey::FuncTorchBatched},
|
2023-09-06 17:49:14 +00:00
|
|
|
{"BatchedNestedTensor", c10::DispatchKey::BatchedNestedTensor},
|
2021-09-22 16:14:01 +00:00
|
|
|
{"FuncTorchVmapMode", c10::DispatchKey::FuncTorchVmapMode},
|
|
|
|
|
{"Batched", c10::DispatchKey::Batched},
|
|
|
|
|
{"VmapMode", c10::DispatchKey::VmapMode},
|
2022-04-27 18:48:44 +00:00
|
|
|
{"DeferredInit", c10::DispatchKey::DeferredInit},
|
2021-09-22 16:14:01 +00:00
|
|
|
{"FuncTorchGradWrapper", c10::DispatchKey::FuncTorchGradWrapper},
|
|
|
|
|
{"FuncTorchDynamicLayerFrontMode",
|
|
|
|
|
c10::DispatchKey::FuncTorchDynamicLayerFrontMode},
|
|
|
|
|
{"TESTING_ONLY_GenericWrapper",
|
|
|
|
|
c10::DispatchKey::TESTING_ONLY_GenericWrapper},
|
|
|
|
|
{"TESTING_ONLY_GenericMode", c10::DispatchKey::TESTING_ONLY_GenericMode},
|
2022-09-15 00:43:36 +00:00
|
|
|
{"PythonDispatcher", c10::DispatchKey::PythonDispatcher},
|
2023-06-21 00:43:00 +00:00
|
|
|
{"PreDispatch", c10::DispatchKey::PreDispatch},
|
Reland: "free up dispatch key space (in C++)" (#74963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963
This is a re-land of D35192346 (https://github.com/pytorch/pytorch/commit/9872a06d77582e91e834103db75f774ca75f7fff) and D35192317 (https://github.com/pytorch/pytorch/commit/a9216cde6cc57f94586ea71a75a35aaabee720ff), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.
The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.
**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294.
The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.
**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.
**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**
Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?
**The debugging experience was pretty difficult**
Debugging the Milan-specific failure was made difficult by the following:
(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.
(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)
(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542
Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.
Reviewed By: phding, albanD
Differential Revision: D35222806
fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
2022-03-31 21:45:53 +00:00
|
|
|
|
|
|
|
|
{"CPU", c10::DispatchKey::CPU},
|
|
|
|
|
{"CUDA", c10::DispatchKey::CUDA},
|
|
|
|
|
{"HIP", c10::DispatchKey::HIP},
|
|
|
|
|
{"XLA", c10::DispatchKey::XLA},
|
2022-04-27 19:21:57 +00:00
|
|
|
{"MPS", c10::DispatchKey::MPS},
|
Reland: "free up dispatch key space (in C++)" (#74963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963
This is a re-land of D35192346 (https://github.com/pytorch/pytorch/commit/9872a06d77582e91e834103db75f774ca75f7fff) and D35192317 (https://github.com/pytorch/pytorch/commit/a9216cde6cc57f94586ea71a75a35aaabee720ff), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.
The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.
**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294.
The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.
**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.
**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**
Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?
**The debugging experience was pretty difficult**
Debugging the Milan-specific failure was made difficult by the following:
(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.
(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)
(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542
Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.
Reviewed By: phding, albanD
Differential Revision: D35222806
fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
2022-03-31 21:45:53 +00:00
|
|
|
{"XPU", c10::DispatchKey::XPU},
|
2022-04-07 17:18:45 +00:00
|
|
|
{"IPU", c10::DispatchKey::IPU},
|
Reland: "free up dispatch key space (in C++)" (#74963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963
This is a re-land of D35192346 (https://github.com/pytorch/pytorch/commit/9872a06d77582e91e834103db75f774ca75f7fff) and D35192317 (https://github.com/pytorch/pytorch/commit/a9216cde6cc57f94586ea71a75a35aaabee720ff), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.
The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.
**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294.
The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.
**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.
**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**
Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?
**The debugging experience was pretty difficult**
Debugging the Milan-specific failure was made difficult by the following:
(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.
(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)
(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542
Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.
Reviewed By: phding, albanD
Differential Revision: D35222806
fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
2022-03-31 21:45:53 +00:00
|
|
|
{"HPU", c10::DispatchKey::HPU},
|
|
|
|
|
{"Lazy", c10::DispatchKey::Lazy},
|
2023-01-16 12:20:23 +00:00
|
|
|
{"MTIA", c10::DispatchKey::MTIA},
|
Reland: "free up dispatch key space (in C++)" (#74963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963
This is a re-land of D35192346 (https://github.com/pytorch/pytorch/commit/9872a06d77582e91e834103db75f774ca75f7fff) and D35192317 (https://github.com/pytorch/pytorch/commit/a9216cde6cc57f94586ea71a75a35aaabee720ff), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.
The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.
**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294.
The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.
**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.
**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**
Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?
**The debugging experience was pretty difficult**
Debugging the Milan-specific failure was made difficult by the following:
(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.
(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)
(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542
Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.
Reviewed By: phding, albanD
Differential Revision: D35222806
fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
2022-03-31 21:45:53 +00:00
|
|
|
{"NestedTensor", c10::DispatchKey::NestedTensor},
|
2022-04-19 18:00:44 +00:00
|
|
|
{"NestedTensorCPU", c10::DispatchKey::NestedTensorCPU},
|
|
|
|
|
{"NestedTensorCUDA", c10::DispatchKey::NestedTensorCUDA},
|
2024-11-14 18:54:39 +00:00
|
|
|
{"NestedTensorXPU", c10::DispatchKey::NestedTensorXPU},
|
2023-02-07 06:03:01 +00:00
|
|
|
{"NestedTensorMeta", c10::DispatchKey::NestedTensorMeta},
|
2023-06-02 22:35:46 +00:00
|
|
|
{"NestedTensorPrivateUse1", c10::DispatchKey::NestedTensorPrivateUse1},
|
Reland: "free up dispatch key space (in C++)" (#74963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963
This is a re-land of D35192346 (https://github.com/pytorch/pytorch/commit/9872a06d77582e91e834103db75f774ca75f7fff) and D35192317 (https://github.com/pytorch/pytorch/commit/a9216cde6cc57f94586ea71a75a35aaabee720ff), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.
The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.
**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294.
The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.
**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.
**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**
Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?
**The debugging experience was pretty difficult**
Debugging the Milan-specific failure was made difficult by the following:
(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.
(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)
(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542
Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.
Reviewed By: phding, albanD
Differential Revision: D35222806
fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
2022-03-31 21:45:53 +00:00
|
|
|
{"PrivateUse1", c10::DispatchKey::PrivateUse1},
|
|
|
|
|
{"PrivateUse2", c10::DispatchKey::PrivateUse2},
|
|
|
|
|
{"PrivateUse3", c10::DispatchKey::PrivateUse3},
|
|
|
|
|
|
|
|
|
|
{"QuantizedCPU", c10::DispatchKey::QuantizedCPU},
|
|
|
|
|
{"QuantizedCUDA", c10::DispatchKey::QuantizedCUDA},
|
|
|
|
|
{"QuantizedXPU", c10::DispatchKey::QuantizedXPU},
|
2023-06-02 22:35:46 +00:00
|
|
|
{"QuantizedPrivateUse1", c10::DispatchKey::QuantizedPrivateUse1},
|
Reland: "free up dispatch key space (in C++)" (#74963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963
This is a re-land of D35192346 (https://github.com/pytorch/pytorch/commit/9872a06d77582e91e834103db75f774ca75f7fff) and D35192317 (https://github.com/pytorch/pytorch/commit/a9216cde6cc57f94586ea71a75a35aaabee720ff), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.
The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.
**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294.
The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.
**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.
**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**
Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?
**The debugging experience was pretty difficult**
Debugging the Milan-specific failure was made difficult by the following:
(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.
(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)
(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542
Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.
Reviewed By: phding, albanD
Differential Revision: D35222806
fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
2022-03-31 21:45:53 +00:00
|
|
|
|
|
|
|
|
{"SparseCPU", c10::DispatchKey::SparseCPU},
|
|
|
|
|
{"SparseCUDA", c10::DispatchKey::SparseCUDA},
|
|
|
|
|
{"SparseHIP", c10::DispatchKey::SparseHIP},
|
|
|
|
|
{"SparseXPU", c10::DispatchKey::SparseXPU},
|
|
|
|
|
{"SparseVE", c10::DispatchKey::SparseVE},
|
2022-11-07 19:13:07 +00:00
|
|
|
{"SparseMeta", c10::DispatchKey::SparseMeta},
|
2023-06-02 22:35:46 +00:00
|
|
|
{"SparsePrivateUse1", c10::DispatchKey::SparsePrivateUse1},
|
Reland: "free up dispatch key space (in C++)" (#74963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963
This is a re-land of D35192346 (https://github.com/pytorch/pytorch/commit/9872a06d77582e91e834103db75f774ca75f7fff) and D35192317 (https://github.com/pytorch/pytorch/commit/a9216cde6cc57f94586ea71a75a35aaabee720ff), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.
The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.
**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294.
The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.
**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.
**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**
Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?
**The debugging experience was pretty difficult**
Debugging the Milan-specific failure was made difficult by the following:
(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.
(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)
(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542
Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.
Reviewed By: phding, albanD
Differential Revision: D35222806
fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
2022-03-31 21:45:53 +00:00
|
|
|
|
2024-03-01 08:06:21 +00:00
|
|
|
{"SparseCsrCPU", c10::DispatchKey::SparseCsrCPU},
|
|
|
|
|
{"SparseCsrCUDA", c10::DispatchKey::SparseCsrCUDA},
|
|
|
|
|
{"SparseCsrHIP", c10::DispatchKey::SparseCsrHIP},
|
|
|
|
|
{"SparseCsrXPU", c10::DispatchKey::SparseCsrXPU},
|
|
|
|
|
{"SparseCsrVE", c10::DispatchKey::SparseCsrVE},
|
|
|
|
|
{"SparseCsrMeta", c10::DispatchKey::SparseCsrMeta},
|
|
|
|
|
{"SparseCsrPrivateUse1", c10::DispatchKey::SparseCsrPrivateUse1},
|
|
|
|
|
|
Reland: "free up dispatch key space (in C++)" (#74963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963
This is a re-land of D35192346 (https://github.com/pytorch/pytorch/commit/9872a06d77582e91e834103db75f774ca75f7fff) and D35192317 (https://github.com/pytorch/pytorch/commit/a9216cde6cc57f94586ea71a75a35aaabee720ff), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.
The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.
**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294.
The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.
**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.
**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**
Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?
**The debugging experience was pretty difficult**
Debugging the Milan-specific failure was made difficult by the following:
(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.
(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)
(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542
Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.
Reviewed By: phding, albanD
Differential Revision: D35222806
fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
2022-03-31 21:45:53 +00:00
|
|
|
{"AutogradCPU", c10::DispatchKey::AutogradCPU},
|
|
|
|
|
{"AutogradCUDA", c10::DispatchKey::AutogradCUDA},
|
|
|
|
|
{"AutogradXLA", c10::DispatchKey::AutogradXLA},
|
|
|
|
|
{"AutogradLazy", c10::DispatchKey::AutogradLazy},
|
2022-05-31 14:26:04 +00:00
|
|
|
{"AutogradMeta", c10::DispatchKey::AutogradMeta},
|
2022-04-07 17:18:45 +00:00
|
|
|
{"AutogradIPU", c10::DispatchKey::AutogradIPU},
|
Reland: "free up dispatch key space (in C++)" (#74963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963
This is a re-land of D35192346 (https://github.com/pytorch/pytorch/commit/9872a06d77582e91e834103db75f774ca75f7fff) and D35192317 (https://github.com/pytorch/pytorch/commit/a9216cde6cc57f94586ea71a75a35aaabee720ff), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.
The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.
**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294.
The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.
**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.
**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**
Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?
**The debugging experience was pretty difficult**
Debugging the Milan-specific failure was made difficult by the following:
(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.
(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)
(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542
Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.
Reviewed By: phding, albanD
Differential Revision: D35222806
fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
2022-03-31 21:45:53 +00:00
|
|
|
{"AutogradXPU", c10::DispatchKey::AutogradXPU},
|
2022-04-27 19:21:57 +00:00
|
|
|
{"AutogradMPS", c10::DispatchKey::AutogradMPS},
|
Reland: "free up dispatch key space (in C++)" (#74963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963
This is a re-land of D35192346 (https://github.com/pytorch/pytorch/commit/9872a06d77582e91e834103db75f774ca75f7fff) and D35192317 (https://github.com/pytorch/pytorch/commit/a9216cde6cc57f94586ea71a75a35aaabee720ff), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.
The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.
**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294.
The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.
**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.
**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**
Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?
**The debugging experience was pretty difficult**
Debugging the Milan-specific failure was made difficult by the following:
(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.
(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)
(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542
Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.
Reviewed By: phding, albanD
Differential Revision: D35222806
fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
2022-03-31 21:45:53 +00:00
|
|
|
{"AutogradHPU", c10::DispatchKey::AutogradHPU},
|
|
|
|
|
{"AutogradPrivateUse1", c10::DispatchKey::AutogradPrivateUse1},
|
|
|
|
|
{"AutogradPrivateUse2", c10::DispatchKey::AutogradPrivateUse2},
|
|
|
|
|
{"AutogradPrivateUse3", c10::DispatchKey::AutogradPrivateUse3},
|
|
|
|
|
|
2021-09-22 16:14:01 +00:00
|
|
|
{"Autograd", c10::DispatchKey::Autograd},
|
|
|
|
|
{"CompositeImplicitAutograd",
|
|
|
|
|
c10::DispatchKey::CompositeImplicitAutograd},
|
2022-09-01 20:01:39 +00:00
|
|
|
{"CompositeImplicitAutogradNestedTensor",
|
|
|
|
|
c10::DispatchKey::CompositeImplicitAutogradNestedTensor},
|
2021-09-22 16:14:01 +00:00
|
|
|
{"CompositeExplicitAutograd",
|
|
|
|
|
c10::DispatchKey::CompositeExplicitAutograd},
|
2022-06-15 15:34:00 +00:00
|
|
|
{"CompositeExplicitAutogradNonFunctional",
|
|
|
|
|
c10::DispatchKey::CompositeExplicitAutogradNonFunctional},
|
2022-12-02 04:38:25 +00:00
|
|
|
{"FuncTorchBatchedDecomposition",
|
|
|
|
|
c10::DispatchKey::FuncTorchBatchedDecomposition},
|
2021-09-22 16:14:01 +00:00
|
|
|
};
|
|
|
|
|
auto it = key_map.find(k);
|
[PrivateUse1] Support parseDispatchKey with modified PrivateUse1 (#144325)
PyTorch now support many private1 backend names like `AutogradPrivateUse1` or `QuantizedPrivateUse1`, not mentioned the original `PrivateUse1` backend.
However, users that implement `PrivateUse1` funtionalities would modified the backend name by calling `torch.utils.rename_privateuse1_backend("my_backend")`, in that case, all `PrivateUse1` backend string would not be found when we call other functions related to it. For example, we utilize `torch.library` to register some customize functions to our new backend, we would use "my_backend" as the backend name instead of "PrivateUse1", in which the error will be throw:
```
could not parse dispatch key 'my_backend'
```
So, this PR changed the function `c10::DispatchKey parseDispatchKey(const std::string& k)`, it would double check if the `PrivateUse1` has been modified, and if so, we would change `k` to adapt new backend name then find it again.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144325
Approved by: https://github.com/albanD
2025-01-14 21:21:28 +00:00
|
|
|
if (it == key_map.end() && c10::get_privateuse1_backend() != "PrivateUse1") {
|
|
|
|
|
std::string pu1_backend_name = c10::get_privateuse1_backend();
|
|
|
|
|
std::transform(
|
|
|
|
|
pu1_backend_name.begin(),
|
|
|
|
|
pu1_backend_name.end(),
|
|
|
|
|
pu1_backend_name.begin(),
|
|
|
|
|
::toupper);
|
|
|
|
|
std::string processed_k =
|
|
|
|
|
std::regex_replace(k, std::regex(pu1_backend_name), "PrivateUse1");
|
|
|
|
|
it = key_map.find(processed_k);
|
|
|
|
|
}
|
2021-09-22 16:14:01 +00:00
|
|
|
TORCH_CHECK(it != key_map.end(), "could not parse dispatch key: ", k);
|
|
|
|
|
return it->second;
|
|
|
|
|
}
|
|
|
|
|
|
2020-01-15 19:12:17 +00:00
|
|
|
} // namespace c10
|