pytorch/test/cpp_extensions/open_registration_extension
Jane Xu be27dbf2b8 Enable CPP/CUDAExtension with py_limited_api for python agnosticism (#138088)
Getting tested with ao, but now there is a real test i added.

## What does this PR do?

We want to allow custom PyTorch extensions to be able to build one wheel for multiple Python versions, in other words, achieve python agnosticism. It turns out that there is such a way that setuptools/Python provides already! Namely, if the user promises to use only the Python limited API in their extension, they can pass in `py_limited_api` to their Extension class and to the bdist_wheel command (with a min python version) in order to build 1 wheel that will suffice across multiple Python versions.

Sounds lovely! Why don't people do that already with PyTorch? Well 2 things. This workflow is hardly documented (even searching for python agnostic specifically does not reveal many answers) so I'd expect that people simply don't know about it. But even if they did, _PyTorch_ custom Extensions would still not work because we always link torch_python, which does not abide by py_limited_api rules.

So this is where this PR comes in! We respect when the user specifies py_limited_api and skip linking torch_python under that condition, allowing users to enroll in the provided functionality I just described.

## How do I know this PR works?

I manually tested my silly little ultra_norm locally (with `import python_agnostic`) and wrote a test case for the extension showing that
- torch_python doesn't show up in the ldd tree
- no Py- symbols show up
It may be a little confusing that our test case is actually python-free (more clean than python-agnostic) but it is sufficient (and not necessary) towards showing that this change works.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138088
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-12-11 18:22:55 +00:00
..
pytorch_openreg OpenReg: Fix releasing tensor issue when exiting process (#140936) 2024-11-22 13:50:35 +00:00
test Openreg: Add RNG Generator (#138449) 2024-11-20 09:27:55 +00:00
README.md openreg add pin_memory (#135339) 2024-10-09 00:07:59 +00:00
setup.py Enable CPP/CUDAExtension with py_limited_api for python agnosticism (#138088) 2024-12-11 18:22:55 +00:00

This folder contains a self-contained example of a PyTorch out-of-tree backend leveraging the "PrivateUse1" backend from core.

How to use

Install as standalone with python setup.py develop (or install) from this folder. You can run test via python test/test_openreg.py.

Design principles

For simplicity anything that can be implemented from python is done so. A real implementation will most likely want to call these different APIs from c++ directly.

The current version sends everything back to python and contains enough implementation to run basic model, transfer host/device and printing.

The codebase is split as follows:

  • pytorch_openreg/__init__.py imports torch to get core state initialized, imports ._aten_impl to register our aten op implementations to torch, imports .C to load our c++ extension that registers more ops, allocator and hooks and finally renames the PrivateUse1 backend and register our python-side module.
  • pytorch_openreg/_aten_impl.py does two main things. Use the _register_same_name() function to register hooks from c++ (like getDevice, getStream, etc) and send them to our device daemon. Define a new torch.Library that registers a fallback that will be called whenever a backend kernel for PrivateUse1 is called. It contains the logic to handle all kind of native functions, computing the output metadata, allocating it and only calling into the device daemon to perform computation
  • pytorch_openreg/_device_daemon.py contains the Allocator (responsible for allocating memory on the device side, as int8 buffers, and recreating nice looking Tensors on the device side to be able to use aten ops to run code there), run_op that is the logic running on the device side to perform compute (for simplicity of coverage, we are re-building full blown Tensors here and calling aten ops on them). It also contains the Daemon responsible for the device worker process and sending data back and forth.
  • pytorch_openreg/_meta_parser.py mainly contain utilities to send objects over the wire from the user process to the device process. The main class there is OpenRegTensorMeta that contains all the metadata sent to the device which should be enough for it to populate the output Tensor.

Next steps

Currently, the autograd test is disabled because it's missing the getStream implementation. The main next step would be to:

  • Split the daemon into a proper user-process driver vs device-process executor. The main goal would be to better mimick which information is held on the user-process side and when we're actually communicating with the device. In particular current device or stream should be user-process informations.
  • Add Stream/Event system. Most likely by having multiple requests queue that go to the device from the driver.
  • Add RNG Generator.

Longer term:

  • Replace the current open_registration_extension.cpp test in PyTorch CI with this.
  • Build this module in the CI environment and enable Device-generic tests on this device.