Packaging projects with GPU code

Modern Graphics Processing Units (GPUs) can be used, in addition to their original purpose (rendering graphics), for high-performance numerical computing. They are particularly important for deep learning, but also widely used for data science and traditional scientific computing and image processing application.

GPUs from NVIDIA using the CUDA programming language are dominant in deep learning and scientific computing as of today. With both AMD and Intel releasing GPUs and other programming languages for them (ROCm, SYCL, OpenCL), the landscape may become more diverse in the future. In addition, Google provides Tensor Processing Units access in Google Cloud Platform, and a host of startups are developing custom accelerator hardware for high-performance computing applications.

Prominent projects which rely on GPUs and are either Python-only or widely used from Python include TensorFlow, PyTorch, CuPy, JAX, RAPIDS, MXNet, XGBoost, Numba, OpenCV, Horovod and PyMC.

Packaging such projects for PyPI has been, and still is, quite challenging.

Current state

As of mid-2023, PyPI and Python packaging tools are completely unaware of GPUs, and of CUDA. There is no way to mark a package as needing a GPU in sdist or wheel metadata, or as containing GPU-specific code (CUDA or otherwise). A GPU is hardware that may or may not be present in a machine that a Python package is being installed on - pip and other installers are unaware of this. If wheels contain CUDA code, they may require a specific version of the CUDA Toolkit (CTK) to be installed. Again, installers do not know this and there is no way to express this dependency. The same will be true for ROCm and other types of GPU hardware and languages.

NVIDIA has made steps towards better support for CUDA on PyPI. Various library components of the CTK have been packaged as wheels and are now distributed on PyPI, such as nvidia-cublas-cu11, although special care is needed to consume them due to lack of symlinks in wheels. Python wrappers around CUDA runtime and driver APIs have been consolidated into CUDA Python (website, PyPI package), but this package assumes that the CUDA driver and NVRTC are already installed since it only provides Python bindings to the APIs (and no bindings for CUDA libraries are provided as of yet). Many other projects remain hosted on NVIDIA's PyPI Index, which also includes rebuilds of TensorFlow and other packages.

A single CUDA version supports a reasonable range of GPU architectures. New CUDA versions get released regularly, and because they come with bug fixes, improved performance, or new functionalities, it may be necessary or desirable to build new wheels for that CUDA version. If only the supported CUDA version is different between two wheels, the wheel tags and filename will be identical. Hence it is not possible to upload more than one of those wheels under the same package name.

Historically, this required projects to produce packages specific to CUDA minor versions. Projects would either support only one CUDA version on PyPI, or create and self-host different packages. PyTorch and TensorFlow do the former, with TensorFlow supporting only a single CUDA version, and PyTorch providing more wheels for other CUDA versions and a CPU-only version in a separate wheelhouse (see pytorch.org/get-started). CuPy provides a number of packages: cupy-cuda102, cupy-cuda110, cupy-cuda111, cupy-rocm-4-3, cupy-rocm-5-0. This works, but adds maintenance overhead to project developers and consumes more storage and network bandwidth for PyPI.org. Moreover, it also prevents downstream projects from properly declaring the dependency unless they also follow a similar multi-package approach. As of CUDA 11, CUDA promises binary compatibility across minor versions, which allows building packages compatible across an entire CUDA major version. For example, CuPy now leverages this to produce wheels like cupy-cuda11x and cupy-cuda12x that work for any CUDA 11.x or CUDA 12.x version, respectively, that a user has installed. However, libraries that package PTX code cannot take advantage of this process yet (see below).

GPU packages tend to result in very large wheels. This is mainly because compiled GPU libraries must support a number of architectures, leading to large binary sizes. These effects are compounded by the requirements imposed by the manylinux standard for Linux wheels, which results in many large libraries being bundled into a single wheel (see Native dependencies for details). This is true in particular for deep learning packages because they link in cuDNN. For example, recent manylinux2014 wheels for TensorFlow are 588 MB (wheels for 2.11.0 ), and for PyTorch those are 890 MB (wheels for 1.13.0 ). The problems around and causes of GPU wheel sizes were discussed in depth in this Packaging thread on Discourse.

So far we have only discussed individual projects containing GPU code. Those projects are the most fundamental libraries in larger stacks of packages (perhaps even whole ecosystems). Hence, other projects will want to declare a dependency on them. This is currently quite difficult, because of the implicit coupling through a shared CUDA version. If a project like PyTorch releases a new version and bumps the default CUDA version used in the torch wheels, then any downstream package which also contains CUDA code will break unless it has an exact == pin on the older torch version, and then releases a new version of its own for the new CUDA version. Such synchronized releases are hard to do. If there were a way to declare a dependency on CUDA version (e.g., through a metapackage on PyPI), that strong coupling between packages would not be necessary.

Other package managers typically do have support for CUDA:

Conda: provides a virtual __cuda package used to determine the CUDA versions supported by the local driver (which must be installed outside of conda). Up through CUDA 11, the CTK is supported via the cudatoolkit conda-forge package. For CUDA 12 and above, the CTK is split into various packages by component (e.g. cudart, CCCL, nvcc, etc.). The cuda-version metapackage provides a means for selecting to select the appropriate version of different CTK components for a specific CUDA minor version.
Spack: supports building with or without CUDA, and allows specifying supported GPU architectures: docs. CUDA itself can be specified as externally provided, and is recommended to be installed directly from NVIDIA: docs. ROCm is supported in a similar fashion,
Ubuntu: provides one CUDA version per Ubuntu release: nvidia-cuda-toolkit package,
Arch Linux: provides one CUDA version: cuda package.

Those package managers typically also provide CUDA-related development tools, and build all the most popular deep learning and numerical computing packages for the CUDA version they ship.

Problems

The problems around GPU packages include:

User-friendliness:

Installs depend on a specific CUDA or ROCm version, and pip does not know about this. Hence installs may succeed, followed by errors at runtime,
CUDA or ROCm must be installed through another package manager or a direct download from the vendor. And the other package manager upgrading CUDA or ROCm may silently break the installed Python package,
Wheels may have to come from a separate wheelhouse, requiring install commands like python -m pip install torch --extra-index-url https://download.pytorch.org/whl/cu116 which are easy to get wrong,
The very large download sizes are problematic for users on slow network connections or plans with a maximum amount of bandwidth usage for a given month (pip potentially downloading multiple wheels because of backtracking in the resolver is extra painful here).

Maintainer effort:

Keeping wheel sizes below either the 1 GB hard limit or the current PyPI file size or total project size limits can be a lot of work (or even impossible),
Hosting your own wheelhouse to support multiple CUDA or ROCm versions is a lot of work,
Depending on another GPU package is difficult, and likely requires a == pin,
A dependency on CUDA, ROCm, or a specific version of them cannot be expressed in metadata, hence maintaining build environments is more error-prone than it has to be.

For PyPI itself:

The large amount of space and bandwidth consumed by GPU packages. pypi.org/stats shows under "top projects by total package size" that many of the largest package are GPU ones, and that together they consume a significant fraction (estimated at ~20% for the ones listed in the top 100) of the total size for all of PyPI.

History

Support for GPUs and CUDA has been discussed on and off on distutils-sig and the Packaging Discourse:

Environment markers for GPU/CUDA availability thread on distutils-sig (2018),
The next manylinux specification thread on Discourse (2019), with a specific comment about presence/absence of GPU hardware and CUDA libraries being out of scope,
What to do about GPUs? (and the built distributions that support them) on a Packaging thread on Discourse (2021),
Selecting variant wheels according to a semi-static specification thread on Discourse (2024),
Implementation variants: rehashing and refocusing thread on Discourse (2024).

The ultimate thread has eventually lead into the Wheel Variants proposal that is being developed as part of the WheelNext initiative.

Relevant resources

How to use cuQuantum wheels

Potential solutions or mitigations

Potential solutions on the PyPI side include:

add specific wheel tags or metadata for the most popular libraries,
make an environment marker or selector package approach work,
improve interoperability with other package managers, in order to be able to declare a dependency on a CUDA or ROCm version as externally provided,

A potential solution is the Wheel Variants proposal that permits publishing multiple variants of the same wheel and selecting between them using a plugin-based system. For example, it makes it possible to publish variants built for different GPU runtimes and use plugins to select a variant matching the driver used on the system, or fall back to a lean CPU-only variant.

Additional Notes on CUDA Compatibility

Compatibility across CUDA versions is a common problem with numerous pitfalls to be aware of. There are three primary components of CUDA:

The CUDA Toolkit (CTK): This component includes the CUDA runtime library (libcudart.so) along with a range of other libraries and tools including math libraries like cuBLAS. libcudart.so and a few other core headers and libraries are the bare minimum required to compile and link CUDA code.
The User-Mode Driver (UMD): This is the libcuda.so library. This library is required to load and run CUDA code.
The Kernel-Mode Driver (KMD): The nvidia.ko file. This constitutes what is typically considered a "driver" in common parlance when referring to other peripherals connected to a computer.

For most compatibility at the level of source code, the KMD can be ignored providing that it meets CUDA's requirement. For the rest of this section, therefore, the "driver" will always be referring to the UMD.

CUDA drivers have always promised binary compatibility: any code that runs with some driver version X is installed will always run correctly with some newer driver version Y>X. As briefly discussed above, though, as of CUDA 11 the CUDA toolkit makes a number of additional compatibility guarantees.

The first is what is typically termed Minor Version Compatibility (MVC) . MVC promises that code built using any version of the CUDA runtime will work on any driver within the same major family. This behavior is useful because it is often easier for users to upgrade their CUDA runtime than it is to upgrade the driver, especially on shared machines. For instance, the CTK may be installed using conda, while the driver library cannot be. An example of leveraging MVC would be compiling code against the CUDA 11.5 runtime library and then running it on a system with a CUDA 11.2 driver installed. MVC allows distributors of Python packages to only require that users have a minimum required driver installed rather than needing a more exact match as in prior versions of CUDA.

Beyond CTK/driver compatibility, CUDA 11 also added increased support for compatibility between versions of the CTK itself. CUDA 11 promised forward binary compatibility across minor versions: code compiled with 11.x will also work on 11.y>11.x (for which the driver compatibility guaranteed by MVC was a prerequisite). This binary compatibility also means that binaries are backwards compatible, within certain limitations. In particular, binaries compiled with a newer CTK will run on an older CTK, so long as no features are used that require the newer CTK. If your code uses features that require a newer CTK (or a newer driver, in MVC contexts), then you must include suitable runtime checks in your code (using e.g. cudaDriverGetVersion) to ensure compatibility. The combination of these CTK compatibility promises and MVC is most often termed CUDA Enhanced Compatibility (CEC).

The above compatibility guarantees refer to binary compatibility for CUDA code compiled down to NVIDIA's machine instructions (SASS). However, there are other important situations that are not covered by these guarantees

NVRTC

NVRTC did not start supporting MVC until CUDA 11.2. Therefore, code that uses NVRTC for JIT compilation must have been compiled with a CUDA version >= 11.2. Moreover, NVRTC only works for a single translation unit that requires no linking because linking is not possible without nvJitLink (see below).

PTX

CEC does not apply to PTX. PTX is an instruction set that the CUDA driver can JIT-compile to SASS. The standard CUDA compilation pipeline includes the translation of CUDA source code into PTX from which SASS is generated, but for various reasons projects may choose to include PTX code in their final libraries to be JIT-compiled at runtime instead (one reason is because PTX can be compiled for the architecture on the target system at runtime instead of having to precompile it for a subset of supported architectures at compile-time). However, since MVC does not cover JIT-compiled PTX code, PTX generated using a particular CTK may not work with an older driver. This fact has two consequences:

Libraries that package PTX code will not benefit from MVC.
Libraries that leverage any sort of JIT-compilation pipeline that generates PTX code will also not support MVC. This can lead to particularly surprising behaviors, such as when a user has a newer CTK than the installed driver and then uses numba.cuda to compile a Python function since Numba compiles CUDA kernels to PTX as part of its pipeline.

Prior to CUDA 12, CUDA itself provides no general solutions to this problem, although in some cases there are tools that may help (for instance, Numba supports MVC in CUDA 11 starting with numba 0.57). CUDA 12 introduced the nvJitLink library as the long-term solution to this problem. nvJitLink may be leveraged to compile PTX and link the resulting executables in a minor version compatible manner. The pynvjitlink package is a Python wrapper for nvjitlink that can be used to enable enhanced compatibility for numba's JIT-compiled kernels.