Packaging projects with GPU code
Modern Graphics Processing Units (GPUs) can be used, in addition to their original purpose (rendering graphics), for high-performance numerical computing. They are particularly important for deep learning, but also widely used for data science and traditional scientific computing and image processing application.
GPUs from NVIDIA using the CUDA programming language are dominant in deep learning and scientific computing as of today. With both AMD and Intel releasing GPUs and other programming languages for them (ROCm, SYCL, OpenCL), the landscape may become more diverse in the future. In addition, Google provides Tensor Processing Units access in Google Cloud Platform, and a host of startups are developing custom accelerator hardware for high-performance computing applications.
Prominent projects which rely on GPUs and are either Python-only or widely used from Python include TensorFlow, PyTorch, CuPy, JAX, RAPIDS, MXNet, XGBoost, Numba, OpenCV, Horovod and PyMC.
Packaging such projects for PyPI has been, and still is, quite challenging.
Current state
As of mid-2023, PyPI and Python packaging tools are completely unaware of
GPUs, and of CUDA. There is no way to mark a package as needing a GPU in sdist
or wheel metadata, or as containing GPU-specific code (CUDA or otherwise). A
GPU is hardware that may or may not be present in a machine that a Python
package is being installed on - pip
and other installers are unaware of this.
If wheels contain CUDA code, they may require a specific version of the CUDA
Toolkit (CTK) to be installed. Again, installers do not know this and there is
no way to express this dependency. The same will be true for ROCm and other
types of GPU hardware and languages.
NVIDIA has made steps towards better support for CUDA on PyPI. Various library components of the CTK have been packaged as wheels and are now distributed on PyPI, such as nvidia-cublas-cu11, although special care is needed to consume them due to lack of symlinks in wheels. Python wrappers around CUDA runtime and driver APIs have been consolidated into CUDA Python (website, PyPI package), but this package assumes that the CUDA driver and NVRTC are already installed since it only provides Python bindings to the APIs (and no bindings for CUDA libraries are provided as of yet). Many other projects remain hosted on NVIDIA's PyPI Index, which also includes rebuilds of TensorFlow and other packages.
A single CUDA version supports a reasonable range of GPU architectures. New CUDA versions get released regularly, and because they come with bug fixes, improved performance, or new functionalities, it may be necessary or desirable to build new wheels for that CUDA version. If only the supported CUDA version is different between two wheels, the wheel tags and filename will be identical. Hence it is not possible to upload more than one of those wheels under the same package name.
Historically, this required projects to produce packages specific to CUDA minor
versions. Projects would either support only one CUDA version on PyPI, or
create and self-host different packages. PyTorch and TensorFlow do the former,
with TensorFlow supporting only a single CUDA version, and PyTorch providing
more wheels for other CUDA versions and a CPU-only version in a separate
wheelhouse (see
pytorch.org/get-started). CuPy
provides a number of packages:
cupy-cuda102
,
cupy-cuda110
,
cupy-cuda111
,
cupy-rocm-4-3
,
cupy-rocm-5-0
. This works, but
adds maintenance overhead to project developers and consumes more storage and
network bandwidth for PyPI.org. Moreover, it also prevents downstream projects
from properly declaring the dependency unless they also follow a similar
multi-package approach. As of CUDA 11, CUDA promises binary compatibility
across minor
versions,
which allows building packages compatible across an entire CUDA major version.
For example, CuPy now leverages this to produce wheels like
cupy-cuda11x
and
cupy-cuda12x
that work for any CUDA
11.x or CUDA 12.x version, respectively, that a user has installed. However,
libraries that package PTX code cannot take advantage of this process yet (see
below).
GPU packages tend to result in very large wheels. This is mainly because
compiled GPU libraries must support a number of architectures, leading to large
binary sizes. These effects are compounded by the requirements imposed by the
manylinux standard for Linux wheels, which results in many large libraries
being bundled into a single wheel (see Native
dependencies for details). This is true in particular for
deep learning packages because they link in
cuDNN. For example, recent
manylinux2014
wheels for TensorFlow are 588 MB (wheels for 2.11.0
), and for PyTorch those
are 890 MB (wheels for 1.13.0
). The problems around and
causes of GPU wheel sizes were discussed in depth in this Packaging thread on
Discourse.
So far we have only discussed individual projects containing GPU code. Those
projects are the most fundamental libraries in larger stacks of packages
(perhaps even whole ecosystems). Hence, other projects will want to declare a
dependency on them. This is currently quite difficult, because of the implicit
coupling through a shared CUDA version. If a project like PyTorch releases a
new version and bumps the default CUDA version used in the torch
wheels, then
any downstream package which also contains CUDA code will break unless it has
an exact ==
pin on the older torch
version, and then releases a new version
of its own for the new CUDA version. Such synchronized releases are hard to do.
If there were a way to declare a dependency on CUDA version (e.g., through a
metapackage on PyPI), that strong coupling between packages would not be
necessary.
Other package managers typically do have support for CUDA:
- Conda: provides a virtual
__cuda
package used to determine the CUDA versions supported by the local driver (which must be installed outside of conda). Up through CUDA 11, the CTK is supported via thecudatoolkit
conda-forge package. For CUDA 12 and above, the CTK is split into various packages by component (e.g. cudart, CCCL, nvcc, etc.). The cuda-version metapackage provides a means for selecting to select the appropriate version of different CTK components for a specific CUDA minor version. - Spack: supports building with or without CUDA, and allows specifying supported GPU architectures: docs. CUDA itself can be specified as externally provided, and is recommended to be installed directly from NVIDIA: docs. ROCm is supported in a similar fashion,
- Ubuntu: provides one CUDA version per Ubuntu release:
nvidia-cuda-toolkit
package, - Arch Linux: provides one CUDA version:
cuda
package.
Those package managers typically also provide CUDA-related development tools, and build all the most popular deep learning and numerical computing packages for the CUDA version they ship.
Problems
The problems around GPU packages include:
User-friendliness:
- Installs depend on a specific CUDA or ROCm version, and
pip
does not know about this. Hence installs may succeed, followed by errors at runtime, - CUDA or ROCm must be installed through another package manager or a direct download from the vendor. And the other package manager upgrading CUDA or ROCm may silently break the installed Python package,
- Wheels may have to come from a separate wheelhouse, requiring install
commands like
python -m pip install torch --extra-index-url https://download.pytorch.org/whl/cu116
which are easy to get wrong, - The very large download sizes are problematic for users on slow network
connections or plans with a maximum amount of bandwidth usage for a given
month (
pip
potentially downloading multiple wheels because of backtracking in the resolver is extra painful here).
Maintainer effort:
- Keeping wheel sizes below either the 1 GB hard limit or the current PyPI file size or total project size limits can be a lot of work (or even impossible),
- Hosting your own wheelhouse to support multiple CUDA or ROCm versions is a lot of work,
- Depending on another GPU package is difficult, and likely requires a
==
pin, - A dependency on CUDA, ROCm, or a specific version of them cannot be expressed in metadata, hence maintaining build environments is more error-prone than it has to be.
For PyPI itself:
- The large amount of space and bandwidth consumed by GPU packages. pypi.org/stats shows under "top projects by total package size" that many of the largest package are GPU ones, and that together they consume a significant fraction (estimated at ~20% for the ones listed in the top 100) of the total size for all of PyPI.
History
Support for GPUs and CUDA has been discussed on and off on distutils-sig and the Packaging Discourse:
- Environment markers for GPU/CUDA availability thread on distutils-sig (2018),
- The next manylinux specification thread on Discourse (2019), with a specific comment about presence/absence of GPU hardware and CUDA libraries being out of scope,
- What to do about GPUs? (and the built distributions that support them) on a Packaging thread on Discourse (2021),
None of the suggested ideas in those threads gained traction, mostly due to a combination of the complexity of the problem, difficulty of implementing support in packaging tools, and lack of people to work on a solution.
Relevant resources
Potential solutions or mitigations
Potential solutions on the PyPI side include:
- add specific wheel tags or metadata for the most popular libraries,
- make an environment marker or selector package approach work,
- improve interoperability with other package managers, in order to be able to declare a dependency on a CUDA or ROCm version as externally provided,
Additional Notes on CUDA Compatibility
Compatibility across CUDA versions is a common problem with numerous pitfalls to be aware of. There are three primary components of CUDA:
- The CUDA Toolkit (CTK): This component includes the CUDA runtime library (libcudart.so) along with a range of other libraries and tools including math libraries like cuBLAS. libcudart.so and a few other core headers and libraries are the bare minimum required to compile and link CUDA code.
- The User-Mode Driver (UMD): This is the libcuda.so library. This library is required to load and run CUDA code.
- The Kernel-Mode Driver (KMD): The nvidia.ko file. This constitutes what is typically considered a "driver" in common parlance when referring to other peripherals connected to a computer.
For most compatibility at the level of source code, the KMD can be ignored providing that it meets CUDA's requirement. For the rest of this section, therefore, the "driver" will always be referring to the UMD.
CUDA drivers have always promised binary compatibility: any code that runs with some driver version X is installed will always run correctly with some newer driver version Y>X. As briefly discussed above, though, as of CUDA 11 the CUDA toolkit makes a number of additional compatibility guarantees.
The first is what is typically termed Minor Version Compatibility (MVC) . MVC promises that code built using any version of the CUDA runtime will work on any driver within the same major family. This behavior is useful because it is often easier for users to upgrade their CUDA runtime than it is to upgrade the driver, especially on shared machines. For instance, the CTK may be installed using conda, while the driver library cannot be. An example of leveraging MVC would be compiling code against the CUDA 11.5 runtime library and then running it on a system with a CUDA 11.2 driver installed. MVC allows distributors of Python packages to only require that users have a minimum required driver installed rather than needing a more exact match as in prior versions of CUDA.
Beyond CTK/driver compatibility, CUDA 11 also added increased support for
compatibility between versions of the CTK
itself.
CUDA 11 promised forward binary compatibility across minor versions: code
compiled with 11.x will also work on 11.y>11.x (for which the driver
compatibility guaranteed by MVC was a prerequisite). This binary compatibility
also means that binaries are backwards compatible, within certain limitations.
In particular, binaries compiled with a newer CTK will run on an older CTK, so
long as no features are used that require the newer CTK. If your code uses
features that require a newer CTK (or a newer driver, in MVC contexts), then
you must include suitable runtime checks in your code (using e.g.
cudaDriverGetVersion
) to ensure compatibility. The combination of these CTK
compatibility promises and MVC is most often termed CUDA Enhanced Compatibility
(CEC).
The above compatibility guarantees refer to binary compatibility for CUDA code compiled down to NVIDIA's machine instructions (SASS). However, there are other important situations that are not covered by these guarantees
NVRTC
NVRTC did not start supporting MVC until CUDA 11.2. Therefore, code that uses NVRTC for JIT compilation must have been compiled with a CUDA version >= 11.2. Moreover, NVRTC only works for a single translation unit that requires no linking because linking is not possible without nvJitLink (see below).
PTX
CEC does not apply to PTX. PTX is an instruction set that the CUDA driver can JIT-compile to SASS. The standard CUDA compilation pipeline includes the translation of CUDA source code into PTX from which SASS is generated, but for various reasons projects may choose to include PTX code in their final libraries to be JIT-compiled at runtime instead (one reason is because PTX can be compiled for the architecture on the target system at runtime instead of having to precompile it for a subset of supported architectures at compile-time). However, since MVC does not cover JIT-compiled PTX code, PTX generated using a particular CTK may not work with an older driver. This fact has two consequences:
- Libraries that package PTX code will not benefit from MVC.
- Libraries that leverage any sort of JIT-compilation pipeline that generates PTX code will also not support MVC. This can lead to particularly surprising behaviors, such as when a user has a newer CTK than the installed driver and then uses numba.cuda to compile a Python function since Numba compiles CUDA kernels to PTX as part of its pipeline.
Prior to CUDA 12, CUDA itself provides no general solutions to
this problem, although in some cases there are tools that may help (for
instance, Numba supports MVC in CUDA 11 starting with numba
0.57).
CUDA 12 introduced the
nvJitLink library as the
long-term solution to this problem. nvJitLink may be leveraged to compile PTX
and link the resulting executables in a minor version compatible manner.
The pynvjitlink
package is a
Python wrapper for nvjitlink that can be used to enable enhanced compatibility
for numba's JIT-compiled kernels.