Distributing a package containing SIMD code

Single Instruction, Multiple Data (SIMD) instructions are instructions that are CPU-specific, and can yield significant performance gains compared to regular, portable C/C++ code. Each popular modern CPU architecture has its own SIMD instruction sets.

Using SIMD instructions in a Python package is quite difficult, because there is no way to specify, in either metadata or wheel tags, what CPU features are needed on the target machine in order to use a given wheel.

What does code containing SIMD instructions look like?

This code fragment shows how to use a single SSE2 instruction on an x86-64 CPU. It defines a mul function which multiplies two double precision floating point vectors:

#include <immintrin.h>
__m128d mul(__m128d a, __m128d b)
{
    return _mm_mul_pd(a, b);
}

If the CPU supports the instruction and the code gets compiled with the needed compiler flag (-msse2), the mul function will work and will be faster than using regular multiplication in C/C++.

As a more real-world example, here is a code fragment from a sin function for 32-bit float data from NumPy code:

#if NPY_SIMD_F32 && NPY_SIMD_FMA3
    if (is_mem_overlap(src, steps[0], dst, steps[1], len) ||
        !npyv_loadable_stride_f32(ssrc) || !npyv_storable_stride_f32(sdst)
    ) {
        for (; len > 0; --len, src += ssrc, dst += sdst) {
            simd_sincos_f32(src, 1, dst, 1, 1, SIMD_COMPUTE_SIN);
        }
    } else {
        simd_sincos_f32(src, ssrc, dst, sdst, len, SIMD_COMPUTE_SIN);
    }
#else
    for (; len > 0; --len, src += ssrc, dst += sdst) {
        const float src0 = *src;
        *dst = npy_sinf(src0);
    }
#endif

How important is use of SIMD code?

Code with SIMD instructions is typically a lot more difficult to read and maintain than regular C or C++ code. The speedups can be large however, so the implementation effort and the maintenance burden may be worth it. For basic and heavily used functionality like element-wise math functions (abs, sqrt, multiply, etc.), typical gains are in the 1.x - 10 range, and sometimes even >10). Here are a few benchmark results for:

OpenCV color conversion functionality, ~25x faster on ARM CPUs with NEON: opencv#19883
NumPy's absolute, reciprocal, sqrt, square functions, for SSE/AVX2 (x86-64), NEON (aarch64/arm64), and VSX (ppc64le): numpy#16247
PyTorch softmax, min and max 3x-4x faster for bfloat16 with AVX2/AVX512 on x86-64: pytorch#55202, and up to 2x-10x with uint8 for +, >>, min: pytorch#89284
Using AVX2 instead of SSE in SciPy's 2-D Fourier transforms: scipy#16984

It is safe to say that performance gains that large, for single-threaded execution in libraries that are so widely used, are extremely important.

Current state

As of December 2022, there is no support on PyPI, in the wheel spec, or in any widely used packaging tool for binaries containing SIMD instructions. Nor a plan to implement such support. The only relevant metadata is the "platform compatibility tag" in a wheel name, first defined in PEP 425 and now maintained under PyPA specifications in the Python Packaging User Guide. A platform tag defines a CPU family, for example x86_64 for 64-bit x86 CPUs and aarch64 for 64-bit ARM CPUs.

Projects that want to distribute wheels containing SIMD instructions have effectively three choices:

Make a single choice of SIMD instructions to include.
Build extension modules with multiple SIMD flavors inside, detect CPU capabilities at runtime, and then dynamically choose the optimal binary code.
Create separate packages on PyPI with a different package name but the same import name, and containing wheels with newer instructions. Then let users manually install those alternative packages.

Choice (1) implicitly defines what CPUs are supported by their package. Given that unsupported instructions result in very obscure errors, this means targeting SIMD instruction sets that are at least 10 years old (sometimes more). Choice (2) results in improved performance, because newer SIMD instructions can be used. However, this comes at the cost of a large amount of code complexity and larger wheel sizes.

In practice, only the largest and most widely used projects are able to make choice (2). And they indeed do so - TensorFlow, PyTorch, OpenCV, NumPy, and MXNet all have their own machinery and methods to work with SIMD instructions.

There are not many examples of choice (3). The ones that do exist, e.g. Pillow-SIMD and Intel(R) Extension for scikit-learn, tend to be forks by a third party rather than packages created by the original development team.

Distributing binaries with SIMD instructions is not something many other packaging systems have an answer for. Exceptions are Spack and Julia's Pkg.jl¹. Spack has builtin capabilities through archspec for installing optimized binaries. This will even be surfaced in its resolver; individual package entries will contain a tag like -skylake_avx512 (microarchitecture + highest supported instruction set). The archspec paper is worth reading for a thorough discussion of the design aspects of integrating support for SIMD instructions, and dealing with CPU compatibility in a packaging system in a more granular fashion. Pkg.jl can serve binaries for Julia packages optimized for the user's CPU architecture - see for example finufft_jll.jl and the listed binaries in its README (e.g. Windows x86_64 {cxxstring_abi=cxx11, march=avx} and march=avx2, march=avx512 variants).

Problems

Writing SIMD instructions is a specialized skill, however it can be effective to do so in only a few performance hotspots of the code. So it is often worthwhile, if it weren't for the problems around distributing wheels on PyPI. To illustrate how prohibitively expensive in terms of developer time the dynamic dispatch solution is: NumPy only gained support for it in 2020, and SciPy still does not have it (it chooses SSE3 instructions, first released in 2005, as the most recent instructions that are allowed to be used on x86).

Less sophisticated methods employed in the wild are compiling the project twice (e.g. once with SSE3 and once with AVX2), and defaulting to importing the latter while falling back to the former. Obviously this doubles binary size.

The "distribute separate wheels under a different package name (choice 3 above) is so user-unfriendly, and also fairly labor-intensive, that we cannot think of a single open source project that does this on PyPI.

The "choose a baseline and compile only for that" (choice 1 above) is the easiest choice that still allows using some SIMD instructions - and the difference between some (e.g. up to SSE3) and none at all can still be very large in terms of performance gain. However, this still leaves some users with old or nonstandard CPUs out in the cold, and it forces package authors to come up with a method for choosing that maximum feature set. The rule of thumb that NumPy and SciPy came up with is: if the number of users with incompatible CPUs stays below 0.5% (as determined by some publicly available data from browser and gaming vendors), then it's okay to use a particular feature. This is not ideal, but tends to lead to few complaints in practice.

History

Distributing packages containing SIMD code on PyPI came up a number of times on the distutils-sig mailing list, as well as more recently on Discourse:

Handling the binary dependency management problem thread on distutils-sig (2013)
Warning about potential problems for wheels thread on distutils-sig (2015)
Status update on the NumPy & SciPy vs SSE problem? thread on distutils-sig (2016)
Archspec: a library for labeling optimized binaries on a Packaging thread on Discourse (2020)
Idea: selector packages on a Packaging thread on Discourse (2020)

Even before wheels existed, NumPy and SciPy were already distributing .exe Windows installers for three SIMD flavors (no SIMD, up to SSE2, and up to SSE3, see for example pypi.org/project/numpy/1.5.1/#files.

Relevant resources

Links to key issues, forum discussions, PEPs, blog posts, etc.

NEP 38 - Using SIMD optimization instructions for performance (see also the "Related Work" section in that NEP for more relevant projects)
archspec - a library for detecting, labeling, and reasoning about microarchitectures: GitHub repo, paper
pytorch/cpuinfo - CPU INFOrmation library: GitHub repo
xsimd - C++ wrappers for SIMD intrinsics and parallelized, optimized mathematical functions: GitHub repo
Spack's docs on support for specific microarchitectures

Potential solutions or mitigations

There are few potential solutions on the Python packaging side that look promising:

New wheel tags for specific microarchitectures is a blunt instrument, and there are too many microarchitectures to consider for this to work well,
Using a library like archspec by packaging tools is very likely too complicated,
The selector packages idea seemed promising at first, but seems to have fallen out of favor now.

The most likely path forward to improve the current situation is to make it easier to share and reuse infrastructure for CPU feature detection and runtime dispatch. With archspec and pytorch/cpuinfo there are two solid libraries available for feature detection. The NumPy and Meson projects are planning to collaborate to make the "multiple compilation for different CPU capabilities" part available as a build system feature. If the runtime dispatch part could be implemented as a standalone, vendorable component, perhaps it will become easier for other projects to go this route.

This also has also been identified as potential use case for the Wheel Variants proposal. The proposal provides a more flexible framework that can be utilized to provide more fine-grained control over variations of CPU architecture than wheel tags provide, build multiple variants of the same package and use plugins to select one that matches the detected CPU best.

The capabilities of Julia's package manager aren't directly relevant to Python users, however it's still instructive to see a language-specific package manager that is SIMD-aware. ↩