Basic code compilation concepts
What does this have to do with Python?
Python as a glue language is essentially exposed to the sum of all problems that other languages have with their binary distribution, and this is a really vast field (even just for C/C++).
Needless to say, adequately summarizing decades of work and developments that have led us to where we are now is not easy. If you find errors or things to improve, please open an issue or PR!
In order to get a computer to execute a given unit of work (say, application X
calling a function f
from a previously compiled library libfoo
), a lot of
preconditions have to be met:
- The function needs to have been compiled into instructions that the computer understands; a symbol.
- The symbol for
f
needs to be named (resp. "mangled") in a consistent manner between the compilation of the current code (forX
), resp. the compilation of the library (libfoo
, which contains the symbol forf
). - The symbol for
f
needs to be discoverable from within the currently running process; assuminglibfoo
is available on the machine where we are compiling, this is ensured by the linker. - Variables passed to the function need to be loaded into the right CPU registers; this is highly dependent on the calling convention of a given CPU, or rather, CPU family.
- The code (in
X
) calling a given symbol (e.g. forf
) needs to be excruciatingly compatible with the actual implementation of that symbol (inlibfoo
).
It's not useful for the average programmer to consider this level of detail when trying to get work done, but it is unfortunately unavoidable when considering the realities of packaging and distributing software as pre-compiled binary artifacts.
Symbols, mangling and linkers
Symbols
For any given function f
you might have written in a language that needs to
be compiled (C, C++, ...), you can consider a symbol as a translation of your
function to a level that can actually be executed by your computer.
So, for example, a simple square function
int square(int num) {
return num * num;
}
will be translated to the following assembly on x86-64 hardware:
square:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov eax, DWORD PTR [rbp-4]
imul eax, eax
pop rbp
ret
imul eax, eax
Symbol name mangling
Note as well that, in C, the symbol square
has exactly the same name as the
function - there is a 1:1 relationship, or in other words, there is no "mangling"
of the symbol name. This is because C has no concept of overloading functions
(i.e. having different functions of the same name but different signatures).
It also means that you can never change anything about a C symbol that has been distributed to consumers in binary form. We return to this in the background content about ABI.
In C++, the same function name can have several different signatures; for example
template<class T>
int square(T num) {
return num * num;
}
square
with all kinds of integers, floats, etc. The compiler
will keep track of which flavor of the function has been used in the program
and generate the respective symbols for each one of them. In order to
distinguish these symbols, they get names that look like gibberish, but really
simply bake the types of their input arguments into the identifier, so that we
can ensure the right symbol gets called when we actually execute the function.
For more details about the most widespread convention about this, see here.
Linkers
When building an executable or a library, any code that references functions from third-party libraries needs to be resolved to the respective symbols, and those need to be found and copied into the executable, or alternatively, loaded at runtime.
The tool for this job is called the linker, and it's a hopefully invisible task, at least, until things break (symbols not found, etc.). For an in-depth introduction to linkers see here. Note that the field has evolved a lot since then, and new-generation linkers have appeared (see e.g. mold), but the basic operations have remained essentially unchanged.
One crucial aspect about the way things have historically grown is that there is no name-spacing of these symbols in any way. All symbols are fully global, and this brings a lot of constraints. The linker will search for a given symbol within different paths on the system, in order, but this is obviously very fragile, in case symbols appear in several libraries on the linker path.
Key take-aways
- Functions are compiled into symbols.
- Symbol names are 1:1 with function names in C, mangled according to their signature in C++.
- These symbols share a global name space.
- Symbols are picked by the linker in order of precedence on the path.
Shared vs. static libraries
From a high-level point of view, libraries are collections of symbols, and can
come in two different flavors: static and dynamic. In very crude terms, static
libraries are only useful at compile time (i.e. compiling an app X
against a
static library libfoo
will copy the symbols from libfoo
required by X
into
the final artifact), whereas for dynamic libraries, the symbols will only be
referenced, and then loaded from the dynamic library at runtime.
As a rough overview:
Static libfoo |
Dynamic libfoo |
|
---|---|---|
Compiling app X |
Copies required symbols from libfoo into final artifact for X |
References required symbols from libfoo |
Running app X |
- | Needs to load required symbols from libfoo |
Storage footprints | O(#libs) for libraries using a given symbol | Symbol only saved once |
Binary Interface | Self-contained | Sensitive to ABI breaks |
The trade-offs involved here are very painful. Either we duplicate symbols for every artifact that uses them (and the bloat can be extreme, e.g. statically linking the MSVC runtime1 can increase storage / memory footprint by 100MB), or we face the constraints of subjecting us to ABI stability, and finding the right symbols at runtime. For a further deep-dive into how Microsoft evolves its UCRT (Universal C Runtime), see here.
In general, large parts of the wider computing ecosystem have found the
duplication inherent in static libraries unacceptable, and would rather deal
with the constraints of ABI stability. For example, standard functions like
printf
(or standard mathematical functions, or ...) are used so pervasively
that copying those symbols into every produced binary would be extremely
wasteful.
It's worth noting that on Windows, symbols in shared libraries work quite
differently than on unix. Due to various reasons, symbols on Windows have to be
explicitly marked __declspec(dllexport)
& __declspec(dllimport)
in the
source with a macro that switches appropriately. This is quite an intrusive
change for libraries aiming to be cross-platform, and the reason that several
libraries developed primarily on unix do not support shared builds on Windows.
Even using workarounds like CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS
still doesn't
cover all necessary symbols (e.g. global static data members).
Using the latter may also have a performance impact, as it keeps the compiler from inlining calls to symbols that have been exported.
Key take-aways
- Static libraries are only useful at compile-time; they cause duplication of symbols, but are much less susceptible to the intricacies of ABIs.
- Shared libraries need to be available both at compilation as well as at runtime, solve the symbol duplication, but are extremely susceptible to ABI breaks.
- Due to the global symbol namespace, there can only be one version / build of a shared library per environment (unless there is explicit versioning for the symbols or libraries themselves, or for C++, explicitly different inline namespaces are used).
Foreign-Function Interface (FFI)
For any given task, a function may already exist in a library written in another language than the one at hand. In order to avoid rewriting (and maintaining) that functionality, it's desirable to have a way to call such functions from a "foreign" language. A common example are the BLAS/LAPACK routines, which are written in Fortran, but provide a C interface in addition to the Fortran one.
In addition to the considerations above, this needs to ensure that the types between the language of the callee and the types of the language that the function is implemented in are transformed correctly.
Transpilation
Many Python projects do not deal with C/C++ directly, but use transpilers like Cython or Pythran that generate C resp. C++ code from (quasi-)Python source code. Aside from almost always being exposed to the NumPy C API & ABI, these modules compile into shared libraries themselves, with all the caveats mentioned above that this implies. However, few projects expose their cythonized functions as a C API, so there are generally fewer concerns about ABI stability in this scenario.
Cross-compilation
Many projects use public CI services, which might not offer some of the more exotic or emerging architectures a project wants to support. This means that publishing binary artifacts for such architectures can often only be achieved with cross-compilation, i.e. compiling on one architecture (e.g., x86-64), for another (e.g., aarch64).
A recent example where this was necessary at scale was the introduction of a
new processor architecture for Apple's M1 notebooks, for which (almost) no CI
agents were available. Distributors of packages for linux-aarch64
,
linux-ppc64le
or windows-arm64
are often in similar situations.
The difficulty in cross-compilation is that it needs further attention to and separation (both conceptually, as well as in metadata) of the different requirements for the build environment (e.g. the necessary build tools that are executed on the x86-64 agent), as well as the host environment (i.e. the libraries that need to match the target architecture).
Additionally, many build procedures assume they can execute arbitrary code (e.g. code generation) on the same architecture as the host, which is not a given in this case and may need to be worked around.
Performance optimization
Compiler flags
TODO: Optimization levels; inlining functions; problems with -Ofast
For -Ofast
, see in particular:
https://github.com/conda-forge/conda-forge.github.io/issues/1824
Link-Time Optimization (LTO)
In the search for speed, it's possible to do a whole-program analysis after everything has been compiled, and let the compiler identify which functions can be inlined. This is generally out of scope for Python packaging, because it is too involved. However, CPython release builds make use of it.
Profile-Guided Optimization (PGO)
If a program is instrumented (compiled with tracking capabilities) to profile common usage patterns, it is possible to optimize the layout of the final artifact to ensure that hot paths get preferred in branch prediction. This is generally out of scope for Python packaging, because it is too involved. However, CPython release builds make use of it.
Binary Optimization and Layout Tool (BOLT)
Further optimization of the produced binary artifacts can be achieved by arranging their layout to avoid cache misses under profiled behavior. The respective tool is based on LLVM and still under heavy development, and not suitable for all usecases. It is generally out of scope for Python packaging, because it is too involved. However, CPython added experimental support as of 3.12.
-
This is an extreme example, but it is used in the wild, see e.g. protobuf. Note that Microsoft itself discourages this pattern, but still supports it:
We strongly recommend against static linking of the Visual C++ libraries, for both performance and serviceability reasons, but we recognize that there are some use cases that require static libraries and we will continue to support the static libraries for those reasons.