On Environment/Package Management in Python

I'm involved in a few open source projects and I often help users address their environment / installation issues. A large number of these environment issues essentially come down to incorrectly / accidentally mixing multiple different python environment together. This post lists a few common pitfalls / misconceptions of such.

The Problems

People often unfortunately have multiple python environment and multiple installations of python pckages, e.g.:

1. The OS's package manager can install python and some python packages.
• Example: /usr/bin/python, /usr/lib/python3.7/*
2. Users can use pip install or setup.py to install new packages. Running them with --user will install packages to a location different from default.
• Example: /usr/local/lib/python3.7/*, $HOME/.local/lib/python3.7/* 3. Installing packages with pip install etc., under a virualenv, can install the package to a location under the virtualenv. • Example: $HOME/my_venv/bin/python, $HOME/my_venv/lib/python3.7/* 4. Anaconda users will install python packages to its own location. • Example: $HOME/anaconda3/bin/python, $HOME/anaconda3/lib/python3.7/* To install a library, all of the above methods are very common. As a result, many python developer's machine has multiple environments. A ton of problems can be caused by this. Be careful of multiple installations of the same package For reasons listed above, you could have multiple installations of the same package in your system. It often causes very confusing issues if you think you're using one installation, but is actually using a different one. Examples of such issues include: 1. You install a package of desired version but still see complaints about wrong package version, or run into bugs that exist in the wrong version 2. You build & install a package with your custom changes but they are not effective 3. You attempt to fix a bug by changing the source code, but you're in fact running another installation of the package so the bug never appears to be fixed Users should be very careful about such issues, and remember to verify that you are actually using the library you installed. When in doubt, try the following methods: 1. Use import lib; print(lib.__version__) to know the version of library you're using. However not all packages have the __version__ attribute. It could also be VERSION, etc. 2. Use import lib; print(lib.__file__) to know the location of library you're using. This method should work for most packages. 3. Use strace -fe file python -c 'import lib; do_something_with_lib()' to see every file used by the command. This tells you everything needed to figure out whether you have the issue of multiple installation. pip freeze or conda list can show you wrong version The version you see in these two commands may not match what you're actually using, because there could be multiple versions of the same library in the system installed by pip or conda or other methods. Neither pip nor conda is able to know all of them. To tell precisely the version of a library you're using, follow suggestions above. Avoid using setup.py install Usually, a package installed in this way is not managed by any system: no command can tell you it is installed; no command can uninstall it for you. A pip uninstall for such packages may complain that it "cannot determine which files belong to it", or it may just do nothing. You often need to manually remove files to really uninstall it. The result is that, when you need to install a different version of it some day in the future, using other methods (e.g. pip or conda), it either fails, or succeeds but give you a system of multiple installations. Prefer python -m pip over pip There could be multiple python binaries in your system (e.g., from system, venv, anaconda). However, pip is just a python script: based on how its shebang line is written, some versions of pip pick the python executable from your $PATH, but some versions of pip have hard-coded absolute path to the python executable it will use.

As a result, when you run pip install directly, you are not guaranteed which python it will use, and where the library will be installed.

On an environment with more than one python, always use python -m pip or /some/python -m pip, instead of the pip command directly.

You don't need root permission for most installation

You only need root permission when the library directly interacts with hardware. e.g., you need root permission to install nvidia driver.

You do not need root permission to, e.g., install a different version of Python, GCC, or CUDA (though a newer CUDA sometimes requires newer driver). But doing these without root permission certainly requires some extra knowledge.

Avoid mixing binaries built from different sources

Python itself is a binary, that depends on some other binary libraries. Each python package may also contain binaries or depend on other binary libraries. Mixing binaries built from different sources (e.g. your system package manager v.s. anaconda) together (i.e. into a single process) has potential binary compatibility issues.

Such issues can happen when you want to use libA and libB together, but they are built using different versions of another library libC, or built with different C++ compilers. (C compiler, however, should produce binary compatible code across compiler versions).

Ideally you might expect some mechanism to avoid such conflicts. There is indeed a compilcated set of symbol visibility & compiler ABI rules, but most libraries are not following them correctly. The result of such incompatibility issues is often a segfault or other mysterious errors.

In reality, here are how packages are built:

• Your OS's package manager (apt/yum/pacman, etc) installs many binaries and libs. They are built with the exact system packages they depend on, using the exact compiler installed by the package manager. They are all built in a nice uniform environment that will not have any compatibility issues: all these packages can be mixed together.

• When you pip install a package, there are two possiblities:

• Source distribution: this command compiles source code, using whatever compiler & dependency libraries it finds. So its compatibility will depend on which compiler & libraries it finds. Typically this is controlled by standard environment variables such as $CC,$LIBRARY_PATH, but it varies among packages.

• Binary wheel distribution: this command downloads a pre-built binary. This means that you need to confirm the binary is built in an environment that's compatible with other packages you're using.

Lots of binary packages on pypi contain the word "manylinux": it means the package is built such that it's supposed to be compatible with most linux environment. Typically, using a manylinux package should not lead to compatibility issues. Although there are exceptions (e.g., some packages incorrectly mark themselves as manylinux). Also, a manylinux package may have suboptimal performance due to the compatibility requirements: they are often built with old version of compilers and old instruction set.

For other packages without the "manylinux" signature, you can only wish for good luck. They usually work fine but could stop working at any day. There are a number of github issues in different projects about "import libA causes import libB to crash". Typically these are giant projects, such as OpenCV, TensorFlow, PyTorch.

• When you conda install a package that contains binaries, it's always pre-built. The official packages are built in anaconda's standard environment, and all the runtime dependencies in that standard environment are also packaged and distributed by anaconda. Anaconda provides a (almost) full runtime environment: including essential libs such as libstdc++ and libgcc. This means that the conda world is just like your OS's package manager: if you use conda to install all libraries (and their dependencies), they are always compatible with each other.

That sounds nice, until you want to build a package by yourself. Anaconda provides a full runtime environment, but usually not the build-time environment. Normally you'll still be building the package using your system's compiler & libraries (or those defined by your envvars).

As long as you use python from conda, you'll almost always run inside conda's runtime environment, using libstdc++, libjpeg, etc from anaconda/lib. It's then possible that the package you build is not compatible with conda's runtime environment.

I've frequently seen such failures, e.g.:

• Build a package using system's gcc. Then it cannot run inside conda's runtime since the runtime is built with an old version of gcc.
• conda install cudatoolkit=10.1 pytorch gives you a working pytorch in cuda10.1 runtime. It works fine until you build a custom cuda extension: the extension will use nvcc from your system which may not be 10.1.

That's why I personally avoid conda and use system's python whenever possible.