On Environment/Package Management in Python

Python's package management is a mess. I'm involved in a few open source projects and I often help users address their environment & installation issues. A large number of these environment issues essentially come down to incorrectly / accidentally mixing multiple different python environment together. This post lists a few common pitfalls and misconceptions of such.

The Problems

People often unfortunately have multiple python binaries and multiple installations of python pckages, e.g.:

  1. The OS's package manager can install python and some python packages.
    • Example: /usr/bin/python, /usr/lib/python3.7/*
  2. Users can use setup.py or system's pip install to install new packages to different locations.
    • Example: /usr/local/lib/python3.7/*, $HOME/.local/lib/python3.7/*
  3. Using pip install etc., under a virualenv, can install to a location under the virtualenv.
    • Example: $HOME/my_venv/bin/python, $HOME/my_venv/lib/python3.7/*
  4. Anaconda users will install python packages to its own location.
    • Example: $HOME/anaconda3/bin/python, $HOME/anaconda3/lib/python3.7/*

To install a library, all of the above methods are very common. As a result, many python developer's machines have multiple environments. A ton of problems can arise from this.

Be careful of multiple installations of the same package

For reasons above, you could have multiple installations of the same package in your system. It often causes very confusing issues if you think you're using one installation, but is actually using a different one. Examples of such issues include:

  1. You install a package of desired version but still see complaints about wrong package version, or run into bugs that exist in the wrong version
  2. You build & install a package with your custom changes but they are not effective
  3. You attempt to fix a bug by changing the source code, but you're in fact running another installation of the package so the bug never appears to be fixed

When such issues appear, remember to verify what/where is the library you're using. When in doubt, try the following methods:

  1. Use import lib; print(lib.__version__) to know the version of library you're using. However not all packages have the __version__ attribute. It could also be VERSION, etc.

  2. Use import lib; print(lib.__file__) to know the location of library you're using. This method should work for most packages.

  3. Use strace -fe file python -c 'import lib; do_something_with_lib()' to see every file used by the command. This tells you everything needed to figure out whether you have the issue of multiple installation.

I have the following command line alias to help me check libraries:

pylibinfo () {
python -c "import $1 as X; print(X.__file__, end=' '); print(X.__version__)"
}

Don't use pip list or conda list to check package version

The version you see in these two commands may not match what you're actually using, because there could be multiple versions of the same library in the system installed by pip or conda or other methods. Neither pip nor conda is able to know all of them.

To tell precisely the version of a library you're using, follow suggestions above.

Don't use setup.py install to install packages

Usually, a package installed in this way is not managed by any system: no command can tell you it is installed; no command can uninstall it for you. A pip uninstall for such packages may complain that it "cannot determine which files belong to it", or it may just do nothing. You often need to manually remove files to really uninstall it.

The result is that, when you need to install a different version of it some day in the future, using other methods (e.g. pip or conda), it either fails, or succeeds but give you a system of multiple installations.

Prefer python -m pip over pip

There could be multiple python binaries in your system (e.g., from system, venv, anaconda). However, pip is just a python script: based on how its shebang line is written, some versions of pip pick the python executable from your $PATH, but some versions of pip have hard-coded absolute path to the python executable it will use.

As a result, when you run pip install directly, it's not immediately clear which python it will use, let alone where the library will be installed.

On an environment with more than one python, always use python -m pip or /some/python -m pip, instead of the pip command directly.

Do pip uninstall multiple times

If you want to uninstall something, uninstall it multiple times until it converges. pip can install one package multiple times in different locations (e.g., one inside virutalenv/conda + one in $HOME/.local).

Use python -c 'import lib' to confirm uninstallation

Not everything can be uninstalled with a simple pip uninstall or conda. Examples are :

  1. Libraries you installed to a different prefix with a different pip.
  2. Libraries installed by the distro or libraries that are installed with setup.py install.
  3. Libraries in your PYTHONPATH.
  4. import lib may be provided by multiple alternative packages. For example, tf-nightly and tensorflow package both provide import tensorflow. It's easy to forget if you've installed both.

As a result, always use import lib to confirm after you uninstall something. If you're surprised by the successful import, use methods in this article to tell where they are.

Be careful when declaring dependencies on large packages

Large, complicated dependencies such as OpenCV, PyTorch, TensorFlow often can be installed in many different ways, only some of them are valid to certain environments. Such dependencies should NOT be declared in setup.py / requirements.txt to be automatically installed. To avoid invalid installation or multiple installation, the choice of how to install these dependencies should be left with users.

Unfortunately, 10k+ projects declares opencv-python as a dependency. As a result, their users will automatically install and use the desktop version opencv-python, instead of:

  1. the contrib version opencv-python-contrib, with more features
  2. the headless version opencv-python-headless, with fewer features and fewer compatibility issues
  3. Linux distro's own package, with fewer compatibility issues

In fact, opencv-python has given suggestions on how to select the right package. "Automatic" selection is simply wrong. Similarly, a project that declares dependency on PyTorch may automatically install one with mismatched CUDA version.

Be careful when using a library in the root of its source

You can sometimes have a python library installed already, but you also have its raw source code somewhere in your system. This is another potential case of multiple installation.

If you execute import libA in the source directory, python may find a local directory called libA which contains the source code, and use this source code, rather than the libA that's actually installed in a different location.

In addition to the common confusions that can arise from multiple installations, such situation often cause errors, because source code is often an invalid installation itself. In many libraries, the raw source code is different from what actually gets installed after you run pip install. The most common example is that compiled extensions will not exist in source code. As a result, using a python library in its source code directory often leads to errors.

The issue is so common that some libraries try to detect and educate the users (e.g., numpy here and tensorflow here ) about it.

The situation where it is OK to use a source directory includes:

  • Simple libraries where the source code is equal to what gets installed
  • Libraries that can be, and have been installed locally inside the source directory, usually with pip install --editable.

Never use sudo to install python packages

Never use sudo pip install or sudo python setup.py, unless it's a virtual system (e.g. docker) that you don't intend to keep long. Because:

  1. It is yet another installation. For example, you can have one version installed with root and one without, causing more trouble.
  2. When you do installation in the future in the right way (without root), this old package cannot be automatically upgraded.
  3. It affects all users, causing the "multiple installation" problem for them as well.

pip install --user can install libraries without root permission (installed to $HOME/.local on linux). This option is sometimes default in latest pip. Or you can use venv if stronger isolation is needed. Now venv is officially part of Python 3.

You don't need root permission for most installation

You only need root permission when the library directly interacts with hardware. e.g., you need root permission to install nvidia driver.

You do not need root permission to, e.g., install a different version of Python, GCC, or CUDA (though a newer CUDA sometimes requires newer driver). But doing these without root permission certainly requires some extra knowledge.

Avoid mixing binaries built from different sources

Python itself is a binary, that depends on some other binary libraries. Each python package may also contain binaries or depend on other binary libraries. Mixing binaries built from different sources (e.g. your system package manager v.s. anaconda) together (i.e. into a single process) has potential binary compatibility issues.

Such issues can happen when you want to use libA and libB together, but they are built using different versions of another library libC, or built with different C++ compilers. (C compiler, however, should produce binary compatible code across compiler versions).

Ideally you might expect some mechanism to avoid such conflicts. There is indeed a compilcated set of symbol visibility & compiler ABI rules, but most libraries are not following them correctly. The result of such incompatibility issues is often a segfault or other mysterious errors.

In reality, here are how packages are built:

  • Your OS's package manager (apt/yum/pacman, etc) installs many binaries and libs. They are built with the exact system packages they depend on, using the exact compiler installed by the package manager. They are all built in a nice uniform environment that will not have any compatibility issues: all these packages can be mixed together.

  • When you pip install a package, there are two possiblities:

    • Source distribution: pip will compile source code, using whatever compiler & dependency libraries it finds. So its compatibility will depend on which compiler & libraries it finds. Typically this is controlled by standard environment variables such as $CC,$LIBRARY_PATH, but it varies among packages.

    • Binary wheel distribution: pip will download a pre-built binary. This means that you need to confirm the binary is built in an environment that's compatible with other packages you're using.

      Lots of binary packages on pypi contain the word "manylinux": it means the package is built such that it's supposed to be compatible with most linux environment. Typically, using a manylinux package should not lead to compatibility issues, unless it is incorrectly marked as manylinux (e.g. this decord wheel is not correctly labeled). In addition, a manylinux package may have suboptimal performance due to the compatibility requirements: they are often built with old version of compilers and old instruction set.

      For other packages without the "manylinux" signature, you can only wish for good luck. They usually work fine but could stop working at any day. There are a number of github issues in different projects about "import libA causes import libB to crash". Typically these are giant projects, such as OpenCV, TensorFlow, PyTorch.

  • When you conda install a package that contains binaries, it's always pre-built. The official packages are built in anaconda's standard environment, and all the runtime dependencies in that standard environment are also packaged and distributed by anaconda. Anaconda provides a (almost) full runtime environment: including essential libs such as libstdc++ and libgcc. This means that the conda world is just like your OS's package manager: if you use conda to install all libraries (and their dependencies), they are always compatible with each other.

    That sounds nice, until you want to build a package by yourself. Anaconda provides a full runtime environment, but usually not the build-time environment. Normally you'll still be building the package using your system's compiler & libraries (or those defined by your envvars).

    As long as you use python from conda, you'll almost always run inside conda's runtime environment, using libstdc++, libjpeg, etc from anaconda/lib. It's then possible that the package you build is not compatible with conda's runtime environment.

    I've frequently seen such failures, e.g.:

    • Build a package using system's gcc. Then it cannot run inside conda's runtime since the runtime is built with an old version of gcc.
    • conda install cudatoolkit=10.1 pytorch gives you a working pytorch in cuda10.1 runtime. It works fine until you build a custom cuda extension: the extension will use nvcc from your system which may not be 10.1.

    That's why I personally avoid conda and use system's python (or pyenv) whenever possible.

Comments