What ML teams should know about Python dependencies
Dependency management is the act of managing all the external bits and pieces that your project needs. When it works, you don’t even know it’s there. On the flip side, you will know if it fails.
In machine learning, python dependency management is essential to the success of your model and overall project. If your Python application relies on third-party frameworks to function, proper dependency management can help ML teams benefit from things like security, sustainability, and consistency.
Indeed, if you manage your direct dependencies as you should, teams are more likely to produce a high-quality ML application. In many ways, high-quality dependency management is the key for teams to take their ML systems to the next level.
In this article, we are going to cover the important things that data and ML teams need to know about Python dependencies and dependency management.
The basics of Python dependency management
Before we start, we need to make one thing clear—simply installing and upgrading Python packages doesn’t count as dependency management. The process is a lot more involved than this and requires you to document the required environment for your project and make it easy for others to reproduce it.
While you could jot installation instructions down on a piece of paper or write them in your source code files as comments and technically say that you’ve carried out dependency management, this isn’t recommended.
Instead, we recommend that you decouple important dependency information such as installation instructions from the code and reproduce it in a standardized format. This enables version pinning and easier deterministic installation. There are many ways that you can go about doing this, but the classic way of doing it is through ‘pip’ and a ‘requirements.txt’ file.
What is a package in Python?
Before we dive into this, though, let’s cover the very thing that sits at the heart of Python dependency: the package.
In Python, a package is a collection of modules, which themselves are collections of everything that’s defined in a single Python file, such as classes and functions. Matplotlib is a package, for example, whereas print()-function is not.
The purpose of a package is to be an easily distributable, reusable, and versioned collection of modules that have well-defined dependencies to other packages. Packages are more common than you might think, and you’re likely to be using one whenever you use import.
The most common method for installing a package is via the Python Package Index (PyPi) using the pip install command. Packages should always be installed in a virtual environment, which acts as a sort of protective bubble around a project.
If, for example, you wanted to install the Pandas software library by calling pip install pandas, you would need to do this inside the virtual environment; doing it outside would install the package globally on the machine.
This is a very bad move because, naturally, packages update over time, and different projects use different versions of packages. Installing a package globally could therefore cause problems because a single global installation cannot serve two different projects relying on different package versions.
If you are used to using Maven or npm, both of which install packages into your project directory, you might find this confusing. But we can’t stress this enough: Things will become very frustrating if you have two different projects that need to use a different version of the same library. So, always install packages from within a virtual environment.
A virtual environment is made up of separate copies of Python, along with tools and installed packages. Creating a virtual environment for each project isolates dependencies for different projects. Once you have made a virtual environment for your project, you can install all that project’s dependencies into the specific virtual environment (i.e., inside the protective bubble) instead of into your global Python environment. This allows you to install different versions of requests into each virtual environment, eliminating any conflicts.
Let’s look at how to do this in practice.
First, go to your project root directory and create a new virtual environment.
This creates your bubble. To go inside it, run:
Once inside the bubble, your terminal should show the virtual environment name like this:
Now that you’re inside your protective bubble, you can safely install packages. Any pip install commands that you run will only have an effect inside this bubble, and any code that you run will similarly only use packages inside.
If you list the installed packages, you should see a brief list of default packages, such as the pip itself. This is because the listing isn’t for all the Python packages that are installed on your machine, but all the packages that are inside of your virtual environment.
To leave your virtual environment, call the deactivate command.
Alternative method: Build virtual environments easily with tox
When you are using virtual environments for your projects, you will want an easy, automated way to build new virtual environments and install all the dependencies from your requirements.txt file. An automated way of doing this is also very useful for quickly and easily rebuilding broken virtual environments.
Tox is a Python tool for managing virtual environments. It enables you to quickly build virtual environments and automate running additional build steps such as unit tests and documentation generation. By simply running tox, the tool will automatically build a new virtual environment, install all the dependencies, and run unit tests. This reduces a huge amount of friction from the setup process.
You can get started with tox here.
You are apparently a machine learning researcher, take a look at some of the other available tools such as Conda (used to install, run, and update packages and their dependencies) and Poetry, which has the added benefit of including an intuitive command-line interface.
What is version pinning in Python?
Whenever you call pip install to pull a new package into your project, you should consider that this will create a new dependency for your project, and this is something that you need to document. But how?
The easiest way to do this is to write down new libraries and their version numbers to a requirements.txt file. It is a format understood by pip to install multiple packages in one go.
By doing this, you’ll already be ahead of most other ML teams that don’t document their dependencies properly. To go one step further, however, you can make the installation more deterministic with pip compile and requirements.in.
# auto-generate requirements.txt
# generated requirements.txt
You should only put your direct dependencies in your requirements.in file. The pip-compile command will then generate the perfect pinning of all the libraries into the requirements.txt file, thus providing all the information needed for a deterministic installation.
Pinning the Python version
Pinning the Python version can be difficult because there’s no straightforward way to pin the version dependency for Python itself. While you could make a Python package out of your project and define the Python version in setup.py or setup.cfg with python_requires>=3.9, that’s way overengineered for a simple data science project.
If you really want to pin to a specific Python, you could do something along the lines of this:
In this article, we covered the basics of what ML teams should know about Python dependencies. In essence, it’s important for ML teams not to avoid dependency management. While it might be convenient to ignore it in the short term, you will be thankful for having them in the future if some sort of unforeseen event knocks out your machine.
As for the pinning versions vs not pinning versos debate, we are of the opinion that pinning is always better than not pinning. Version pinning prevents packages from moving forward and updating themselves before your project is ready, thus avoiding potential conflicts.
Looking forward to the future, when your project matures and moves to the cloud and into production, it then becomes important to consider pinning the entire environment, not just Python. This is where Docker containers come into play; they not only let you pin the Python version but anything inside the operating system. You can think of Docker containers as virtual environments on a much bigger scale.