Just like human beings require a specific environment to survive and thrive, so too does software.
To function in areas outside of the environment our bodies are designed to work in, such as the deepest depths of the sea or high up in the sky where the atmosphere is thin, we require specialist “containers” like submarines and spacesuits. Without them, we would simply die.
Similarly, to function in environments that a piece of software isn’t designed for, it too needs a container that can isolate it from everything else that exists on the same system. This is exactly what ‘Docker’ was designed for, and we’re going to cover everything that you need to know about this infinitely useful containerization platform in this article, including how you can use it to benefit your own workflows.
Before we get started, however, we need to make sure that we are all on the same page.
Docker is a software tool for the creation and deployment of isolated environments (i.e., virtual machines) for running applications with their dependencies. There are a few terms that you need to be familiar with before we dive into the fundamentals:
There are many advantages to using Docker in data science and machine learning projects. These include:
By this point, you might be thinking, “Why should I care?”. Well, just keep in mind that many more systems are beginning to rely on Docker as the trend of ML containerization continues to grow, and getting to grips with it now will turn you into a better ML engineer in the long term as it helps you to turn your ML projects into applications and deploy models into production.
Now that you have got an idea of what Docker is, let’s go through the process of how to create Docker containers. This can be achieved by following a three-step flow:
First things first, we need a set of instructions. That’s because Docker is instruction-based, not requirement-based; you need to describe the how rather than the what. To do this, we create a text file and name it ‘Dockerfile’.
The FROM command describes a base environment, eliminating the need to start from scratch. If you don’t have a base image, you can find a whole load of them on DockerHub or through Google Images. The RUN command is an instruction to change the environment.
Although the example we’re sharing installs Python libraries one by one, this isn’t how t should be done. Best practice says that you should utilize requirements.txt, which defines the Python dependencies. You can learn more about this in our previous blog post, What ML teams should know about Python dependencies.
The COPY command copies a file from your local disk, e.g., the requirements.txt file, into the image. The RUN command then installs all the Python dependencies that are defined in requirements.txt.
Now that we’ve got our Dockerfile, we can compile it into a binary artifact that’s known as a Docker Image. The reason for compiling the Dockerfile is simple: it makes it faster and reproducible. If we didn’t compile the Dockerfile, it would compromise standardization and reproducibility, leading to the problems we touched on earlier.
To compile your Dockerfile, use the build command:
This command builds an image on your machine. The -t parameter defines the image name, in this case, ‘myimage’, and gives it a tag, ‘1.0’. You can list all the images by running the command:
These images, also known as “snapshots” in other virtual machine settings, are snapshots of a Docker virtual machine at a certain point in time and at a certain place. The key thing about Docker images is that they’re immutable; they cannot be changed, only deleted.
This is critical in the Docker world because once you’ve set up your Docker virtual machine and have created an image, you can be certain that the image will always function, making it simple to experiment with new features.
As we explained earlier, containers are what protect your application from other applications that exist on the same machine. The instructions can either be embedded into the image or provided before starting the container. To do the latter, run the command:
This command starts the container, runs an echo command, and then closes it down. We now have a reproducible method for executing our code in any environment that supports Docker. I.e., no matter what machine is being used, so long as Docker is available, the code will work. This level of standardization and reproducibility is critical in data science, where each project has several dependencies.
Containers will close themselves down when they’ve executed their instructions. That said, they can run for a long time, and you can control this by starting a long background command, for example:
By running the command docker container list, you’ll be able to see if whether it’s running. To stop a container, take the container ID from the table and call the command docker stop <ID>. This will stop it but keep its state. Alternatively, to completely terminate the container, call the command docker rm -f <ID>.
In our recent blog post, What ML teams should know about Python dependencies, we talked about Python virtual environments and how these can be used to form a protective ‘bubble’ between different Python projects in the same local development environment.
Although it may sound like Docker solves the same problem, it doesn’t; it solves a similar problem but on a different layer. While a Python virtual environment creates the layer between all Python-related entities, Docker achieves this for the full software stack. The use cases for Python virtual environment and Docker are therefore different.
As a general rule—and this is something to keep in mind—virtual environments are ideal for developing applications on your local machine whereas Docker containers are built for collaborative productions in the cloud.
The machine learning space moves very fast. New research is constantly being implemented into APIs and open source frameworks. When things evolve this rapidly, keeping up with the latest developments and maintaining quality, consistency, and reliability can be a seemingly insurmountable challenge.
As you have hopefully learned from this article, one way to address this challenge is to move to containerized ML development by leveraging tools like Docker. Given that this enables ML teams to increase portability, achieve greater efficiency, operate more consistently, and develop better applications, it just makes sense, especially in situations where there are multiple engineers working on a single project.
Speaking of engineers, Docker enables them to track the different versions of a container image, check who built a version with what, and roll back to previous versions if necessary. Furthermore, ML applications can continue running even if one of its services is updating, being repaired, or down.