Everything you need to know about Docker for data science and ML

Just like human beings require a specific environment to survive and thrive, so too does software.
Ran Romano
Ran Romano
Co-founder & CPO at Qwak
June 14, 2022
Table of contents
Everything you need to know about Docker for data science and ML

Just like human beings require a specific environment to survive and thrive, so too does software. 

To function in areas outside of the environment our bodies are designed to work in, such as the deepest depths of the sea or high up in the sky where the atmosphere is thin, we require specialist “containers” like submarines and spacesuits. Without them, we would simply die. 

Similarly, to function in environments that a piece of software isn’t designed for, it too needs a container that can isolate it from everything else that exists on the same system. This is exactly what ‘Docker’ was designed for, and we’re going to cover everything that you need to know about this infinitely useful containerization platform in this article, including how you can use it to benefit your own workflows.  

What is Docker?

Before we get started, however, we need to make sure that we are all on the same page. 

Docker is a software tool for the creation and deployment of isolated environments (i.e., virtual machines) for running applications with their dependencies. There are a few terms that you need to be familiar with before we dive into the fundamentals:

  • Docker Container—A single instance of the live, running application. 
  • Dockerfile—A text file with a list of commands to call when creating a Docker Image.
  • Docker Image—A blueprint for creating containers. All containers created from the same image are exactly alike. 

There are many advantages to using Docker in data science and machine learning projects. These include: 

  • Standardization—The main advantage of using Docker is standardization. This means that you can define the parameters of a container once and run it wherever Docker is installed. This in turn provides two more benefits: reproducibility and portability. 
  • Reproducibility—With Docker, everyone has the same operating system and the same versions of tools. This helps to avoid the problem of an application working on one machine but not another. If it works on one machine, it will work on them all. 
  • Portability—Portability makes it easy to move from local development to a cluster. In addition, if you’re working on open-source data projects then portability makes it easy for collaborators to bypass setup. 
  • Deployment—Docker makes it easier to deploy ML models on the fly. Need to provide external stakeholders with a status update? That’s fine: simply put your model into an API container and deploy it via Kubernetes—job done. (OK—we have simplified this somewhat, but the point that we’re making is that Docker makes it relatively straightforward to go from iteration in a workflow to deployment in a container.)

By this point, you might be thinking, “Why should I care?”. Well, just keep in mind that many more systems are beginning to rely on Docker as the trend of ML containerization continues to grow, and getting to grips with it now will turn you into a better ML engineer in the long term as it helps you to turn your ML projects into applications and deploy models into production. 

How can I create a Docker container? 

Now that you have got an idea of what Docker is, let’s go through the process of how to create Docker containers. This can be achieved by following a three-step flow:

  1. Dockerfile—Instructions for compiling an image.
  2. Docker Image—The compiled artifact.
  3. Docker Container—An executed instance of the image. 

1. Dockerfile

First things first, we need a set of instructions. That’s because Docker is instruction-based, not requirement-based; you need to describe the how rather than the what. To do this, we create a text file and name it ‘Dockerfile’. 

# Dockerfile

FROM python:3.9
RUN pip install tensorflow==2.7.0
RUN pip install pandas==1.3.3

The FROM command describes a base environment, eliminating the need to start from scratch. If you don’t have a base image, you can find a whole load of them on DockerHub or through Google Images. The RUN command is an instruction to change the environment. 

Although the example we’re sharing installs Python libraries one by one, this isn’t how t should be done. Best practice says that you should utilize requirements.txt, which defines the Python dependencies. You can learn more about this in our previous blog post What ML teams should know about Python dependencies.

# Dockerfile with requirements.txt

FROM python:3.9
COPY requirements.txt /tmp
RUN pip install -r /tmp/requirements.txt

The COPY command copies a file from your local disk, e.g., the requirements.txt file, into the image. The RUN command then installs all the Python dependencies that are defined in requirements.txt. 

2. Docker Image

Now that we’ve got our Dockerfile, we can compile it into a binary artifact that’s known as a Docker Image. The reason for compiling the Dockerfile is simple: it makes it faster and reproducible. If we didn’t compile the Dockerfile, it would compromise standardization and reproducibility, leading to the problems we touched on earlier.

To compile your Dockerfile, use the build command:

docker build . -t myimage:1.0

This command builds an image on your machine. The -t parameter defines the image name, in this case, ‘myimage’, and gives it a tag, ‘1.0’. You can list all the images by running the command:

docker image list

These images, also known as “snapshots” in other virtual machine settings, are snapshots of a Docker virtual machine at a certain point in time and at a certain place. The key thing about Docker images is that they’re immutable; they cannot be changed, only deleted. 

This is critical in the Docker world because once you’ve set up your Docker virtual machine and have created an image, you can be certain that the image will always function, making it simple to experiment with new features. 

3. Docker Container

As we explained earlier, containers are what protect your application from other applications that exist on the same machine. The instructions can either be embedded into the image or provided before starting the container. To do the latter, run the command:

docker run myimagename:1.0 echo "Hello world"

This command starts the container, runs an echo command, and then closes it down. We now have a reproducible method for executing our code in any environment that supports Docker. I.e., no matter what machine is being used, so long as Docker is available, the code will work. This level of standardization and reproducibility is critical in data science, where each project has several dependencies. 

Containers will close themselves down when they’ve executed their instructions. That said, they can run for a long time, and you can control this by starting a long background command, for example:

docker run myimagename:1.0 sleep 500000000000 &

By running the command docker container list, you’ll be able to see if whether it’s running. To stop a container, take the container ID from the table and call the command docker stop <ID>. This will stop it but keep its state. Alternatively, to completely terminate the container, call the command docker rm -f <ID>. 

Docker vs Python virtual environments

In our recent blog post What ML teams should know about Python dependencies. we talked about Python virtual environments and how these can be used to form a protective ‘bubble’ between different Python projects in the same local development environment. 

Although it may sound like Docker solves the same problem, it doesn’t; it solves a similar problem but on a different layer. While a Python virtual environment creates the layer between all Python-related entities, Docker achieves this for the full software stack. The use cases for Python virtual environment and Docker are therefore different. 

As a general rule—and this is something to keep in mind—virtual environments are ideal for developing applications on your local machine whereas Docker containers are built for collaborative productions in the cloud. 

Containerized ML just makes sense

The machine learning space moves very fast. New research is constantly being implemented into APIs and open source frameworks. When things evolve this rapidly, keeping up with the latest developments and maintaining quality, consistency, and reliability can be a seemingly insurmountable challenge. 

As you have hopefully learned from this article, one way to address this challenge is to move to containerized ML development by leveraging tools like Docker. Given that this enables ML teams to increase portability, achieve greater efficiency, operate more consistently, and develop better applications, it just makes sense, especially in situations where there are multiple engineers working on a single project.

Speaking of engineers, Docker enables them to track the different versions of a container image, check who built a version with what, and roll back to previous versions if necessary. Furthermore, ML applications can continue running even if one of its services is updating, being repaired, or down. 

Chat with us to see the platform live and discover how we can help simplify your ML journey.

say goodbe to complex mlops with Qwak