What data teams need to know about Git for data science
If you’re a data professional, you’ve probably already heard about Git, and chances are that somebody has told you that it’s only for software developers, and as a data professional, it’s not something you should care about.
If you’re a software engineer who has ‘switched’ to the field of data science, this is a topic you’ll be all too familiar with. And if you’re an aspiring data scientist who comes from a different background, it’s something that you should be interested in.
In this article, we’re going to attempt to dispel the myth that Git is a tool only for software developers by showing you how important it is for data scientists and teams to use it in their workflows.
First things first… what is Git?
It’s often the case that lots of different developers work on the same project. Over time, it’s very plausible that larger projects could have hundreds or thousands of people who have left their own stamp on them. This is where Git comes in.
“Git is a distributed version control system for tracking changes in source code during software development,” according to Wikipedia. In fact, Git is the most widely used modern version control system in the world today. It is the most recognized and popular approach to contributing to a project in a distributed and collaborative manner.
Without a version control system, this situation would invariably lead to total chaos and confusion. Resolving conflicts, for example, will be impossible because nobody has kept track of their own changes, and this makes it very difficult, if not impossible, to merge them into a single central truth. Git, and other services built on top of it such as GitHub, help developers overcome this problem.
What is the difference between Git and GitHub, you ask? That’s simple. Git is the underlying technology for tracking and merging changes in source code while GitHub is a web platform built on top of the Git technology to make tracking easier. It offers additional features such as user management, automation, and pull requests.
Basic Git terminology and commands
Now that you’ve had an introduction to what Git is, let’s take a look at some of the common terminology and commands that you’ll come across while using it.
Common Git terminology
- Repository—A database of all the branches and commits of a single project.
- Branch—Alternative state or line of development for a repository.
- Merge—Merging two (or more) branches into a single branch, single truth.
- Clone—Creating a local copy of the remote repository.
- Origin—Name for the remote repository which the local clone is made from.
- Main/Master—Name for the root branch, which is the central source of truth.
- Stage—Choosing which files will be part of the new commit.
- Commit—A saved snapshot of staged changes made to the file(s) in the repository.
- Push—Pushing means sending your changes to the remote repository.
- Pull—Pulling means getting everybody else's changes from your local repository.
- Pull Request—Mechanism for reviewing and approving your changes before merging to main/master.
Common Git commands
- git init—Create a new repository on your local machine.
- git clone—Begin working on an existing remote repository.
- git add—Choose file(s) to be saved for staging.
- git status—Show files that you have changed.
- git commit—Save a snapshot of the chosen file(s).
- git push—Send your saved snapshots into the remote repository.
- git pull—Pull recent commits into your local computer.
- git branch—Create or delete branches.
- git checkout—Switch branches or undo changes made to local files.
- git merge—Merge branches to form a single truth.
Why should data teams use Git?
Considering all of the above, you might be thinking that what we discussed above is correct—that Git is for software developers. This isn’t strictly true, though. Git is growing in importance in the field of data science and being a part of the contributions made towards open source projects can help data science teams to work much more cohesively with development teams.
In addition, Git is particularly useful for data teams in organizations that follow agile software development frameworks where distributed version control helps to make development workflows faster, more efficient, and easily adaptable to changes.
Let’s say you’re a data scientist working with a team where you and another data scientist are working on the same function to build a machine learning model. If you Mae some changes to the function and upload it to the remote repository where they’re merged with the master branch, your model will become a new version. For the purpose of this example, let’s call it version 2.7.
Later, another data scientist also makes a change to the same function with version 2.7, and these are then merged with the master branch. Now the model becomes version 2.8. If at any point your team learns that version 2.8 is bugged, they can recall version 2.7. That’s the beauty of Git and version control; it makes development so much easier.
Thankfully, data teams don’t need to be experts in Git to use it. They simply need to understand the Git workflow and find ways to implement it in their day-to-day tasks.
Basic rules for using Git
If you are going to try and get to grips with Git, it’s important to keep some basic rules in mind.
Don’t push datasets
Git is a version control system designed to serve software developers. It has got features that can be used to handle source code and other related content like configuration, dependencies, and documentation. That said, it has not been designed for training data. It is for code only.
In software development, it has long been thought that code reigns supreme and everything else falls below it to serve the code. In data science, however, there’s a duality between data and code; it doesn’t make sense for code to be dependent on data nor does it make sense for data to be dependent on code. As such, it’s a good idea for them to be decoupled. But this is where the code-centric school of thought fails because Git should never be the core point of truth for a data science project.
There are some extensions like LFS that refer to external datasets from a Git repository. Although they do serve a purpose and solve some of the technical limitations such as size and speed, they don’t solve the core problem of the code-centric mindset.
Keep in mind that no matter how careful you are, there’ll always be datasets floating around in your local directory. It can be quite easy to accidentally stage and commit them. So, be extra alert or use a workaround.
Don’t push sensitive information
Although this sounds obvious, it’s something that needs to be said. You should never commit any information such as usernames, passwords, key codes, API tokens, or anything else that’s sensitive to Git. Remember that even private repositories can be accessed via multiple accounts and cloned to local machines, which gives an attacker the hypothetical potential to wreak real havoc. Also, private repositories can become public in the future.
It's good practice to decouple sensitive information and other secrets from your code and pass them using the environment instead. In Python, for instance, you can use the .env file which contains environment variables, and the .gitignore file which makes sure that the .env file doesn’t get pushed to the remote Git repository.
Don’t force your pushes
You might come across an error when attempting to push a file to the remote repository that something is wrong. Git will automatically abort the push attempt if this is the case. Although there is the option to force through the file by using -f or –-force, it’s not something you should do.
Although the –-force command does serve a useful purpose in some situations, forcing through a file when Git has aborted it is not one of them. Instead, read the error message that Git supplies you with and use this to solve the problem before attempting to push the file again.
Create smaller commits with clear descriptions
It’s easy for new users of Git to fall into the trap of making huge commits with unclear (or no) descriptions. A good rule of thumb, however, is the opposite; any single commit should only do one thing. Fix one problem with your commits rather than three, four, or five or more. Solve one issue at a time.
The reason for this is simple: it makes version control easier. When you make smaller and more clear commits, other people can understand what has gone on in the past and accurately pinpoint where things have gone wrong. In contrast, this becomes much harder if you’ve gone ahead and fixed 10 bugs and have tagged your commit with an unclear description. This has little-to-no value for developers that might look at it later. At the same time, you don’t need to write essay-length descriptions. Keep them brief, concise, and to the point.
Don’t be afraid of Git
Data teams are increasingly becoming involved in the research & development of machine learning models; gone are the days when data professionals only worked with data, full stop.
This means that data science and engineering domains are colliding more than ever before, and as such it’s important for data teams to learn the fundamentals of engineering best practices, and vice-versa.
Because of this and despite the myths that Git is only for engineers, data teams need to become familiar with the basics of it in order to work effectively with engineering teams. Luckily, as we’ve highlighted, this isn’t as difficult as some teams may fear.
If you found this interesting and would like to read more of our insights into the world of data, AI, and machine learning, check out the Qwak blog!