SPCE0038: Machine Learning with Big Data

Introduction

In this second lecture we will introduce a tool called Data Version Control, often shortened to DVC, which can be used to automate some of the tasks involved in accessing data and managing the steps in data pipelines that we encountered in the previous lecture.

Importantly, as suggested by the name, DVC also provides a systematic way of keeping track of different versions of models and datasets. For data science workflows which are often very iterative and involve lots of going back and forth, training models with different values for hyperparameter and integrating new data that has become available, having an approach for systematically organizing our models and data becomes really important. It can help avoid the common issue for example of not being able to identify the particular model parameters which gave the best performance on some metric, or exactly reproducing all the steps that were used to produce a model output.

For the current lecture we will cover

Why using data version control specifically, and version control systems more generally is a good idea.
How we can use DVC to access and track data files and change between different tracked versions of files.

This lecture is part of series on Data Version Control (DVC), a way of systematically keeping track of different versions of models and datasets.

This first lecture in the series will cover:

Why using DVC is a good idea.
How to track files and move between versions.

Learning outcomes

Recognize the need for version control when working with code and data files.
Explain how version control systems like Git track changes.
Compare the use of Git and DVC for working with data files.
Apply DVC to track changes to data files.

Why do we need version control?

Before looking at data version control specifically, we will give a bit of an introduction to idea of version control systems in a more general settings. Let’s first consider why we need to use version control in the first place.

You may have seen this cartoon from PhD Comics by Jorge Cham before, which summarizes very well that using version control can be helpful in a range of contexts, via a situation many of you may have experienced yourself, that is editing a document in response to feedback from for example a supervisor or peer. A common ad-hoc approach used in such setting is to use naming schemes to try to track the different versions of a document. While this may work for a couple of sets of changes, once we start getting beyond this it becomes increasingly difficult to ensure we remain consistent in our naming scheme when tracking versions manually like this. It also is typically difficult to manage diverging sets of revisions of a file due to for example multiple people working on edits to a file and merging these back together.

What is version control

Version control provides an alternative systematic approach to this problem.

Changes to a set of files are tracked by creating distinct checkpoints or commits as we work.
Importantly the relationship between the commits is recorded allowing us to move backwards and forward through the history of files.
As well as allow recording linear histories, version control systems will typically allow us to create diverging history where two different sets of changes are made to one set of files, resulting in a branching history. Conversely we can also merge changes from parallel lines of work back together, with systematic approaches for dealing with conflicting changes between files.
In addition to giving a systematic way of tracking changes, version control systems also ease working collaboratively with other people, typically allowing us to share the entire history of a project and have multiple all making changes to the same set of files.

There are several different version control systems available, with Git, Subversion and Mercurial all being currently used options. We will concentrate on Git here, with this being the most popular system currently, though a lot of the concepts are shared with other systems.

Instead of having multiple copies or working on a shared version:

track changes in distinct stages (commits) as you work,
move backwards and forwards in history,
explore different alternatives (branches),
share entire history with others.

Different systems: Git, Subversion, Mercurial, …

Tracking changes

We start our work with by committing the state of our code or data. Each commit we create is given a unique identifier:

Diagram of a single commit, represented as a labelled circle

Tracking changes

As we work, we make more commits:

Diagram of two commits with a link from the first to the second

Tracking changes

Sometimes we make mistakes:

Diagram of three linked commits, where the third is highlighted as wrong

Tracking changes

After realising the error, we can go back and fix it, replacing it with a new commit:

Diagram of three commits, where the previous mistake has been replaced with a new, fixed commit

Tracking changes

Often, we want to try out different approaches before we decide on what’s best:

Diagram of a commit history with two branches

Tracking changes

This results in a non-linear history. If we want, we can also merge the two branches:

Diagram of a commit history where two branches split off and are then rejoined

Why data version control?

Now that we understand better what what standard version control systems do, we can get back to the specific topic of interest, that is data version control. Data science related projects and workflows experience similar needs in this regards to other situations where we might use version control. Very commonly we will regularly make mistakes or revisit our assumptions and so want to go back to earlier versions of files. Often we also have new data becoming available over time, meaning we want to record updates to our models. As we mentioned previously most data workflows are also exploratory and iterative in nature with we wanting to investigate the performance of different variants of models - for example different parameter values or choices of model class or algorithm, or adding or removing stages from our overall data pipeline such as pre-processing steps.

Similar principles apply to data workflows as to code:

Mistakes happen!
New data appearing.
Try variants of model (e.g. algorithm or its parameters) or data pipeline (e.g. preprocessing).

Why data version control?

While Git is most commonly applied to tracking changes in the source code for software projects, it can actually be used to track changes to any set of files, and so can equally be applied to data and model artefacts generated during a data science project as well as the associated code.

However, while Git can be used to track such files, there are some pitfalls which suggest a more bespoke solution such as DVC can be helpful.

One key issue is that Git is primarily targeted at text based files, and its approach for storing commits and displaying changes across files does not scale well to large files particularly those with binary formats such as images and videos, where even a perceptually small change can lead to significant bit level differences between files. In contrast DVC has an efficient system for dealing with tracking large files such as datasets.

Being generic in nature, Git also does not have built-in support for data science specific abstractions such data pipelines, parameters and performance metrics, while these all have first-class support in DVC, with useful functionality provided to allow automating pipelines and ensuring reproducibility.

DVC also offers tight integration with a variety of remote data providers, such as cloud storage services like Amazon Web Services Simple Storage Service.

While DVC offers several advantages over using Git directly for data workflows, a key point is that it builds on top of Git rather than replacing it, and so can be naturally used in conjunction with Git on a project, allowing for example code and data artefacts to be version controlled in the same repository.

Git is not only for source code files. However, a dedicated data-focused solution is more attractive:

Git does not handle very large files efficiently.
Thinking in terms of data workflows offers new useful functionality, e.g. reproducibility, metrics.
Better integration with remote data providers, e.g. Amazon Web Services S3.
Can still use Git under the hood, keeping code and data versioned simultaneously.

Getting started with DVC

Now that we have seen why we would want to use data version control, we will now run through the basic workflow for using DVC and Git to track changes in a new project.

Like Git, DVC is a command-line tool and runs across a wide range of platforms. The installation instructions in the DVC documentation detail how to get set up on a range of systems.

We will walkthrough of some of the key DVC commands, with this based on an official DVC tutorial which you can revisit in full after the lectures if you want to learn more.

To ensure the commands we run here don’t affect any other files on your system, we will first create a new directory called dvc-example to run the tutorial commands in, and change this to be the current working directory. In a bash shell we can use the mkdir and cd commands to do this.

DVC is a command-line application that runs on any platform. Follow the installation instructions to get it on your computer.

To follow along, first create a new directory and switch it to be the current working directory by running

mkdir dvc-example
cd dvc-example

in a terminal.

This walkthrough is based on the official tutorial.

Initializing project

First, initialize your directory as a DVC (and Git) repository to allow tracking changes:

git init
dvc init

Initialized empty Git repository in /tmp/Rtmp0fNv3c/dvc-example-1f9a2886eca9/.git/
Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/treeverse/dvc>

Initializing project

After the above, DVC creates some new files and gives a hint about what to run: You can now commit the changes to git.

git commit -m "Initial setup"

[main (root-commit) 4266533] Initial setup
 3 files changed, 6 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvcignore

Downloading data

Now that we have both Git and DVC initialized we are ready to download a sample data file using DVC. DVC has several different commands available for accessing external data files, but we will start with one of the simplest, the dvc get command, here, which provides an easy way of downloading files from a remote Git or DVC repository.

The dvc get commands accepts two positional arguments, the first a URL to the remote Git repository and the second a path to a file or directory within this remote repository which we wish to download. Optionally we can also specify the path to write the downloaded file or directory to in the local repository using the -o option. Here we use dvc get to download an XML data file from a tutorial Git repository hosted by the developers of DVC on GitHub, downloading this data to a new directory data under the current directory.

Download a sample data file by running

dvc get https://github.com/iterative/dataset-registry \
    get-started/data.xml -o data/data.xml

This should create a directory called data in your new directory, with a file called data.xml inside it.

tree

.
└── data
    └── data.xml

2 directories, 1 file

Initializing tracking

We are not tracking any files yet. Let’s tell DVC to track the dataset we downloaded:

dvc add data/data.xml


To track the changes with git, run:

    git add data/data.xml.dvc data/.gitignore

To enable auto staging, run:

    dvc config core.autostage true

As we saw previously the output from the DVC command gives us some hints about what to do next, with we seeing that DVC has created two additional files it uses for tracking the dataset and it instructing use to stage these new files with Git and commit them. Again we have an ignore file created here, which in this case tells Git to ignore the original XML data file - this is wanted here as we don’t want to track changes to this large data file directly with Git. Instead DVC has additionally created a ‘proxy’ file with the same path and name as the original data file but suffixed with a .dvc extension. This proxy file contains some basic metadata about the associated file, specifically its size, path and a hash of its contents which with high probability will be different for two files with differing contents, and so can be used to identify if changes have been made to the file since the last commit.

As before, DVC creates some internal files and tells us what to commit with Git.

Initializing tracking

Run the command it suggests, and then commit:

git add data/data.xml.dvc data/.gitignore
git commit -m "Add initial version of dataset"

[main b5fc654] Add initial version of dataset
 2 files changed, 6 insertions(+)
 create mode 100644 data/.gitignore
 create mode 100644 data/data.xml.dvc

Initializing tracking

Note that this is different from the usual Git workflow.

Normally, we would be adding the data file itself (data.xml).

Instead, we are adding a smaller “proxy” file (data.xml.dvc).

This file is much smaller, and DVC knows it represents the original dataset.

Initializing tracking

To verify the size difference, run

ls -lh data

total 14M
-rw-r--r-- 1 runner runner 14M Mar 13 15:00 data.xml
-rw-r--r-- 1 runner runner  92 Mar 13 15:00 data.xml.dvc

The original data takes up 14MB, while the proxy file is only ~100 bytes long.

Making changes

During the course of our work, the dataset may change - intentionally or by accident. For simplicity, we will simulate a change by repeating the dataset twice:

cp data/data.xml temp.xml  # create a temporary copy
cat temp.xml >> data/data.xml  # append the copy to the original
rm temp.xml  # remove the copy

We can check the size of the file with

ls -lh data

total 28M
-rw-r--r-- 1 runner runner 28M Mar 13 15:00 data.xml
-rw-r--r-- 1 runner runner  92 Mar 13 15:00 data.xml.dvc

to verify it has doubled.

Making changes

To register the changes with Git and DVC, we run similar commands to before:

dvc add data/data.xml
git add data/data.xml.dvc  # as suggested by dvc
git commit -m "Double size of dataset"


To track the changes with git, run:

    git add data/data.xml.dvc

To enable auto staging, run:

    dvc config core.autostage true
[main 4417236] Double size of dataset
 1 file changed, 2 insertions(+), 2 deletions(-)

Switching versions

Switching to another version happens in two stages.

First, we switch with Git. In Git HEAD refers to the current commit and a ~ suffix indicates the parent of a commit.

git checkout HEAD~

Note: switching to 'HEAD~'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at b5fc654 Add initial version of dataset

Switching versions

Then we “synchronise” the files under DVC with

dvc checkout

M       data/data.xml

This will find the version of the data when that commit was made, and check it out.

Verify that the version changed with

ls -lh data

total 14M
-rw-r--r-- 1 runner runner 14M Mar 13 15:00 data.xml
-rw-r--r-- 1 runner runner  92 Mar 13 15:00 data.xml.dvc

Notice data.xml is back to its original size.

Switching versions

Go back to the newest version (doubled data) with

git checkout main
dvc checkout

Previous HEAD position was b5fc654 Add initial version of dataset
Switched to branch 'main'
M       data/data.xml

and we see we are back to the doubled size

ls -lh data

total 28M
-rw-r--r-- 1 runner runner 28M Mar 13 15:00 data.xml
-rw-r--r-- 1 runner runner  92 Mar 13 15:00 data.xml.dvc

Summary

In this lecture we have covered has been the basic usage of Git and DVC to track and revert changes to a file.

Building on this, in the next lecture we will see how DVC can be used to track models and entire machine learning workflows.

Lecture 23: Data version control I

Introduction

Learning outcomes

Why do we need version control?

What is version control

Tracking changes

Tracking changes

Tracking changes

Tracking changes

Tracking changes

Tracking changes

Why data version control?

Why data version control?

Getting started with DVC

Initializing project

Initializing project

Downloading data

Initializing tracking

Initializing tracking

Initializing tracking

Initializing tracking

Making changes

Making changes

Switching versions

Switching versions

Switching versions

Summary