Lecture 23: Data version control I

Centre for Advanced Research Computing

Introduction

This lecture is part of series on Data Version Control (DVC), a way of systematically keeping track of different versions of models and datasets.

This first lecture in the series will cover:

  • Why using DVC is a good idea.
  • How to track files and move between versions.

Learning outcomes

  • Recognize the need for version control when working with code and data files.
  • Explain how version control systems like Git track changes.
  • Compare the use of Git and DVC for working with data files.
  • Apply DVC to track changes to data files.

Why do we need version control?

From PHD Comics

What is version control

Instead of having multiple copies or working on a shared version:

  • track changes in distinct stages (commits) as you work,
  • move backwards and forwards in history,
  • explore different alternatives (branches),
  • share entire history with others.

Different systems: Git, Subversion, Mercurial, …

Tracking changes

We start our work with by committing the state of our code or data. Each commit we create is given a unique identifier:

Diagram of a single commit, represented as a labelled circle

Tracking changes

As we work, we make more commits:

Diagram of two commits with a link from the first to the second

Tracking changes

Sometimes we make mistakes:

Diagram of three linked commits, where the third is highlighted as wrong

Tracking changes

After realising the error, we can go back and fix it, replacing it with a new commit:

Diagram of three commits, where the previous mistake has been replaced with a new, fixed commit

Tracking changes

Often, we want to try out different approaches before we decide on what’s best:

Diagram of a commit history with two branches

Tracking changes

This results in a non-linear history. If we want, we can also merge the two branches:

Diagram of a commit history where two branches split off and are then rejoined

Why data version control?

Similar principles apply to data workflows as to code:

  • Mistakes happen!
  • New data appearing.
  • Try variants of model (e.g. algorithm or its parameters) or data pipeline (e.g. preprocessing).

Why data version control?

Git is not only for source code files. However, a dedicated data-focused solution is more attractive:

  • Git does not handle very large files efficiently.
  • Thinking in terms of data workflows offers new useful functionality, e.g. reproducibility, metrics.
  • Better integration with remote data providers, e.g. Amazon Web Services S3.
  • Can still use Git under the hood, keeping code and data versioned simultaneously.

Getting started with DVC

DVC is a command-line application that runs on any platform. Follow the installation instructions to get it on your computer.

To follow along, first create a new directory and switch it to be the current working directory by running

mkdir dvc-example
cd dvc-example

in a terminal.

This walkthrough is based on the official tutorial.

Initializing project

First, initialize your directory as a DVC (and Git) repository to allow tracking changes:

git init
dvc init
Initialized empty Git repository in /tmp/Rtmp0fNv3c/dvc-example-1f9a2886eca9/.git/
Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/treeverse/dvc>

Initializing project

After the above, DVC creates some new files and gives a hint about what to run: You can now commit the changes to git.

git commit -m "Initial setup"
[main (root-commit) 4266533] Initial setup
 3 files changed, 6 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvcignore

Downloading data

Download a sample data file by running

dvc get https://github.com/iterative/dataset-registry \
    get-started/data.xml -o data/data.xml

This should create a directory called data in your new directory, with a file called data.xml inside it.

tree
.
└── data
    └── data.xml

2 directories, 1 file

Initializing tracking

We are not tracking any files yet. Let’s tell DVC to track the dataset we downloaded:

dvc add data/data.xml

To track the changes with git, run:

    git add data/data.xml.dvc data/.gitignore

To enable auto staging, run:

    dvc config core.autostage true

As before, DVC creates some internal files and tells us what to commit with Git.

Initializing tracking

Run the command it suggests, and then commit:

git add data/data.xml.dvc data/.gitignore
git commit -m "Add initial version of dataset"
[main b5fc654] Add initial version of dataset
 2 files changed, 6 insertions(+)
 create mode 100644 data/.gitignore
 create mode 100644 data/data.xml.dvc

Initializing tracking

Note that this is different from the usual Git workflow.

Normally, we would be adding the data file itself (data.xml).

Instead, we are adding a smaller “proxy” file (data.xml.dvc).

This file is much smaller, and DVC knows it represents the original dataset.

Initializing tracking

To verify the size difference, run

ls -lh data
total 14M
-rw-r--r-- 1 runner runner 14M Mar 13 15:00 data.xml
-rw-r--r-- 1 runner runner  92 Mar 13 15:00 data.xml.dvc

The original data takes up 14MB, while the proxy file is only ~100 bytes long.

Making changes

During the course of our work, the dataset may change - intentionally or by accident. For simplicity, we will simulate a change by repeating the dataset twice:

cp data/data.xml temp.xml  # create a temporary copy
cat temp.xml >> data/data.xml  # append the copy to the original
rm temp.xml  # remove the copy

We can check the size of the file with

ls -lh data
total 28M
-rw-r--r-- 1 runner runner 28M Mar 13 15:00 data.xml
-rw-r--r-- 1 runner runner  92 Mar 13 15:00 data.xml.dvc

to verify it has doubled.

Making changes

To register the changes with Git and DVC, we run similar commands to before:

dvc add data/data.xml
git add data/data.xml.dvc  # as suggested by dvc
git commit -m "Double size of dataset"

To track the changes with git, run:

    git add data/data.xml.dvc

To enable auto staging, run:

    dvc config core.autostage true
[main 4417236] Double size of dataset
 1 file changed, 2 insertions(+), 2 deletions(-)

Switching versions

Switching to another version happens in two stages.

First, we switch with Git. In Git HEAD refers to the current commit and a ~ suffix indicates the parent of a commit.

git checkout HEAD~
Note: switching to 'HEAD~'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at b5fc654 Add initial version of dataset

Switching versions

Then we “synchronise” the files under DVC with

dvc checkout
M       data/data.xml

This will find the version of the data when that commit was made, and check it out.

Verify that the version changed with

ls -lh data
total 14M
-rw-r--r-- 1 runner runner 14M Mar 13 15:00 data.xml
-rw-r--r-- 1 runner runner  92 Mar 13 15:00 data.xml.dvc

Notice data.xml is back to its original size.

Switching versions

Go back to the newest version (doubled data) with

git checkout main
dvc checkout
Previous HEAD position was b5fc654 Add initial version of dataset
Switched to branch 'main'
M       data/data.xml

and we see we are back to the doubled size

ls -lh data
total 28M
-rw-r--r-- 1 runner runner 28M Mar 13 15:00 data.xml
-rw-r--r-- 1 runner runner  92 Mar 13 15:00 data.xml.dvc

Summary

In this lecture we have covered has been the basic usage of Git and DVC to track and revert changes to a file.

Building on this, in the next lecture we will see how DVC can be used to track models and entire machine learning workflows.