Centre for Advanced Research Computing
This lecture is part of series on Data Version Control (DVC), a way of systematically keeping track of different versions of models and datasets.
This first lecture in the series will cover:

Instead of having multiple copies or working on a shared version:
Different systems: Git, Subversion, Mercurial, …
We start our work with by committing the state of our code or data. Each commit we create is given a unique identifier:
As we work, we make more commits:
Sometimes we make mistakes:
After realising the error, we can go back and fix it, replacing it with a new commit:
Often, we want to try out different approaches before we decide on what’s best:
This results in a non-linear history. If we want, we can also merge the two branches:
Similar principles apply to data workflows as to code:
Git is not only for source code files. However, a dedicated data-focused solution is more attractive:
DVC is a command-line application that runs on any platform. Follow the installation instructions to get it on your computer.
To follow along, first create a new directory and switch it to be the current working directory by running
in a terminal.
This walkthrough is based on the official tutorial.
First, initialize your directory as a DVC (and Git) repository to allow tracking changes:
Initialized empty Git repository in /tmp/Rtmp0fNv3c/dvc-example-1f9a2886eca9/.git/
Initialized DVC repository.
You can now commit the changes to git.
+---------------------------------------------------------------------+
| |
| DVC has enabled anonymous aggregate usage analytics. |
| Read the analytics documentation (and how to opt-out) here: |
| <https://dvc.org/doc/user-guide/analytics> |
| |
+---------------------------------------------------------------------+
What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/treeverse/dvc>
After the above, DVC creates some new files and gives a hint about what to run: You can now commit the changes to git.
[main (root-commit) 4266533] Initial setup
3 files changed, 6 insertions(+)
create mode 100644 .dvc/.gitignore
create mode 100644 .dvc/config
create mode 100644 .dvcignore
Download a sample data file by running
This should create a directory called data in your new directory, with a file called data.xml inside it.
.
└── data
└── data.xml
2 directories, 1 file
We are not tracking any files yet. Let’s tell DVC to track the dataset we downloaded:
To track the changes with git, run:
git add data/data.xml.dvc data/.gitignore
To enable auto staging, run:
dvc config core.autostage true
As before, DVC creates some internal files and tells us what to commit with Git.
Run the command it suggests, and then commit:
[main b5fc654] Add initial version of dataset
2 files changed, 6 insertions(+)
create mode 100644 data/.gitignore
create mode 100644 data/data.xml.dvc
Note that this is different from the usual Git workflow.
Normally, we would be adding the data file itself (data.xml).
Instead, we are adding a smaller “proxy” file (data.xml.dvc).
This file is much smaller, and DVC knows it represents the original dataset.
To verify the size difference, run
total 14M
-rw-r--r-- 1 runner runner 14M Mar 13 15:00 data.xml
-rw-r--r-- 1 runner runner 92 Mar 13 15:00 data.xml.dvc
The original data takes up 14MB, while the proxy file is only ~100 bytes long.
During the course of our work, the dataset may change - intentionally or by accident. For simplicity, we will simulate a change by repeating the dataset twice:
We can check the size of the file with
total 28M
-rw-r--r-- 1 runner runner 28M Mar 13 15:00 data.xml
-rw-r--r-- 1 runner runner 92 Mar 13 15:00 data.xml.dvc
to verify it has doubled.
To register the changes with Git and DVC, we run similar commands to before:
To track the changes with git, run:
git add data/data.xml.dvc
To enable auto staging, run:
dvc config core.autostage true
[main 4417236] Double size of dataset
1 file changed, 2 insertions(+), 2 deletions(-)
Switching to another version happens in two stages.
First, we switch with Git. In Git HEAD refers to the current commit and a ~ suffix indicates the parent of a commit.
Note: switching to 'HEAD~'.
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:
git switch -c <new-branch-name>
Or undo this operation with:
git switch -
Turn off this advice by setting config variable advice.detachedHead to false
HEAD is now at b5fc654 Add initial version of dataset
Then we “synchronise” the files under DVC with
M data/data.xml
This will find the version of the data when that commit was made, and check it out.
Verify that the version changed with
total 14M
-rw-r--r-- 1 runner runner 14M Mar 13 15:00 data.xml
-rw-r--r-- 1 runner runner 92 Mar 13 15:00 data.xml.dvc
Notice data.xml is back to its original size.
Go back to the newest version (doubled data) with
Previous HEAD position was b5fc654 Add initial version of dataset
Switched to branch 'main'
M data/data.xml
In this lecture we have covered has been the basic usage of Git and DVC to track and revert changes to a file.
Building on this, in the next lecture we will see how DVC can be used to track models and entire machine learning workflows.