Data version control II

Advanced features of DVC

Authors

Affiliation

Centre for Advanced Research Computing

Introduction

In this lecture we will be continuing to explore data version control. Specifically in this lecture we will look at some more advanced features of DVC, namely

How we can access and track files from remote data sources such as cloud storage services.
How we use DVC to define the stages of a data pipeline in a way that makes it easily reproducible by both ourselves and others.

Learning outcomes

Apply different approaches for adding external files to a DVC repository.
Understand how pipelines are represented in DVC.
Define a DVC pipeline via a YAML file or commands.
Explain how DVC reproduces the outputs of pipelines.
Use parameters and metrics to control and compare the outputs of DVC pipelines.
Compare DVC to other workflow tools such as MLFlow.

Adding external files

In the last lecture we saw how to use the dvc add command to track files that are already in our local repository. In practice however we will often want to include data files that are already available in an external repository or remote storage service.

There are several different commands available in DVC for working with external files

We have already seen dvc get which allows us to download a file from a remote DVC or Git repository.
dvc import allows to reuse a file from a remote repository, both downloading the file and setting up tracking including specifying the remote source of the file to allow checking if it has been updated there in future.
There are also variants of these two commands dvc get-url and dvc import-url that allow us to work with more general, non-Git based, remote storage services such as files accessible directly using hypertext transfer protocol (HTTP) or on cloud storage services.

The get and import commands are similar and easy to confuse, but the key point is import also sets up tracking from the remote repository, and so if a data file we need is already under a remote repository, should be generally preferred. If the file is subsequently updated in the remote repository we can bring these changes in locally using the dvc update command.

The []get-url](https://dvc.org/doc/command-reference/get-url) and import-url variants allow us the flexibility of working with non-Git based storage, with there built in support for a range of protocols and providers and an extensible interface for working with additional unsupported providers.

Creating pipelines

While the ability to track potentially large datasets with DVC is useful, its power becomes more apparent in its support for working with data pipelines. Pipelines are a first class construct in DVC and let us think in terms of data flows and the dependencies between files rather than the individual files themselves.

DVC allows us to in a very generic way:

defines the stages of a data pipeline,
infer the graph of dependencies between the pipeline stages,
reproducibly run a whole pipeline, or if we make changes to only some dependencies only the parts of the pipeline that need to be rerun, which can be a major efficiency gain for computationally intensive pipelines.

Pipeline example

We will consider a toy example here of a data pipeline for performing a logistic regression on a comma separated value formatted dataset. You can find a toy implementation of it on GitHub. We will assume the covariates in the dataset are high-dimensional and so we perform an initial dimensionality reduction step using principle components analysis to produce reduced dimensionality features. The reduced dimensionality features and associated target labels are then used to train a logistic regression classification model, before the model is serialized to a file in pickle format for sharing.

Image of a data pipeline with a set of samples undergoing PCA and logistic regression. Each step is represented as a box with arrows linking them.

Start with a labelled set of samples in a file samples.csv.

First step is to run reduce_dim.py. This reads the inputs, performs principle component analysis (PCA) to reduce their dimensionality and outputs to a file reduced.csv.

Running log_reg.py then trains a logistic regression classifier on the reduced dataset, serializes the trained model and outputs it in Pickle format to a file classifier.pkl.

Representing a pipeline

In DVC a pipeline is defined as a collection of steps or stages, each with a set of input dependencies, and one or more outputs. The dependencies between stages are inferred from them having mutually shared outputs and inputs.

Pipeline stages can additionally have parameters that control the behaviour of the stage, and which typically represent variables we will want to investigate varying to explore their effect on performance.

`dvc.yaml` files

As we build up the definition of a pipeline, DVC stores the description in a structured format in a dvc.yaml file. The toy example we just introduced might have a dvc.yaml file with something like the following content.

stages:
  pca:
    cmd: python reduce_dim.py
    deps:
      - reduce_dim.py
      - samples.csv
    params:
      - total_var
    outs:
      - reduced.csv 
  classification:
    cmd: python log_reg.py
    deps:
      - log_reg.py
      - reduced.csv
    outs:
      - classifier.pkl

The format is relatively human readable - under the stages key we have a set of pipeline stages, keyed by their stage names. Each pipeline stage species the command to run, a list of file dependencies for the stage, a list of outputs for the stage and optionally any parameters that control the stage.

An important point is that the code used to run the stage should be considered a dependency of the stage as well as any data inputs as if the code changes we want to know to re-run the stage.

DVC parameters

The actual parameter values are stored in a dedicated file, by default a YAML file params.yaml though it is possible to use different formats such as JSON.

Here we have just one parameter which controls the total variance to be explained by the chosen number of principle components in the dimension reduction stage.

total_var: 0.9

Creating steps

While we can directly create a dvc.yaml file specifying the stages of a pipeline, in practice we will often add the steps sequentially as we build up a workflow, and DVC has commands available to add pipeline stages from the command line.

In particular the dvc stage add command command allows use to specify the properties of a stage to be added, with command options specifying the stage name, dependencies, parameters, outputs and command to run the stage.

dvc stage add \
    -n pca \
    -d reduce_dim.py -d samples.csv \
    -p total_var \
    -o reduced.csv \
    python reduce_dim.py

The properties of the stage are given in the command options:

-n: name of the step,
-d: dependencies,
-p: parameters,
-o: outputs,
finally, the command to run.

Reproducing a pipeline

Once we have defined a pipeline in DVC we can use the dvc repro command to run the pipeline stages.

Commonly it will be the case that we do not need to run all the steps in a pipeline. For instance, if we change only the log_reg.py script used to fit the model since we last ran the pipeline then the output of the previous dimension reduction stage would be unaffected by this change and so not need to rerun.

Conversely if we update the input samples.csv data file then initial dimension reduction step would need to be re-run as well as all stages that depend directly or indirectly on outputs of it.

Importantly, DVC automates this process of considering the dependencies between pipeline stages and only running the stages that are effected by changes when we execute dvc repro, which for larger and more complex pipelines can be a significant efficiency saving.

Metrics and outputs

In addition to its support for data pipelines, DVC also offers the ability to compare and track the performance of models as changes are made to them, by defining the performance metrics of interest.

For our toy logistic regression example, we might for example have an additional stage which scores our trained classifier model with its precision and are under the receiver operator characteristic curve, outputting the computed metrics to a file scores.json.

{ "precision": 0.63, "roc_auc": 0.85 }

We can indicate that a stage produces performance metrics using the metrics key in the dvc.yaml file or using the -m option when specifying a stage using dvc stage add.

A pipeline with this additional evaluation stage would now additionally produce the scores.json file when run with dvc repro.

stages:
  (...as above...)
  evaluation:
    cmd: python evaluate.py
    deps:
      - evaluate.py
      - classifier.pkl
    outs:
      - scores.json
metrics:
  - scores.json

Importantly as well as outputting the metrics, compared to standard stage outputs, DVC allows use to use the metrics to compare the runs of the pipeline, for example across different parameter values or versions.

For example if we change the total_var parameter in the parameters.yaml file, DVC will identify the parameter has been changed. We can see the change by for example running dvc params diff .

Path         Param          Old    New
params.yaml  total_var      0.9    0.95

If we re-run the pipeline we can also compare the effect on the performance metrics using the dvc metrics diff command which will output the changes in the performance metrics.

Path         Metric     Old     New      Change
scores.json  precision  0.63    0.65     0.02
scores.json  roc_auc    0.85    0.91     0.06

This allows us to iteratively try out different values for parameters and easily see the effect it has on performance metrics, and importantly allows us to easily track these changes to allow us to go back to promising parameter values later.

Other DVC features

That concludes our brief tour of DVC. We have only covered a subset of the features DVC offers here, with for example it additionally offering

Support for automatically producing plots of metrics.
An experiment management system for systematically tracking the results of different pipeline runs and presenting them cleanly.
The ability to push the repository to remote hosts.
A programmatic Python API that allows us to directly use DVC within our Python code.

MLflow

While we have concentrated on DVC here, there are a range of other MLOps tools which offer overlapping features. One package of note is MLflow which also offers the ability for tracking and reproducing machine learning workflows. Unlike DVC it is typically used directly within the code defining the pipeline, and is more specific to Python workflows, offering tight integrations with a number of frameworks such as Tensorflow and Pytorch. Compared to DVC it offers some additional features such as a visual web based interface for live monitoring of the progress of experiments.

Summary

In this lecture we have explored:

DVC’s support for loading and tracking files from remote repositories and storage services.
How we can use DVC to programmatically define the stages of a data pipeline and allow us to easily reproduce pipeline outputs.
DVC’s support for comparing how pipelines perform on different metrics as we vary parameter values.

Reuse

CC BY-SA 4.0