SPCE0038: Machine Learning with Big Data

Introduction

This lecture is part of a series on Data Version control (DVC), a way of systematically keeping track of different versions of models and datasets.

This second lecture in the series will cover

Including files from external sources.
Automation: creating and rerunning pipelines.

Learning outcomes

Apply different approaches for adding external files to a DVC repository.
Understand how pipelines are represented in DVC.
Define a DVC pipeline via a YAML file or commands.
Explain how DVC reproduces the outputs of pipelines.
Use parameters and metrics to control and compare the outputs of DVC pipelines.
Compare DVC to other workflow tools such as MLFlow.

Adding external files

In the last lecture we saw how to use the dvc add command to track files that are already in our local repository. In practice however we will often want to include data files that are already available in an external repository or remote storage service.

There are several different commands available in DVC for working with external files

We have already seen dvc get which allows us to download a file from a remote DVC or Git repository.
dvc import allows to reuse a file from a remote repository, both downloading the file and setting up tracking including specifying the remote source of the file to allow checking if it has been updated there in future.
There are also variants of these two commands dvc get-url and dvc import-url that allow us to work with more general, non-Git based, remote storage services such as files accessible directly using hypertext transfer protocol (HTTP) or on cloud storage services.

We have seen how we can track our own files with dvc add. But what if we want to include data or other files that are already available?

DVC offers different options:

dvc get to download a file from a DVC or Git repository.
dvc import to reuse (download and track) a file from a DVC or Git repository.
dvc get-url and dvc import-url to download or reuse a file from general remote storage.

Adding external files

The get and import commands are similar.

The difference is that the latter also links to the original file and tracks its history.

Therefore, if an update is made to the original repository later on, we can get the changes to our copy (see dvc update).

The get-url and import-url variants are useful when the original data is not already in a repository. They can handle different storage protocols and providers.

Creating pipelines

One of the core benefits of using a data-focused version control system is that we can structure our work around data flows, not individual files.

With DVC, we can

specify each stage of a pipeline,
infer the connections between them,
run a whole pipeline or only the parts required.

Pipeline example

Let’s consider an example data pipeline - you can find a toy implementation of it on GitHub

Image of a data pipeline with a set of samples undergoing PCA and logistic regression. Each step is represented as a box with arrows linking them.

Start with a labelled set of samples in a file samples.csv.

First step is to run reduce_dim.py. This reads the inputs, performs principle component analysis (PCA) to reduce their dimensionality and outputs to a file reduced.csv.

Running log_reg.py then trains a logistic regression classifier on the reduced dataset, serializes the trained model and outputs it in Pickle format to a file classifier.pkl.

Representing a pipeline

DVC sees pipelines as collections of steps. Each step has some inputs or dependencies and produces outputs. The different stages are linked to each other through these.

Furthermore, a pipeline can have one or more parameters that control its stages and customize their behaviour.

`dvc.yaml` files

DVC defines pipelines in a structured file dvc.yaml written in the YAML format. The previous example could be defined by:

stages:
  pca:
    cmd: python reduce_dim.py
    deps:
      - reduce_dim.py
      - samples.csv
    params:
      - total_var
    outs:
      - reduced.csv 
  classification:
    cmd: python log_reg.py
    deps:
      - log_reg.py
      - reduced.csv
    outs:
      - classifier.pkl

`dvc.yaml` files

Every step has its own section, appropriately named (pca, classification). Each section specification has:

the command to run that step (cmd),
a list of dependencies (deps) which feed into the step,
a list of outputs (outs) that the step produces,
a list of parameters (params) that control that step.

Note that the dependencies include the code itself, as well as any input files!

DVC parameters

The parameters are stored in a separate file. By default this is called params.yaml but other formats are allowed.

In our example, the file needs to contain one parameter (the total variance explained by the chosen PCA dimensions):

total_var: 0.9

Creating steps

Rather than write the above dvc.yaml file all at once, we can add the steps in order and let DVC create the file.

To do this, we need to tell it how to execute each step, by using the dvc stage add command at the terminal:

dvc stage add \
    -n pca \
    -d reduce_dim.py -d samples.csv \
    -p total_var \
    -o reduced.csv \
    python reduce_dim.py

Creating steps

The properties of the stage are given in the command options:

-n: name of the step,
-d: dependencies,
-p: parameters,
-o: outputs,
finally, the command to run.

Reproducing a pipeline

Once we have defined a pipeline in DVC we can use the dvc repro command to run the pipeline stages.

Commonly it will be the case that we do not need to run all the steps in a pipeline. For instance, if we change only the log_reg.py script used to fit the model since we last ran the pipeline then the output of the previous dimension reduction stage would be unaffected by this change and so not need to rerun.

Conversely if we update the input samples.csv data file then initial dimension reduction step would need to be re-run as well as all stages that depend directly or indirectly on outputs of it.

Importantly, DVC automates this process of considering the dependencies between pipeline stages and only running the stages that are effected by changes when we execute dvc repro, which for larger and more complex pipelines can be a significant efficiency saving.

We can run the pipeline stages using the command dvc repro.

Often, running all stages will be redundant. For example, if we have only made changes to the log_reg.py file since the last time we ran the pipeline, then the previous step is unaffected and does not need to be rerun.

However, if we had updated the samples file or modified the parameter controlling the PCA step, then that step would need to be rerun, as well as all steps downstream of it.

By tracking changes to files and using the pipeline structure, DVC can infer which steps have changed and only run those.

Metrics and outputs

DVC also allows us to compare the performance of our models as we make changes to them.

We do this by declaring metrics as part of a pipeline.

In the our example, let’s assume we have an additional script evaluate.py which evaluates our trained classifier and stores the results in a file scores.json.

If we compute two performance metrics, the precision and the area under the ROC curve, the results may look like:

{ "precision": 0.63, "roc_auc": 0.85 }

Metrics and outputs

We can record this by expanding the pipeline description with an extra step, like this:

stages:
  (...as above...)
  evaluation:
    cmd: python evaluate.py
    deps:
      - evaluate.py
      - classifier.pkl
    outs:
      - scores.json
metrics:
  - scores.json

Metrics and outputs

Running the whole pipeline with dvc repro will now also produce the file with the scores.

Let’s make a change to the parameters file, e.g. to increase total_var to 0.95.

DVC will notice this and can show us the difference if we run dvc params diff:

Path         Param          Old    New
params.yaml  total_var      0.9    0.95

Metrics and outputs

We can also easily inspect what effect this has on the performance metrics.

If we run the pipeline again, then dvc metrics diff will show us how the metrics have changed:

Path         Metric     Old     New      Change
scores.json  precision  0.63    0.65     0.02
scores.json  roc_auc    0.85    0.91     0.06

In two commands, we can run a whole series of steps and inspect the results - in this case, see that changing how we choose features after PCA has improved performance.

Other DVC features

Automated plots for metrics.
Experiments collect different runs and present them cleanly.
Pushing to remote storage, for backups and sharing.
Python API lets us call DVC functionality from a program (e.g. add a new file).

MLflow

MLflow is a tool for tracking and rerunning machine learning workflows.

It is similar to DVC but is controlled programatically rather than the command line, and is primarily focused on Python workflows. It provides:

Tracking of code, data files, parameter configurations, environment, results.
(Re)running code locally or remotely.
Visual interface for monitoring progress.
Integrations with specific toolchains and frameworks, e.g. Tensorflow.

Summary

DVC supports including data and files from other projects, from a variety of sources.
Combining steps in pipelines makes it simpler to rerun analyses without manual intervention.
Apart from files, DVC can also track metrics and automatically present the effect of changes.

Lecture 24: Data version control II

Introduction

Learning outcomes

Adding external files

Adding external files

Creating pipelines

Pipeline example

Representing a pipeline

dvc.yaml files

dvc.yaml files

DVC parameters

Creating steps

Creating steps

Reproducing a pipeline

Metrics and outputs

Metrics and outputs

Metrics and outputs

Metrics and outputs

Other DVC features

MLflow

Summary

`dvc.yaml` files

`dvc.yaml` files