Centre for Advanced Research Computing
This lecture is part of a series on Data Version control (DVC), a way of systematically keeping track of different versions of models and datasets.
This second lecture in the series will cover
We have seen how we can track our own files with dvc add. But what if we want to include data or other files that are already available?
DVC offers different options:
dvc get to download a file from a DVC or Git repository.dvc import to reuse (download and track) a file from a DVC or Git repository.dvc get-url and dvc import-url to download or reuse a file from general remote storage.The difference is that the latter also links to the original file and tracks its history.
Therefore, if an update is made to the original repository later on, we can get the changes to our copy (see dvc update).
The get-url and import-url variants are useful when the original data is not already in a repository. They can handle different storage protocols and providers.
One of the core benefits of using a data-focused version control system is that we can structure our work around data flows, not individual files.
With DVC, we can
Let’s consider an example data pipeline - you can find a toy implementation of it on GitHub
Start with a labelled set of samples in a file samples.csv.
First step is to run reduce_dim.py. This reads the inputs, performs principle component analysis (PCA) to reduce their dimensionality and outputs to a file reduced.csv.
Running log_reg.py then trains a logistic regression classifier on the reduced dataset, serializes the trained model and outputs it in Pickle format to a file classifier.pkl.
DVC sees pipelines as collections of steps. Each step has some inputs or dependencies and produces outputs. The different stages are linked to each other through these.
Furthermore, a pipeline can have one or more parameters that control its stages and customize their behaviour.
dvc.yaml filesDVC defines pipelines in a structured file dvc.yaml written in the YAML format. The previous example could be defined by:
dvc.yaml filesEvery step has its own section, appropriately named (pca, classification). Each section specification has:
cmd),deps) which feed into the step,outs) that the step produces,params) that control that step.Note that the dependencies include the code itself, as well as any input files!
The parameters are stored in a separate file. By default this is called params.yaml but other formats are allowed.
Rather than write the above dvc.yaml file all at once, we can add the steps in order and let DVC create the file.
To do this, we need to tell it how to execute each step, by using the dvc stage add command at the terminal:
The properties of the stage are given in the command options:
-n: name of the step,-d: dependencies,-p: parameters,-o: outputs,We can run the pipeline stages using the command dvc repro.
Often, running all stages will be redundant. For example, if we have only made changes to the log_reg.py file since the last time we ran the pipeline, then the previous step is unaffected and does not need to be rerun.
However, if we had updated the samples file or modified the parameter controlling the PCA step, then that step would need to be rerun, as well as all steps downstream of it.
By tracking changes to files and using the pipeline structure, DVC can infer which steps have changed and only run those.
DVC also allows us to compare the performance of our models as we make changes to them.
We do this by declaring metrics as part of a pipeline.
In the our example, let’s assume we have an additional script evaluate.py which evaluates our trained classifier and stores the results in a file scores.json.
We can record this by expanding the pipeline description with an extra step, like this:
Running the whole pipeline with dvc repro will now also produce the file with the scores.
Let’s make a change to the parameters file, e.g. to increase total_var to 0.95.
DVC will notice this and can show us the difference if we run dvc params diff:
Path Param Old New
params.yaml total_var 0.9 0.95
We can also easily inspect what effect this has on the performance metrics.
If we run the pipeline again, then dvc metrics diff will show us how the metrics have changed:
Path Metric Old New Change
scores.json precision 0.63 0.65 0.02
scores.json roc_auc 0.85 0.91 0.06
In two commands, we can run a whole series of steps and inspect the results - in this case, see that changing how we choose features after PCA has improved performance.
MLflow is a tool for tracking and rerunning machine learning workflows.
It is similar to DVC but is controlled programatically rather than the command line, and is primarily focused on Python workflows. It provides: