• Classes
  • 16 - Data versioning
  • Part 1

DVC: Data Versioning Control

Let's learn how to use tools for data versioning.

We already know the most famous versioning tool: git!

Question! 1

Do you think we can we use git to version our parquets and CSVs? Explain whether or not this is a good idea.

Answer!

It is possible, but it is not the best approach. Git is primarily designed for versioning source code and text-based files, it doesn't provide efficient storage and diffing capabilities for large and frequently changing binary files used in ML.

An Alternative

In this class we will explore DVC, an open-source version control system for Data Science and ML projects. DVC provides a git-like experience to organize your data, models, and experiments.

From the DVC creators

DVC is a tool for data science that takes advantage of existing software engineering toolset.

It helps ML teams manage large datasets, make projects reproducible, and collaborate better.

Create repository

Question! 2

Create a private repository to be used in the experiment and clone it on your machine.

If you create a public repository the dvc will also work! It's just a recommendation that it be a private repository.

Important!

Access the repository folder and work from there!

Install dvc

Question! 3

Create (or activate) a virtual environment to be used in class.

Question! 4

After activating the environment, we install dvc with:

$ pip install -U pip
$ pip install "dvc[s3]"

Use dvc

Let's initialize dvc in our repository.

Question! 5

Make sure you are at the root of the repository and run:

$ dvc init

Then, we can use dvc to download a dataset.

Question! 6

To download a dataset, run:

$ dvc get-url https://mlops-material.s3.us-east-2.amazonaws.com/data_v0.csv  data/data.csv

Tip! 1

You could have used any tool (curl, wget) to download this file.

If this file was in a git repository, you could use dvg get instead of dvc get-url.

Question! 7

Then we configure dvc to track the data/data.csv file with:

$ dvc add data/data.csv

Question! 8

And commit changes with:

$ git add data/data.csv.dvc data/.gitignore
$ git commit -m "Add data to project"
$ git push

Question 9

Go to github.com and check out their repository. Browse the file structure and check which dvc files we don't commonly have in our repositories. Can you find the data.csv file?

Remotes

In DVC, remotes refer to the storage locations where you can store and retrieve your data, models, and other artifacts.

For this we can use, among other alternatives, a local folder or an S3 bucket.

First we will use a local folder.

Question! 10

Create a folder dvcstore anywhere outside the repository folder and save its path:

$ mkdir /home/user/dvcstore

Question! 11

Let's configure DVC to use this folder as remote storage:

Attention!

Change /home/user/dvcstore to the path of the previously created folder!

$ dvc remote add -d myremote /home/user/dvcstore

Question! 12

To upload data to the storage, run:

Attention!

Out of curiosity, check that your storage folder is empty before running!

$ dvc push

Attention!

Check that your storage folder is not empty after running!

Testing the remote

To check if DVC is actually tracking the file, let's simulate a deletion and restoration of the data/data.csv base.

Question! 13

To do this, run:

$ rm -rf .dvc/cache
$ rm -f data/data.csv

Question! 14

Check that the file has indeed been deleted

$ ls -la data/

Question! 15

To restore the file we can do:

$ dvc pull

Question! 16

Check that the file has indeed been restaured

$ ls -la data/

Checkout to version

When developing with git, it is common to do checkout to explore a specific version of the software to be developed.

With DVC, we added the ability to maintain data versioning, without necessarily using the repository as storage.

Question! 17

Let's ensure that all changes so far have been committed.

$ git add .
$ git commit -m "version 0"
$ git push

To make it easier to understand, let's create a tag:

Tip! 2

A git tag is a named reference to a specific commit in a Git repository.

Question! 18

To create the tag v0.0.0, run:

$ git tag -a v0.0.0 -m "Release version 0.0.0"

Let's simulate that the data scientist identified the need to add new features to improve the model's performance.

Question! 19

A new file has been prepared by the professor and can be downloaded with:

$ dvc get-url --force https://mlops-material.s3.us-east-2.amazonaws.com/data_v1.csv  data/data.csv

Attention!

After downloading, open the file data/data.csv and see that it has more columns than the previous version!

Attention!

create a new src/train.py file to simulate some source code addition:

# simulate trainning. No need to add source code!

Question! 20

Commit the changes, both in git and dvc:

$ dvc commit data/data.csv
$ dvc push

$ git add .
$ git commit -m "version 1"
$ git push

Question! 21

Create a new v0.0.1 tag with:

$ git tag -a v0.0.1 -m "Release version 0.0.1"

Now we can switch between versions, checking out both git and dvc. This way, both the source code and the data are versioned!

Attention!

Keep the file data/data.csv open in VSCode and split your screen. This way you will be able to observe the modifications in the file as soon as the checkout occurs in DVC!

Question! 22

To switch to version v0.0.0 do:

$ git checkout v0.0.0
$ dvc checkout

Attention!

Check if the data/data.csv file has been restored to the previous version.

Question! 23

To switch to version v0.0.1 do:

$ git checkout v0.0.1
$ dvc checkout

Attention!

Repeat these last two steps a few times and check both the repository and the data being changed!

Important

It is not mandatory to use git tag. You could checkout directly to a commit.

We use the tag just to standardize and have a named commit!