• Classes
  • 16 - Data versioning
  • Part 1

DVC: Data Versioning Control

Let's learn how to use tools for data versioning.

We already know the most famous versioning tool: git!

Question! 1

Do you think we can we use git to version our parquets and CSVs? Explain whether or not this is a good idea.

Answer!

It is possible, but it is not the best approach. Git is primarily designed for versioning source code and text-based files, it doesn't provide efficient storage and diffing capabilities for large and frequently changing binary files used in ML.

An Alternative

In this class we will explore DVC, an open-source version control system for Data Science and ML projects. DVC provides a git-like experience to organize your data, models, and experiments.

From the DVC creators

DVC is a tool for data science that takes advantage of existing software engineering toolset.

It helps ML teams manage large datasets, make projects reproducible, and collaborate better.

Create repository

Question! 2

Create a private repository to be used in the experiment and clone it on your machine.

If you create a public repository the dvc will also work! It's just a recommendation that it be a private repository.

Important!

Access the repository folder and work from there!

Install dvc

Question! 3

Create (or activate) a virtual environment to be used in class.

Question! 4

After activating the environment, we install dvc with:

fast →pip install -U pippip install "dvc[s3]"
restart ↻

Use dvc

Let's initialize dvc in our repository.

Question! 5

Make sure you are at the root of the repository and run:

Then, we can use dvc to download a dataset.

Question! 6

To download a dataset, run:

fast →dvc get-url https://mlops-material.s3.us-east-2.amazonaws.com/data_v0.csv data/data.csv
restart ↻

Tip! 1

You could have used any tool (curl, wget) to download this file.

If this file was in a git repository, you could use dvg get instead of dvc get-url.

Question! 7

Then we configure dvc to track the data/data.csv file with:

fast →dvc add data/data.csv
restart ↻

Question! 8

And commit changes with:

fast →git add data/data.csv.dvc data/.gitignoregit commit -m "Add data to project"git push
restart ↻

Question 9

Go to github.com and check out their repository. Browse the file structure and check which dvc files we don't commonly have in our repositories. Can you find the data.csv file?

Remotes

In DVC, remotes refer to the storage locations where you can store and retrieve your data, models, and other artifacts.

For this we can use, among other alternatives, a local folder or an S3 bucket.

First we will use a local folder.

Question! 10

Create a folder dvcstore anywhere outside the repository folder and save its path:

fast →mkdir /home/user/dvcstore
restart ↻

Question! 11

Let's configure DVC to use this folder as remote storage:

Attention!

Change /home/user/dvcstore to the path of the previously created folder!

fast →dvc remote add -d myremote /home/user/dvcstore
restart ↻

Question! 12

To upload data to the storage, run:

Attention!

Out of curiosity, check that your storage folder is empty before running!

Attention!

Check that your storage folder is not empty after running!

Testing the remote

To check if DVC is actually tracking the file, let's simulate a deletion and restoration of the data/data.csv base.

Question! 13

To do this, run:

fast →rm -rf .dvc/cacherm -f data/data.csv
restart ↻

Question! 14

Check that the file has indeed been deleted

Question! 15

To restore the file we can do:

Question! 16

Check that the file has indeed been restaured

Checkout to version

When developing with git, it is common to do checkout to explore a specific version of the software to be developed.

With DVC, we added the ability to maintain data versioning, without necessarily using the repository as storage.

Question! 17

Let's ensure that all changes so far have been committed.

fast →git add .git commit -m "version 0"git push
restart ↻

To make it easier to understand, let's create a tag:

Tip! 2

A git tag is a named reference to a specific commit in a Git repository.

Question! 18

To create the tag v0.0.0, run:

fast →git tag -a v0.0.0 -m "Release version 0.0.0"
restart ↻

Let's simulate that the data scientist identified the need to add new features to improve the model's performance.

Question! 19

A new file has been prepared by the professor and can be downloaded with:

fast →dvc get-url --force https://mlops-material.s3.us-east-2.amazonaws.com/data_v1.csv data/data.csv
restart ↻

Attention!

After downloading, open the file data/data.csv and see that it has more columns than the previous version!

Attention!

create a new src/train.py file to simulate some source code addition:

# simulate trainning. No need to add source code!

Question! 20

Commit the changes, both in git and dvc:

fast →dvc commit data/data.csvdvc push
git add .git commit -m "version 1"git push
restart ↻

Question! 21

Create a new v0.0.1 tag with:

fast →git tag -a v0.0.1 -m "Release version 0.0.1"
restart ↻

Now we can switch between versions, checking out both git and dvc. This way, both the source code and the data are versioned!

Attention!

Keep the file data/data.csv open in VSCode and split your screen. This way you will be able to observe the modifications in the file as soon as the checkout occurs in DVC!

Question! 22

To switch to version v0.0.0 do:

fast →git checkout v0.0.0dvc checkout
restart ↻

Attention!

Check if the data/data.csv file has been restored to the previous version.

Question! 23

To switch to version v0.0.1 do:

fast →git checkout v0.0.1dvc checkout
restart ↻

Attention!

Repeat these last two steps a few times and check both the repository and the data being changed!

Important

It is not mandatory to use git tag. You could checkout directly to a commit.

We use the tag just to standardize and have a named commit!