• Classes
  • 16 - Data versioning

DVC + S3

It is possible to use dvc with a remote pointing to an S3 bucket.

Question! 1

Point out at least one advantage of using S3 as storage instead of a local folder.

Answer!

  • It facilitates collaboration between data scientists since information is centralized.

  • S3 is scalable, whereas local files can exceed the disk's storage capacity.

  • S3 is also durable and secure, with data replication capabilities.

Practicing!

Create another repository and repeat the procedures from the previous handout.

Attention!

Make sure you have an AWS account and the MLops profile configured and set as default.

Visit the Environment Setup section if you need help.

Some important steps:

Question! 2

Init DVC, download and track data:

$ dvc init
$ dvc get-url https://mlops-material.s3.us-east-2.amazonaws.com/data_v0.csv  data/data.csv
$ dvc add data/data.csv

Question! 3

Git commit:

$ git add data/data.csv.dvc data/.gitignore
$ git commit -m "Add data to project"
$ git push

Question! 4

Create a bucket on S3 with the pattern name mlops-dvc-INSPERUSERNAME

$ aws s3 mb s3://mlops-dvc-INSPERUSERNAME --profile mlops --region us-east-2

Question! 5

Configure S3 storage:

Attention!

Change the bucket name in mlops-dvc-INSPERUSERNAME

$ dvc remote add myremote s3://mlops-dvc-INSPERUSERNAME
$ dvc remote modify myremote profile mlops
$ dvc remote default myremote
$ dvc push

Question! 6

Check the contents of the bucket and ensure that the data was actually stored!

Attention!

After finishing the class, delete the bucket you created!

References