• Classes
  • 16 - Data versioning
  • Part 1

DVC + S3

It is possible to use dvc with a remote pointing to an S3 bucket.

Question! 1

Point out at least one advantage of using S3 as storage instead of a local folder.

Answer!

  • It facilitates collaboration between data scientists since information is centralized.

  • S3 is scalable, whereas local files can exceed the disk's storage capacity.

  • S3 is also durable and secure, with data replication capabilities.

Practicing!

Create another repository and repeat the procedures from the previous handout.

Attention!

Make sure you have an AWS account and the MLops profile configured and set as default.

Visit the Set profile section if you need help.

Some important steps:

Question! 2

Init DVC, download and track data:

$ dvc init
$ dvc get-url https://mlops-material.s3.us-east-2.amazonaws.com/data_v0.csv  data/data.csv
$ dvc add data/data.csv

Question! 3

Git commit:

$ git add data/data.csv.dvc data/.gitignore
$ git commit -m "Add data to project"
$ git push

Question! 4

Create a bucket on S3 with the pattern name mlops-dvc-INSPERUSERNAME

Check HERE if you need help with bucket creation.

Question! 5

Configure S3 storage:

Attention!

Change the bucket name in mlops-dvc-INSPERUSERNAME

$ dvc remote add myremote s3://mlops-dvc-INSPERUSERNAME
$ dvc remote default myremote
$ dvc push

Question! 6

Check the contents of the bucket and ensure that the data was actually stored!

Attention!

After finishing the class, delete the bucket you created!

References