• Classes
  • 16 - Data versioning
  • Part 1

DVC + S3

It is possible to use dvc with a remote pointing to an S3 bucket.

Question! 1

Point out at least one advantage of using S3 as storage instead of a local folder.

Answer!

  • It facilitates collaboration between data scientists since information is centralized.

  • S3 is scalable, whereas local files can exceed the disk's storage capacity.

  • S3 is also durable and secure, with data replication capabilities.

Practicing!

Create another repository and repeat the procedures from the previous handout.

Some important steps:

Question! 2

Init DVC, download and track data:

$ dvc init
$ dvc get-url https://mlops-material.s3.us-east-2.amazonaws.com/data_v0.csv  data/data.csv
$ dvc add data/data.csv

Question! 3

Git commit:

$ git add data/data.csv.dvc data/.gitignore
$ git commit -m "Add data to project"
$ git push

Question! 4

Create a bucket on S3 with the pattern name mlops-dvc-INSPERUSERNAME

Check HERE if you need help with bucket creation.

Question! 5

Configure S3 storage:

Attention!

Change the bucket name in mlops-dvc-INSPERUSERNAME

$ dvc remote add myremote s3://mlops-dvc-INSPERUSERNAME
$ dvc remote default myremote
$ dvc push

Question! 6

Check the contents of the bucket and ensure that the data was actually stored!

Attention!

After finishing the class, delete the bucket you created!

References