Classes
05 - Docker

Amazon S3

Introduction

Amazon S3 (Simple Storage Service) is a cloud storage service offered by Amazon Web Services that provides object storage through a web services interface.

By using Amazon S3, users can store data with an elastic scalability (almost any amount of data, from anywhere).

It is designed for 99.99% durability and stores objects in a buildling block storage infrastructure, optimized for:

Availability
Scalability
Durability
Performance

Reading Objects

Public

Let's read a public object from S3.

But first, install the dependencies with:

Tip! 1

Remember to activate the environment!

$ pip install boto3

So, we can read an object. At this point, we are going to use any text file from a bucket.

import boto3
from botocore import UNSIGNED
from botocore.config import Config

# Disable authentication
s3 = boto3.client(
    "s3",
    config=Config(signature_version=UNSIGNED),
)

# Bucket name
bucket = "atd-insper"

# Public file at bucket
key = "aula04/alice_wonderland.txt"

response = s3.get_object(
    Bucket=bucket,
    Key=key,
)

content = response["Body"].read().decode("utf-8")

print(f"File Content:\n{content}")

Understanding the Key Parameter

The Key is essentially the file path within the S3 bucket that uniquely identifies the object you want to access.

In the context of S3, the bucket is like the container or the root directory, and the Key represents the specific path to the file within that bucket.

Question 1

Authentication

To gain access to private files, it will be necessary to pass our authentication information to boto3. We have already seen that it is not a good idea to leave this information directly in the code.

Tip! 2

Remember here the importance of not defining direct access variables and passwords in the source code.

So let's create an .env:

AWS_ACCESS_KEY_ID="XXXXXXXXXXXXXXXXXXXXX"
AWS_SECRET_ACCESS_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
AWS_BUCKET_NAME="some-bucket-name"

Info!

Ask the professor where the authentication information is!

Example code:

import boto3
import os
from dotenv import load_dotenv

load_dotenv()

s3 = boto3.client(
    "s3",
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
)

obj = s3.get_object(
    Bucket=os.getenv("AWS_BUCKET_NAME"),
    Key="welcome.txt",
)

file_content = obj["Body"].read().decode("utf-8")

print(f"File Content:\n{file_content}")

Question 2

Writing Objects

Creating folders

In AWS S3, the concept of a "folder" doesn't exist in the traditional sense as you would find in a file system like on your computer. However, S3 does allow you to organize your objects in a way that mimics a folder structure, and these are often referred to as "folders" for convenience.

Let's create a folder inside the bucket.

Attention!

For better organization, each student should create a folder with their Insper username and make changes only there!

Example code:

import boto3
import os
from dotenv import load_dotenv
from pprint import pprint

load_dotenv()

s3 = boto3.client(
    "s3",
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
)

# Create folder - CHANGE "aantonio/" TO YOUR INSPER USERNAME, keep the "/"
res = s3.put_object(
    Bucket=os.getenv("AWS_BUCKET_NAME"),
    Key="aantonio/",
)

print("Answer:")
pprint(res)

Question 3

Submit file

Let's upload a file to S3. But first, let's create any text file that will be used in the upload.

Question 4

Example code for upload:

import boto3
import os
from dotenv import load_dotenv

load_dotenv()

s3 = boto3.client(
    "s3",
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
)

s3.upload_file(
    "hello.txt",  # Local Filepath
    os.getenv("AWS_BUCKET_NAME"),  # Bucket name
    "aantonio/hello.txt",  # Key (path on bucket)
)

Question 5

Question 6

Refactoring

In the previous class (SQL) we made a version of the project that uses a PostgreSQL server as the data source.

Question 7

Answer

Yes. Raw or pre-processed data, in Parquet or CSV formats for example, could be stored in S3 and retrieved by the model, whether for training or batch predicting.

This choice will also depend on the architecture used by the company and model requirements identified in the planning phase.

But the model pickles were always only stored locally in the models folder.

Question 8

Answer

Yes. We can make good use of S3 for storing machine learning model pickles.

We can configure folders and object names for clean organization of model versions and metadata.

When a highly requested model is in production, it will likely use computational resources from several machines. Model training will not necessarily take place in this same environment. With the use of S3 we can manage this in a centralized way, working as a model store.

Question 9

Question 10