• Classes
  • 05 - Docker

Amazon S3

Introduction

Amazon S3 (Simple Storage Service) is a cloud storage service offered by Amazon Web Services that provides object storage through a web services interface.

By using Amazon S3, users can store data with an elastic scalability (almost any amount of data, from anywhere).

It is designed for 99.99% durability and stores objects in a buildling block storage infrastructure, optimized for:

  • Availability
  • Scalability
  • Durability
  • Performance

Reading Objects

Public

Let's read a public object from S3.

But first, install the dependencies with:

Tip! 1

Remember to activate the environment!

$ pip install boto3


So, we can read an object. At this point, we are going to use any text file from a bucket.

import boto3
from botocore import UNSIGNED
from botocore.config import Config

# Disable authentication
s3 = boto3.client(
    "s3",
    config=Config(signature_version=UNSIGNED),
)

# Bucket name
bucket = "atd-insper"

# Public file at bucket
key = "aula04/alice_wonderland.txt"

response = s3.get_object(
    Bucket=bucket,
    Key=key,
)

content = response["Body"].read().decode("utf-8")

print(f"File Content:\n{content}")

Understanding the Key Parameter

The Key is essentially the file path within the S3 bucket that uniquely identifies the object you want to access.

In the context of S3, the bucket is like the container or the root directory, and the Key represents the specific path to the file within that bucket.

Question 1

As a result, you should see a short excerpt from the book Alice in Wonderland!

Authentication

To gain access to private files, it will be necessary to pass our authentication information to boto3. We have already seen that it is not a good idea to leave this information directly in the code.

Tip! 2

Remember here the importance of not defining direct access variables and passwords in the source code.

So let's create an .env:

AWS_ACCESS_KEY_ID="XXXXXXXXXXXXXXXXXXXXX"
AWS_SECRET_ACCESS_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
AWS_BUCKET_NAME="some-bucket-name"

Info!

Ask the professor where the authentication information is!

Example code:

import boto3
import os
from dotenv import load_dotenv

load_dotenv()

s3 = boto3.client(
    "s3",
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
)

obj = s3.get_object(
    Bucket=os.getenv("AWS_BUCKET_NAME"),
    Key="welcome.txt",
)

file_content = obj["Body"].read().decode("utf-8")

print(f"File Content:\n{file_content}")

Question 2

As a result, you should see a short text and no Exceptions!!!

Writing Objects

Creating folders

In AWS S3, the concept of a "folder" doesn't exist in the traditional sense as you would find in a file system like on your computer. However, S3 does allow you to organize your objects in a way that mimics a folder structure, and these are often referred to as "folders" for convenience.

Let's create a folder inside the bucket.

Attention!

For better organization, each student should create a folder with their Insper username and make changes only there!

Example code:

import boto3
import os
from dotenv import load_dotenv
from pprint import pprint

load_dotenv()

s3 = boto3.client(
    "s3",
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
)

# Create folder - CHANGE "aantonio/" TO YOUR INSPER USERNAME, keep the "/"
res = s3.put_object(
    Bucket=os.getenv("AWS_BUCKET_NAME"),
    Key="aantonio/",
)

print("Answer:")
pprint(res)

Question 3

Using the example code above as reference, create a folder whose name should be your Insper username!

Submit file

Let's upload a file to S3. But first, let's create any text file that will be used in the upload.

Question 4

Create a file hello.txt on your local computer.

Write any text in this file. For example:

Now I own the S3!

Example code for upload:

import boto3
import os
from dotenv import load_dotenv

load_dotenv()

s3 = boto3.client(
    "s3",
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
)

s3.upload_file(
    "hello.txt",  # Local Filepath
    os.getenv("AWS_BUCKET_NAME"),  # Bucket name
    "aantonio/hello.txt",  # Key (path on bucket)
)

Question 5

Upload any txt file to your folder inside the bucket.

Question 6

Read the file to check that it is working!

Refactoring

In the previous class (SQL) we made a version of the project that uses a PostgreSQL server as the data source.

Question 7

Do you consider that S3 could serve as a data source for model training/predict? Explain.

Answer

Yes. Raw or pre-processed data, in Parquet or CSV formats for example, could be stored in S3 and retrieved by the model, whether for training or batch predicting.

This choice will also depend on the architecture used by the company and model requirements identified in the planning phase.

But the model pickles were always only stored locally in the models folder.

Question 8

Do you consider that S3 could serve as storage for model pickles? Explain.

Answer

Yes. We can make good use of S3 for storing machine learning model pickles.

We can configure folders and object names for clean organization of model versions and metadata.

When a highly requested model is in production, it will likely use computational resources from several machines. Model training will not necessarily take place in this same environment. With the use of S3 we can manage this in a centralized way, working as a model store.

Question 9

Change one of your previous projects so that the training exports the models' pickle to your Insper user's folder created in S3.

You can use the deployment project we did with FastAPI or any of the batch versions.

Question 10

In the project chosen in the previous step, change it so that the model is fetched from S3.