Amazon S3
Introduction
Amazon S3 (Simple Storage Service) is a cloud storage service offered by Amazon Web Services that provides object storage through a web services interface.
By using Amazon S3, users can store data with an elastic scalability (almost any amount of data, from anywhere).
It is designed for 99.99% durability and stores objects in a buildling block storage infrastructure, optimized for:
- Availability
- Scalability
- Durability
- Performance
Reading Objects
Public
Let's read a public object from S3.
But first, install the dependencies with:
Tip! 1
Remember to activate the environment!
So, we can read an object. At this point, we are going to use any text file from a bucket.
import boto3
from botocore import UNSIGNED
from botocore.config import Config
# Disable authentication
s3 = boto3.client(
"s3",
config=Config(signature_version=UNSIGNED),
)
# Bucket name
bucket = "atd-insper"
# Public file at bucket
key = "aula04/alice_wonderland.txt"
response = s3.get_object(
Bucket=bucket,
Key=key,
)
content = response["Body"].read().decode("utf-8")
print(f"File Content:\n{content}")
Understanding the Key
Parameter
The Key
is essentially the file path within the S3 bucket that uniquely identifies the object you want to access.
In the context of S3, the bucket is like the container or the root directory, and the Key
represents the specific path to the file within that bucket.
Question 1
Authentication
To gain access to private files, it will be necessary to pass our authentication information to boto3
. We have already seen that it is not a good idea to leave this information directly in the code.
Tip! 2
Remember here the importance of not defining direct access variables and passwords in the source code.
So let's create an .env
:
AWS_ACCESS_KEY_ID="XXXXXXXXXXXXXXXXXXXXX"
AWS_SECRET_ACCESS_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
AWS_BUCKET_NAME="some-bucket-name"
Info!
Ask the professor where the authentication information is!
Example code:
import boto3
import os
from dotenv import load_dotenv
load_dotenv()
s3 = boto3.client(
"s3",
aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
)
obj = s3.get_object(
Bucket=os.getenv("AWS_BUCKET_NAME"),
Key="welcome.txt",
)
file_content = obj["Body"].read().decode("utf-8")
print(f"File Content:\n{file_content}")
Question 2
Writing Objects
Creating folders
In AWS S3, the concept of a "folder" doesn't exist in the traditional sense as you would find in a file system like on your computer. However, S3 does allow you to organize your objects in a way that mimics a folder structure, and these are often referred to as "folders" for convenience.
Let's create a folder inside the bucket.
Attention!
For better organization, each student should create a folder with their Insper username and make changes only there!
Example code:
import boto3
import os
from dotenv import load_dotenv
from pprint import pprint
load_dotenv()
s3 = boto3.client(
"s3",
aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
)
# Create folder - CHANGE "aantonio/" TO YOUR INSPER USERNAME, keep the "/"
res = s3.put_object(
Bucket=os.getenv("AWS_BUCKET_NAME"),
Key="aantonio/",
)
print("Answer:")
pprint(res)
Question 3
Submit file
Let's upload a file to S3. But first, let's create any text file that will be used in the upload.
Question 4
Example code for upload:
import boto3
import os
from dotenv import load_dotenv
load_dotenv()
s3 = boto3.client(
"s3",
aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
)
s3.upload_file(
"hello.txt", # Local Filepath
os.getenv("AWS_BUCKET_NAME"), # Bucket name
"aantonio/hello.txt", # Key (path on bucket)
)
Question 5
Question 6
Refactoring
In the previous class (SQL) we made a version of the project that uses a PostgreSQL server as the data source.
Question 7
But the model pickles were always only stored locally in the models
folder.
Question 8
Answer
Yes. We can make good use of S3 for storing machine learning model pickles.
We can configure folders and object names for clean organization of model versions and metadata.
When a highly requested model is in production, it will likely use computational resources from several machines. Model training will not necessarily take place in this same environment. With the use of S3 we can manage this in a centralized way, working as a model store.
Question 9
Question 10