Classes
03 - Batch prediction
Part 1

Practicing

Create folder

Use cookiecutter and your template, created in class 01, to create a new folder/project for today's class. You can use any name for the folder/project (I chose p03-batch).

Atention

If you didn't do this part of the activity (APS 01 part 2) and do not have a template, create the folders manually.

$ cookiecutter https://github.com/macielcalebe/template-ds-maciel.git --checkout main
You've downloaded /home/calebe/.cookiecutters/template-ds-maciel before. Is it okay to delete and re-download it? [y/n] (y): y
[1/3] directory_name (project-name): p03-batch
[2/3] author_name (Your Name): Maciel
[3/3] compatible_python_versions (^3.8):

Let's check if the folders were created correctly.

$ cd p03-batch/
$ ls
data  models  notebooks  README.md  src

Important

This is not a repository (yet). We just created the folders structure of a ML project!

Introduction

In this task, we are going to create a model for forecasting sales for five stores of a company. The model will make predictions in batch, for now called manually from terminal.

Task 1: Generating Train data

Click Here to see a script that generates data!

Copy this python script to the path src/get_data.py

                                                                                        <       <       <  < <  <  <            <   <

name="__codelineno-2-1" href="#__codelineno-2-1">import sys class="kn">import os class="kn">import numpy as np class="kn">import random class="kn">import datetime class="kn">import calendar class="kn">import pandas as pd class="kn">import itertools class="k">class Config: stores = { 5000: { "avg_n": 100, "avg_price": 350.0, "std": 10.0, "boost_weekday": [6, 7], "boost_months": [5, 12], }, 5001: { "avg_n": 10, "avg_price": 500.0, "std": 20.0, "boost_weekday": [7], "boost_months": [5, 12], }, 5002: { "avg_n": 25, "avg_price": 400.0, "std": 10.0, "boost_weekday": [7], "boost_months": [4, 10, 12], }, 5003: { "avg_n": 200, "avg_price": 220.0, "std": 12.0, "boost_weekday": [1, 3, 7], "boost_months": [], }, 5004: { "avg_n": 140, "avg_price": 415.0, "std": 17.0, "boost_weekday": [4, 6, 7], "boost_months": [4, 10, 12], }, 5005: { "avg_n": 50, "avg_price": 890.0, "std": 15.0, "boost_weekday": [6, 7], "boost_months": [5, 12], }, } product_ids = np.random.randint(1000, 3000, size=30) class="k">def generate_day_sales(store_id, date): config = Config.stores[store_id] year, month, day = date.year, date.month, date.day n_sales = np.random.poisson(lam=config["avg_n"]) if date.weekday() in config["boost_weekday"]: n_sales = int(n_sales * random.uniform(1.6, 1.7)) if month in config["boost_months"]: n_sales = int(n_sales * random.uniform(1.45, 1.50)) stores = np.full(n_sales, store_id) products = np.random.choice(Config.product_ids, size=n_sales) prices = np.random.normal( loc=config["avg_price"], scale=config["std"], size=n_sales ) dates = np.full(n_sales, date.strftime("%Y-%m-%d")) client_ids = np.random.randint(100000, 400000, size=n_sales) return pd.DataFrame( { "store_id": stores, "date": dates, "client_id": client_ids, "product_id": products, "price": prices, } ) class="k">def generate_predict_register(store_id, date): return pd.DataFrame( { "store_id": [store_id], "year": [date.year], "month": [date.month], "day": [date.day], "weekday": [date.weekday()], } ) /a> class="k">def generate_data(year_from, month_from, day_from, year_to, month_to, day_to, type_): dates = pd.date_range( start=f"{year_from}-{month_from:02d}-{day_from:02d}", end=f"{year_to}-{month_to:02d}-{day_to:02d}", ) store_ids = list(Config.stores.keys()) combinations = itertools.product(store_ids, dates) /a> dfs = [] for store_id, date in combinations: if type_ == "train": dfs.append(generate_day_sales(store_id, date)) else: dfs.append(generate_predict_register(store_id, date)) /a> return pd.concat(dfs, ignore_index=True) /a> /a> class="k">if __name__ == "__main__": print("Simulate data ingestion!") /a> out_type = sys.argv[-1] /a> if len(sys.argv) != 8 or out_type not in ["train", "predict"]: print("USAGE: python get_data.py <year_from> <month_from> <day_from> <year_to> <month_to> <day_to> <train/predict>") else: date_args = sys.argv[1:-1] date_args = [int(x) for x in date_args] df = generate_data(*date_args, out_type) st_date = "-".join(sys.argv[4:-1]) if out_type == "train": file_name = f"{out_type}-{st_date}.csv" else: file_name = f"{out_type}-{st_date}.parquet" /a> file_path = os.path.join("../data/", file_name) print(f"Saving to {file_path} file...") /a> if out_type == "train": df.to_csv(file_path, index=False) else: df.to_parquet(file_path.replace(".csv", ".parquet"), index=False)

Let's use this script to simulate ingesting data that will be used whenever the model needs it:

be trained
make prediction

Question 1

$ python3 get_data.py help
Simulate data ingestion!
USAGE: python get_data.py <year_from> <month_from> <day_from> <year_to> <month_to> <day_to> <train/predict>

We will consider as training data from the beginning of the year 2022 until the first day of August 2023.

$ python3 get_data.py 2022 01 01 2023 08 01 train
Simulate data ingestion!
Saving to ../data/train-2023-08-01.csv file...

This will create a train-2023-08-01.csv file in the data folder containing the sales data for each company's store.

Each line in this file represents a sale made to a customer.

Task 2: Processing train file

The model will predict the total to be sold per store in one day.

So we need to process the input data to change its granularity. The expected result is a DataFrame where each line represents the total sales of a store in one day:

	store_id	total_sales	year	month	day	weekday
0	5000	62895.6	2023	1	1	6
1	5000	42351.1	2023	1	2	0
2	5000	37377.4	2023	1	3	1
3	5000	31385.5	2023	1	4	2
...
1636	5005	46246.3	2023	9	29	4
1637	5005	43698.2	2023	9	30	5

Notice the feature weekday. It represents the day of the week, going from 0 to 6.

Question 2

Answer

total_sales, that's what we are predicting!

Question 3

Task 3: Trainning the model

Train a model using RandomForestRegressor or any other of your preference.

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=195)
model.fit(X_train, y_train)

Question 4

Construct a python script src/train.py that receives as argument the path of preprocessed training file:

$ python3 train.py ../data/train-2023-08-01.parquet
Training model!
Saving to ../model/model-2023-08-01.pickle file...

Task 4: Simulate predicton data

Now that the model is trained, we can use it to make predictions on future dates, having an estimate of the billing provided by each store.

Let's use our script to simulate ingesting the data for predict. You can imagine that some system task would generate a file containing the batch of data that must be used by the model to make predictions. The script will simulate this task.

Then, when it was time for the model to make predictions, the model would read the lines from this file and generate predictions of total sales.

Question 5

Call the python script src/get_data.py in order to generate prediction data:

$ python3 get_data.py 2023 08 02 2023 08 03 predict
Simulate data ingestion!
Saving to ../data/predict-2023-08-03.parquet file...

As the prediction is performed by store and by day, each line of this file will contain a store x day combination:

store_id	year	month	day	weekday
5000	2023	8	2	2
5000	2023	8	3	3
5001	2023	8	2	2
5001	2023	8	3	3
5002	2023	8	2	2
5002	2023	8	3	3
5003	2023	8	2	2
5003	2023	8	3	3
5004	2023	8	2	2
5004	2023	8	3	3
5005	2023	8	2	2
5005	2023	8	3	3

Task 5: Making predictions

Question 6

Construct a python script src/predict.py which takes two arguments (model and file with data for predict), then:

Reads a model from ../model
Make predictions for each row of argument file
Save predictions to a new file

$ python3 predict.py ../model/model-2023-08-01.pickle ../data/predict-2023-08-03.parquet
Making predictions!
Saving to ../data/predict-done-2023-08-03.parquet file...

Done! Whenever there is a new file that must be predicted, just call the predict.py script, informing which model to be used and the path of the file with the data! Then, the model will read this batch of information and perform the prediction.

Extra questions!

Categorical variables

Question 7

Question 8

OOT validation

How do we check if a model is good? Although for now we haven't focused much on the construction details of the models, this is an important topic worth discussing!

Info

We say that a model is good if it is performing well in unseen data.

It is common to use train_test_split to generate X_test and y_test. This is the Out-of-sample (OOS) validation, where data are split in a random way.

Out-of-time (OOT) refers to evaluating the performance of a trained model on data that falls outside the time period or timeframe used for training the model. This concept is particularly relevant in scenarios where the data is time-dependent or exhibits temporal patterns (like).

OOT Example!

If your training data are from january to july, use data from January to May for training and separate June and July to check model performance!

Tip! 1

Once you decide the model is good enough, you can retrain with the whole base (January to July) and deploy this new version of the model!

Question 9