Classes
16 - Data versioning
Part 2

Practicing with DVC Pipelines

Let's see the use of pipelines in DVC in practice.

Create repository

Question! 1

Important!

Access the repository folder and work from there!

Use `dvc`

Let's initialize dvc in our repository.

Question! 2

Basic structure

Question! 3

Create the basic folder structure for the class.

.
├── data
│   └── bank.csv
├── models
├── results
└── src
    ├── preproc.py
    └── train.py

The csv is the same one used in the first class and can also be found HERE!

We will discuss the remaining files next!

`preproc.py`

This file will be responsible for performing small transformations on the data to make it suitable for training. As output, it saves a file in the parquet format in the data folder.

Its content is:

Click to see src/preproc.py contents

import pandas as pd


def preprocess():
    df = pd.read_csv("data/bank.csv")

    # Convert the column to category and map the values
    dep_mapping = {"yes": 1, "no": 0}
    df["deposit"] = df["deposit"].astype("category").map(dep_mapping)

    df = df.drop(
        labels=[
            "default",
            "contact",
            "day",
            "month",
            "pdays",
            "previous",
            "loan",
            "poutcome",
        ],
        axis=1,
    )

    return df


def export_data(df):
    df.to_parquet("data/bank_preproc.parquet")


def main():
    df = preprocess()
    export_data(df)


if __name__ == "__main__":
    main()

Question! 4

`train.py`

This file takes the outputs generated by preproc.py and uses them to build two models: a one-hot encoder and a classifier using RandomForestClassifier.

As output, this file should generate a csv file with performance metrics of the model on the test data and an image of the confusion matrix. Both results are saved in the results folder.

Additionally, the file also saves the model pickles in the models folder.

Its content is:

Click to see src/train.py contents

# Data
import pandas as pd

# Export
import pickle

# Plot
import matplotlib.pyplot as plt
import seaborn as sns

# Modeling
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
)
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder


def load_data():
    df = pd.read_parquet("data/bank_preproc.parquet")
    return df


def split(df):
    X = df.drop("deposit", axis=1)
    y = df["deposit"]

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=1912
    )

    return X_train, X_test, y_train, y_test


def train_ohe(X_train):
    cat_cols = ["job", "marital", "education", "housing"]
    one_hot_enc = make_column_transformer(
        (OneHotEncoder(handle_unknown="ignore", drop="first"), cat_cols),
        remainder="passthrough",
    )

    one_hot_enc.fit(X_train)

    return one_hot_enc


def train(X_train, y_train):
    model = RandomForestClassifier(n_estimators=50, max_depth=5)
    model.fit(X_train, y_train)
    return model


def export_model(model, file_path):
    with open(file_path, "wb") as f:
        pickle.dump(model, f)


def export_results(model, X_test, y_test, y_pred):
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average="weighted")
    recall = recall_score(y_test, y_pred, average="weighted")

    # Create a DataFrame with the evaluation metrics
    results_df = pd.DataFrame(
        {"Accuracy": [accuracy], "Precision": [precision], "Recall": [recall]}
    )

    results_df.to_csv("results/model_test_metrics.csv", index=False)


def export_confusion_matrix(model, y_test, y_pred):
    cm = confusion_matrix(y_test, y_pred)

    # Create a pandas DataFrame for the confusion matrix
    cm_df = pd.DataFrame(cm, index=model.classes_, columns=model.classes_)

    # Generate the confusion matrix plot
    plt.figure(figsize=(10, 7))
    sns.heatmap(cm_df, annot=True, cmap="Blues")
    plt.title("Confusion Matrix")
    plt.xlabel("Predicted Labels")
    plt.ylabel("True Labels")
    plt.savefig("results/confusion_matrix.png")
    plt.close()


def main():
    df = load_data()
    X_train, X_test, y_train, y_test = split(df)
    ohe = train_ohe(X_train)
    X_train = ohe.transform(X_train)
    X_test = ohe.transform(X_test)

    model = train(X_train, y_train)
    y_pred = model.predict(X_test)

    export_results(model, X_test, y_test, y_pred)
    export_confusion_matrix(model, y_test, y_pred)
    export_model(ohe, "models/ohe.pickle")
    export_model(model, "models/model.pickle")


if __name__ == "__main__":
    main()

Question! 5

Checkpoint

Let's check if everything is working? To do this, let's run both files and check the results produced.

Question! 6

Question! 7

Question! 8

Question! 9

Question! 10

Attention!

Delete the files generated by the runs:

models/model.pickle
models/ohe.pickle
results/confusion_matrix.png
results/model_test_metrics.csv

Create Pipeline

We verified that we are able to call the execution of each step separately. For now we don't have a defined pipeline yet. Let's do this!

We discussed that the pipeline will be made up of one or more stages.

We will create the preprocessing stage. But first, let's discuss this stage:

Question! 11

Answer!

It depends on data/bank.csv and the existence of the Python file src/preproc.py itself.

Question! 12

Answer!

The file data/bank_preproc.parquet is generated as output.

Question! 13

Answer!

python src/preproc.py

Information on dependencies, outputs and how to execute the file must be provided at stage. Let's finally create the stage of src/preproc.py!

Question! 14

To do this, run in the terminal:

$ dvc stage add --name preproc \
            --deps src/preproc.py \
            --deps data/bank.csv \
            --outs data/bank_preproc.parquet \
            python src/preproc.py

Important!

Notice that we provided a name for the stage, dependencies, outputs, and the command for its execution from the root directory of the repository.

This will create, at the root of the repository, a YAML file dvc.yaml containing the information that defines the pipeline:

stages:
  preproc:
    cmd: python src/preproc.py
    deps:
    - data/bank.csv
    - src/preproc.py
    outs:
    - data/bank_preproc.parquet

Tip! 1

The stages of the pipeline can be programmed with direct edits in this dvc.yaml file or using terminal commands such as dvc stage add.

Let's run the pipeline and check the results produced.

Attention!

Before running, ensure that the files generated by the previous manual execution (before creating the pipeline) have been deleted.

Question! 15

From the root folder, run:

$ dvc repro

Running stage 'preproc':                                                                                                  
> python src/preproc.py
Generating lock file 'dvc.lock'                                                                                           
Updating lock file 'dvc.lock'

To track the changes with git, run:

        git add dvc.lock

To enable auto staging, run:

        dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.

Question! 16

Answer!

DVC can see that there were no changes to the dependencies and that there is no need to run the entire pipeline again. This will be especially important to avoid running time-consuming tasks unnecessarily.

$ dvc repro

Stage 'preproc' didn't change, skipping                                                                                   
Data and pipelines are up to date.

Train Stage

Question! 17

Answer!

It depends on data/bank_preproc.parquet and the existence of the Python file src/train.py itself.

Question! 18

Answer!

The files:

models/model.pickle
models/ohe.pickle
results/confusion_matrix.png
results/model_test_metrics.csv

Question! 19

Answer!

python src/train.py

Question! 20

Answer!

$ dvc stage add --name train \
        --deps src/train.py \
        --deps data/bank_preproc.parquet \
        --outs models/model.pickle \
        --outs models/ohe.pickle \
        --outs results/confusion_matrix.png \
        --outs results/model_test_metrics.csv \
        python src/train.py

Question! 21

Question! 22

Question! 23

Question! 24

Answer!

The DVC can see that there were no changes to anything that affects the stage of preproc. However, the stage of train was changed due to the change in the train.py file and needs to be run again.

$ dvc repro

Stage 'preproc' didn't change, skipping                                                                                   
Running stage 'train':                                                                                                    
> python src/train.py
Updating lock file 'dvc.lock'                                                                                             

To track the changes with git, run:

        git add dvc.lock

To enable auto staging, run:

        dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.

Question! 25

Answer!

DVC can realize that the changes affect the stage of preproc and that it needs to be run again.

The stage of train has not changed but depends on the stage of preproc, then train also needs to be executed.

Tip! 2

To view the DAG in the terminal, run:

$ dvc dag

Try it yourself!

The combination of DVC data versioning features with its pipeline management capabilities provides data scientists with a powerful tool that provides easy execution of experiments, while being able to reproduce and audit results.

That's all for today!

Practicing with DVC Pipelines

Create repository

Use dvc

Basic structure

preproc.py

train.py

Checkpoint

Create Pipeline

Train Stage

Use `dvc`

`preproc.py`

`train.py`