• Classes
  • 16 - Data versioning
  • Part 2

Practicing with DVC Pipelines

Let's see the use of pipelines in DVC in practice.

Create repository

Question! 1

Create a private repository to be used in the experiment and clone it on your machine.

If you create a public repository the dvc will also work! It's just a recommendation that it be a private repository.

Important!

Access the repository folder and work from there!

Use dvc

Let's initialize dvc in our repository.

Question! 2

Make sure you are at the root of the repository and run:

$ dvc init

Basic structure

Question! 3

Create the basic folder structure for the class.

.
├── data
│   └── bank.csv
├── models
├── results
└── src
    ├── preproc.py
    └── train.py

The csv is the same one used in the first class and can also be found HERE!

We will discuss the remaining files next!

preproc.py

This file will be responsible for performing small transformations on the data to make it suitable for training. As output, it saves a file in the parquet format in the data folder.

Its content is:

Click to see src/preproc.py contents
import pandas as pd


def preprocess():
    df = pd.read_csv("data/bank.csv")

    # Convert the column to category and map the values
    dep_mapping = {"yes": 1, "no": 0}
    df["deposit"] = df["deposit"].astype("category").map(dep_mapping)

    df = df.drop(
        labels=[
            "default",
            "contact",
            "day",
            "month",
            "pdays",
            "previous",
            "loan",
            "poutcome",
        ],
        axis=1,
    )

    return df


def export_data(df):
    df.to_parquet("data/bank_preproc.parquet")


def main():
    df = preprocess()
    export_data(df)


if __name__ == "__main__":
    main()

Question! 4

Create the src/preproc.py file.

Ensure you understand its content. Call the professor if necessary.

train.py

This file takes the outputs generated by preproc.py and uses them to build two models: a one-hot encoder and a classifier using RandomForestClassifier.

As output, this file should generate a csv file with performance metrics of the model on the test data and an image of the confusion matrix. Both results are saved in the results folder.

Additionally, the file also saves the model pickles in the models folder.

Its content is:

Click to see src/train.py contents
# Data
import pandas as pd

# Export
import pickle

# Plot
import matplotlib.pyplot as plt
import seaborn as sns

# Modeling
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
)
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder


def load_data():
    df = pd.read_parquet("data/bank_preproc.parquet")
    return df


def split(df):
    X = df.drop("deposit", axis=1)
    y = df["deposit"]

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=1912
    )

    return X_train, X_test, y_train, y_test


def train_ohe(X_train):
    cat_cols = ["job", "marital", "education", "housing"]
    one_hot_enc = make_column_transformer(
        (OneHotEncoder(handle_unknown="ignore", drop="first"), cat_cols),
        remainder="passthrough",
    )

    one_hot_enc.fit(X_train)

    return one_hot_enc


def train(X_train, y_train):
    model = RandomForestClassifier(n_estimators=50, max_depth=5)
    model.fit(X_train, y_train)
    return model


def export_model(model, file_path):
    with open(file_path, "wb") as f:
        pickle.dump(model, f)


def export_results(model, X_test, y_test, y_pred):
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average="weighted")
    recall = recall_score(y_test, y_pred, average="weighted")

    # Create a DataFrame with the evaluation metrics
    results_df = pd.DataFrame(
        {"Accuracy": [accuracy], "Precision": [precision], "Recall": [recall]}
    )

    results_df.to_csv("results/model_test_metrics.csv", index=False)


def export_confusion_matrix(model, y_test, y_pred):
    cm = confusion_matrix(y_test, y_pred)

    # Create a pandas DataFrame for the confusion matrix
    cm_df = pd.DataFrame(cm, index=model.classes_, columns=model.classes_)

    # Generate the confusion matrix plot
    plt.figure(figsize=(10, 7))
    sns.heatmap(cm_df, annot=True, cmap="Blues")
    plt.title("Confusion Matrix")
    plt.xlabel("Predicted Labels")
    plt.ylabel("True Labels")
    plt.savefig("results/confusion_matrix.png")
    plt.close()


def main():
    df = load_data()
    X_train, X_test, y_train, y_test = split(df)
    ohe = train_ohe(X_train)
    X_train = ohe.transform(X_train)
    X_test = ohe.transform(X_test)

    model = train(X_train, y_train)
    y_pred = model.predict(X_test)

    export_results(model, X_test, y_test, y_pred)
    export_confusion_matrix(model, y_test, y_pred)
    export_model(ohe, "models/ohe.pickle")
    export_model(model, "models/model.pickle")


if __name__ == "__main__":
    main()

Question! 5

Create the src/train.py file.

Ensure you understand its content. Call the professor if necessary.

Checkpoint

Let's check if everything is working? To do this, let's run both files and check the results produced.

Question! 6

From the root folder, run:

$ python src preproc.py

Question! 7

Check if the data/bank_preproc.parquet file was created.

Question! 8

From the root folder, run:

$ python src/train.py

Question! 9

Check if the:

  • models/model.pickle
  • models/ohe.pickle
  • results/confusion_matrix.png
  • results/model_test_metrics.csv

files were created.

Question! 10

Open the file results/model_test_metrics.csv in VSCode or any spreadsheet program and check its contents!

Attention!

Delete the files generated by the runs:

  • models/model.pickle
  • models/ohe.pickle
  • results/confusion_matrix.png
  • results/model_test_metrics.csv

Create Pipeline

We verified that we are able to call the execution of each step separately. For now we don't have a defined pipeline yet. Let's do this!

We discussed that the pipeline will be made up of one or more stages.

We will create the preprocessing stage. But first, let's discuss this stage:

Question! 11

What are the dependencies for src/preproc.py to run successfully?

Ignore Python library dependencies and think about the required inputs (files).

Answer!

It depends on data/bank.csv and the existence of the Python file src/preproc.py itself.

Question! 12

What are the results or outputs generated by src/preproc.py?

Answer!

The file data/bank_preproc.parquet is generated as output.

Question! 13

What is the command to execute src/preproc.py?

Answer!

python src/preproc.py

Information on dependencies, outputs and how to execute the file must be provided at stage. Let's finally create the stage of src/preproc.py!

Question! 14

To do this, run in the terminal:

$ dvc stage add --name preproc \
            --deps src/preproc.py \
            --deps data/bank.csv \
            --outs data/bank_preproc.parquet \
            python src/preproc.py

Important!

Notice that we provided a name for the stage, dependencies, outputs, and the command for its execution from the root directory of the repository.

This will create, at the root of the repository, a YAML file dvc.yaml containing the information that defines the pipeline:

stages:
  preproc:
    cmd: python src/preproc.py
    deps:
    - data/bank.csv
    - src/preproc.py
    outs:
    - data/bank_preproc.parquet

Tip! 1

The stages of the pipeline can be programmed with direct edits in this dvc.yaml file or using terminal commands such as dvc stage add.

Let's run the pipeline and check the results produced.

Attention!

Before running, ensure that the files generated by the previous manual execution (before creating the pipeline) have been deleted.

Question! 15

From the root folder, run:

$ dvc repro

Running stage 'preproc':                                                                                                  
> python src/preproc.py
Generating lock file 'dvc.lock'                                                                                           
Updating lock file 'dvc.lock'

To track the changes with git, run:

        git add dvc.lock

To enable auto staging, run:

        dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.

Question! 16

Use dvc repro to rerun the pipeline. What happens?

Answer!

DVC can see that there were no changes to the dependencies and that there is no need to run the entire pipeline again. This will be especially important to avoid running time-consuming tasks unnecessarily.

$ dvc repro

Stage 'preproc' didn't change, skipping                                                                                   
Data and pipelines are up to date.

Train Stage

Question! 17

What are the dependencies for src/train.py to run successfully?

Ignore Python library dependencies and think about the required inputs (files).

Answer!

It depends on data/bank_preproc.parquet and the existence of the Python file src/train.py itself.

Question! 18

What are the results or outputs generated by src/train.py?

Answer!

The files:

  • models/model.pickle
  • models/ohe.pickle
  • results/confusion_matrix.png
  • results/model_test_metrics.csv

Question! 19

What is the command to execute src/train.py?

Answer!

python src/train.py

Question! 20

Write the command to create the train stage:

Answer!

$ dvc stage add --name train \
        --deps src/train.py \
        --deps data/bank_preproc.parquet \
        --outs models/model.pickle \
        --outs models/ohe.pickle \
        --outs results/confusion_matrix.png \
        --outs results/model_test_metrics.csv \
        python src/train.py

Question! 21

Create the train stage and open the dvc.yaml file to check its new contents.

Question! 22

Run the pipeline and check generated files.

Question! 23

Rerun the pipeline. What happens?

Question! 24

In the train function of the train.py file, edit the RandomForestClassifier parameters. Change n_estimators from 50 to 100.

Rerun the pipeline. What happens?

Answer!

The DVC can see that there were no changes to anything that affects the stage of preproc. However, the stage of train was changed due to the change in the train.py file and needs to be run again.

$ dvc repro

Stage 'preproc' didn't change, skipping                                                                                   
Running stage 'train':                                                                                                    
> python src/train.py
Updating lock file 'dvc.lock'                                                                                             

To track the changes with git, run:

        git add dvc.lock

To enable auto staging, run:

        dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.

Question! 25

Let's now change something in preproc.py. In the preprocess function, add the campaign column to the list of columns to be removed (df.drop).

Rerun the pipeline. What happens?

Answer!

DVC can realize that the changes affect the stage of preproc and that it needs to be run again.

The stage of train has not changed but depends on the stage of preproc, then train also needs to be executed.

Tip! 2

To view the DAG in the terminal, run:

$ dvc dag

Try it yourself!

The combination of DVC data versioning features with its pipeline management capabilities provides data scientists with a powerful tool that provides easy execution of experiments, while being able to reproduce and audit results.

That's all for today!