Practicing with DVC Pipelines
Let's see the use of pipelines in DVC in practice.
Create repository
Question! 1
Important!
Access the repository folder and work from there!
Use dvc
Let's initialize dvc
in our repository.
Question! 2
Basic structure
Question! 3
preproc.py
This file will be responsible for performing small transformations on the data to make it suitable for training. As output, it saves a file in the parquet format in the data folder.
Its content is:
Click to see src/preproc.py
contents
import pandas as pd
def preprocess():
df = pd.read_csv("data/bank.csv")
# Convert the column to category and map the values
dep_mapping = {"yes": 1, "no": 0}
df["deposit"] = df["deposit"].astype("category").map(dep_mapping)
df = df.drop(
labels=[
"default",
"contact",
"day",
"month",
"pdays",
"previous",
"loan",
"poutcome",
],
axis=1,
)
return df
def export_data(df):
df.to_parquet("data/bank_preproc.parquet")
def main():
df = preprocess()
export_data(df)
if __name__ == "__main__":
main()
Question! 4
train.py
This file takes the outputs generated by preproc.py
and uses them to build two models: a one-hot encoder and a classifier using RandomForestClassifier.
As output, this file should generate a csv file with performance metrics of the model on the test data and an image of the confusion matrix. Both results are saved in the results
folder.
Additionally, the file also saves the model pickles in the models folder.
Its content is:
Click to see src/train.py
contents
# Data
import pandas as pd
# Export
import pickle
# Plot
import matplotlib.pyplot as plt
import seaborn as sns
# Modeling
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
confusion_matrix,
accuracy_score,
precision_score,
recall_score,
)
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
def load_data():
df = pd.read_parquet("data/bank_preproc.parquet")
return df
def split(df):
X = df.drop("deposit", axis=1)
y = df["deposit"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=1912
)
return X_train, X_test, y_train, y_test
def train_ohe(X_train):
cat_cols = ["job", "marital", "education", "housing"]
one_hot_enc = make_column_transformer(
(OneHotEncoder(handle_unknown="ignore", drop="first"), cat_cols),
remainder="passthrough",
)
one_hot_enc.fit(X_train)
return one_hot_enc
def train(X_train, y_train):
model = RandomForestClassifier(n_estimators=50, max_depth=5)
model.fit(X_train, y_train)
return model
def export_model(model, file_path):
with open(file_path, "wb") as f:
pickle.dump(model, f)
def export_results(model, X_test, y_test, y_pred):
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average="weighted")
recall = recall_score(y_test, y_pred, average="weighted")
# Create a DataFrame with the evaluation metrics
results_df = pd.DataFrame(
{"Accuracy": [accuracy], "Precision": [precision], "Recall": [recall]}
)
results_df.to_csv("results/model_test_metrics.csv", index=False)
def export_confusion_matrix(model, y_test, y_pred):
cm = confusion_matrix(y_test, y_pred)
# Create a pandas DataFrame for the confusion matrix
cm_df = pd.DataFrame(cm, index=model.classes_, columns=model.classes_)
# Generate the confusion matrix plot
plt.figure(figsize=(10, 7))
sns.heatmap(cm_df, annot=True, cmap="Blues")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.savefig("results/confusion_matrix.png")
plt.close()
def main():
df = load_data()
X_train, X_test, y_train, y_test = split(df)
ohe = train_ohe(X_train)
X_train = ohe.transform(X_train)
X_test = ohe.transform(X_test)
model = train(X_train, y_train)
y_pred = model.predict(X_test)
export_results(model, X_test, y_test, y_pred)
export_confusion_matrix(model, y_test, y_pred)
export_model(ohe, "models/ohe.pickle")
export_model(model, "models/model.pickle")
if __name__ == "__main__":
main()
Question! 5
Checkpoint
Let's check if everything is working? To do this, let's run both files and check the results produced.
Question! 6
Question! 7
Question! 8
Question! 9
Question! 10
Attention!
Delete the files generated by the runs:
models/model.pickle
models/ohe.pickle
results/confusion_matrix.png
results/model_test_metrics.csv
Create Pipeline
We verified that we are able to call the execution of each step separately. For now we don't have a defined pipeline yet. Let's do this!
We discussed that the pipeline will be made up of one or more stages.
We will create the preprocessing stage. But first, let's discuss this stage:
Question! 11
Answer!
It depends on data/bank.csv
and the existence of the Python file src/preproc.py
itself.
Question! 12
Answer!
The file data/bank_preproc.parquet
is generated as output.
Question! 13
Answer!
python src/preproc.py
Information on dependencies, outputs and how to execute the file must be provided at stage. Let's finally create the stage of src/preproc.py
!
Question! 14
This will create, at the root of the repository, a YAML file dvc.yaml
containing the information that defines the pipeline:
stages:
preproc:
cmd: python src/preproc.py
deps:
- data/bank.csv
- src/preproc.py
outs:
- data/bank_preproc.parquet
Tip! 1
The stages of the pipeline can be programmed with direct edits in this dvc.yaml
file or using terminal commands such as dvc stage add
.
Let's run the pipeline and check the results produced.
Attention!
Before running, ensure that the files generated by the previous manual execution (before creating the pipeline) have been deleted.
Question! 15
Question! 16
Answer!
DVC can see that there were no changes to the dependencies and that there is no need to run the entire pipeline again. This will be especially important to avoid running time-consuming tasks unnecessarily.
Train Stage
Question! 17
Answer!
It depends on data/bank_preproc.parquet
and the existence of the Python file src/train.py
itself.
Question! 18
Answer!
The files:
models/model.pickle
models/ohe.pickle
results/confusion_matrix.png
results/model_test_metrics.csv
Question! 19
Answer!
python src/train.py
Question! 20
Question! 21
Question! 22
Question! 23
Question! 24
Answer!
The DVC can see that there were no changes to anything that affects the stage of preproc. However, the stage of train
was changed due to the change in the train.py
file and needs to be run again.
Question! 25
Answer!
DVC can realize that the changes affect the stage of preproc and that it needs to be run again.
The stage of train
has not changed but depends on the stage of preproc, then train
also needs to be executed.
The combination of DVC data versioning features with its pipeline management capabilities provides data scientists with a powerful tool that provides easy execution of experiments, while being able to reproduce and audit results.
That's all for today!