Evaluate Model Degradation
Ground Truth Evaluation
We have already discussed the idea of monitoring model performance in production using metrics that could, for example, compare performance on the test set (separate during training) and the current data set in production.
This concept is known as Ground Truth Evaluation. It demands waiting for the occurrence of a labeled event. For example, in the case of a recommendation engine, it would determine whether a customer clicked on or purchased one of the recommended products.
Once the new ground truth data is gathered, the next step is to evaluate the models performance using this information and compare it against the predefined metrics established during the training phase.
Decision making!
If the disparity between the model performance and the registered metrics exceeds a certain threshold, it signifies that the model has become outdated.
Question! 1
Answer
Retrain the model on current data!
Question! 2
Answer
The goal is to refine the model, not radically change it.
Question 3
Question! 4
Answer!
In scenarios where the events or outcomes of interest are in the future or have not occurred yet, so it becomes impractical to obtain ground truth data for evaluation.
Think, for example, of fraud detection models. Sometimes it can take months for a credit card payment claim to be created.
Another example: a model that predicts the occurrence of a disease in the next ten years requires a wait of up to ten years!
Tip! 1
When the maturation time of the target variable is high, the use of Ground Truth Evaluation may become not very practical!
Question! 5
Answer!
To address this challenge, one possible solution is to implement random labeling, which involves establishing a ground truth for a subset of the dataset chosen at random. This approach allows for a broader representation of the data, providing a more comprehensive evaluation of the model performance.
Data Drift
Considering the challenges and limitations associated with Ground Truth Evaluation, a more practical alternative could be the use of data drift or input drift detection.
Rather than relying solely on ground truth labels, input drift detection focuses on identifying changes in the input data itself, without requiring explicit knowledge of the true outcomes.
This approach offers a more feasible and efficient means of monitoring and maintaining machine learning models.
Tip! 2
We will compare the distribution of features in the data being predicted versus the distribution of features in the data used in training.
Our goal is to check if distribution of the input data changes over time.