(Day 252) Designing effective ML monitoring with EvidentlyAI

Ivan Ivanov · September 9, 2024

nlp mlops

Hello :) Today is Day 252!

A quick summary of today:

covered first 3 sections of module 4

Tomorrow I have an exam so today my study time for this blog was a bit limited. Nevertheless, I got to cover half of Module 4 from EvidentlyAI’s course on AI monitoring.

4.1. Logging for ML monitoring

What is a good ML monitoring system? A good ML monitoring system consists of three key components:

Instrumentation to ensure collection and computation of useful metrics for analyzing model behavior and resolving issues.
Alerting to define unexpected model behavior through metrics and thresholds and design action policy.
Debugging to provide engineers with context to understand model issues for faster resolution.

Logging and instrumentation

Step 1

Step 2

Step 3

4.2. How to prioritize ML monitoring metrics

TLDR for metric prio

Service health

Model performance

Data quality and data integrity

Data and concept drift

Comprehensive monitoring

Depending on the problem statement and model usage scenario, we can introduce more comprehensive monitoring metrics:

Performance by segment. It can be especially useful if we deal with a diverse audience or complex object structures and want to monitor them separately.
Model bias and fairness. These metrics are crucial for sensitive domain areas like healthcare.
Outliers. Monitoring for outliers is vital when individual errors are costly.
Explainability. Explainability is important when users need to understand model decisions/outputs.

4.3. When to retrain machine learning models

Model retraining strategies

On Schedule

Pro-tip for scheduled retraining: use historical data to determine the rate of model decay and the volume of new data required for effective retraining. For example, we can get a training set from our historical data and train a model on top of this dataset. Then, we can start experimenting: apply this model to new batches of data with a certain time step – daily, weekly, monthly – to measure how the model performs on the new data and define when its quality starts to degrade. Important note: we need labels to do it. If feedback/ground truth is not available yet, it makes sense to send data for labeling before we start experimenting with historical data.

Trigger-based retraining

Model retraining tradeoffs

Thinking through the retraining decision

Be pragmatic. Develop a strategy considering available actions, service properties, resources, model criticality, and the cost of errors. Here is an example of a decision-making logic you can follow:

That is all for today!

See you tomorrow :)