(Day 247) Continuing with AI monitoring using EvidentlyAI

Ivan Ivanov · September 4, 2024

mlops theory

Hello :) Today is Day 247!

A quick summary of today:

monitoring unstructured data (text) using evidently

Today I continued with EvidentlyAI’s course on monitoring.

Evidently AI: Module 3: ML monitoring for unstructured data

3.1. Introduction to NLP and LLM monitoring

NLP and LLM models are widely used for both predictive and generative tasks.

Predictive applications include:

Classification tasks, e.g., spam detection, classification of support tickets, evaluating text sentiment, etc.
Search (ranking) tasks are common in e-commerce website search, content recommendations, etc.
Information extraction, like extracting structured information (names, dates, etc.) from text, etc.

Generative applications cover such tasks as translation, summarization, and text generation, e.g., conversational interfaces, article generation, code generation, etc.

What can go wrong with NLP and LLMs?

There are standard issues that apply to both unstructured and tabular data:

Technical errors such as wrong encoding or data processing bugs.
Data shifts like new topics or unexpected usage scenarios.

However, there are also issues specific to working with unstructured data. For example:

Model attacks, e.g., prompt injection and adversarial usage.
Model behavior shift. For example, if we rely on third-party models, we can face changes in their properties due to retraining.
Hallucinations arise when the model starts to generate factually incorrect or unrelated answers.

ML monitoring metrics for NLP and LLMs

3.2. Monitoring data drift on raw text data

Challenges of monitoring raw text data

Handling raw text data is more complex than dealing with structured tabular data. With structured data, we can usually define “good” or “expected” data, e.g., particular feature distributions or statistical values can signal the data quality. For unstructured data, there is no straightforward way to define data quality or extract a signal for raw text data.

When it comes to data drift detection, we can use two strategies that rely on raw text data: domain classifier and topic modeling

Domain classifier

Domain classifier method or model-based drift detection uses a classifier model to compare distributions of reference and current datasets by training a model that predicts to which dataset a specific text belongs. If the model can confidently identify which text samples belong to the current or reference dataset, the two datasets are probably sufficiently different.

Topic modeling

Another strategy for evaluating raw data quality is topic modeling. The goal here is to categorize text into interpretable topic clusters, so instead of a binary classification model, we use a clustering model.

How it works:

Apply the clustering model to new batches of data.
Monitor the size and share of different topics over time.
Changes in topics can indicate data drift.

Using this method can be challenging due to difficulties in building a good clustering model:

There is no ideal structure in clustering.
Typically, it requires manual tuning to build accurate and interpretable clusters.

3.3. Monitoring text data quality and data drift with descriptors

What is a text descriptor? A text descriptor is any feature we can derive or calculate from raw text. Text descriptor transforms unstructured data (e.g., text) into structured data (e.g., numeric or categorical descriptors).

Depending on wer goal and problem statement, there are various groups of descriptors we can calculate:

Text data quality. Example descriptors are text length, the share of out-of-vocabulary words, the share of non-letter characters, regular expressions match, etc.
Text contents. For instance, we can monitor the presence of trigger words like mentions of specific brands or competitors or text sentiment. Collaborating with product managers or business teams is recommended to determine the meaningful text descriptors for wer problem statement and domain area.

Monitoring text descriptors

Once you have a structured representation of the unstructured data – in our case, text descriptors – you can use monitoring techniques applied to structured data. For example, you can:

Measure descriptor distribution drift.

Run rule-based checks, such as min-max ranges for text length, the expected share of non-letter symbols, or responses that match a regular expression.

Track any statistics calculable for tabular data, e.g., correlation changes between descriptors and model target.

3.4. Monitoring embeddings drift

Embedding drift detection methods

Since embeddings are numerical values, we can use many methods to monitor embedding distribution drift.

Distance metrics Each object in the dataset, represented as an embedding, is a numerical vector. Distance metrics allow calculating distances between these vectors. For example, you can use Euclidean distance or Cosine distance (to assess the angle between vectors). We can detect shifts in datasets by measuring the distance between centroids of reference data and current data.

Model-based drift detection This approach uses embeddings to build a domain classifier that distinguishes between reference and current data. The idea is similar to model-based drift detection on raw text data: you get an estimation of the model’s ability to distinguish between datasets. However, with embeddings, this approach has a limitation: you cannot use the information about the strongest features or best objects to determine the root cause/source of drift.

Share of drifted components You can also use the share of drifted components to monitor embedding drift. This approach treats each embedding component independently and uses drift detection methods that can be applied to numerical values. For each component, drift size or score is assessed. You can then aggregate these individual scores into the number of drifted components or share of drifted components.

Practical

Setup

import pandas as pd
import numpy as np

from sklearn import datasets, ensemble, model_selection

from evidently import ColumnMapping
from evidently.report import Report

from evidently.metric_preset import TextOverviewPreset

from evidently.metrics import ColumnSummaryMetric
from evidently.metrics import ColumnCorrelationsMetric
from evidently.metrics import ColumnDistributionMetric
from evidently.metrics import ColumnDriftMetric
from evidently.metrics import ColumnValueRangeMetric
from evidently.metrics import DatasetCorrelationsMetric
from evidently.metrics import DatasetDriftMetric
from evidently.metrics import DataDriftTable
from evidently.metrics import EmbeddingsDriftMetric

from evidently.metrics.data_drift.embedding_drift_methods import model, distance, ratio, mmd

from evidently.descriptors import TextLength, TriggerWordsPresence, OOV, NonLetterCharacterPercentage
from evidently.descriptors import SentenceCount, WordCount, Sentiment, RegExp

Import data

import nltk
nltk.download('words')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('vader_lexicon')

reviews_data = datasets.fetch_openml(name='Womens-E-Commerce-Clothing-Reviews', version=2, as_frame='auto')
reviews = reviews_data.frame

reviews['prediction'] = reviews['Rating']
reviews_ref = reviews[reviews.Rating > 3].sample(n=5000, replace=True, ignore_index=True, random_state=42)
reviews_cur = reviews[reviews.Rating < 3].sample(n=5000, replace=True, ignore_index=True, random_state=42)

column_mapping = ColumnMapping(
    numerical_features=['Age', 'Positive_Feedback_Count'],
    categorical_features=['Division_Name', 'Department_Name', 'Class_Name'],
    text_features=['Review_Text', 'Title']
)

Text data overview report

text_overview_report = Report(metrics=[
    TextOverviewPreset(column_name="Review_Text")
])

text_overview_report.run(reference_data=reviews_ref, current_data=reviews_cur, column_mapping=column_mapping)
text_overview_report.show(mode='inline')

Sample graphs in the report:

Text data drift with descriptors

table_column_metrics_report = Report(metrics=[
    DatasetDriftMetric(columns=["Age", "Review_Text"]),
    DataDriftTable(columns=["Age", "Review_Text"]),

])

table_column_metrics_report.run(reference_data=reviews_ref, current_data=reviews_cur, column_mapping=column_mapping)
table_column_metrics_report.show(mode='inline')

Extra info in each dropdown

Embedding Drift Detection

embeddings_data = datasets.fetch_lfw_people()
embeddings_data = pd.DataFrame(embeddings_data['data'])
embeddings_data.columns = ['col_' + str(x) for x in embeddings_data.columns]

embeddings_data = embeddings_data.iloc[:5100, :31]

embeddings_data_shifted = embeddings_data.copy()
embeddings_data_shifted.iloc[2500:5000, :5] = 0

column_mapping = ColumnMapping(
    embeddings={'small_subset': embeddings_data.columns[:10], 'big_subset': embeddings_data.columns[10:29]},
    target=embeddings_data.columns[30]
)

report = Report(metrics=[
    EmbeddingsDriftMetric('small_subset')
])

report.run(reference_data = embeddings_data[:2500], current_data = embeddings_data[2500:5000], 
           column_mapping = column_mapping)
report.show(mode='inline')

Use a specific drift detection method:

report = Report(metrics = [
    EmbeddingsDriftMetric('small_subset', 
                          drift_method = ratio(
                              component_stattest = 'wasserstein',
                              component_stattest_threshold = 0.1,
                              threshold = 0.2,
                              pca_components = None,
                          )
                         )
])

report.run(reference_data = embeddings_data_shifted[:2500], current_data = embeddings_data_shifted[2500:5000],  
           column_mapping = column_mapping)
report.show(mode='inline')

This course/tutorial/explanation series is amazing. AI monitoring is crucial and I am glad I found this great course to improve upon this skill. There are 3 more modules to cover (hopefully in the next days).

That is all for today!

See you tomorrow :)