(Day 287) Calibrating classification models

Ivan Ivanov · October 14, 2024

Hello :) Today is Day 287!

A quick summary of today:

  • finished MLE by Andriy Burkov
  • streamed for 2 hours (and learned a lot about model calibration)
  • finished the LLM fine-tuning (for now)

MLE by Andriy Burkov - Chapter 9 - Model serving, Monitoring, and Maintenance

image

Properties of the model serving runtime

The model serving runtime is the environment in which the model is applied to the input data. The runtime properties are dictated by the model deployment pattern

  • security and correctness: the runtime is responsible for authenticating the user’s identity, and authorising thei requests. Things to check:
    • whether a specific user has authorised access to the models they want to run
    • whether the names and the values of params passed correspond to the model’s requirements
    • whether those params and their values are currently available to the user
  • easy of deployment: the model runtime should allow easy updates without disrupting the entire application. For a web service on a physical server, updating the model involves replacing the model file and restarting the service. For virtual machines or containers, updating requires stopping old instances and starting new ones with the updated model image. In model streaming-based applications, the new model version is streamed, making the application stateful. Once updated, the application contains the new model and related components like feature extractors, with modern stream-processing engines supporting these stateful updates

image

  • guarantees of model validity: an effective runtime will automatically make sure the model its running is valid (alond with the feature extractor and any other components). The model should not be served in prod in either of the below two conditions:
    • at least one of the end2end test examples was not scored correctly, or
    • the value of the metric, calculated on the confidence test set examples, is not within the acceptable range
  • easy of recovery: the runtime allows easy fallback to previous versions in case of errors
  • avoidance of training/serving skew: avoid using a different codebase (i.e. one for train and one for testing). If using two different codebases is unavoidable, then we need to make sure to add logging of feature values generated in the prod env
  • avoidance of hidden feedback loops (using one model’s output as the input for another)

Modes of model serving

Serving in batch mode

  • batch mode is used when a model processes large amounts of input data, such as all users of a product or all incoming events
  • it is more resource-efficient compared to on-demand mode and tolerates some latency
  • a typical batch size ranges between 100 to 1,000 feature vectors, often using powers of two (ie. 32, 64, 128)
  • the output is usually saved in a database for future use
  • examples of batch mode applications:
    • generating weekly music recommendations
    • classifying incoming comments as spam or not
    • extracting named entities from documents for search engines

Serving on demand to a human

  • the six steps for serving a model on demand to a human:
    1. validate the request
    2. gather the context
    3. transform context into model input
    4. apply the model and get output
    5. ensure the output makes sense
    6. present the output to the user
  • context refers to the user’s situation, which is either explicitly (like music recommendations) or implicitly (like auto-replies) provided
  • a good context provides enough data for feature extraction, logging, and debugging
  • examples of good context for various problems:
    • device malfunctioning: vibration, noise levels, firmware, etc
    • emergency room: age, vital signs, medical history, etc
    • credit risk: salary, debts, past credit payments, etc
    • advertisement display: webpage title, user position, browser info, etc
  • the feature extractor transforms the context into model input, often as a separate object in the pipeline
  • confidence scores are critical before serving the result, and low confidence can trigger prompts for user verification
  • prediction confidence can be combined with error cost estimates to guide decision-making and manage risk

Serving on demand to a machine

Machines are often served through streaming, as their data requirements are fixed and predetermined. A well-designed streaming application topology efficiently utilizes available resources. Demand can fluctuate, with high demand during the day and low at night. Autoscaling in the cloud helps manage this but isn’t fast enough for sudden spikes.

To handle these spikes, message brokers like RabbitMQ or Apache Kafka are used. Ondemand requests are placed in an input queue. The model reads input data in batches from the queue, processes them, and writes predictions to the output queue. A separate process reads the predictions from the output queue and delivers them to users. This setup helps manage demand spikes and improves resource efficiency

image

Model serving in the real world

  • be ready for errors
  • dealing with errors

image

  • be ready for, and dealing with, change
  • be ready for, and dealing with, human nature
    • avoid confusion
    • manage expectations
    • gain trust
    • manage user fatigue (and how the model affects the UX)
    • beware of the creep factor (when the user perceives the model’s predictive capacity as too high; then user feels uncomfortable, especially when a prediction concerns their very private details. Make sure the system doesn’t feel like “big brother” and doesn’t take too much responsibility)

Model monitoring

(just a reminder to myself, and anyone reading - EvidentlyAI offers an amazing and 100% free course on AI monitoring)

Monitoring makes sure that:

  • the model is serverd corrrectly
  • the performance of the model remains within acceptable limits

What can go wrong?

Monitoring models in production is crucial to catch potential issues early, such as:

  • degraded performance: new training data might worsen model accuracy
  • data drift: changes in live data without model updates can lead to outdated performance
  • feature extraction issues: updates to feature extraction code that aren’t reflected in the model can lead to unpredictable results
  • resource changes: loss or modification of resources needed for feature generation may impact model output
  • adversarial attacks: models, especially in e-commerce and media, can be targeted by malicious actors aiming to exploit weaknesses or manipulate behavior

Additional training data can introduce biases or inconsistencies if not labeled correctly. Furthermore, dual-use concerns arise when models are repurposed for harmful applications, such as voice impersonation for fraud or surveillance in weapon systems.

What and how to monitor?

Effective monitoring is essential to ensure the model performs well against a regularly updated confidence test set, avoiding distribution shifts. Regular testing on end-to-end examples is also necessary. Key metrics to monitor include accuracy, precision, recall, and particularly prediction bias. A well-calibrated model should show a similar distribution of predicted and observed classes; deviations indicate potential issues that need investigation.

Monitoring should also track changes in data sources, such as abandoned database columns or changes in definitions/formatting. Statistical tests like the Chi-square and Kolmogorov–Smirnov tests can identify significant shifts in feature value distributions. Alerts should be sent to stakeholders upon detection of these shifts.

Also, monitoring should include:

  • numerical stability: alert if nans or infinite values are encountered
  • computational performance: detect and warn about regressions and suspicious usage fluctuations, such as significant changes (ie. 30% hourly or 15% weekly) in model serving numbers
  • prediction statistics: monitor minimal and maximal prediction values, as well as the median, mean, standard deviation, latency, memory, and CPU usage

For recommender systems, monitoring click-through rates (CTR) is important - a decline indicates the need for model updates.

A balance is necessary between frequent alerts for minor metric changes and alert fatigue. Stakeholders can be allowed to set their own alert thresholds for non-critical cases.

Logging monitoring events is crucial for traceability. Monitoring tools should visualize performance trends and allow metric computation on data slices, enabling targeted analysis of model degradation.

Finally, logging additional data can assist in identifying problem sources and improving existing or future models, even if it cannot be analyzed in real-time.

What to log?

It is important to log enough information to reproduce any erratic system behavior during a future analysis. If the model is served to a front-end user, such as a website visitor or a mobile application user, it’s worth saving the user’s context at the moment of the model serving

It is also useful to include the model input, that is, the features extracted from the context, and the time it took to generate those features.

Also:

  • the model’s output and time it took to generate it
  • the new context of the user (once they observed the model’s output)
  • the user’s reaction to the output (click, where, how long it took)

Monitoring for abuse

Monitoring for abuse is crucial as some users may exploit your model for their business needs, sending an excessive number of requests or attempting to reverse-engineer training data. Here are some strategies to prevent such abuse:

  • charge users: implement a pay-per-request model
  • introduce delays: create longer response times for requests
  • user blocking: block users who exhibit suspicious behavior

Additionally, attackers might manipulate the model by submitting data that benefits them, potentially degrading overall model quality. To mitigate this risk:

  • data validation: only trust user data if corroborated by multiple users
  • reputation scoring: assign reputation scores to users and be cautious with low-reputation data
  • behavior classification: classify user behavior as normal or abnormal, and reject data from those exhibiting abnormal patterns

To stay ahead of potential attacks, regularly update your models and incorporate new data and features to detect fraudulent activities effectively.

Model maintenance

Most production models must be regularly updated. The rate depends on several factors:

  • how often it makes errors and how critical they are
  • how “fresh” the model should be, so as to be useful
  • how fast new training data becomes available
  • how much time it takes to retrain a model
  • how costly it is to deploy the model
  • how much a model update contributes to the product and the achievement of user goals

When to update?

When first deployed, a model may not be perfect and will require updates to correct critical errors. Over time, models can stabilize, needing fewer updates, but some models, like recommender systems, must always remain ‘fresh’ and updated frequently. The frequency of updates depends on business needs, user demands, and the speed of new data availability. In some cases, such as with machine translation models, updates can happen less frequently. But when new data streams in rapidly, labeling delays may affect update timing

Complex models, especially those requiring hyperparameter tuning, can take significant time to retrain, potentially lasting days or weeks. Using parallelizable algorithms and GPUs can help speed up the process. In cases where updates are costly, such as in healthcare, less frequent updates might be necessary. Sometimes, even a small model improvement can provide substantial business outcomes, especially when user disturbance is low and deployment costs are manageable.

How to update?

Ideally, new model versions should be deployed without interrupting the system. This is often achieved in virtualized or containerized infrastructures by replacing the image of a virtual machine or container. The deployment process typically involves versioning data, code, and models, ensuring that when a new model is trained, it is tested before replacing the old model in production. Techniques like A/B testing or multi-armed bandit algorithms can be used to determine when to replace a model.

Monitoring for distribution shifts is important, and when detected, human labelers may validate the changes. In streaming or on-demand model scenarios, updates are triggered when new data or state changes. A message broker architecture, involving a human in the loop for labeling, can facilitate model serving and updating. Continuous integration workflows are useful, ensuring that models are retrained as soon as new data becomes available. In these workflows, storing hyperparameters, labeler identities, and model versions is essential for traceability.

Before deploying new models, ensure the system has adequate storage and RAM, as newer models may be larger and slower. Monitor whether the new model introduces more costly errors or disproportionately affects certain user groups. If a validation fails, avoid deploying the model and roll it back if issues arise post-deployment.

Finally, in systems with model cascading, ensure that updating one model does not negatively affect others in the sequence. If models are interdependent, all models in the cascade must be updated accordingly.

This book is awesome!

Fine-tuning gpt4o-mini and submitting all the results to my professor

image

Here is a link to the sheet

Streamed

Part 1

Then my internet stopped for a minute so I had to reset my stream, and

Part 2

On stream I also tested and learned more about calibration, and I ended up putting it all up in a Kaggle notebook


That is all for today!

See you tomorrow :)