(Day 223) Finishing up Introducing MLOps

Ivan Ivanov · August 11, 2024

mlops theory data-eng

Hello :) Today is Day 223!

A quick summary of today:

read the last few chapters of Introducing MLOps
read chapter 4 of the fundamentals of data eng book
created a short demo video and submitted the KB AI competition project

Introducing MLOps Chapter 4: Developing Models

The first step in the MLOps cycle

In Theory

Representation: A machine learning model is an approximation of reality, representing certain aspects of a real-world process.
Foundation: Models are built using statistical theory and are crafted by algorithms from training data.
Generalization: The ability to predict unseen cases based on patterns learned from training data is known as generalization capacity.

In Practice

Model Components: A model comprises parameters that transform input data into predictions. This includes mathematical functions and derived data.
Application Example: Predicting house prices based on factors like area, bedrooms, and market conditions.
Outputs: Outputs can vary from single numbers to probabilities with confidence intervals, or even recommendations.

Required Components

Training Data: Labeled data used to train the model.
Performance Metric: The metric the model optimizes (e.g., accuracy, precision).
ML Algorithm: The method used to build the model, with different algorithms suited to different tasks.
Hyperparameters: Configurations for algorithms that influence how models learn.
Evaluation Dataset: Data separate from the training set to test the model’s generalization.

MLOps Considerations by Algorithm Type

Linear Models: Susceptible to overfitting.
Tree-based Models: Can be unstable; some, like Random Forests, are hard to interpret.
Deep Learning: Highly complex and resource-intensive, with low interpretability.

Computing Power

Role in ML: The evolution of ML has been tied to the availability of computing power, impacting algorithm complexity and usage.

Data Exploration

Initial Analysis: Involves understanding data through documentation, statistics, distribution analysis, and comparison to other models or datasets.

Feature Engineering and Selection

Purpose: Transform raw data into features that better represent the problem to the model.
Techniques: Includes derivatives, enrichment, encoding, and combinations of features.
Impact: The number and quality of features directly affect model performance and MLOps strategy.

Experimentation

Process: Iterative testing and adjustment of model components, hyperparameters, and features to optimize performance.
Bias/Variance Trade-off: Balancing between underfitting (high bias) and overfitting (high variance).

Evaluating and Comparing Models

Usefulness Over Perfection: Models should be judged on their practical utility rather than aiming for perfection.
Evaluation Metrics: Metrics should be chosen carefully to reflect the problem at hand, with methods like cross-validation ensuring robustness.
Model Behavior: Beyond metrics, understanding how models react to different inputs and across subpopulations is critical.

Introducint MLOps Chapter 5 Preparing for Production

Runtime Environments in MLOps

Diverse Production Environments: Models need to be adaptable to various environments like custom services, data platforms, or Kubernetes. Ideally, a model runs in production as it did in development, but this is often challenging.
Adaptation Challenges: Moving from development to production may require anything from minimal changes to complete reimplementation, potentially delaying deployment by months or years.
Tooling: Early consideration of production formats (e.g., PMML, ONNX) is crucial. Tools should be integrated during development to avoid blocking the validation process.
Performance Optimization: Conversion for performance (e.g., Python to C++) might be necessary, especially for models running on low-power devices. Techniques like quantization, pruning, and distillation can reduce model size and improve speed.
Data Access: Models may need real-time data access in production. Proper setup of data connections and validation in a production-like environment is critical to avoid malfunctions.
Final Thoughts: While training is computationally intensive, inference is more resource-demanding over time. Simplifying models for production can reduce costs and environmental impact, offering a viable alternative to complex models.

Model Risk Evaluation

Purpose of Model Validation

Models are imperfect and can have bugs, leading to real-world consequences.
Validation is essential to anticipate and minimize risks in production.
Organizations must consider risks like malfunctions, attacks, financial, legal, and reputational damage.

Key Questions Before Production

Worst-case scenario for model behavior?
Potential for data or logic extraction?
Financial, business, legal, safety, and reputational risks?

Importance of Risk Awareness

High-risk applications require rigorous validation.
Machine learning introduces unique risks, needing careful management.

Origins of ML Model Risk

Risks arise from:
- Design, training, or evaluation bugs.
- Runtime framework bugs or incompatibilities.
- Low-quality training data.
- Differences between training and production data.
- Misuse or misinterpretation.
- Adversarial attacks.
- Legal and reputational issues.
Risks are amplified by:
- Broad model use.
- Rapidly changing environments.
- Complex model interactions.

Quality Assurance for Machine Learning

Maturity: QA in traditional software is mature; for ML, it’s still evolving and challenging to scale in MLOps.
Continuous QA: QA should occur throughout the model development process, not just at the final stage.
Validation: Those responsible for validation need ML expertise and authority to address risks and halt production if necessary.
Documentation: QA also involves documenting and validating models against organizational guidelines, including data origin and security.

Key Testing Considerations

Test Data: Test scenarios should include both realistic and problematic data (e.g., extreme values).
Metrics: Collect both statistical and computational metrics; tests should fail if key assumptions are not met.
Stability: Model stability (small input changes should cause small output changes) is crucial; simpler models often exhibit better stability.

Subpopulation Analysis and Model Fairness

Fairness: Assess fairness by analyzing subpopulations, particularly for sensitive variables.
Impact: Lack of fairness evaluation can lead to business, regulatory, and reputational risks.

Reproducibility and Auditability

Reproducibility: In MLOps, reproducibility involves rerunning experiments with the same data, code, and environment.
Auditability: Requires access to full model history, including documentation, artifacts, test results, and logs.

Machine Learning Security

Adversarial Attacks: Small input modifications can cause drastic prediction changes in models, especially in deep learning.
Training Data Poisoning: Manipulating training data can control model behavior (e.g., Microsoft’s 2016 Twitter bot incident).
Governance: Security must be embedded from the start; feedback mechanisms and governance structures are crucial.

Model Risk Mitigation

Deployment: Control deployment with progressive rollouts; monitor for behavior changes.
Changing Environments: Fast-changing inputs can lead to significant risks; models may need to operate in a degraded mode during remediation.
Interactions Between Models: Complexity grows with the number of models and their interactions; design independent processing chains to reduce risk.

Model Misbehavior

Input/Output Monitoring: Implement real-time checks on feature values and outputs to detect misbehavior.
Anomaly Detection: Use anomaly detection and certainty scores to validate predictions, especially in high-risk models.

Introducing MLOps Chapter 6: Deploying to Production

CI/CD Pipelines Overview

CI/CD (Continuous Integration/Delivery):

CI: Integrate and test code frequently, catching bugs early.
CD: Automate deployment of tested code to production, reducing manual intervention.

Key Components:

Centralized Version Control: Git for code; additional tools for data/versioning.
Artifacts: Bundled code, model, data, environment, and tests ready for deployment.
Testing Pipeline: Automated tests to ensure model performance and compliance.

Deployment Strategies

Integration: Merging code changes and testing.
Delivery: Ready-to-deploy, validated models.
Deployment: Running models on target infrastructure.
Release: Controlling when the new model handles production workload.

Deployment Approaches

Batch Scoring: Process entire datasets (e.g., daily jobs).
Real-time Scoring: Handle individual requests (e.g., user interactions).

Model Deployment Strategies

Blue-Green: Run new and old versions in parallel, switch to new once stable.
Canary Release: Gradually shift traffic to a new version, monitoring performance.

Maintenance

Resource Monitoring: Track CPU, memory, etc.
Health Checks: Ensure the model is online and responsive.
ML Metrics Monitoring: Track model accuracy and relevance over time.

Containerization

Docker: Packages applications with all dependencies, ensuring consistency.
Kubernetes: Orchestrates containers, manages scaling, updates, and failures.

Scaling Challenges

Horizontal Scaling: Add more machines to handle increased loads.
Elastic Systems: Automatically adjust resources based on demand.

Batch Processing

Spark with Kubernetes: Distributes computation across multiple nodes.

Managing Multiple Models

Standardized CI/CD Pipelines: Ensures consistency and efficiency across deployments.
Tools: E.g., Spinnaker for managing complex deployment scenarios.

Logging Requirements

Centralized logging for tracking model performance and identifying issues across multiple servers and model versions.

Introducing MLOps Chapter 7: Monitoring and Feedback Loop

Key Factors:

Domain: Fields like cybersecurity need frequent updates; stable domains, less so.
Cost: Evaluate if retraining costs justify performance gains.
Model Performance: Retrain based on data sufficiency and performance monitoring.

Organizational Boundaries:

Upper Bound: At least once a year for skill retention and toolchain readiness.
Lower Bound: Likely not more than once a day, depending on feedback speed.

Online Learning:

Costly but necessary for streaming data scenarios.
Requires rigorous testing due to iterative learning.

Drift Detection:

Input Drift: Monitor feature values to detect performance changes.
Techniques: Use univariate statistical tests or domain classifiers to detect drift.

Feedback Loop:

Components: Logging system, model evaluation store, and online model comparison.
Methods:
- Champion/Challenger: Shadow testing for model comparison.
- A/B Testing: Used when direct comparison isn’t possible, e.g., ad engines.

Conclusion:

Retraining is essential, with frequency determined by the domain, cost, and model performance. Continuous monitoring and a robust feedback loop are critical.

Introducing MLOps Chapter 8: Model Governance

Model governance goals aim to ensure that the business delivers on its responsibilities to all stakeholders, from shareholders and employees to the public and national governments. The responsibilities include financial, legal, and ethical, and are all underpinned by the desire for fairness.

Financial Model Risk Management Regulation

In finance, model risk is the risk of incurring losses when the models used for making decisions about tradable assets prove to be inaccurate. These models, such as the Black–Scholes model, existed long before the arrival of ML.

The UK Prudential Regulation Authority’s (PRA) regulation, for example, defines four principles for good MRM:

Model definition

Define a model and record such models in inventory.

Risk governance

Establish model risk governance framework, policies, procedures, and controls.

Life cycle management

Create robust model development, implementation, and usage processes.

Effective challenge

Undertake appropriate model validation and independent review.

Key elements of responsible AI

Data
Bias
Inclusiveness
Model Management at Scale
Governance

A template for MLOps governance

Understand and Classify the Analytics Use Cases
Establish an Ethical Position
Establish Responsibilities

Determine Governance Policies

Governance Consideration	Example Measures
Reproducibility and Traceability	Full VM and data snapshot for rapid model re-instantiation, ability to re-create environment and retrain with a data sample, or record only metrics of deployed models?
Audit and Documentation	Full log of all changes during development including experiments and reasons for choices, automated documentation of deployed model only, or no documentation at all?
Human-in-the-Loop Sign-Off	Multiple sign-offs for every environment move (development, QA, preproduction, production).
Preproduction Verification	Hand-code model for verification, automated test pipeline with extensive unit and end-to-end test cases, or automated checks on database, software version, and standards?
Transparency and Explainability	Manually-coded decision tree for maximum explainability, use of regression algorithms’ explainability tools like Shapley values, or acceptance of opaque algorithms?
Bias and Harm Testing	“Red team” adversarial manual testing with multiple tools, or automated bias checking on specific subpopulations.
Production Deployment Modes	Containerized deployment to scalable, high-availability multi-node configuration with automated stress/load testing, or a single production server?
Production Monitoring	Real-time alerting, dynamic model balancing, automated nightly retraining and redeployment, weekly input drift monitoring with manual retraining, or basic infrastructure alerts?
Data Quality and Compliance	PII considerations including anonymization, documented and reviewed column-level lineage, and automated data quality checks for anomalies.

Integrate Policies into the MLOps Process

Process Step	Example Activities and Governance Considerations
Business Scoping	Record objectives, define KPIs, and record sign-off: for internal governance considerations
Ideation	Data discovery: data quality and regulatory compliance constraints Algorithm choice: impacted by explainability requirements
Development	Data preparation: consider PII compliance, separation of legal regional scopes, avoid input bias Model development: consider model reproducibility and auditability Model testing and verification: bias and harm testing, explainability
Preproduction	Verify performance/bias with production data Production-ready testing: verify scalability
Deployment	Deployment strategy: driven by the level of operationalization Deployment verification tests Use of shadow challenger or A/B test techniques for in-production verification
Monitoring and Feedback	Performance metrics and alerting Prediction log analysis for input drift with alerting

Select the Tools for Centralized Governance Management
Engage and Educate
Monitor and Refine

Introducing MLOps Chapter 9: MLOps in Practice: Consumer Credit Risk Management

Background: The Business Use Case

When a consumer applies for a loan, the credit institution must decide whether to grant it, often using scores to estimate the likelihood of repayment. The process involves:

Prescreen Stage: A preliminary score with minimal features to quickly filter out some applications.
Underwriting Stage: A detailed score based on comprehensive information for a more accurate decision.
Post-Underwriting: Scores to assess the risk of loans in the portfolio.

Historical Context:

FICO Score: Used since 1995 in the U.S. for credit risk assessment.
Regulations: Models, whether rule-based, statistical, or machine learning, must comply with strict regulations.

Credit Decision Information:

Data Availability: Includes customer’s credit history and repayment behavior.
Outcome Definition: “Bad” outcomes (non-repayment) are often determined before the final repayment data is available.

Model Development

Traditional Methods: Relied on expert rules and logistic regression.
Modern Techniques: Gradient boosting algorithms like XGBoost offer better model-building but come with challenges like the black-box effect.
Validation: Methods like Shapley values are used to build trust and validate models.

Model Bias Considerations

Selection Bias: Models used for loan rejection must avoid cherry-picking and ensure accurate predictions across all applicant segments.
Probability Calibration: Models must produce accurate probabilities, not just binary outcomes, which often requires calibration techniques.

Validation and Monitoring

Validation: Includes performance testing on out-of-sample data and analysis of subpopulations.
Monitoring: Both data and performance drift must be assessed. Data drift involves metrics like Kolmogorov-Smirnov and stability indices, while performance drift is monitored with metrics like AUC.

Prepare for Production

Documentation: Detailed records of data, model assumptions, validation methods, and monitoring approaches are essential.
Validation Process: Independent reviews and potential model rebuilds ensure accuracy and understanding.

Deploy to Production

Environment: Production environments are typically separate and may use different technical stacks.
Deployment: Ranges from simple operations to complex integrations. Automation of monitoring depends on the frequency and criticality of the model.

Fundamentals of Data Engineering Chapter 4: Choosing Technologies Across the Data Engineering Lifecycle

Team Size and Capabilities

The first thing you need to assess is your team’s size and its capabilities with technol‐ ogy. Are you on a small team (perhaps a team of one) of people who are expected to wear many hats, or is the team large enough that people work in specialized roles? Will a handful of people be responsible for multiple stages of the data engineering lifecycle, or do people cover particular niches? Your team’s size will influence the types of technologies you adopt.

Speed to Market

In technology, speed to market wins. This means choosing the right technologies that help you deliver features and data faster while maintaining high-quality standards and security. It also means working in a tight feedback loop of launching, learning, iterating, and making improvements.

Interoperability

Rarely will you use only one technology or system. When choosing a technology or system, you’ll need to ensure that it interacts and operates with other technologies. Interoperability describes how various technologies or systems connect, exchange information, and interact.

Today Versus the Future: Immutable Versus Transitory Technologies

Given the rapid pace of tooling and best-practice changes, we suggest evaluating tools every two years. Whenever possible, find the immutable technologies along the data engineering lifecycle, and use those as your base. Build transitory tools around the immutables.

Location

on-premise
cloud
hybrid
multi-cloud

Build Versus Buy

Build versus buy comes back to knowing your competitive advantage and where it makes sense to invest resources toward customization. In general, we favor open-source (OSS) software and Commercial OSS by default, which frees you to focus on improving those areas where these options are insufficient. Focus on a few areas where building something will add significant value or reduce friction substantially.

Monolith Versus Modular

While monoliths are attractive because of ease of understanding and reduced com‐ plexity, this comes at a high cost. The cost is the potential loss of flexibility, opportu‐ nity cost, and high-friction development cycles

Serverless Versus Servers

Key factors to consider:

workload size and complexity
execution frequency and duration
requests and networking
language
runtime limitations
cost

Kubernetes for the Absolute Beginners - Hands-On

This is a video course full with short lectures. Tbh I clicked on it by mistake but decided to watch a bit. Turns out is is quite interesting. Below is some of the interesting bits I learned

What is a Pod?

Pod: The smallest Kubernetes object that encapsulates one or more containers.
Function: Encapsulates a container to deploy an application on a set of worker nodes.

Scaling

Scaling Up: Create new pods for additional instances of the application, not more containers within the same pod.
Scaling Down: Delete existing pods to reduce the number of instances.
Capacity: If more capacity is needed, add new nodes to the cluster and deploy additional pods there.

Multiple Containers in a Pod

Single Container: A pod usually has a 1:1 relationship with containers.
Multiple Containers: Possible within a single pod for specific use cases, such as having a helper container alongside the main application container.
- Communication: Containers in the same pod share the same network and storage.
- Lifecycle: Helper containers are created and destroyed with the main container.

Comparing with Docker Containers

Without Kubernetes: Manual management of networking, volumes, and state for multiple containers.
With Kubernetes: Kubernetes manages these aspects automatically within pods, simplifying deployment and management.

Pod Deployment

Command: kubectl run deploys a Docker container by creating a pod.
- Image: Specify the image using the --image parameter (e.g., Nginx from Docker Hub).
Viewing Pods: Use kubectl get pods to see the list and status of pods.

Replicas.

The ReplicaSet controller manages the desired number of pod replicas, ensuring that the specified number of identical pods are running at all times. This helps balance load and provides redundancy in case of pod failures.

Finally, we submitted the KB project

In the last second I added a small video demo of the project. Instead of speaking I put subtitles. Here is the youtube link <- it is in Korean only.

They do not mention when they will announce people inveted for the finals, which are in-person on the 28th of Aug, so we have to just wait and see ~

That is all for today!

See you tomorrow :)