(Day 349) Finished the Effective ML teams book

Ivan Ivanov · December 15, 2024

mlops traditional-machine-learning

Hello :) Today is Day 349!

A quick summary of today:

Chapter 3 - Effective Dependency Management: Principles and Tools
Chapter 4 - Effective Dependency Management in Practice
Chapter 5 - Automated Testing: Move Fast Without Breaking Things
Chapter 6 - Automated Testing: ML Model Tests
Chapter 7 - Supercharging Your Code Editor with Simple Techniques
Chapter 8 - Refactoring and Technical Debt Management
Chapter 9 - MLOps and Continuous Delivery for ML (CD4ML)
Chapter 10/11 - Building Blocks of Effective ML Teams/Organisations

Effective Machine Learning Teams - Chapter 3 - Effective Dependency Management: Principles and Tools

What if our code could run everywhere, every time?

A Better Way: Check Out and Go

Joining a team, running a cmd like ./go.sh and we are done - this is known as ‘check out and go’. By adopting this practice, we get the following benefits:

faster onboarding
saving time and cognitive resources
experiment-operation symmetry
reproducability and repeatability

Principles for Effective Dependency Management

With good dependency management practices, we can create consistent, production-like, and reproducible environments from the get-go

Tools for Dependency Management

In a nutshell, to effectively manage the dependencies of a given project, we need to:

specify OS-level dependencies as code (by using Docker)
specify application-level dependencies as code (by using Poetry)

Docker crash course

This 2nd part of the chapter went a bit more in-depth on the Docker basics. And also a library called batect which seems to have been archived in 2023 and no longer maintained

Effective Machine Learning Teams - Chapter 4 - Effective Dependency Management in Practice

In Context: ML Development Workflow

A good starting point would be:

a dev image
prod API image
deployment image - model training
deployment image — model web service

Here are common ML development tasks and their associated images, however the specific slicing and differentiation of images will vary depending on the project’s needs

Secure Dependency Management

Remove unnecessary dependencies

Unnecessary deps:

enlarge the attack surface area of your project and make it more vulnerable to malicious attackers
increase the time needed to build, publish, and pull your images
stray dependencies that are installed but never used can make the project confusing and hard to maintain

As a rule of thumb, we should:

start with base images that are as small as possible
remove dependencies that are not used from pyproject.toml
exclude development dependencies from the production image
automate checks for vulnerabilities

Effective Machine Learning Teams - Chapter 5 - Automated Testing: Move Fast Without Breaking Things

Automated Tests: The Foundation for Iterating Quickly and Reliably

Automated tests are the essential foundation for products that are easy to maintain and evolve. Without automated tests, changes become error-prone, tedious, and stressful.

Key benefits:

fast feedback
reduced risk of production defects
living documentation and self-testing code
reduced cognitive load
refactoring
regulatory compliance
improved flow, productivity, and satisfaction

If Automated Testing Is So Important, Why Aren’t We Doing It?

we think writing automated tests slows us down
we have CI/CD (but the content matters)
we just don’t know how to test ML systems

Building Blocks for a Comprehensive Test Strategy for ML Systems

Identifying Components For Testing

The ML Systems Test Pyramid

Characteristics of a Good Test and Pitfalls to Avoid

tests should be independent and idempotent
tests should fail fast and fail loudly
tests should check behavior, not implementation

# bad example: implementation-focused test
def test_train_model_returns_a_trained_model(mock):
   mock.patch("train.training_helpers._fit_model", 
              return_value=RandomForestClassifier())

   model = train_model()

   assert isinstance(model, RandomForestClassifier)

# good example: behavior-focused test
def test_train_model_returns_a_trained_model():
   test_data = load_test_data()
   model = train_model(test_data)
   valid_predictions = [0, 1, 2]

   prediction = model.predict(test_data[0])

   assert prediction in valid_predictions

tests should be runnable in your development environment
tests must be part of feature development
tests let us ‘catch bugs once’

Structure of a Test

a meaningful test name
structure of arrange, act, assert (AAA)
specific and holistic assertions

# good example
def test_convert_keys_to_snake_case_for_keys_with_spaces_or_punctuation(): 
   # arrange (provide input conditions) 
   user_details = {"Job Description": "Wizard",
                   "Work_Address": "Hogwarts Castle",
  "Current-title": "Headmaster"}

   # act
   result = convert_keys_to_snake_case(user_details)

   # assert (verify output conditions)
   assert result == {
       "job_description": "Wizard",
       "work_address": "Hogwarts Castle",
       "current_title": "Headmaster"
    }

# bad example
def test_convert_keys_to_snake_case(): 
    result = convert_keys_to_snake_case({"Job Description": "Wizard",
                                        "Work Address": "Hogwarts Castle",
                                        "Current_title": "Headmaster"
})
    assert result["job_description"] == "Wizard" 
    assert result["work_address"] is not None 

Software Tests

unit tests

# unit testing a function that transforms a dataframe
from pandas._testing import assert_frame_equal 

def test_normalize_columns_returns_a_dataframe_with_values_between_0_and_1():
   loan_applications = pd.DataFrame({"income": [10, 100, 10000]})

   result = normalize_columns(loan_applications)

   expected = pd.DataFrame({"income": [0, 0.009, 1]})
   assert_frame_equal(expected, result)

training smoke tests

def test_training_smoke_test():
    data = pd.read_csv("/code/data/train.csv", encoding="utf-8", 
                       low_memory=False)
    # group by target column to ensure we train on data 
    # that contains all target labels:
	test_data = data.groupby('DEFAULT') \
	                            .apply(lambda df: df.head(10)) \
	                            .reset_index(drop=True)

    pipeline = train_model(test_data)

    predictions = pipeline.predict(test_data)
    valid_predictions = {0, 1}
    assert valid_predictions.issubset(predictions)

API tests

from fastapi.testclient import TestClient
from precisely import assert_that, is_mapping, \
   greater_than_or_equal_to, less_than_or_equal_to

from api.app import app


client = TestClient(app)


def test_root(self):
   response = client.get("/")

   assert response.status_code == 200
   assert response.json() == {"message": "hello world"} 

def test_predict_should_return_a_prediction_when_given_a_valid_payload(self): 

   valid_request_payload = {
       "Changed_Credit_Limit": 0,
       "Annual_Income": 0,
       "Monthly_Inhand_Salary": 0,
       "Age": 0,
   } 

   response = client.post("/predict/",
                          headers={"Content-Type": "application/json"},
                          json=valid_request_payload) 

   assert response.status_code == 200
   assert_that(response.json(),
               is_mapping({"prediction": greater_than_or_equal_to(0)
                                              and less_than_or_equal_to(4),
                         "request": valid_request_payload}) 

It is a recommended practice to assert on the whole ‘elephant’ (final object), rather than assertions for each item in it

post-deployment tests

import requests


class TestPostDeployment:
   endpoint_url = "https://my-model-api.example.com" 

   def test_root(self): 
       response = requests.get(self.endpoint_url)

       assert response.status_code == 200

   def test_predict_should_return_a_prediction_when_given_a_valid_payload(self):
       valid_request_payload = {
           "Changed_Credit_Limit": 0,
           "Annual_Income": 0,
           "Monthly_Inhand_Salary": 0,
           "Age": 0,
        }

       response = requests.post(f"{self.endpoint_url}/predict/", 
                                headers={"Content-Type": "application/json"},
                                json=valid_request_payload
                                ) 

       assert response.status_code == 200

Effective Machine Learning Teams - Chapter 6 - Automated Testing: ML Model Tests

Model Tests

The Necessity of Model Tests

Model tests are essential for ensuring both speed and quality in the ML delivery cycle. Without them, teams face a false choice between maintaining production speed and ensuring model quality, often leading to skipped checks and defects detected too late. Automated model tests address silent failures, reduce manual testing, and prevent production issues, allowing ML practitioners to focus on higher-value tasks. Comprehensive testing builds confidence in model performance and enhances efficiency, ultimately saving time and improving outcomes for users.

Challenges of Testing ML Models

Four challenges—slow tests, high-volume and high-dimensional data, the need for visual exploration, and unclear definitions of ‘good enough’—are challenges that ML practitioners tend to work through and live with.

Fitness Functions for ML Models

Fitness functions are objective, executable functions that can be used to summarize, as a single figure of merit, how close a given design solution is to achieving its set aims. Fitness functions bridge the gap between automated tests (precise) and ML models (fuzzy).

For example, we can define fitness functions that measure the quality or toxicity of our code. If the code gets too convoluted or violates certain code quality rules, the ‘code quality’ fitness function fails and gives us feedback that a code change has degraded our system beyond our specified tolerance level.

Example fitness functions for an ML model

metrics tests: the model is fit for production if a given evaluation metric measured using a holdout set is above a specified threshold
model fairness tests: the model is fit for production if a given evaluation metric for each key segment—e.g., country—is within X% of each other
model API latency tests: the model is fit for production if it can handle N amount of concurrent requests within t seconds
model size tests: the model artifact must be below a certain size (e.g., so that we can deploy it easily to embedded devices or mobile devices)
training duration tests

Model Metrics Tests (Global and Stratified)

In these tests, we calculate model evaluation metrics that measure the aggregate correctness of the model on an extensible validation dataset and test if they meet our expected threshold of what is considered good enough for the model to be released to production. We compute these metrics at the global level and also at the level of important segments of our data (i.e., stratified level).

Example test:

class TestMetrics:
   recall_threshold = 0.65

   def test_global_recall_score_should_be_above_specified_threshold(self):
       # load trained model
       pipeline = load_model()

       # load test data 
       data = pd.read_csv("./data/train.csv", encoding="utf-8", low_memory=False)
       y = data["DEFAULT"]
       X = data.drop("DEFAULT", axis=1)
       X_test, X_train, y_test, y_train = train_test_split(X, y, random_state=10)

       # get predictions
       y_pred = pipeline.predict(X_test)

       # calculate metric
       recall = recall_score(y_test, y_pred, average="weighted")

       # assert on metric
       print(f"global recall score: {recall}") 
       assert recall >= self.recall_threshold

Example of a stratified test:

class TestMetrics:
   recall_threshold = 0.65

   def test_stratified_recall_score_should_be_above_specified_threshold(self):
       pipeline = load_model()

       data = pd.read_csv("./data/train.csv", encoding="utf-8", low_memory=False)
       strata_col_name = "OCCUPATION_TYPE" 
       stratas = data["OCCUPATION_TYPE"].unique().tolist()
       y = data["DEFAULT"]
       X = data.drop("DEFAULT", axis=1)
       X_test, X_train, y_test, y_train = train_test_split(X, y, random_state=10)

       # get predictions and metric for each strata 
       recall_scores = []
       for strata in stratas:
           X_test_stratified = X_test[X_test[strata_col_name] == strata]
           y_test_stratified = y_test[y_test.index.isin(X_test_stratified.index)]
           y_pred_stratified = pipeline.predict(X_test_stratified)

           # calculate metric
           recall_for_single_strata = recall_score(y_test_stratified, 
                                                   y_pred_stratified, 
                                                   average="weighted")
           print(f"{strata}: recall score: {recall_for_single_strata}")

           recall_scores.append(recall_for_single_strata)

       assert all(recall > self.recall_threshold for recall in recall_scores)

Behavioral Tests

Behavioral tests complement metrics tests by allowing us to enumerate specific—and potentially out-of-sample—scenarios under which to test our model. Behavioral testing (also known as black-box testing) has its roots in software engineering and is concerned with testing various capabilities of a system by ‘validating the input-output behavior, without any knowledge of the internal structure’.

General approach for defining behavioral tests in ML:

define or generate test samples. You can start with just one or two examples and use data-generation techniques to scale to more examples, if that provides value
use these test samples to generate predictions from a trained model
verify that the model’s behavior matches your expectations

Example tests:

invariance tests where we apply label-preserving perturbations to the input data and expect the model’s prediction to remain the same
directional expectation tests that are like invariance tests, except that we expect the prediction to change in a specific direction when we perturb the input data
minimum functionality tests contains a set of simple examples and corresponding labels that are used to verify the model’s behavior in a particular scenario. These tests are similar to unit tests in software engineering and are useful for creating small, targeted testing datasets.

Testing LLMs: Why and How

manual exploratory tests
example-based tests
benchmark tests
property-based tests
LLM-as-a-judge-tests

Essential Complementary Practices for Model Tests

Error Analysis and Visualization

Learn from Production by Closing the Data Collection Loop

minimize training-serving skew
use production-like data for testing (and use synthetic data where necessary)
close the data collection loop

When closing data collection loops, we need to identify and mitigate risks of runaway feedback loops, which is the phenomenon where the model learns biases and, through the effect that it has on the real world, perpetuates the bias in the real world and the data that it will be trained on in the future, thereby creating a vicious cycle.

The rate at which we can improve our models depends on the rate of information flow in the data collection loop.

Open-Closed Test Design

The open-closed design principle states that software entities (classes, modules, functions, and so on) should be open for extension, but closed for modification. This is a simple but powerful design principle that helps us write extensible code and minimize the amount of bespoke customization that we need to add for each new piece of functionality. By exposing the test data source and model as a configurable parameter in our test, we can rerun model evaluation tests on any given model, against any given dataset, at any time

Exploratory Testing

When you are stuck in writing automated model tests—which can often be the case, especially in early phases of an ML project—exploratory testing can help you discover what good looks like. When you are not sure what to test in an exploratory test, feedback or complaints from customers and domain specialists are a valuable starting point. While you may have a reflexive aversion to customer complaints, they are nevertheless valuable signals and they also point to a gap in our ML delivery process that needs to be looked into. When an exploratory test shows signs of being repeated or repeatable, you can formulate it as an automated test and reap the benefits of quality, flow, cognitive load, and satisfaction.

Means to Improve the Model

Tests do not improve a product; the improvement is done when you fix the bugs uncovered by the tests—when a model test fails and you’ve undertaken the error analysis to understand how a model arrived at an incorrect outcome, and finally you identify potential options for improving the model.

We can approach model improvement from two angles

data-centric approach: we can leverage the data collection loop from earlier to create training data that is more representative and of a better quality. We can also consider various feature engineering approaches, such as creating balanced datasets or feature scaling
model-centric approach: no, this is not just tuning hyperparameters. We can explore alternative model architectures, ensembles, or even decompose the ML problem into narrower subproblems that are easier to solve

There are times where teams may try both approaches.

Designing to Minimize the Cost of Failures

Given that ML models are bound to make wrong predictions some of the time, you need to design the product in a way that reduces risk (risk = likelihood x impact) of a wrong prediction, especially when the stakes are high. Not all errors are created equal, and some mistakes are more costly than others. Once we know the cost, we can apply some cost-sensitive learning techniques:

addressing data imbalances
highlighting costly mistakes
evaluating and training against cost-penalization metrics
inference-time cost consideration

Monitoring in Production

application monitoring
data monitoring
model metrics monitoring

Bringing It All Together

Effective Machine Learning Teams - Chapter 7 - Supercharging Your Code Editor with Simple Techniques

This chapter was just commands and how to setup PyCharm/VScode 😆 ~ nothing extraordinary, but I guess it’s good for someone that has not used an IDE before

Effective Machine Learning Teams - Chapter 8 - Refactoring and Technical Debt Management

Technical Debt: The Sand in Our Gears

Refactoring requires automated tests as a safety harness and benefits greatly from good software design to manage complexities in ML systems. A high-level understanding of components and their collaboration helps create maintainable, readable solutions, avoiding architectural issues like tight coupling. The C4 model can aid in visualizing software architecture.

Key principles for software design include abstraction, which simplifies complex details into clear interfaces, and maintaining loosely coupled components to enhance flexibility and maintainability. Readable and maintainable code accelerates iteration and experimentation.

Refactoring guidelines:

refactoring definition: restructure code without altering behavior using small, incremental changes, keeping the system functional at all times
Two Hats Rule: separate refactoring from adding new functionality to reduce cognitive load and maintain stability
Scout Rule: leave the codebase cleaner than you found it, addressing minor issues or flagging larger ones for future fixes
leverage IDEs: use tools for efficient tasks like renaming variables or extracting functions
avoid premature abstraction: Favor simplicity initially and refine design later to avoid sunk-cost fallacies and overly complex systems

The ultimate goal is a codebase that is organized, functional, and adaptable, avoiding the chaos of poorly designed ‘big ball of mud’ systems.

How to Refactor a Notebook (or a Problematic Codebase)

The refactoring cycle:

There was an exercise to refactor a notebook and turn it into an executable python file - here

Technical Debt Management in the Real World

Technical Debt Management Techniques

make debt visible

A technical debt wall helps to make debt visible and helps the team pay down the most important debt as part of ongoing delivery

the 80/20 rule

For each story card, focus 80% of the time on delivering features and completing the story, and spend 20% of the time paying off some technical debt, so that the team can continue to deliver features at a sustainable and predictable pace.

make it cheap and safe
demonstrate value of paying off technical debt

Effective Machine Learning Teams - Chapter 9 - MLOps and Continuous Delivery for ML (CD4ML)

CD4ML practices address help teams ensure that changes to their software and ML models are continuously tested and monitored for quality. Any changes in code, data, or ML models that satisfy comprehensive quality checks can be confidently deployed to production at any time

MLOps: Strengths and Missing Puzzle Pieces

CD4ML complements MLOps well by adding a set of principles and practices that help ML practitioners shorten feedback loops and improve reliability of ML systems

MLOps 101

A typical MLOps architecture:

Key components:

scalable training infrastructure
CI/CD pipelines (with tests)
deployment automation
‘as code’ everything
artifact stores
experiment tracking
feature stores or feature platforms (with data versioning)
monitoring in production
scalable data-labeling mechanisms

Hints That We Missed Something

CI/CD pipelines with no tests
infrequent model deployments to production or preproduction environments
data in production goes to waste
X is another team’s responsibility

Continuous Delivery for Machine Learning

Benefits of CD4ML

shorter cycle times (between an idea and shipping it to customers in production)
lower defect rates
faster recovery times (should something go wrong in production)

Continuous Delivery Principles

Principle 1: Build quality into the product (test automation, shift left/prioritise on security)
Principle 2: Work in small batches (practice pair programming, use version control for all production artifacts, implement CI, apply trunk-based development - the practice of committing code changes to the main branch)
Principle 3: Automation—computers perform repetitive tasks, people solve problems (automate dev env setup, automate deployments (minimally to a prod env), monitoring in production)
Principle 4: Relentlessly pursue continuous improvement
Principle 5: Everyone is responsible

Building Blocks of CD4ML: Creating a Production-Ready ML System

How CD4ML Supports ML Governance and Responsible AI

As ML capabilities grow, so do risks like bias amplification and societal harm, necessitating ML governance and Responsible AI frameworks to ensure ethical and effective use. Responsible AI integrates principles, policies, tools, and processes to guide the design and deployment of AI systems, balancing societal good and business impact.

CD4ML (Continuous Delivery for Machine Learning) operationalizes these principles by:

enabling iterative improvements: facilitating small, incremental changes to detect and resolve issues early, reducing harm and ensuring timely fixes
maximizing model lifetime value: supporting MLOps to improve model quality, manage risks, and maximize value through automated monitoring and timely updates
defining and enforcing policy-as-code: automating compliance with Responsible AI policies, such as consent validation and bias testing, reducing manual effort while ensuring auditability
encouraging tough questions: fostering a culture where team members can question practices and explore potential harms or failure modes collaboratively

Effective Machine Learning Teams - Chapter 10 - Building Blocks of Effective ML Teams

Common Challenges Faced by ML Teams

unsurfaced conflicting approaches (ie. one person using tensorflow and another pytorch)
intra-team silos of specialization
lack of shared purpose
no sense of progress
lack of skill diversity
no common ground
unsafe to fail, or even to ask

Effective Team Internals

Trust as the Foundational Building Block

Trust as the Foundation

Patrick Lencioni’s The Five Dysfunctions of a Team underscores trust as the essential building block of team success. Without trust, teams struggle with productive conflict, commitment, accountability, and achieving results. Building trust is not a one-time activity; it requires sustained effort, leadership advocacy, and awareness of group dynamics. Challenges like diverse backgrounds, high-stakes work, and imposter syndrome often exacerbate issues in trust-building, particularly in ML teams.

Vulnerability and Leadership

Brené Brown’s Daring Greatly complements Lencioni by emphasizing vulnerability as critical to trust-building. Leaders can model vulnerability by admitting uncertainties and encouraging openness. Thoughtful team-building activities and empathy foster connections, lowering barriers to vulnerability.

Takeaways:

reflect on management styles in your work context
identify your team’s developmental stage and discuss ways to move forward
use the Belbin framework to recognize team strengths, default roles, and opportunities to adopt unfulfilled roles for better collaboration

Communication

Communication is a key building block in teams, and something that challenges us all at times. In ML teams, where there are many different specializations, communication can be particularly hard. Are you referring to a product ‘feature’ or a data ‘feature’? Although the average person says ‘average’ did you mean ‘mean’? And recall, precision in ML may be more important than accuracy

Communication is another skill we can improve with deliberate attention, guided by frameworks. A typical team in the tech industry has many noisy communication channels to choose from, raising the question of which channel(s) to use for communications. A typical team also has its ups and downs, and trivial and critical matters to raise and resolve, and it’s important that communication works well in all of these scenarios—especially the difficult conversations.

Diverse Membership

Purposeful, Shared Progress

Progress in meaningful work is the most significant factor influencing the inner work life of individuals, followed by the presence of supportive systems (catalysts) and the absence of blockers (inhibitors)

Improving Flow with Engineering Effectiveness

ML teams share similarities with software engineering teams but differ in their reliance on extensive discovery activities and complex data supply chains. Modern data engineering practices aim to streamline these processes using software-defined solutions. Research on Developer Experience (DevEx) highlights sociotechnical factors impacting productivity, categorizing them into three dimensions:

feedback loops: fast feedback reduces delays and frustrations, enabling quicker iteration. Techniques like automated testing, pair programming, and clear user stories improve feedback cycles. For long-running tasks, minimizing active involvement (duty cycle) and using automation enhances efficiency
cognitive load: high cognitive load stems from both essential and incidental complexities, such as messy codebases or unclear documentation. Reducing cognitive load involves eliminating avoidable obstacles, streamlining workflows, and providing user-friendly tools for repetitive tasks
flow state: achieving flow state—characterized by deep focus and enjoyment—boosts productivity and innovation. Protecting flow involves minimizing interruptions, maintaining clear goals, and fostering autonomy while aligning individual efforts with team objectives

Effective Machine Learning Teams - Chapter 11 - Effective ML Organizations

Common Challenges Faced by ML Organizations

centralized team has too many responsibilities
inter-team dependencies: handoffs and waiting
duplication of effort in multiple cross-functional teams

Effective Organizations as Teams of Teams

The Role of Value-Driven Portfolio Management

Value-driven portfolio management emphasizes aligning an organization’s portfolio of work with business value and strategy to enhance team effectiveness. Drawing from concepts in Team Topologies and EDGE: Value-Driven Digital Transformation, this approach advocates for self-sufficient, cross-functional teams organized around delivering value. Tools like the Lean Value Tree help structure portfolios dynamically, focusing on strategic goals and evolving execution plans.

Team Topologies Model

Team Topologies for ML Teams

stream-aligned team: ML product team

complicated subsystem team: ML domain team

platform team: ML and data platform team

enabling team: Specialists in some aspect of ML product development

combining and evolving topologies

an example topology

Intentional Leadership

create structures and systems for rffective teams
engage stakeholders and coordinate organizational resources
cultivate psychological safety
champion continuous improvement
embrace failure as a learning opportunity
build the culture you wish you had
encourage teams to play at work

I guess it is understandable that there are many references to management frameworks in these latter chapters and it seems it would be valuable for managers/team builders to understand the concepts.

The Kafka/Flink lectures/labs from Zach Wilson’s lab were not released today but I expect them tomorrow ^^

That is all for today!

See you tomorrow :)