(Day 315) Finally started reading Chip Huyen's Designing ML systems

Ivan Ivanov · November 11, 2024

theory traditional-machine-learning applying-knowledge mlops nlp data-eng

Hello :) Today is Day 315!

A quick summary of today:

started reading one of the most talked about books in ML
basic deployment for the company reviewer project

Chapter 1 - Overview of ML systems

When people hear ML systems they often think only of the underlying algorithms. However, the algorithm is just a small piece of an ML system in production. A complete ML system also includes the business requirements that drive the project, the interface for user and developer interactions, the data stack, and the processes for developing, monitoring, and updating models. Additionally, it involves the infrastructure that enables the delivery of these components. Below the components of an ML system and indicates the chapters in this book are shown

When to use ML

Machine learning is an approach to learn complex patterns from existing data and use these patterns to make predictions on unseen data.

Survey results on the use of ML:

Understanding ML systems

ML in research vs production

	Research	Production
Requirements	State-of-the-art model performance on benchmark datasets	Different stakeholders have different requirements
Computational priority	Fast training, high throughput	Fast inference, low latency
Data	Static	Constantly shifting
Fairness	Often not a focus	Must be considered
Interpretability	Often not a focus	Must be considered

ML systems vs Traditional software

in SWE there’s an underlying assumption that code and data are separate and engineers want to keep things modular and separate
contrary, ML systems are part code, part data, and part artifacts created from the two
in SWE, we focus on testing and versioning the code
with ML, we have to test and version data as well (and that’s hard)
ML models are getting bigger and bigger, requiring more and more space
ML models on edge devices is another challenge
prod monitoring and debugging is also nontrivial

Chapter 2 - Introduction to ML systems design

ML systems design takes a system approach to MLOps, which means that we’ll consider an ML system holistically to ensure that all the components—the business requirements, the data stack, infrastructure, deployment, monitoring, etc.—and their stakeholders can work together to satisfy the specified objectives and requirements.

Before an ML system is developed we must understand why it is needed. If it’s built for a business, then there must be some business objectives behind which will need to be translated into ML objectives.

Once these objectives are defined and everyone is on board, we’ll need to set some requirements to guide the development of the system. The book considers 4 requirements:

reliability
scalability
maintainability
adaptability

Business and ML objectives

Data scientists might be interested to improve the performance of their model - i.e. increase accuracy from 94% ot 94.2% and they might spend a lot of resources to achieve that.

However, most companies don’t care about this 0.2% unless it moves some business metric. We need to pay more attention to the buisness objectives. The ultimate goal of any project in a business is to increase profits - either directly or indirectly. So ~ for an ML project to succeed within a business, it’s crucial to tie the performance of an ML system to the overall business performance.

When evaluating ML solutions through the business lens, it’s important to be realistic about the expected returns. Due to all the hype surrounding ML, generated both by the media and by practitioners with a vested interest in ML adoption, some companies might have the notion that ML can magically transform their businesses overnight.

Returns on investment in ML depend a lot on the maturity stage of adoption. The longer we’ve adopted ML, the more efficient our pipes will run, the faster the dev cycle will be, the less engineering time we’ll need, and the lower the cloud bills will be.

The belos is based on a survey on how long it takes for a company to bring a model to prod. It was conclded that it is proportional to how long it has used ML

Requirements for ML systems

We can’t say that we’ve successfully built an ML system without knowing what requirements the system has to satisfy. The specified requirements for an ML system vary from use case to use case. However, most systems should have these four characteristics: reliability, scalability, maintainability, and adaptability.

Reliability

The system should continue to perform the correct function at the desired level of performance even if problems arise (hardware/software faults/human error)

‘Correctness’ might be difficult to determine because we cannot know if a prediction is wrong if we don’t have the gorund truth. With traditional software systems we might get a warning/system crash/runtime error/404, but ML systems can fail silently and end users might not even know it. This topic is explored more in chapter 8.

Scalability

An ML system can grow in complexity, in traffic volume, in ML model count. Whichever way it grows, we should be ready and have reasonable ways of dealing with that growth. Again, this is discussed more in detail later in the book

Maintainability

There are many people who will work on an ML systems, so its important to strucutre the workloads set up up the infrastructure in such a way that different contributors can work using tools they are comfortable with.

code should be documented
code, data, and artifacts should be versioned
models should be sufficiently reproducible
when a problem occurs, different contributors should be able to work together to identify the problem and implement a solution without finger-pointing

Adaptability

To adapt to shifting data distributions and business requirements, the system should have some capacity for both discovering aspects for performance improvement and allowing updates without service interruption.

Iterative process

Developing an ML system is an iterative and, in most cases, never-ending process. Once a system is put into production, it’ll need to be continually monitored and updated.

Below shows an oversimplified representation of what the iterative process for developing ML systems in prod looks like from the perspective of a DS/MLE

A brief of each step

Step 1: Project scoping

A project starts with scoping the project, laying out goals, objectives, and constraints. Stakeholders should be identified and involved. Resources should be estimated and allocated.

Step 2: Data engineering

A vast majority of ML models today learn from data, so developing ML models starts with engineering data. Garbage-in-garbage-out as the saying goes.

Step 3: ML model development

With the initial set of training data, we’ll need to extract features and develop initial models leveraging these features. This is the stage that requires the most ML knowledge and is most often covered in ML courses.

Step 4: Deployment

After a model is developed, it needs to be made accessible to users.

Step 5: Monitoring and continual learning

Once in production, models need to be monitored for performance decay and maintained to be adaptive to changing environments and changing requirements.

Step 6: Business analysis

Model performance needs to be evaluated against business goals and analysed to generate business insights. These insights can then be used to eliminate unproductive projects or scope out new projects. This step is closely related to the first step.

Framing ML problems

If a stakeholder one day sees that our company’s customer support is slower than a competitor’s ML system, they will come to us (as we are an MLE) and ask us to use ML to improve our customer support as well. Slow customer support is a problem, but it’s not an ML problem. An ML problem is defined by inputs, outputs, and the objective function that guides the learning process - and these are not so obvious to our stakeholder, so it’s our job, as an MLE, to figure those things out. Upon investigation, we discover that the slow CS comes form the fact that the bottleneck in responding to customer requests lies in routing customer requests to the right department among 4. We can alleviate this bottleneck by developing an ML model to predict to which department a request should go to - this just became a multiclass classification problem.

the input is the customer request
the output is the department the request should go to
the objective function is to minimize the difference between the predicted department and the actual department

There might be multiple ways to frame a problem, and changing the way we frame the problem might make the problem significantly easier or harder.

Mind vs Data

Progress in the last decade shows that the success of an ML system depends largely on the data it was trained on. Instead of focusing on improving ML algorithms, most companies focus on managing and improving their data.

In theory, you can both pursue architectural designs and leverage large data and computation, but spending time on one often takes time away from another.

Many people in ML today are in the data-over-mind camp. Professor Richard Sutton, a professor of computing science at the University of Alberta and a distinguished research scientist at DeepMind, wrote a great blog post in which he claimed that researchers who chose to pursue intelligent designs over methods that leverage computation will eventually learn a bitter lesson: “The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.… Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation.”

When asked how Google Search was doing so well, Peter Norvig, Google’s director of search quality, emphasized the importance of having a large amount of data over intelligent algorithms in their success: “We don’t have better algorithms. We just have more data.”

The debate isn’t about whether finite data is necessary, but whether it’s sufficient. The term finite here is important, because if we had infinite data, it might be possible for us to look up the answer. Having a lot of data is different from having infinite data.

The data science hierarchy of needs:

Regardless of which camp will prove to be right eventually, no one can deny that data is essential, for now. Both the research and industry trends in the recent decades show the success of ML relies more and more on the quality and quantity of data.

Chapter 3 - Data engineering fundamentals

Data sources

user input data
system-generated data
internal databases
3rd-party data

Data formats

json
csv
parquet
avro
protobuf
pickle

Data models

relational model
noSQL
- document model
- graph model
structured vs. unstructured data

Structured data	Unstructured data
Schema clearly defined	Data doesn’t have to follow a schema
Easy to search and analyze	Fast arrival
Can only handle data with a specific schema	Can handle data from any source
Schema changes will cause a lot of troubles	No need to worry about schema changes (yet), as the worry is shifted to the downstream applications that use this data
Stored in data warehouses	Stored in data lakes

Data storage engines and processing

Transactional databases

Designed to process online transactions and satisfy the low latency, high availability requirements.

ACID:

atomicity

To guarantee that all the steps in a transaction are completed successfully as a group. If any step in the transaction fails, all other steps must fail also. For example, if a user’s payment fails, you don’t want to still assign a driver to that user.

consistency

To guarantee that all the transactions coming through must follow predefined rules. For example, a transaction must be made by a valid user.

isolation

To guarantee that two transactions happen at the same time as if they were isolated. Two users accessing the same data won’t change it at the same time. For example, you don’t want two users to book the same driver at the same time.

durability

To guarantee that once a transaction has been committed, it will remain committed even in the case of a system failure. For example, after you’ve ordered a ride and your phone dies, you still want your ride to come.

However, transactional databases don’t necessarily need to be ACID, and some developers find ACID to be too restrictive. According to Martin Kleppmann, “systems that do not meet the ACID criteria are sometimes called BASE, which stands for Basically Available, Soft state, and Eventual consistency. This is even more vague than the definition of ACID.”

OLAP vs OLTP

ETL

Modes of Dataflow

Most times, in production there are multiple processes going on, and the question becomes how do we pass data between different processes that don’t share memory?

When data is passed from one process to another, we say the data flows from one process to another, which gives us a dataflow. There are 3 main modes:

data passing through databases
data passing through services using requests such as the requests provided by REST and RPC APIs (i.e. POST/GET requests)
data passing through a real-time transport like Apache Kafka and Amazon Kinesis

Chapter 4 - Training data

Data is messy, complex, unpredictable, and potentially treacherous. If not handled properly, it can easily sink your entire ML operation. But this is precisely the reason why data scientists and ML engineers should learn how to handle data well, saving us time and headache down the road.

Sampling

Nonprobability sampling

As the name suggests, this is when the selection of data isn’t based on any probability criteria. Here are some of the criteria for nonprob sampling:

convenience sampling: samples of data are selected based on their availability; this method is popular because its convenient
snowball sampling: future samples are selected based on existing samples; for instance, to scrape legitimate twitter accounts without having access to twitter dbs, we start with a small # of accounts, then scrape all the accounts they follow, and so on
judgement sampling: experts decide what samples to include
quota sampling: samples are selected based on quotas for certain slices of data without any randomisation;

The samples selected by nonprobability criteria are not representative of the real-world data and therefore are riddled with selection biases and because of that it’s a bad idea to use them for ML. In most cases, data for ML models is still driven by convenience.

Nonprobability sampling can be a quick and easy way to gather initial data to get a project running, but for reliable models, we might need to use probability-based sampling methods

Simple random sampling

In the simplest form, we can sample data after giving each sample a uniform chance of being picked. The advantage of this method is that it’s easy to implement, however, rare categories of the data might not appear in the selection

Stratified sampling

One drawback of this sampling method is that it isn’t always possible, such as when it’s impossible to divide all samples into groups. This is especially challenging when one sample might belong to multiple groups, as in the case of multilabel tasks. For instance, a sample can be both class A and class B.

Weighted sampling

This method allows to leverage domain expertise. For example, if we know that a certain subpopulation of data, such as more recent data, is more valuable to the model and want it to have a higher chance of being selected, we can give it a higher weight.

Reservoir sampling

This is especially useful when dealing with streaming data. It is useful when the dataset size is unknown or too large to fit in memory.

The algorithm ensures that every element in the stream has an equal probability of being selected by maintaining a fixed-size reservoir of k elements. The steps are:

fill the reservoir with the first k elements
for each new element n, generate a random integer i between 1 and n
if i is within the range of the reservoir (1 to k), replace the i-th element in the reservoir with the n-th element; otherwise, ignore the new element

This method guarantees that each element has an equal chance of being included in the sample, even if the algorithm stops partway through the data stream.

Importance sampling

Importance sampling is one of the most important sampling methods, not just in ML. It allows us to sample from a distribution when we only have access to another distribution.

Imagine we have to sample x from a distribution P(x), but P(x) is really expensive, slow, or infeasible to sample from. However, there is a distribution Q(x) that is a lot easier to sample from, so we sample x from Q(x) instead and weigh this sample by P(x)/Q(x). Q(x) is called the proposal distribution or the importance distribution. Q(x) can be any distribution as long as Q(x) > 0 whenever P(x) ≠ 0

Labeling

Hand labels

expensive, especially if subject matter expertise is required
poses a threat to data privacy
slow

Natural labels

canonical example of tasks with natural labels is recommender systems

How companies obtain labels

Handling the lack of labels

Method	How	Ground truths required?
Weak supervision	Leverages (often noisy) heuristics to generate labels	No, but a small number of labels are recommended to guide the development of heuristics
Semi-supervision	Leverages structural assumptions to generate labels	Yes, a small number of initial labels as seeds to generate more labels
Transfer learning	Leverages models pretrained on another task for your new task	No for zero-shot learning. Yes for fine-tuning, though the number of ground truths required is often much smaller than what would be needed if you train the model from scratch
Active learning	Labels data samples that are most useful to your model	Yes

Programmatic labeling

Hand labeling	Programmatic labeling
Expensive: Especially when subject matter expertise required	Cost saving: Expertise can be versioned, shared, and reused across an organization
Lack of privacy: Need to ship data to human annotators	Privacy: Create LFs using a cleared data subsample and then apply LFs to other data without looking at individual samples
Slow: Time required scales linearly with number of labels needed	Fast: Easily scale from 1K to 1M samples
Nonadaptive: Every change requires relabeling the data	Adaptive: When changes happen, just reapply LFs!

Class imbalance

there’s insufficient signal for the model to learn to detect the minority classes
it’s easier for the model to get stuck in a nonoptimal solution by exploiting a simple heuristic instead of learning anything useful about the underlying pattern of the data
leads to asymmetric costs of error
can be caused by biases during the sampling process
another cause can be due to labeling errors

Need to make sure to use the appropriate evaluation metrics in the presence of imbalanced data.

We can use resampling methods:

Algorithm-level methods address class imbalance by adjusting the learning algorithm rather than modifying the training data. These methods often alter the loss function to make the model more sensitive to minority classes.

cost-sensitive learning: this approach adjusts the individual loss by applying a cost matrix, which specifies different misclassification costs for each class. The cost matrix enables prioritizing misclassification penalties based on class importance, though it requires manual definition for each task
class-balanced loss: to prevent models from favoring majority classes, class-balanced loss assigns weights inversely proportional to class frequencies, making rare classes more influential in the learning process. Advanced versions, like those based on effective sample numbers, further refine these weights.
focal loss: focal loss increases focus on harder-to-classify examples by boosting the loss weight for low-confidence predictions, encouraging the model to improve on challenging samples rather than easy ones

Data augmentation

Simple label-preserving transformations

pertrubation
data synthesis

Company reviewer project

Today my advisor sent me an email to share a link where someone can use the fine-tuned company reviewer chatbit. A reminder - this is a project where based 50 online reviews, an LLM writes a review for that company based on 12 dimensions (i.e. salary, work-life balance, etc.). The data is created by our client and I am using it to fine-tune various models. The best one so far is gpt4o-mini, so I helped my professor understand how to use it so that potentially the customer company can use it as well. It is not final, but for now we have a streamlit UI option (which can only be used with an API key from the organisation but the UI itself is public), using the model through the openai playground, and also creating a GPT (which I have not done yet as I do not have GPT Plus). Later this week I will be going to the lab again to continue trying out different open-source model configurations and evaluate them.

That is all for today!

See you tomorrow :)