(Day 280) Reading scikit-learn docs on stream + reading Andryi Burkov's MLE book

Ivan Ivanov · October 7, 2024

Hello :) Today is Day 280!

A quick summary of today:

started the MLE book by Andriy Burkov
streamed and read some of scikit-learn’s docs

MLE by Andryi Burkov

In the foreword, written by Cassie Kozyrkov - Chief Decision Scientist at Google, I found about Making Friends with Machine Learning - a course by Google, and I will mark it to check later.

Also she compliments this book a lot as one of the very few ‘applied’ machine learning books, compared to the horde of research ML books out there, and she finishes the foreword with:

If you intend to use machine learning to solve business problems at scale, I’m delighted you got your hands on this book. Enjoy!

Chapter 1: Introduction

Notation and definitions

scalar
vector
matrix
capital sigma
Euclidean norm and distance

What is ML

Machine learning is a subfield of computer science that is concerned with building algorithms that, to be useful, rely on a collection of examples of some phenomenon. These examples can come from nature, be handcrafted by humans, or generated by another algorith

supervised learning (features, targets, regression, classification)
unsupervised learning (unlabeled examples, clustering, dimensionality reduction, outlier detection)
semi-supervised learning (both labeled and unlabeled examples)
reinforcement learning

Data and ML terminology

data used directly and indirectly
raw and tidy data
training and holdout sets
baseline models
ML pipelines
parameters vs hyperparameters
classification vs regression
model-based (i.e. SVM with params that are applied on new data) vs instance-based learning (i.e. KNN which uses the nearby neighbourhood of the input to make a prediction)
shallow vs deep learning

When to use ML

when the problem is too complex for coding
when the problem is constantly changing
when it is a perceptive problem (i.e. speech, image, video recognition)
when it is an unstudied phenomenon
when the problem has a simple objective
when it is cost-effective

When not to use ML

every action of the system or a decision made by it must be explainable
every change in the system’s behavior compared to its past behavior in a similar situation must be explainable
the cost of an error made by the system is too high
you want to get to the market as fast as possible
getting the right data is too hard or impossible
you can solve the problem using traditional software development at a lower cost
a simple heuristic would work reasonably well
the phenomenon has too many outcomes while you cannot get a sufficient amount of examples to represent them (like in video games or word processing software)
you build a system that will not have to be improved frequently over time
you can manually fill an exhaustive lookup table by providing the expected output for any input (that is, the number of possible input values is not too large, or getting outputs is fast and cheap)

What is MLE

Machine Learning Engineering (MLE) focuses on building, deploying, and maintaining machine learning models in production environments. It involves data collection, preprocessing, feature engineering, model training, optimization, and ensuring model stability and scalability. MLEs work to integrate machine learning models into systems, handling issues like performance degradation and system failures over time.

Thanks to DataTalksClub’s mlops-zoomcamp I managed to get a very good insight into the world of MLE.

ML project life cycle

Chapter 2 - Before the project starts

the project must be prioritised as there could be many items in the backlog, so we need to figure out which ones are more important for an MVP (or something like that)
the project myst have a well-defined goal so that resources can be allocated appropriately

Prioritisation of a ML project

The key considerations are impact and cost

impact is high when 1) ML can replace a complex part if an engineering project, or 2) there’s a great benefit in getting inexpensive (but probably imperfect) predictions
three cost highly influence the cost: the difficulty of the problem; the cost of data; the need for accuracy
- in terms of the problem’s difficulty, the main considerations are: whether an implemented algorithm or a software library capable of solving the problem is available; and whether significant computation power is needed to build the model or to run it in the production environment
- in terms of cost of data, the main considerations are: can data ba generated automatically; what is the cost of manual annotation of the data; how many examples are needed
- in terms of accuracy: how costly is each wrong prediction; and what is the lowest accuracy level below which the model becomes impractical

Estimating complexity of a ML project

There is no set standard, but just a comparison with other projects.

The unknowns

whether the required quality is attainable in practice
how much data you will need to read the required quality
what features and how many features are necessary so that the model can learn and generalise sufficiently
how large the model should be
how long will it take to train one model and how many experiemtns will be required to reach the desired level of performance

Simplifying the problem

One way to make a more educated guess is to simplify the problem and solve a simpler problem first.

Nonlinear progress

The prediction error typically decreases rapidly at first, but progress slows down over time. Sometimes, there may be no improvement, prompting you to add new features that rely on external databases or knowledge sources. While developing these features or labeling more data (possibly through outsourcing), the model’s performance may stagnate.

Due to this nonlinear progress, it’s important to ensure that the product owner or client understands the limitations and risks involved. Keep detailed logs of every activity and track the time spent, which will aid in both reporting and estimating the complexity of future projects.

Defining the goal of a ML project

The goal of a machine learning project is to build a model that solves, or helps solve, a business problem.

What a model can do

automate (for example, by taking action on the user’s behalf or by starting or stopping a specific activity on a server)
alert or prompt (for example, by asking the user if an action should be taken or by asking a system administrator if the traffic seems suspicious)
organize, by presenting a set of items in an order that might be useful for a user (for example, by sorting pictures or documents in the order of similarity to a query oraccording to the user’s preferences)
annotate (for instance, by adding contextual annotations to displayed information, or by highlighting, in a text, phrases relevant to the user’s task)
extract (for example, by detecting smaller pieces of relevant information in a larger input, such as named entities in the text: proper names, companies, or locations)
recommend (for example, by detecting and showing to a user highly relevant items in a large collection based on item’s content or user’s reaction to the past recommendations)
classify (for example, by dispatching input examples into one, or several, of a predefined set of distinctly-named groups)
quantify (for example, by assigning a number, such as a price, to an object, such as a house)
synthesize (for example, by generating new text, image, sound, or another object similar to the objects in a collection)
answer an explicit question (for example, “Does this text describe that image?” or “Are these two images similar?”)
transform its input (for example, by reducing its dimensionality for visualization purposes, paraphrasing a long text as a short abstract, translating a sentence into another language, or augmenting an image by applying a filter to it)
detect a novelty or an anomaly

Properties of a successful model

it respects the input and output specifications and the performance requirements
it benefits the organisation (measured via cost reduction, increased sales or profit)
it helps the user (measured via productivity, engagement, and sentiment)
it is scientifically rigorous

Structuring a ML team

Two cultures

Collaborative specialization:
- data analysts and software engineers work closely together
- engineers are not required to have deep machine learning expertise, but they need to understand the terminology
- advocates argue that each team member should excel in their specific area. Data analysts should master machine learning techniques, while engineers focus on efficient, maintainable code
Full-stack expertise:
- every engineer in the team possesses both machine learning and software engineering skills
- supporters of this approach claim that scientists often prioritize accuracy over practicality, leading to solutions that may not be viable in production. Additionally, scientists may not write well-structured, efficient code, creating challenges for engineers to rewrite it for production

Each culture has its pros and cons, with one emphasizing deep specialization and the other encouraging a hybrid skill set.

Members of a ML team

A machine learning team may consist of various experts, including:

Data engineers:
- responsible for ETL (Extract, Transform, Load) processes and creating automated data pipelines
- design the structure and integration of data from various sources
- provide fast access to data through APIs or queries for data analysts and consumers
- typically do not need machine learning knowledge
- in large companies, data engineers usually work in separate teams from machine learning engineers
Data labeling experts:
- handle labeling of data according to specifications from analysts, build labeling tools, manage labelers, and validate data quality
- in large companies, labeling teams may include local, outsourced labelers, and engineers responsible for tool development
- collaboration with domain experts is encouraged for better feature engineering, aligning model predictions with business needs
DevOps engineers:
- collaborate with machine learning engineers to automate model deployment, monitoring, and maintenance
- in smaller companies, a DevOps engineer may be part of the machine learning team, while in larger organizations, they work in a broader DevOps team
- some companies have introduced the MLOps role, dedicated to deploying and upgrading machine learning models and managing data pipelines involving these models

Why ML projects fail

lack of experienced talent
lack of support by the leadership
lack of data infrastructure
data labeling challenges
siloed organisations and lack of collaboration
technically infeasible projects
lack of alignment between tech and business teams

Chapter 3 - Data collection and preparation

Before any machine learning activity can start, the analyst must collect and prepare the data. The data available to the analyst is not always “right” and is not always in a form that a machine learning algorithm can use.

Questions about the data

is the data accessible
is the data sizeable (enough for our project)
is the data useable
is the data understandable
is the data reliable

Common problems with data

high cost
bad quality
noise
bias (selection bias, self-selection bias, omitted variable bias, sponsorship or funding bias, sampling bias, prejudice or stereotype bias, systematic value distortion, experimenter bias, labeling bias)
low predictive power
outdated examples
outliers
data leakage

Ways to avoid bias

Avoiding bias in data is challenging but crucial for building fair models. Here are several strategies:

Question the data:
- investigate who created the data, why, and how. Examine research methods to ensure they don’t introduce bias
Avoid selection bias:
- systematically question the choice of data sources. Using only current customer data for predictions, for example, can lead to overly optimistic results
Manage self-selection bias:
- keep surveys short and offer incentives for quality responses. Pre-select respondents to reduce bias, such as using expert references instead of asking entrepreneurs directly if they are successful
Minimize omitted variable bias:
- use all available features, even unnecessary ones, and rely on regularization to determine their importance. If a key feature is missing, consider using a proxy
Reduce sponsorship bias:
- investigate the incentives of data sources, especially in areas like tobacco or pharmaceuticals where bias may be introduced by sponsors
Avoid sampling bias:
- research real-world data proportions and ensure similar proportions in your training data
Control prejudice or stereotypebias bias:
- ensure balanced data representation by adjusting for under- or over-represented groups during training
Address systematic value distortion:
- use multiple measuring devices or trained humans to reduce bias in measurements
Prevent experimenter bias:
- let multiple people validate survey questions and opt for open-ended questions. If using multiple-choice, include an “Other” option
Avoid labeling bias:
- use multiple labelers to label the same examples and compare their decisions. Investigate frequent document skipping by labelers
Human involvement:
- keep humans involved in data gathering and preparation, as models trained on biased data will produce biased results

What is good data

good data is informative: it contains enough information that can be used for modeling
goot data has good coverage of what you want to do with the model
good data reflects real inputs
good data is as unbiased as possible
good data is not a result of a feedback loop (is no the result of the model itself)
good data has consistent labels
good data is big enough

Dealing with interaction data

Interaction data is the data you can collect from user interactions with the system our model supports.

Good interaction data contains information on 3 aspects:

context of interaction
action of the user in that context
outcome of interaction

Causes of data leakage

target is a function of a feature
features hides the target
features from the future

Data partitioning

To obtain good partitions (train, validation, test) if a dataset, partitioning has to satisfy several conditions:

Split was applied to raw data
Data was randomised before the split (unless time-series cases)
Validation and test sets follow the same distribution
Leakage during the split was avoided

Group leakage - imagine you have magnetic resonance images of the brains of multiple patients. Each image is labeled with certain brain disease, and the same patient may be represented by several images taken at different times. If you apply the partitioning technique discussed above (shuffle, then split), images of the same patient might appear in both the training and holdout data.

Dealing with issing attributes

removing the examples with missing attributes from the dataset (if the data is big enough)
using a learning algorithm that can deal with missing attribute values (such as decision trees)
using data imputation

If we use the mean or something similar for data imputation, data leakage might appear, however this type of leakage is not as significant as other types from above.

Data augmentation

images
text

Dealing with imbalanced data

Oversampling

Copying examples:
- multiple copies of minority class examples are created, effectively increasing their weight
Creating synthetic examples:
- synthetic examples are generated by combining feature values from multiple minority class examples. Two well-known algorithms for this purpose are:
- Synthetic Minority Oversampling Technique (SMOTE): for a given minority class example, it selects a number of nearest neighbors. A new synthetic example is generated by taking the original example and adding a fraction of the difference between it and one of its randomly chosen neighbors
- Adaptive Synthetic Sampling Method (ADASYN): similar to SMOTE, but the number of synthetic examples created for each original example is proportional to the number of neighbors that are not from the minority class. This approach generates more synthetic examples in areas where minority class examples are sparse

Undersampling

The undersampling can be done randomly - the examples to remove from the majority class can be chosen at random
Property-Based selection - examples can be selected for removal based on specific properties. One such property involves Tomek links. A Tomek link exists between two examples from different classes if no other example is closer to either of them than they are to each other. Closeness can be defined using metrics like cosine similarity or Euclidean distance. Removing examples based on Tomek links can help establish a clear margin between classes.
Cluster-Based - first, decide how many examples you want to retain in the majority class after undersampling. This number is denoted as k. Then, run a centroid-based clustering algorithm on the majority class examples, using k as the desired number of clusters. After clustering, replace all examples in the majority class with the k centroids. An example of a centroid-based clustering algorithm is k-nearest neighbors.

Hyprid strategies, combining over- and undersampling exist as well.

Data sampling strategies

When we have a large dataset, it’s not always practical or necessary to work with the entire data asset. Instead, we can draw a smaller data sample that contains enough information for learning.

There are two main strategies: probability sampling and nonprobability sampling

In probability sampling, all examples have a chance to be selected. These techniques involve randomness
Nonprobability sampling is not random. To build a sample, it follows a fixed deterministic sequence of heuristic actions. This means that some examples don’t have a chance of being selected, no matter how many samples you build.

Sample random sampling

It is the most straightforward method. Here, each example from the entire dataset is chosen purely by chance; each example has an equal chance of being selected.

Systematic (or interval) sampling

We create a list containing all examples. From that list, randomly select the first example x_start from the first k elements on the list. Then, you select every k-th item on the list starting from x_start. You choose such a value of k that will give you a sample of the desired size.

An advantage of this method over the simple random sampling is that it draws examples from the whole range of values. However, systematic sampling is inappropriate if the list of examples has periodicity or repetitive patterns. In the latter case, the obtained sample can exhibit a bias. However, if the list of examples is randomised, then systematic sampling often results in a better sample than simple random sampling.

Stratified sampling

If we know about the existence of several groups (e.g. gender, location, or age) in our data, we should have examples from each of those groups in your sample. In stratified sampling, we first divide your dataset into groups (called strata) and then randomly select examples from each stratum, like in simple random sampling. The number of examples to select from each stratum is proportional to the size of the stratum.

If we don’t know how to define the strata, we can use clustering algorithms where we need to decide on the amount of clusters.

Stratified sampling is the slowest of the three methods due to the additional overhead of working with several independent strata. However, its potential benefit of producing a less biased sample typically outweighs its drawbacks.

Storing Data

data formats (CSV, XML, JSON, PARQUET, etc.)
data storage levels (filesystems, object storage, database, data lake)
data versioning:
- level 0: data is unversioned
- level 1: data is versioned as a snapshot at training time
- level 2: both data and code are versioned as one asset
- level 3: using or building a specialised data versioning solution
docs and metadata: what that data means, how it was collected/created, the partitioning details, preprocessing steps, was any data excluded, format used to store the data, types of attributes/features, number of examples

Data manipulation best practices

Reproducability

Reproducibility is crucial in all aspects of data collection and preparation. To ensure reproducibility, avoid manual data transformations and reliance on ad hoc tools, such as regular expressions or quick commands in text editors and command line shells.

Data collection and transformation typically involve multiple stages, including:

downloading data from web APIs or databases
replacing multiword expressions with unique tokens
removing stop-words and noise
cropping and unblurring images
imputation of missing values

Each stage should be implemented as a software script, such as in Python or R, detailing inputs and outputs. This approach helps you track all changes made to the data. If an issue arises during any stage, you can simply fix the script and rerun the entire data processing pipeline from the beginning.

Manual interventions, on the other hand, can be challenging to reproduce, especially when working with updated datasets or larger volumes of data. Automating the process ensures scalability and consistency in data handling.

Data first, Algorithm second

Focus most of your effort and time on getting more data of wide variety and high quality, instead of trying to squeeze the maximum out of a learning algorithm

Data augmentation, when implemented well, will most likely contribute more to the quality of the model than the search for the best hyperparameter values or model architecture.

Streamed for 1.5hr today

Here

I read a few different posts on scikit-learn’s documentation because they are very well-written and have good examples.

That is all for today!

See you tomorrow :)