(Day 281) Read chapter 4 of Andriy Burkov's MLE book

Ivan Ivanov · October 8, 2024

Hello :) Today is Day 281!

A quick summary of today:

  • chapter 4 of Andriy Burkov’s MLE book

Chapter 4 - Feature Engineering

After data collection and preparation, feature engineering is the second most important activity in machine learning

image

Feature engineering is a process of first conceptually and then programmatically transforming a raw example into a feature vector.

How to engineer features

Feature engineering in text

There are two common techniques: one-hot encoding and bag-of-words

Categorical features to numbers

OHE is not the only method. There is also:

  • mean encoding (also called bin counting or feature calibration)

The odds ratio (OR) is used to quantify the association between two events, like a feature and a binary label. In binary classification, it helps measure how the presence of a feature (e.g., a word in an email) is associated with the likelihood of a certain outcome (e.g., spam).

The odds ratio compares the odds of an event happening in the presence of some condition to the odds of the event happening in the absence of that condition.

Key concept:

  • Odds of an event = P(A happens)/P(A doesn’t happen)

To calculate the odds ratio for two events, A (the feature being present, e.g., the word ‘infected’) and B (the positive class, e.g., spam):

OR = Odds of B when A is present / Odds of B when A is absent

  Spam (B) Not Spam (¬B) Total
Contains “infected” (A) 145 8 153
Doesn’t contain “infected” (¬A) 346 2909 3255
Total 491 2917 3408
  • Odds of spam when ‘infected’ is present = 145/8 = 18.125
  • Odds of spam when ‘infected’ is absent = 346/2909 = 0.119

Now, calculate the odds ratio:

OR = (145/8)/(346/2909) = 152.4

The odds ratio of 152.4 means that emails containing the word “infected” are 152.4 times more likely to be spam compared to emails that don’t contain the word.

Log-Odds Ratio

The log-odds ratio is the logarithm of the odds ratio, and it helps in managing extremely large or small values, which can arise when the odds ratio is very high or very low. It is easier to work with log values in machine learning models.

Here’s the formula for the log-odds ratio:

log(OR) = log(145) - log(8) - log(346) + log(2909) = 2.2

This log-odds ratio value of 2.2 can be used to replace the categorical value ‘infected’ in the model as a numeric feature.

  • sine-cosine transformation

When categorical features are cyclical, integer encoding does not work well. For example, try converting Monday through Sunday to the integers 1 through 7. The difference between Sunday and Saturday is 1, while the difference between Monday and Sunday is −6. However, our reasoning suggests the same difference of 1, because Monday is just one day past Sunday

Instead, we can use sine-consine transformation which converts a cyclical feature into two synthetic features. If p is the feature of interest, we can replace it with two following two values:

image-1

The below table shows the days of the week converted using this transformation

p ( p_{sin} ) ( p_{cos} )
1 0.78 0.62
2 0.97 -0.22
3 0.43 -0.9
4 -0.43 -0.9
5 -0.97 -0.22
6 -0.78 0.62
7 0 1

image-2

If we visualise the points from the table, we can clearly see the cyclical nature

In our data, we can replace monday with [0.78, 0.62], and the same for the rest of the week and this representation might work better than simple ordinal encoding.

Feature hashing

Feature hashing converts text data, or categorical attributes with many values, into a feature vector of arbitrary dimensionality.

BOW and OHE are good for handling categorical features but might result in adding too many dimensions to the dataset.

The hashing trick is a method to manage large datasets by converting categorical attributes (or document tokens) into feature vectors of a fixed dimensionality. The process works as follows:

  1. Decide on dimensionality: choose the size of the feature vectors (like 5)
  2. Hash function: apply a hash function to each value (e.g., words), converting them into non-negative integers
  3. Modulo operation: use the modulo of the dimensionality to assign each hashed value to an index in the feature vector

Commonly used hash functions are MurmurHash3, Jenkins, CityHash, and MD5

Topic modelling

Topic modeling is a family of techniques that uses unlabeled data, typically in the form of natural language text documents. The model learns to represent a document as a vector of topics. For example, in a collection of news articles, the five major topics could be sports, politics, entertainment, finance, and technology. Then, each document could be represented as a five-dimensional feature vector, one dimension per topic:

[0.04, 0.5, 0.1, 0.3, 0.06]

Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) are examples of topic modelling algorithms.

Properties of good features

  • high predictive power
  • fast computability
  • reliability
  • uncorrelatedness

Feature selection

If the feature vector is very wide (contains thousands or millions of features), the training time can become prohibitively long. Furthermore, the overall size of the training data can become too large to fit in the RAM of a conventional server.

Cutting the long tail

Typically, if a feature contains information (e.g., a non-zero value) only for a handful of examples, such a feature could be removed from the feature vector. In bag-of-words, you can build a graph with the distribution of token counts, and then cut off the so-called long tail

The decision on a threshold for defining the long tail is somewhat subjective. You can set it as a hyperparameter for your problem and discover the optimal value experimentally

image-3

Boruta

This is popularly used in Kaggle (apparently).

Boruta iteratively trains random forest models and runs statistical tests to identify features as important and unimportant

One useful feature of the random forest is its built-in capability to estimate the importance of each feature

The algorithm works in two stages. First, it classifies all training examples from the original training set. Each decision tree in the random forest model votes only on the classification of examples that weren’t used to build that tree. After a tree is tested, the number of correct predictions is recorded for that tree.

At the second stage, the values of a certain feature are randomly permuted across examples, and the tests are repeated. The number of correct predictions is once again recorded for each tree. The importance of the feature for a single tree is then computed as the difference between the number of correct classifications between the original and permuted setting, divided by the number of examples. To obtain the feature importance score, the feature importance measures for individual trees are averaged. While not strictly necessary, it’s convenient to use z-scores instead of the raw importance scores.

The underlying idea of Boruta is simple: we first extend the list of features by adding a randomized copy of each original feature, and then build a classifier based on this extended dataset. To assess the importance of an original feature, we compare it to all randomized features. Only features for which the importance is higher than that of the randomized features — and statistically significant — are considered truly important.

It is important to note that Boruta is a heuristic and there are no theoretical guarantees for its performance. If we want to be sure that Boruta doesn’t harm, run it multiple times and make sure that the feature selection is stable

L1-regularization

L1 (Lasso) in particular allows the algorithm to learn to ignore features - it penalizes the model for being too complex. Since it 0s down features, it produces a sparse model, and it implicitly performs feature selection.

Task-specific feature selection

When working with text, removing stop words is common - a, the, propositions, pronouns, etc.

Sometimes it might even be worth to replace rare (infrequent) words with a special token like RARE_WORD if we want to reduce the resulting dimensionality.

Synthesizing features

Feature discretisation

Binning, also known as bucketing, is a popular technique that allows transforming a numerical feature into a categorical one by replacing numerical values in a specific range by a constant categorical value.

There are three typical approaches to binning:

  • uniform binning
  • k-means-based binning
  • quantile-based binning

After binning, the bins must be transformed back to numerical values by using a technique like OHE.

Synthesizing features from the data

One technique commonly used to synthesize one or more additional features is clustering (by using k-means for example)

Synthesizing features from other features

Neural networks are notorious for their ability to learn complex features by combining simple features in unordinary ways. They combine simple features by letting their values undergo several levels of nested nonlinear transformations. If you have data in abundance, you can train a deep multilayer perceptron model that will learn to cleverly combine the basic unitary features it receives as input.

In practice, the most common way to obtain new features from the existing features is to apply a simple transformation to one or a pair of existing features. Three typical simple transformations that apply to a numerical feature j in example i are 1) discretization of the feature, 2) squaring the feature, and 3) computing the sample mean and the standard deviation of feature j from k-nearest neighbors of the example i found by using some metric like Euclidean distance or cosine similarity

Learning features from data

Learning features from data is especially effective when we can get access to large collections of relevant labeled or unlabeled data, such as text corpora or collections of images from the Web.

  • word embeddings
  • document embeddings

How to choose embedding dimension?

d = fourth root of the number of categories

A more principled approach to choose the embedding dimensionality is to treat it as a hyperparameter tuned on a downstream task.

Dimensionality reduction

  • PCA
  • UMAP

Scaling features

Feature scaling is bringing all your features to the same, or very similar, ranges of values or distributions. Multiple experiments demonstrated that a learning algorithm applied to scaled features might produce a better model. While there’s no guarantee that scaling will have a positive impact on the quality of your model, it’s considered a best practice

Also, scaling reduces the risk of numerical overflow, the problem that computers have when working with very small or very big numbers.

  • normalisation
  • standardisation

Data leakage in feature engineering

Possible Problems

Data leakage can occur if we perform feature engineering on the entire dataset before splitting it into training, validation, and test sets. This results in:

  • the training set benefiting from knowledge of the holdout sets, leading to overly optimistic model performance
  • for text data, if bag-of-words is used before splitting the data, the model may learn features (tokens) present only in the holdout sets, inflating its performance

Solutions

To prevent data leakage:

  • first split the dataset into training and holdout sets
  • perform feature engineering, such as scaling, binning, or mean encoding, only on the training data

Storing and documenting features

It’s advised to design a schema file that provides a description of the features’ expected properties

A schema file is a document that describes features. This file is machine-readable, versioned, and updated each time someone makes significant updates to features. It should contain:

  • names of features
  • for each feature:
    • its type
    • the fraction of examples that are expected to have that feature present
    • min and max values
    • sample mean and variance
    • whether it allows zeros
    • whether it allows undefined values

Feature store

It is a central repository used to store, document, reuse, and share features across teams in large organizations. It helps address several challenges:

  • feature reuse: same features are often re-implemented by different teams
  • varying feature definitions: different teams define features differently, often without accessible documentation
  • computationally intensive features: some real-time models require complex features that are computationally expensive, making them hard to use in real-time applications
  • inconsistency between training and serving: features might differ between training (historical data) and serving (real-time data), leading to inconsistent results
  • feature expiration: it’s unclear when features need to be recomputed, leading to unnecessary recalculations

Key components:

  • feature name: a unique identifier for each feature (e.g. ‘average_session_length’)
  • description: a natural language explanation of the feature (e.g. ‘the average length of a user’s session’)
  • metadata: information such as the feature’s purpose, input/output types, caching duration, and whether it is available for online or offline use
  • feature definition: versioned code that computes the feature’s value

Benefits:

  • online and offline availability: features can be read from cache or computed in real-time, supporting both online and offline use
  • versioning: feature values are versioned, allowing models to be retrained with the same feature values as previous versions
  • reproducibility: previous feature values are retained and timestamped, enabling consistent model building

Use Cases:

  • online feature store: updated in real-time with live data (e.g. ‘restaurant’s average meal preparation time over the last hour’)
  • offline feature store: updated in batch mode with historical data (e.g. ‘restaurant’s average meal preparation time over the last seven days’)

Feature stores sync between online and offline modes, ensuring consistency across real-time and batch applications.

image-4

Feature engineering best practices

  • start by generating many simple features
  • reuse legacy systems
  • use IDs as features when needed (ID as in location for example)
  • reduce cardinality (number of distinct values) when possible (i.e. group similar values, group the long tail, remove the feature)
  • use features based on counts with caution
  • re-do feature selection if necessary
  • test the code carefully
  • keep code, model, and data in sync
  • isolate feature extraction code so that individual feature code can be updated without affecting other features, pipelines, or the model
  • serialise together model and feature extractor
  • log the values of features

I was on a uni trip today so I’m glad I managed to finish this chapter while riding on the bus.

That is all for today!

See you tomorrow :)