(Day 314) Streamed + new video + more sklearn docs

Ivan Ivanov · November 10, 2024

Hello :) Today is Day 314!

A quick summary of today:

  • covariance estimation
  • novelty and outlier detection
  • made a basic video on my notebook on multicollinearity
  • streamed

Covariance estimation

Many statistical problems require the estimation of a population’s covariance matrix, which can be seen as an estimation of data set scatter plot shape. Most of the time, such an estimation has to be done on a sample whose properties (size, structure, homogeneity) have a large influence on the estimation’s quality. The sklearn.covariance package provides tools for accurately estimating a population’s covariance matrix under various settings. We assume that the observations are independent and identically distributed (i.i.d.).

Empirical covariance

The covar matrix is known to be well approximated by the classical maximum likelihoodd estimator (or empirical covar), provided the # of observations is large enough compared to # of features. The MLE of a sample is an asymptomatically unbiased estimator of the corresponding popoulation’s covar matrix.

When using empirical_covariance’s fit method we need to be careful as the results depend on whether the data is centred. For this there is a given assume_centered param: if assume_centered=False, then the test set is supposed to have the same mean vector as the training set. If not, both should be centered by the user, and assume_centered=True should be used.

Shrunk covariance

The shrunk covar approach is a transformation of the empirical covar matrix that mitigates issues with inversion and eigenvalue estimation, commonly encountered with the Maximum Likelihood Estimator (MLE). This transformation, often applied using a shrinkage coefficient in sklearn, reduces the ratio between the smallest and largest eigenvalues, effectively regularizing the covariance matrix by offsetting eigenvalues to manage the bias-variance trade-off.

Basic shrinkage

A convex transformation applies a penalty to MLE, making the covariance estimation more stable.

Ledoit-Wolf Shrinkage

This method calculates an optimal shrinkage coefficient to minimize the Mean Squared Error (MSE) of covariance estimates. Notably, when sample size greatly exceeds features, the Ledoit-Wolf shrinkage approaches 1 if the population covariance matrix is isotropic, yielding a multiple of the identity matrix.

Oracle Approximating Shrinkage (OAS)

For Gaussian-distributed data, the OAS method achieves lower MSE than Ledoit-Wolf’s estimator.

Sparse inverse covariance

The matrix inverse of the covariance matrix, often called the precision matrix, is proportional to the partial correlation matrix. It gives the partial independence relationship. In other words, if two features are independent conditionally on the others, the corresponding coefficient in the precision matrix will be zero. This is why it makes sense to estimate a sparse precision matrix: the estimation of the covariance matrix is better conditioned by learning independence relations from the data. This is known as covariance selection.

In cases when n_samples is similar or smaller than n_features, this tends to work better than the shrunk covar estimators. However, in the opposite situation, or for very correlated data, they can be numerically unstable. In addition, unlike shrinkage estimators, sparse estimators are able to recover off-diagonal structure.

A sparse inverse covar (GraphicalLassoCV) is used in this case where the similarities between stocks are calculated and then this similarity matrix is passed onto affinity_propagation (a clustering algorithm)

Robust covariance estimation

The two main methods above are sensitive to outliers. In such cases we should use a robust covar estimator. In addition, such algorithms can be used for outlier detection and discard/downweight some observations.

Novelty and Outlier Detection

Overview of outlier detection methods

image

Local Outlier Factor (LOF) does not show a decision boundary in black as it has no predict method to be applied on new data when it is used for outlier detection.

ensemble.IsolationForest and neighbors.LocalOutlierFactor perform reasonably well on the data sets considered here. The svm.OneClassSVM is known to be sensitive to outliers and thus does not perform very well for outlier detection. That being said, outlier detection in high-dimension, or without any assumptions on the distribution of the inlying data is very challenging. svm.OneClassSVM may still be used with outlier detection but requires fine-tuning of its hyperparameter nu to handle outliers and prevent overfitting. linear_model.SGDOneClassSVM provides an implementation of a linear One-Class SVM with a linear complexity in the number of samples. This implementation is here used with a kernel approximation technique to obtain results similar to svm.OneClassSVM which uses a Gaussian kernel by default. Finally, covariance.EllipticEnvelope assumes the data is Gaussian and learns an ellipse.

Novelty detection

Consider a data set of n observations from the same distribution described by p features. Consider now that we add one more observation to that data set. Is the new observation so different from the others that we can doubt it is regular? (i.e. does it come from the same distribution?) Or on the contrary, is it so similar to the other that we cannot distinguish it from the original observations? This is the question addressed by the novelty detection tools and methods.

The One-Class SVM is used for this purpose and requires the choice of a kernel and a scalar parameter to define a frontier. The RBF kernel is usually chosen although there exists no exact formula or algorithm to set its bandwidth parameter. This is the default in the scikit-learn implementation. The nu parameter, also known as the margin of the One-Class SVM, corresponds to the probability of finding a new, but regular, observation outside the frontier

image

Scaling up

An online linear version of the One-Class SVM is implemented by sklearn which scales linearly with # of samples and can be used with a kernel approximation to approximate the solution of a kernelized svm.OneClassSVM whose complexity is at best quadratic in # of samples.

Outlier detection

Outlier detection is similar to novelty detection in the sense that the goal is to separate a core of regular observations from some polluting ones, called outliers. Yet, in the case of outlier detection, we don’t have a clean data set representing the population of regular observations that can be used to train any tool.

Fitting an elliptic envelope

One common way of performing outlier detection is to assume that the regular data come from a known distribution (e.g. data are Gaussian distributed). From this assumption, we generally try to define the “shape” of the data, and can define outlying observations as observations which stand far enough from the fit shape.

image

Isolation forest

Isolation Forest is a method for detecting outliers in high-dimensional data by isolating observations. It works by building a forest of randomly partitioned trees, where each sample is ‘isolated’ by splitting features at random values. The path length from the root to a leaf node in each tree represents how isolated a sample is. Anomalies tend to have shorter path lengths on average, as they are easier to isolate. The implementation uses an ensemble of ExtraTreeRegressor trees, with a maximum tree depth proportional to the logarithm of the sample size

image

Local Outlier Factor

This is another efficient method for high dim datasets. It computes a score (called local outlier factor) reflecting the degree of abnormality of the observations. It measures the local density deviation of a given data point with respect to its neighbors. The idea is to detect the samples that have a substantially lower density than their neighbors.

In practice the local density is obtained from the k-nearest neighbors. The LOF score of an observation is equal to the ratio of the average local density of its k-nearest neighbors, and its own local density: a normal instance is expected to have a local density similar to that of its neighbors, while abnormal data are expected to have much smaller local density.

The strength of the LOF algorithm is that it takes both local and global properties of datasets into consideration: it can perform well even in datasets where abnormal samples have different underlying densities. The question is not, how isolated the sample is, but how isolated it is with respect to the surrounding neighborhood.

image

Novelty detection with LOF

To use that, we need to set LocalOutlierFactor(novelty=True). When novelty is set to true we must only use predict, decision_function and score_samples on new unseen data and not on the training samples as this would lead to wrong results.

image

The full story behind multicollinearity

The original notebook and post were made on day 121. But given that I am doing youtube streams these days and have a few viewers, I decided to make a video of the notebook. Nothing special it was pretty much me reciting/reading my notebook, but hoping it reaches a larger audience and get feedback in case something is incorrect.

Streamed

Covered more of the sklearn docs’ user guide.

  • density estimation
  • cross validation
  • hyperparam tuning

That is all for today!

See you tomorrow :)