(Day 312) Reading more of sklearn's Supervised learning doc

Ivan Ivanov · November 8, 2024

theory traditional-machine-learning statistics

Hello :) Today is Day 312!

A quick summary of today:

LDA and QDA
kernel ridge regression
SVMs
SGD
nearest neighbours

Today, I continued reading down the ‘User Guide’ section of the scikit-learn docs.

I could not stream today, so I reviewed the below pages by myself and kind of summarised each of them. While some of the parts are not really new for me, I still think it’s valuable to review the pages as there are little details that are interesting to me. And I encourage anyone to checkout sklearn’s docs ^^

Linear and Quadratic Discriminant Analysis

LDA and QDA are two classic classifiers with a linear and quadratic decision surface, respectively. Here are 3 examples on 3 different datasets, and as seen the LDA can only learn linear boundaries, while the QDA can learn quadratic and is more flexible.

Dimensionality reduction using LDA

LDA can be used for supervised dim reduction, by projecting the input to a linear subspace consisting of the directions which maximise the separation between classes. The output dim is less than the number of classes so this is in general considered strong dim reduction, and only makes sense in multiclass settings.

Here is an example between dim reduction via LDA and PCA to a 2d projection on the infamous iris dataset (consisting of 3 kind of flowers, and 4 features)

For LDA, the defaul n_components param is None, but here is the full explanation from the docs:

n_components: int, default=None

Number of components (<= min(n_classes - 1, n_features)) for dimensionality reduction. If None, will be set to min(n_classes - 1, n_features). This parameter only affects the transform method.

So, with the iris dataset using just LDA, the n_components becomes min(n_classes - 1, n_features) = 2

If we use PCA with n_components=2 as well, we get:

QDA relation with GaussianNB

If in the QDA model we assume that the covar matrices of individual inputs are diagonal, then the inputs are assumed to be conditionally independent in each class which matches the Gaussian Naive Bayes classifier.

Shrinkage and covar estimator

Shrinkage is a regularisation method that improves covar matrix estimation, especially when there are few samples relative to features. This technique is used in LDA to enhance classifier generalisation. Setting the shrinkage parameter to ‘auto’ enables optimal shrinkage calculation via Ledoit and Wolf’s method, applicable only with lsqr or eigen solvers.

The shrinkage parameter can also be set manually between 0 (no shrinkage) and 1 (full shrinkage), balancing between empirical covariance and a diagonal variance matrix. For normally distributed data, the Oracle Approximating Shrinkage (OAS) estimator often yields lower mean squared error and can improve LDA classification accuracy over Ledoit and Wolf’s approach.

Estimation algorithms

Using LDA and QDA requires computing the log-posterior which depends on the class priors, the class means, and the covar matrix.

The ‘svd’ solver is the default solver used for LDA, and it is the only available solver for QDA. It can perform both classification and transform (for LDA). As it does not rely on the calculation of the covar matrix, the ‘svd’ solver may be preferable in situations where the number of features is large. The ‘svd’ solver cannot be used with shrinkage.

The ‘lsqr’ solver is an efficient algorithm that only works for classification. It needs to explicitly compute the covar matrix, and supports shrinkage and custom covariance estimators.

The ‘eigen’ solver is based on the optimisation of the between class scatter to within class scatter ratio. It can be used for both classification and transformation, and supports shrinkage. However, the ‘eigen’ solver needs to compute the covar matrix, so it might not be suitable for situations with a high number of features.

Kernel ridge regression

KRR combines Ridge regression and classification (linear least squares with l2 norm reg) with the kernel trick. It learns a linear function in the space induced by the respective kernel and the data. For non-linear kernels, this corresponds to a non-linear function in the original space.

The form of the model learned by KRR is identical to SVR, but the loss functions are different. KRR uses squared error loss, while SVR uses epsilon-insensitive loss, both combined with l2 reg.

Note: https://library.fiveable.me/key-terms/predictive-analytics-in-business/epsilon-insensitive-loss-function

The epsilon-insensitive loss function allows for a certain degree of error within the epsilon range, which simplifies the training process by ignoring small discrepancies.
By focusing on larger errors, this loss function aids in improving model performance on real-world datasets where noise and outliers are common.
The choice of epsilon can significantly affect the complexity of the model; smaller epsilon values may lead to more complex models, while larger values promote simpler models.
The implementation of an epsilon-insensitive loss function is a core feature that differentiates support vector regression from traditional regression methods.
Utilizing this loss function can enhance computational efficiency by reducing the number of support vectors needed to define the model.

In contrast to SVR, fitting KRR can be done in closed-form and is typically faster for medium-sized datasets. On the other hand, the learned model is non-sparse and thus slower than SVR, which learns a sparse model for epsilon > 0, at prediction-time.

The below image compares KRR and SVR on an artificial dataset with a sin target function and strong noise. The learned functions are very similar, however, fitting KRR is 7x faster, but prediction of 100,000 target values is >3x faster with SVR (since it has learned a sparse model using only approx. 1/3 of the 100 training datapoints as support vectors)

SVR’s sparsity depends on C and epsilon.

C: float, default=1.0
Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2.

epsilon: float, default=0.1
Epsilon in the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value. Must be non-negative.

They even say:

The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to datasets with more than a couple of 10000 samples. For large datasets consider using LinearSVR or SGDRegressor instead, possibly after a Nystroem transformer or other Kernel Approximation.

Support Vector Machines

SVMs are a set of supervised learning methods used for classification, regression and outlier detection

Advantages:

effective in high dim spaces
still effective in cases where number of dims is greater than the number of samples
uses a subset of training points in the decision function (called support vectors), so it is memory efficient
versatile: different kernel functions can be specified for the decision function. Common kernels are provided, but there is also the option for custom ones

Disadvantages:

if the number of features is much greater than the number of samples, avoid overfitting in choosing kernel functions and regularisation term is crucial
SVMs do not directly provide probability estimates, these are calculated using an expensive 5-fold CV

Sklearn’s SVMs support both dense and sparse sample vectors as input. However, to use an SVM to make preds on sparse data, it must have been fit on such data.

Classification

In sklearn, SVC, NuSVC and LinearSVC are classes capable of performing binary and multi-class classification.

SVC and NuSVC are similar methods, but accept slightly different sets of parameters and have different mathematical formulations. On the other hand, LinearSVC is another (faster) implementation of Support Vector Classification for the case of a linear kernel. It also lacks some of the attributes of SVC and NuSVC, like support_ (which gives the indices of support vectors). LinearSVC uses squared_hinge loss and due to its implementation in liblinear it also regularizes the intercept, if considered. This effect can however be reduced by carefully fine tuning its intercept_scaling parameter, which allows the intercept term to have a different regularisation behavior compared to the other features. The classification results and score can therefore differ from the other two classifiers.

Multi-class classification

SVC and NuSVC implement the ‘one-versus-one’ approach for multi-class classification. In total, n_classes * (n_classes - 1) / 2 classifiers are constructed and each one trains data from two classes. To provide a consistent interface with other classifiers, the decision_function_shape option allows to monotonically transform the results of the “one-versus-one” classifiers to a “one-vs-rest” decision function of shape (n_samples, n_classes), which is the default setting of the parameter (default=’ovr’).

Here is some extra info about the decision_function_shape param:

decision_function_shape: {‘ovo’, ‘ovr’}, default=’ovr’

Whether to return a one-vs-rest (‘ovr’) decision function of shape (n_samples, n_classes) as all other classifiers, or the original one-vs-one (‘ovo’) decision function of libsvm which has shape (n_samples, n_classes * (n_classes - 1) / 2). However, note that internally, one-vs-one (‘ovo’) is always used as a multi-class strategy to train models; an ovr matrix is only constructed from the ovo matrix. The parameter is ignored for binary classification.

(Ref to Scores and probabilities

To learn more about the decision_function method of SVC and NuSVC)

Imbalanced problems

When we want to give more importance to certain classes or samples, we can use the class_weight and sample_weight params.

SVC (but not NuSVC) implements the parameter class_weight in the fit method. It’s a dictionary of the form {class_label : value}, where value is a floating point number > 0 that sets the parameter C of class class_label to C * value. The figure below illustrates the decision boundary of an unbalanced problem, with and without weight correction.

sample_weight, sets the parameter C for the i-th example to C * sample_weight[i], which will encourage the classifier to get these samples right. The figure below illustrates the effect of sample weighting on the decision boundary. The size of the circles is proportional to the sample weights

Regression

The model produced by SVC depends only on a subset of the training data, because the cost function for building it doesn’t care about training points that lie beyond the margin

Similarly, SVR ignores samples whose prediction is close to their target

There are three different implementations of Support Vector Regression: SVR, NuSVR and LinearSVR. LinearSVR provides a faster implementation than SVR but only considers the linear kernel, while NuSVR implements a slightly different formulation than SVR and LinearSVR. Due to its implementation in liblinear LinearSVR also regularizes the intercept, if considered. This effect can however be reduced by carefully fine tuning its intercept_scaling parameter, which allows the intercept term to have a different regularization behavior compared to the other features. LinearSVR’s classification results and score can therefore differ from the other two.

Outlier detection

OneClassSVM is used for unsupervised outlier detection.

This example which I remember reading on stream uses the OCSVM for outlier detection compared to 2 other methods:

OCSVM does not assume any parametric form of the data distribution and can model the complex shape better than the other two. There is also a link to this, but I will check that one later as it’s part of the docs’ 2nd ‘chapter’.

Complexity

The core of an SVM is a quadratic programming proglem. The QP solver used by the libsvm-based implementation scales between O(n_features x n_samples^2) and O(n_features x n_samples^3).

For the linear case, the algorithm used in LinearSVC by the liblinear implementation is much more efficient that its libsvm-based SVC couterpart and can scale almost linearly to millions of samples and/or features (hence for linear cases the LinearSVC was suggested above)

Tips on practical use

avoiding data copy

For SVC, SVR, NuSVC and NuSVR, if the data passed to certain methods is not C-ordered contiguous and double precision, it will be copied before calling the underlying C implementation. Whether a given np array is C-contigious can be checked by inspecting its flags attribute

For LinearSVC (and LogisticRegression for that matter), any input passed as a numpy array will be copied and converted to the liblinear internal sparse data representation. If we want to fit a large-scale linear classifier without copying a dense numpy C-contiguous double precision array as input, using SGDClassifier is suggested

kernel cache size

For SVC, SVR, NuSVC and NuSVR, the size of the kernel cache has a strong impact on run times for larger problems. If we have enough RAM available, they recommended to set cache_size to a higher value than the default of 200MB, such as 500 or 1000

setting C

LinearSVC and LinearSVR are less sensitive to C when it becomes large, and prediction results stop improving after a certain threshold. Meanwhile, larger C values will take more time to train, sometimes up to 10 times longer

scale your data

SVMs are not scale invariant! For example, scale each attribute on the input vector X to [0,1] or [-1,+1], or standardize it to have mean 0 and variance 1. Note that the same scaling must be applied to the test vector to obtain meaningful results.

the parameter nu in NuSVC, OneClassSVM and NuSVR approximates the fraction of training errors and support vectors
in SVC, if the data is imbalanced set class_weight=’balanced’ and/or try different penalty parameters C
randomness of the underlying implementations

The underlying implementations of SVC and NuSVC use a random number generator only to shuffle the data for probability estimation (when probability is set to True). This randomness can be controlled with the random_state parameter. If probability is set to False these estimators are not random and random_state has no effect on the results. The underlying OneClassSVM implementation is similar to the ones of SVC and NuSVC. As no probability estimation is provided for OneClassSVM, it is not random. The underlying LinearSVC implementation uses a random number generator to select features when fitting the model with a dual coordinate descent (i.e. when dual is set to True). It is thus not uncommon to have slightly different results for the same input data. If that happens, try with a smaller tol parameter. This randomness can also be controlled with the random_state parameter. When dual is set to False the underlying implementation of LinearSVC is not random and random_state has no effect on the results.

Kernel functions

Params of the RBF kernel

C trades off misclassification of training examples against simplicity of the decision surface
gamma defines how much influence a single training example has. The larger gamma is, the closer other examples must be to be affected

Proper choice of C and gamma is critical, and the docs advise to use GridSearchCV with C and gamma spaced exponentially far apart to choose good values.

(pic source: RBF SVM parameters)

Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to fitting linear classifiers and regressors under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. Even though SGD has been around in the machine learning community for a long time, it has received a considerable amount of attention just recently in the context of large-scale learning.

SGD has seen much success in large-scale and sparse ML problems often encountered in text classification and NLP. Given that the data is sparse, the classifiers in this module easily scale to problems with 100,000 training samples and more than 100,000 features.

SGD is an optimization technique and doesn’t correspond to a specific family of machine learning models.

Often, an instance of SGDClassifier or SGDRegressor will have an equivalent estimator in the scikit-learn API, potentially using a different optimization technique. For example, using SGDClassifier(loss=’log_loss’) results in logistic regression. Similarly, SGDRegressor(loss=’squared_error’, penalty=’l2’) and Ridge solve the same optimization problem, via different means.

Advantages:

efficiency
ease of implementation (lots of opportunities for code tuning)

Disadvantages:

it requires a number of hyperparams such as the regularisation param and the number of iterations
it is sensitive to feature scaling

Warning

Make sure you permute (shuffle) your training data before fitting the model or use shuffle=True to shuffle after each iteration (used by default). Also, ideally, features should be standardized using e.g. make_pipeline(StandardScaler(), SGDClassifier())

Classification

The class SGDClassifier implements a plain SGD learning routine which supports different loss functions and penalties for classification.

If we train it using the hinge loss, it is equivalent to a linear SVM.

The loss param

‘hinge’ gives a linear SVM
‘log_loss’ gives logistic regression, a probabilistic classifier
‘modified_huber’ is another smooth loss that brings tolerance to outliers as well as probability estimates
‘squared_hinge’ is like hinge but is quadratically penalized
‘perceptron’ is the linear loss used by the perceptron algorithm.

There are other losses - ‘squared_error’, ‘huber’, ‘epsilon_insensitive’ and ‘squared_epsilon_insensitive’ are designed for regression but can be useful in classification as well

Penalty

penalty: {‘l2’, ‘l1’, ‘elasticnet’, None}, default=’l2’
The penalty (aka regularization term) to be used. Defaults to ‘l2’ which is the standard regularizer for linear SVM models. ‘l1’ and ‘elasticnet’ might bring sparsity to the model (feature selection) not achievable with ‘l2’. No penalty is added when set to None.

SGDClassifier supports multi-class classification by combining multiple binary classifiers in a one-vs-all (OVA) scheme. For each of the classes, a binary classifier is learned that discriminates between that and all other classes. At testing time, we compute the confidence score (i.e. the signed distances to the hyperplane) for each classifier and choose the class with the highest confidence

The below shows the OVA on the iris dataset. The dashed lines show the three OVA classifiers and the background colours show the decision surface induced by each calssifier

(Using log_loss and modified_huber enable the usage of predict_proba so they are more suitible for OVA classification)

The SGDClassifier supports both class_weight and sample_weight which we saw earlier in the SVM part as well.

Averaged SGD

This can be enabled by passing average=True. ASGD performs the same updates as the normal SGD, but instead of using the last value of the coefs (the values of the last update), the coef_ attribute is set to the average value of the coefs across all updates. The same is done for the intercept_ attribute as well. When using ASGD the learning rate can be larger and even constant, leading on some datasets to a speed up in training time.

Regression

The class SGDRegressor implements a plain stochastic gradient descent learning routine which supports different loss functions and penalties to fit linear regression models. SGDRegressor is well suited for regression problems with a large number of training samples (> 10.000), for other problems we recommend Ridge, Lasso, or ElasticNet.

Loss

loss=”squared_error”: Ordinary least squares
loss=”huber”: Huber loss for robust regression
loss=”epsilon_insensitive”: linear Support Vector Regression

The Huber and epsilon-insensitive loss functions can be used for robust regression. The width of the insensitive region has to be specified via the parameter epsilon. This parameter depends on the scale of the target variables

Penalty

penalty: {‘l2’, ‘l1’, ‘elasticnet’, None}, default=’l2’

The penalty (aka regularization term) to be used. Defaults to ‘l2’ which is the standard regularizer for linear SVM models. ‘l1’ and ‘elasticnet’ might bring sparsity to the model (feature selection) not achievable with ‘l2’. No penalty is added when set to None.

Online One-Class SVM

The class sklearn.linear_model.SGDOneClassSVM implements an online linear version of the One-Class SVM using a stochastic gradient descent. Combined with kernel approximation techniques, sklearn.linear_model.SGDOneClassSVM can be used to approximate the solution of a kernelized One-Class SVM, implemented in sklearn.svm.OneClassSVM, with a linear complexity in the number of samples. Note that the complexity of a kernelized One-Class SVM is at best quadratic in the number of samples. sklearn.linear_model.SGDOneClassSVM is thus well suited for datasets with a large number of training samples (> 10,000) for which the SGD variant can be several orders of magnitude faster.

Stochastic Gradient Descent for sparse data

Sparse and dense formats are supported, but the sparse implementation produces slightly different results from the dense implementation.

Complexity

SGD’s efficiency is basically linear in the # of training examples. If X is a matrix of size (n, p), training has a cost of O(k x n x p_), where k is the # of iterations, and p_ is the avg # of non-zero attributes per sample

Stopping criterion

with early_stopping=True, the input data is split into a training set and a validation set. The model is then fitted on the training set, and the stopping criterion is based on the prediction score (using the score method) computed on the validation set. The size of the validation set can be changed with the parameter validation_fraction
with early_stopping=False, the model is fitted on the entire input data and the stopping criterion is based on the objective function computed on the training data

In both cases, the criterion is evaluated once by epoch, and the algorithm stops when the criterion does not improve n_iter_no_change times in a row. The improvement is evaluated with absolute tolerance tol, and the algorithm stops in any case after a maximum number of iteration max_iter.

Tips on practical use

SGD is sensitive to feature scaling, so scaling to [0,1] or [-1,+1] is recommended
finding a good regularisation term alpha is best done using automatic hyperparam search (GridSearchCV or RandomizedSearchCV, usually in the range 10.0**-np.arange(1,7))
the docs suggest that SGD converges after observing approximately 10^6 training samples - thus, a reasonable first guess for the number of iterations is max_iter = np.ceil(10**6 / n), where n is the size of the training set
if SGD is applied to features extracted using PCA, it is often wise to scale the feature values by some constant c such that the avg L2 norm of the train data equals 1

Learning rate param

‘constant’: eta = eta0
‘optimal’: eta = 1.0 / (alpha * (t + t0)) where t0 is chosen by a heuristic proposed by Leon Bottou.
‘invscaling’: eta = eta0 / pow(t, power_t)
‘adaptive’: eta = eta0, as long as the training keeps decreasing. Each time n_iter_no_change consecutive epochs fail to decrease the training loss by tol or fail to increase validation score by tol if early_stopping is True, the current learning rate is divided by 5.

For SGDRegressor the default is invscaling, for SGDClassifier the default is optimal.

eta is the initial learning rate which can be set as well. Its default is 0.01.

power_t is the exponent for inverse scaling learning rate.

t is the time step

Nearest Neighbors

sklearn.neighbors provides functionality for unsupervised and supervised neighbors-based learning methods. Unsupervised nearest neighbors is the foundation of many other learning methods, notably manifold learning and spectral clustering. Supervised neighbors-based learning comes in two flavors: classification for data with discrete labels, and regression for data with continuous labels.

The models can handle both dense and sparse input matrices. For dense matrices, a large number of possible distance metrics are supported. For sparse matrices, arbitrary Minkowski metrics (results in the standard Euclidean distance if p=2, if p=1 -> the Manhattan distance is used, other p results in minkowski_distance(l_p) being used) are supported for searches.

Unsupervised nearest neighbours

NearestNeighbors implements these models, and it acts as a uniform interface to 3 different algorithms: BallTree, KDTree, and a brute-force algorithm based on routines in sklearn.metrics.pairwise. The hoice is controlled using the algorithm param, which is auto, ball_tree, kd_tree, or brute, and auto determines the best approach from the train data.

Nearest neighbours classification

scikit-learn implements two different nearest neighbors classifiers: KNeighborsClassifier implements learning based on the k nearest neighbors of each query point, where k is an integer value specified by the user. RadiusNeighborsClassifier implements learning based on the number of neighbors within a fixed radius r of each training point, where r is a floating-point value specified by the user.

The radious classifier can be an option where the data is not uniformly sampled, and with a specific r points in sparser n’hoods use fewer nearest neighbours for the classification. For high-dim, this becomes less effective due to the curse of dimensionality.

Weight param

The basic nearest neighbors classification uses uniform weights: that is, the value assigned to a query point is computed from a simple majority vote of the nearest neighbors. Under some circumstances, it is better to weight the neighbors such that nearer neighbors contribute more to the fit.

weights: {'uniform', 'distance'} (or an user-defined callable)

Nearest neighbours regression

scikit-learn implements two different neighbors regressors: KNeighborsRegressor implements learning based on the k nearest neighbors of each query point, where k is an integer value specified by the user. RadiusNeighborsRegressor implements learning based on the neighbors within a fixed radius r of the query point, where r is a floating-point value specified by the user.

Again there is the weights param which can be uniform or distance (or an user-defined callable)

Nearest neighbour algorithms

Brute force

The most naive neighbor search implementation involves the brute-force computation of distances between all pairs of points in the dataset - this approach is O(DN^2), D - dimensions, and N samples. This can be god for small datasets, however it can quickly become inefficient as N grows.

K-D Tree

To improve upon the brute force, structures that attempt to reduce the required number of distance calculations started to appear. For instance, if A is far from B, and B is close to C, then the model doesn’t need to calculate the distance b/e A and C and can assume they are not close. This can reduce the computational cost to O(DNlog(N)) or better.

The KD tree (K-dimensional tree) is a binary tree data structure that partitions space along data axes, creating nested regions for data points. It enables fast nearest-neighbour searches in low-dim spaces D < 20 with an O(log(N)) complexity for distance calculations. However, as dims increase, KD trees become inefficient due to the curse of dimensionality.

Ball Tree

To address the inefficiencies of the kd_tree algorithm in higher dims, the ball tree structure was developed. This algorithm paritions the data in a series of nesting hyper-spheres, and while the tree construction is morecostly than the KD tree, it results in a data structure which can be very efficient on highly structured data, even in high dimensions.

A ball tree recursively divides the data into nodes defined by a centroid C and radius r, such that each point in the node lies within the hyper-sphere defined by r and C. Each node contains points within a hyper-sphere around its centroid. By using the triangle inequality (|x+y| <= |x| + |y|), the ball tree reduces the number of candidate points for a neighbor search, as it allows quick elimination of nodes that can’t contain nearby points. This approach can outperform a KD-tree in higher dimensions, though its efficiency depends on the data structure

There is a section on the choice of nearest neighbour algorithm which I will bookmark and remember to check out with some examples on stream. I can also try out different distance metrics, and just play around with all the different parameters in detail.

Nearest centroid classifier

The NearestCentroid classifier is a simple algorithm that represents each class by the centroid of its members. In effect, this makes it similar to the label updating phase of the KMeans algorithm. It also has no parameters to choose, making it a good baseline classifier. It does, however, suffer on non-convex classes, as well as when classes have drastically different variances, as equal variance in all dimensions is assumed

Nearest neighbours transformer

Many scikit-learn estimators rely on nearest neighbors: Several classifiers and regressors such as KNeighborsClassifier and KNeighborsRegressor, but also some clustering methods such as DBSCAN and SpectralClustering, and some manifold embeddings such as TSNE and Isomap. All these estimators can compute internally the nearest neighbors, but most of them also accept precomputed nearest neighbors sparse graph, as given by kneighbors_graph and radius_neighbors_graph. With mode='connectivity', these functions return a binary adjacency sparse graph as required, for instance, in SpectralClustering. Whereas with mode='distance', they return a distance sparse graph as required, for instance, in DBSCAN. To include these functions in a scikit-learn pipeline, one can also use the corresponding classes KNeighborsTransformer and RadiusNeighborsTransformer.

The benefits of this sparse graph API are:

the precomputed graph can be re-used multiple times
precomputing the graph can give finer control on the nearest neighbour estimation
the precomputation can be done using custom estimators, such as ANN

That is all for today!

See you tomorrow :)