(Day 313) More of sklearn's 'user guide'

Ivan Ivanov · November 9, 2024

theory traditional-machine-learning statistics

Hello :) Today is Day 313!

A quick summary of today:

gaussian processes
cross decomposition
naive Bayes
decision trees
semi-supervised learning
manifold learning
clustering
biclustering

Gaussian Processes

Gaussian Processes (GP) are a nonparametric supervised learning method used to solve regression and probabilistic classification problems.

Advantages:

the prediction interpolates the observations (at least for regular kernels)
the prediction is probabilistic (Gaussian) so that one can compute empirical confidence intervals and decide based on those if one should refit the prediction in some region of interest
versatile: different kernels can be specified

Disadvantages:

sklearn’s implementation is not sparse - it uses the whole feature space to perform predictions
they lose efficiency in high dim space - namely when the # of features exceeds a few dozens

Gaussian Process Regression (GPR)

The GaussianProcessRegressor implements Gaussian processes (GP) for regression purposes. For this, the prior of the GP needs to be specified. GP will combine this prior and the likelihood function based on training samples. It allows to give a probabilistic approach to prediction by giving the mean and standard deviation as output when predicting.

alpha

Noise level can be specified through setting alpha. An alternative is to use a WhiteKernel component in the kernel, which can estimate the global noise level from the data

normalize_y

Whether or not to normalize the target values y by removing the mean and scaling to unit-variance. This is recommended for cases where zero-mean, unit-variance priors are used. Note that, in this implementation, the normalisation is reversed before the GP predictions are reported.

GPR also:

allows prediction without prior fitting (based on the GP prior)
provides an additional method sample_y(X), which evaluates samples drawn from the GPR (prior or posterior) at given inputs
exposes a method log_marginal_likelihood(theta), which can be used externally for other ways of selecting hyperparameters, for instance, via Markov chain Monte Carlo

Gaussian Process Classification (GPC)

The GaussianProcessClassifier implements Gaussian processes (GP) for classification purposes, more specifically for probabilistic classification, where test predictions take the form of class probabilities. GaussianProcessClassifier places a GP prior on a latent function f, which is then squashed through a link function to obtain the probabilistic classification. The latent function f is a so-called nuisance function, whose values are not observed and are not relevant by themselves. Its purpose is to allow a convenient formulation of the model, and f is removed (integrated out) during prediction. GaussianProcessClassifier implements the logistic link function, for which the integral cannot be computed analytically but is easily approximated in the binary case.

In contrast to GPR, the posterior of the function f is not Gaussian since Gaussian likelihood is inappropriate for discrete class labels. Instead, a non-Gaussian likelihood corresponding to the logistic link function (logit) is used.

GPC supports multi-class classification by performing one-vs-rest or one-vs-one based training and prediction. In one-versus-rest, one binary Gaussian process classifier is fitted for each class, which is trained to separate this class from the rest. In “one_vs_one”, one binary Gaussian process classifier is fitted for each pair of classes, which is trained to separate these two classes.

Kernels for GPs

I went on to check out this youtube video and the explanation is so great, and it cleared a lot of confusion about GPs and also about kernels.

Constant Kernel

This kernel scales the magnitude of the other factor (kernel) and depends on a constant value

WhiteKernel

The main use-case of this is as part of a sum-kernel where it explains the noise-component of the signal. Tuning its parameter noise_level corresponds to estimating the noise-level.

Kernel operators

Sum - takes two kernels and combines them via

k_sum(X,Y) = k1(X,Y) + k2(X,Y)

Product - takes two kernels and combines them via

k_product(X,Y) = k1(X,Y) x k2(X,Y)

Exponentiation - takes one base kernel and a scalar param p and combines them via

k_exp(X,Y) = k(X,Y)^p

Radial basis function kernel

It is known as the squared exponential kernel. It is parametrized by a length-scale param (l > 0), which can either be a scalar or a vector with the same dim as the input x

d(o,o) is the Euclidean distance. This kernel is infinitely differentiable, which implies that GPs with this kernel as covar function have mean square derivatives of all orders, and thus are very smooth.

Matern kernel

Rational quadratic kernel

Exp-Sine-Squared kernel

As mentioned in the video I linked, creating an effective kernel (usually a combination of kernels) and finding the best hyperparameters can be difficult, so it requires trial and error.

Cross decomposition

The cross decomposition module contains supervised estimators for dimensionality reduction and regression, belonging to the “Partial Least Squares” family.

Cross decomposition algorithms find the fundamental relations between two matrices (X and Y). They are latent variable approaches to modeling the covariance structures in these two spaces. They will try to find the multidimensional direction in the X space that explains the maximum multidimensional variance direction in the Y space. In other words, PLS projects both X and Y into a lower-dimensional subspace such that the covariance between transformed(X) and transformed(Y) is maximal.

PLS has similarities with Principal Component Regression (PCR), but PCR is an unsupervised dim reduction method.

Apart from Canonical Correlation Analysis (CCA), the PLS estimators are suited when the matrix of predictors has more variables than observations, and when there is multicollinearity among the features. By contrast, standard linear regression would fail in these cases unless it is regularized.

This module containes the classes: PLSRegression, PLSCanonical, CCA, and PLSSVD

This example shows a comparison between the 4 methods.

Naive Bayes

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.

Regardless of the naive assumption, naive Bayes classifiers have worked quite well in many real-world scenarios (document classification and spam filtering). They require a small amount of train data to estimate the necessary params. Furthermore, naive Bayes learners and classifieers can be extremely fast compared to more sophisticated methods.

On the other handl, the naive Bayes classifiers are known to be bad estimators, so the output from their predict_proba should not be trusted.

Gaussian Naive Bayes

GaussianNB implements the Gaussian Naive Bayes algorithm for classification, where the likelihood of the features is assumed to be Gaussian and the params sigma and my are found through maximum likelihood

Multinomial Naive Bayes

MultinomialNB - The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

Complement Naive Bayes

ComplementNB implements the complement naive Bayes (CNB) algorithm. CNB is an adaptation of the standard multinomial naive Bayes (MNB) algorithm that is particularly suited for imbalanced data sets. Specifically, CNB uses statistics from the complement of each class to compute the model’s weights. The inventors of CNB show empirically that the parameter estimates for CNB are more stable than those for MNB. Further, CNB regularly outperforms MNB (often by a considerable margin) on text classification tasks.

Bernoulli Naive Bayes

BernoulliNB implements the naive Bayes training and classification algorithms for data that is distributed according to multivariate Bernoulli distributions; i.e., there may be multiple features but each one is assumed to be a binary-valued (Bernoulli, boolean) variable. Therefore, this class requires samples to be represented as binary-valued feature vectors; if handed any other kind of data, a BernoulliNB instance may binarize its input

The decision rule here differs from the multinomial NB’s rule in that it explicitly penalises the non-coccurance of a feature i that is an indicator for class y, where the multinomial variant would simply ignore a non-occuring feature.

For text classification, word occurance vectors (not word count vectors) may be used to train BNB, and it might even perform better on some datasets, sepecially those with shorter docs.

Categorical Naive Bayes

CategoricalNB is a Naive Bayes classifier for categorical data. It estimates a categorical distribution for each feature conditioned on the class label. The data must be encoded (like using OrdinalEncoder) so each feature’s categories are represented as integers.

Out-of-core naive Bayes model fitting

We can use partial_fit in cases where the full training set might not fit in memory.

Decision Trees

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.

Advantages:

simple to understand and interpret
requires little data prep; other techniques often require data normalisation, dummy vars need to be created and blank values to be removed; some tree algs support missing values
the cost of using the tree is logarithmic in the # of data points used to train the tree
able to handle both numerical and cat data; however, the sklearn implementation does not support cat vars for now
able to handle multi-output problems
uses a white box model - if a given situation is observable in a model, the explanation for the condition is easlity explained by boolean logic
possible to validate a model using stat tests; that makes it possible to acount for the reliability of the model
performs well even if its assumptions are somewhat violated by the true model from which the data were generated

Disadvantages:

decision tree learners can create over-complex trees that do not generalise the data well
can be unstable because small variations in the data might result in a completely different tree being generated (fixed through ensemble methods)
predictions of decision trees are neither smooth nor continuous, but piecewise constant approximations - they are not good at extrapolation
the problem of learning an optimal deicision tree is known to be NP-complete
there are concepts that are hard to learn because decision trees do not express them easily - such as XOR, parity or multiplexer problems
learners create biased trees if some classes dominate, so it’s recommended to balance the data prior to fitting

Complexity

In general, the run time cost to construct a balanced binary tree is O(n_samples x n_featuers x log(n_samples)) and query time O(log(n_samples)). Although the tree construction alg attempts to generated balanced trees, they will not always be balanced. Assuming te subtrees remain approx balanced, the cost at each node consists of searching through O(n_features) to find the feature that offers the largest reduction in the impurity criterion. This has a cost of O(n_samples x n_featuers x log(n_samples)) and query time O(log(n_samples)) at each node, leading to a total cost over the entire trees O(n_samples^2 x n_featuers x log(n_samples)) and query time O(log(n_samples)).

Tips on practical use

decision trees tend to overfit on data with a large number of features. Getting the right ratio of samples to features is important since a tree with few samples in high dims is very likely to overfit
consider performing dim reduction beforehand to give the tree a better chance of finding features that are discriminative
understanding the decition tree structure will help in gaining more insights about how the decision tree makes preds, which is important for understanding the important features in the data
visualise the tree as you are training, use max_depth=3 to start with and get a feel of how the tree is fitting, and then increase the depth accordingly
the number of samples required to populate the tree doubles for each additional level the tree grows to. Use max_depth to control the size of the tree to prevent overfitting
use min_samples_split or min_samples_leaf to ensure that multiple samples inform every decision in the tree by controlling which splits will be considered. A very small number will usually mean the tree will overfit, and a large number will prevent the tree for learning the data. Try min_samples_leaf=5 as an initial value. If the sample size varies greatly, a float number can be used as percentage in these two parameters. While min_samples_split can create arbitrarily small leaves, min_samples_leaf guarantees that each leaf has a minimum size, avoiding low-variance, over-fit leaf nodes in regression problems. For classification with few classes, min_samples_leaf=1 is often the best choice. Note that min_samples_split considers samples directly and independent of sample_weight, if provided. Consider min_weight_fraction_leaf or min_impurity_decrease if accounting for sample weights is required at splits
balance the data before training to prevent the tree from being biased towards the dominant classes. This can be done by sampling an equal number of samples from each class, or preferably by normalising the sum of the sample_weights for each class to the same value
if the samples are weighted, it will be easier to optimize the tree structure using weight-based pre-pruning criterion such as min_weight_fraction_leaf, which ensure that leaf nodes contain at least a fraction of the overall sum of the sample weights
all decision trees use np.float32 arrays internally. If training data is not in this format, a copy of the dataset will be made
if the input matrix X is very sparse, it is recommended to convert to sparse csc_matrix before calling fit and sparse csr_matrix before calling predict. Training time can be orders of magnitude faster for a sparse matrix input compared to a dense matrix when features have zero values in most of the samples

Missing values support

DecisionTreeClassifier and DecisionTreeRegressor have built-in support for missing values when splitter=’best’ and criterion is ‘gini’, ‘entropy’, or ‘log_loss’, for classification or ‘squared_error’, ‘friedman_mse’, or ‘poisson’ for regression. For each potential threshold on the non-missing data, the splitter will evaluate the split with all the missing values going to the left node or the right node.

Decisions are made as follows:

by default when predicting, the samples with missing values are classified with the class used in the split found during training
if the criterion evaluation is the same for both nodes, then the tie for missing value at predict time is broken by going to the right node. The splitter also checks the split where all the missing values go to one child and non-missing values go to the other
if no missing values are seen during training for a given feature, then during prediction missing values are mapped to the child with the most samples

Semi-supervised learning

Semi-supervised learning is a situation in which in your training data some of the samples are not labeled. The semi-supervised estimators in sklearn.semi_supervised are able to make use of this additional unlabeled data to better capture the shape of the underlying data distribution and generalize better to new samples. These algorithms can perform well when we have a very small amount of labeled points and a large amount of unlabeled points.

It’s important to assign an intifier to unlabeled points, the module sees -1 as unlabaled.

Semi-supervised learning algorithms make use of at least one of the following assumptions:

continuity/smoothness assumption

Points that are close to each other are more likely to share a label

cluster assumption

The data tend to form discrete clusters, and points in the same cluster are more likely to share a label

manifold assumption

The data lie approximately on a manifold of much lower dimension than the input space

Self Training

This self-training implementation is based on Yarowsky’s algorithm where a given supervised classifier can function as a semi-supervised classifier, allowing it to learn from unlabeled data.

SelfTrainingClassifier can be called with any classifier that implements predict_proba, passed as the parameter base_classifier. In each iteration, the base classifier predicts labels for the unlabeled samples and adds a subset of these labels to the labeled dataset.

The choice of this subset is determined by the selection criterion. This selection can be done using a threshold on the prediction probabilities, or by choosing the k_best samples according to the prediction probabilities.

The labels used for the final fit as well as the iteration in which each sample was labeled are available as attributes. The optional max_iter parameter specifies how many times the loop is executed at most. The max_iter parameter may be set to None, causing the algorithm to iterate until all samples have labels or no new samples are selected in that iteration.

When using the self-training classifier, the calibration of the classifier is important.

Label Propagation

Label propagation denotes a few variations of semi-supervised graph inference algorithms.

A few features available in this model:

used for classification tasks
kernel methods to project data into alternate dimensional spaces

sklearn provides two label propagation models: LabelPropagation and LabelSpreading. Both work by constructing a similarity graph over all items in the input dataset.

LabelPropagation uses hard clamping of input labels, meaning it strictly retains the original label distribution. The clamping factor can be adjusted, allowing for some flexibility in label distribution updates. The algorithm uses the raw similarity matrix without modifications.

LabelSpreading minimises a loss function with regularization, making it more robust to noise. It iterates on a modified graph and normalizes edge weights using the normalized graph Laplacian, a method also used in Spectral clustering.

Both models offer two kernels:

rbf: produces a fully connected graph, represented by a large, dense matrix. While more accurate, it can result in longer running times due to memory usage and matrix calculations
knn: produces a sparse matrix, reducing memory usage and running times significantly

Manifold learning

Manifold learning is an approach to non-linear dimensionality reduction. Algorithms for this task are based on the idea that the dimensionality of many data sets is only artificially high.

Manifold Learning can be thought of as an attempt to generalize linear frameworks like PCA to be sensitive to non-linear structure in data. Though supervised variants exist, the typical manifold learning problem is unsupervised: it learns the high-dimensional structure of the data from the data itself, without the use of predetermined classifications.

Below are sklearn’s manifold learning implementations and the pictures are of each method applied on

Isomap

One of the earliest approaches to manifold learning is the Isomap algorithm, short for Isometric Mapping. Isomap can be viewed as an extension of Multi-dimensional Scaling (MDS) or Kernel PCA. Isomap seeks a lower-dimensional embedding which maintains geodesic distances between all points.

Locally Linear Embedding

LLE seeks a lower-dimensional projection of the data which preserves distances within local neighborhoods. It can be thought of as a series of local Principal Component Analyses which are globally compared to find the best non-linear embedding.

Modified LLE

One well-known issue with LLE is the regularization problem. When the number of neighbors is greater than the number of input dimensions, the matrix defining each local neighborhood is rank-deficient. One method to address the regularization problem is to use multiple weight vectors in each neighborhood. This is the essence of modified locally linear embedding (MLLE).

Heissian Eigenmapping

This is another method of solving the regularization problem of LLE. It revolves around a hessian-based quadratic form at each neighborhood which is used to recover the locally linear structure. Though other implementations note its poor scaling with data size, sklearn implements some algorithmic improvements which make its cost comparable to that of other LLE variants for small output dimension.

Spectral Embedding

This is an approach to calculating a non-linear embedding. Scikit-learn implements Laplacian Eigenmaps, which finds a low dimensional representation of the data using a spectral decomposition of the graph Laplacian. The graph generated can be considered as a discrete approximation of the low dimensional manifold in the high dimensional space. Minimization of a cost function based on the graph ensures that points close to each other on the manifold are mapped close to each other in the low dimensional space, preserving local distances

Local Tangent Space Alignment

Though not technically a variant of LLE, Local tangent space alignment (LTSA) is algorithmically similar enough to LLE that it can be put in this category. Rather than focusing on preserving neighborhood distances as in LLE, LTSA seeks to characterize the local geometry at each neighborhood via its tangent space, and performs a global optimization to align these local tangent spaces to learn the embedding.

Multi-dimensional Scaling

MDS seeks a low-dimensional representation of the data in which the distances respect well the distances in the original high-dimensional space. In general, its is a technique used for analyzing similarity or dissimilarity data. It attempts to model similarity or dissimilarity data as distances in a geometric spaces. The data can be ratings of similarity between objects, interaction frequencies of molecules, or trade indices between countries

t-distributed Stochastic Neighbour Embedding (t-SNE)

t-SNE converts data affinities to probabilities, using Gaussian distributions in the original space and Student’s t-distributions in the embedded space. This approach makes t-SNE sensitive to local structures, with advantages such as revealing structure at multiple scales, uncovering multiple manifolds or clusters, and reducing point crowding at the center. Unlike Isomap or LLE, which focus on unfolding a single manifold, t-SNE emphasizes local data structure and can group samples based on local similarities, which is useful for visualizing datasets with multiple clusters.

The algorithm minimizes the Kullback-Leibler (KL) divergence between joint probabilities using gradient descent. Since the KL divergence is not convex, multiple restarts with different initializations may be needed to avoid local minima.

Practical tips

ensure all features are on the same scale to improve algorithm performance, as manifold learning relies on nearest-neighbor search
the reconstruction error can guide the selection of the optimal output dimension. It decreases as n_components increases, until it matches the dimensionality of the data
noisy data can distort the manifold, connecting parts of it that would otherwise be separate. Manifold learning on noisy or incomplete data remains an active research topic
identical points or disjoint data groups can lead to singular weight matrices, causing issues with solvers like arpack. Use solver='dense' to handle singular matrices, though it may be slower with large datasets. Alternatively, resolve the issue by increasing n_neighbors for disjoint data or removing identical points

Clustering

Each clustering algorithm comes in two variants: a class, that implements the fit method to learn the clusters on train data, and a function, that, given train data, returns an array of integer labels corresponding to the different clusters. For the class, the labels over the training data can be found in the labels attribute._

Method name	Parameters	Scalability	Usecase	Geometry (metric used)
K-Means	number of clusters	Very large `n_samples`, medium `n_clusters` with MiniBatch code	General-purpose, even cluster size, flat geometry, not too many clusters, inductive	Distances between points
Affinity propagation	damping, sample preference	Not scalable with `n_samples`	Many clusters, uneven cluster size, non-flat geometry, inductive	Graph distance (e.g. nearest-neighbor graph)
Mean-shift	bandwidth	Not scalable with `n_samples`	Many clusters, uneven cluster size, non-flat geometry, inductive	Distances between points
Spectral clustering	number of clusters	Medium `n_samples`, small `n_clusters`	Few clusters, even cluster size, non-flat geometry, transductive	Graph distance (e.g. nearest-neighbor graph)
Ward hierarchical clustering	number of clusters or distance threshold	Large `n_samples` and `n_clusters`	Many clusters, possibly connectivity constraints, transductive	Distances between points
Agglomerative clustering	number of clusters or distance threshold, linkage type, distance	Large `n_samples` and `n_clusters`	Many clusters, possibly connectivity constraints, non-Euclidean distances, transductive	Any pairwise distance
DBSCAN	neighborhood size	Very large `n_samples`, medium `n_clusters`	Non-flat geometry, uneven cluster sizes, outlier removal, transductive	Distances between nearest points
HDBSCAN	minimum cluster membership, minimum point neighbors	Large `n_samples`, medium `n_clusters`	Non-flat geometry, uneven cluster sizes, outlier removal, transductive, hierarchical, variable cluster density	Distances between nearest points
OPTICS	minimum cluster membership	Very large `n_samples`, large `n_clusters`	Non-flat geometry, uneven cluster sizes, variable cluster density, outlier removal, transductive	Distances between points
Gaussian mixtures	many	Not scalable	Flat geometry, good for density estimation, inductive	Mahalanobis distances to centers
BIRCH	branching factor, threshold, optional global clusterer.	Large `n_clusters` and `n_samples`	Large dataset, outlier removal, data reduction, inductive	Euclidean distance between points
Bisecting K-Means	number of clusters	Very large `n_samples`, medium `n_clusters`	General-purpose, even cluster size, flat geometry, no empty clusters, inductive, hierarchical	Distances between points

K-means

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares (see below). This algorithm requires the number of clusters to be specified. It scales well to large numbers of samples and has been used across a large range of application areas in many different fields.

The K-means algorithm aims to choose centroids that minimise the inertia, or within-cluster sum-of-squares criterion

Inertia can be recognized as a measure of how internally coherent clusters are. It suffers from various drawbacks:

it makes the assumption that clusters are convex and isotropic, which is not always the case. It responds poorly to elongated clusters, or manifolds with irregular shapes
it’s not a normalised metric; lower is better and 0 is optimal. In high-dim spaces, Euclidean distances tend to become inflated, so running a dim reduction alg such as PCA prior to k-means can alleviate this problem and speed up computations

Given enough time, K-means will always converge, however this may be to a local minimum. This is highly dependent on the initialization of the centroids. As a result, the computation is often done several times, with different initializations of the centroids. One method to help address this issue is the k-means++ initialization scheme, which has been implemented in scikit-learn. This initializes the centroids to be (generally) distant from each other, leading to probably better results than random initialization, as shown in the reference.

There is also a mini-batch kmeans, which converges faster than KMeans, but the quality of the results is reduced. In practice this difference in quality can be quite small, as shown in the example and cited reference.

Affinity Propagation

AffinityPropagation creates clusters by sending messages between pairs of samples until convergence. A dataset is then described using a small number of exemplars, which are identified as those most representative of other samples. The messages sent between pairs represent the suitability for one sample to be the exemplar of the other, which is updated in response to the values from other pairs. This updating happens iteratively until convergence, at which point the final exemplars are chosen, and hence the final clustering is given.

Affinity Propagation can be interesting as it chooses the number of clusters based on the data provided. For this purpose, the two important parameters are the preference, which controls how many exemplars are used, and the damping factor which damps the responsibility and availability messages to avoid numerical oscillations when updating these messages.

The main disadvantage is the algorithm’s complexity, as it has a time complexity of O(N^2 x T), where N is # of samples and T is # of iterations until convergence; and the memory complexity is O(N^2) if a dense similarity matrix is used, but reducible if a sparse one is used. Therefore, AP is most appropriate for small to medium datasets.

This example shows AP applied to group stocks that behave similarly according to their stock price movement:

Mean Shift

MeanShift clustering aims to discover blobs in a smooth density of samples. It is a centroid based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region. These candidates are then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids.

It is not highly scalable, as it requires multiple nearest neighbor searches during its execution. It is guaranteed to converge, however it will stop iterating when the change in centroids is small.

Spectral clustering

SpectralClustering performs a low-dimension embedding of the affinity matrix between samples, followed by clustering, e.g., by KMeans, of the components of the eigenvectors in the low dimensional space. It is especially computationally efficient if the affinity matrix is sparse and the amg solver is used for the eigenvalue problem.It works well for a small number of clusters, but is not advised for many clusters.

Hierarchical clustering

Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging or splitting them successively. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample

The AgglomerativeClustering object performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together. The linkage criteria determines the metric used for the merge strategy:

‘ward’ minimises the sum of squared differences within all clusters; similar to the k-means objective function but tackled with an agglomerative hirarchical approach
‘maximum’ or ‘complete linkage’ minimises the max distance b/e observations of pairs of clusters
‘average linkage’ minimises the avg of the distances b/e all observations of pairs of clusters
‘single linkage’ minimises the distance b/e the closes observations of pairs of clusters

An interesting aspect of AgglomerativeClustering is that connectivity constraints can be added to this algorithm through a connectivity matrix that defines for each sample the neighboring samples following a given structure of the data. These constraint are useful to impose a certain local structure, but they also make the algorithm faster, especially when the number of the samples is high.

Bisecting K-means

This is a variant of K-Means using divisive hierarchical clustering, where clusters are iteratively split into two until reaching the target number of clusters. It’s more efficient than standard K-Means, especially with a high cluster count, as it operates on smaller data subsets per split. While it lacks the “k-means++” initialization, it achieves comparable results with lower computational costs.

This method avoids empty clusters and provides two strategies for splitting: selecting the cluster with the most points (faster and yields balanced clusters) or the cluster with the highest inertia (higher accuracy but costlier).

This example shows a comparison between bisecting k-means and k-means. While the regular K-Means algorithm tends to create non-related clusters, clusters from Bisecting K-Means are well ordered and create quite a visible hierarchy.

DBSCAN

The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. The central component to the DBSCAN is the concept of core samples, which are samples that are in areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples). There are two parameters to the algorithm, min_samples and eps, which define formally what we mean when we say dense. Higher min_samples or lower eps indicate higher density necessary to form a cluster.

Any core sample is part of a cluster, by definition. Any sample that is not a core sample, and is at least eps in distance from any core sample, is considered an outlier by the algorithm. While the parameter min_samples primarily controls how tolerant the algorithm is towards noise (on noisy and large data sets it may be desirable to increase this parameter), the parameter eps is crucial to choose appropriately for the data set and distance function and usually cannot be left at the default value. It controls the local neighborhood of the points. When chosen too small, most data will not be clustered at all (and labeled as -1 for “noise”). When chosen too large, it causes close clusters to be merged into one cluster, and eventually the entire data set to be returned as a single cluster. Some heuristics for choosing this parameter have been discussed in the literature, for example based on a knee in the nearest neighbor distances plot (as discussed in the references below).

Big circles - core samples

Smaller curcles - non-core samples

Black points - outliers

DBSCAN is deterministic, producing the same clusters if data is provided in the same order
Results can vary with different data orders:
- core samples always form the same clusters, but cluster labels depend on the encounter order
- non-core samples may be assigned differently if they are close to core samples in different clusters
a non-core sample near two core samples from different clusters joins the cluster encountered first in data order
uses ball trees and kd-trees for efficient neighborhood searches, avoiding full distance matrix calculations
supports custom metrics for distance calculations

HDBSCAN

DBSCAN assumes that the clustering criterion (i.e. density requirement) is globally homogeneous so it might struggle in capturing clusters with different densities. HDBSCAN alleviates this assumption and explores all possible density scales by building an alternative representation of the clustering problem.

Here’s how it works:

it calculates a core distance for each point based on a set number of nearest neighbours (min_samples)
points are connected based on their mutual reachability distance (distance with adjustments for density). This graph helps find clusters by linking close points while considering density variations
it builds a hierarchical cluster structure by iteratively removing high-distance edges, revealing clusters at various densities
unlike DBSCAN, HDBSCAN doesn’t require a fixed density threshold (epsilon). It only requires min_samples and optionally min_cluster_size, making it adaptable to clusters of different densities

OPTICS

The OPTICS (Ordering Points To Identify the Clustering Structure) algorithm is similar to DBSCAN but generalises it by allowing a range of eps values rather than a fixed one.

Unlike DBSCAN, OPTICS builds a reachability graph that records each sample’s reachability distance and position in cluster order. This reachability information enables flexible cluster detection at varying densities within the same dataset. By setting max_eps to infinity, OPTICS can perform DBSCAN-style clustering repeatedly for any eps value. The reachability plot, generated from the reachability distances, can be analyzed to extract clusters by identifying steep slopes, with xi controlling the slope sensitivity. Additional options include hierarchical clustering views via reachability-plot dendrograms, accessible through the cluster_hierarchy_ parameter.

Comparison with DBSCAN

The results from OPTICS cluster_optics_dbscan method and DBSCAN are very similar, but not always identical; specifically, labeling of periphery and noise points. This is in part because the first samples of each dense area processed by OPTICS have a large reachability value while being close to other points in their area, and will thus sometimes be marked as noise rather than periphery. This affects adjacent points when they are considered as candidates for being marked as either periphery or noise. Note that for any single value of eps, DBSCAN will tend to have a shorter run time than OPTICS; however, for repeated runs at varying eps values, a single run of OPTICS may require less cumulative runtime than DBSCAN. It is also important to note that OPTICS’ output is close to DBSCAN’s only if eps and max_eps are close.

BIRCH

The BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) algorithm compresses data into a Clustering Feature Tree (CFT), where each node contains subclusters (CF Subclusters) with summary statistics like sample count, linear sum, squared sum, centroid, and squared norm. This structure enables efficient clustering without needing to store all data in memory. Key parameters include the threshold (maximum distance for adding points to subclusters) and branching factor* (limit on subclusters per node). BIRCH can act as a data reduction method, creating subclusters for further processing by a global clusterer, defined by the n_clusters parameter.

Clustering performance evaluation

rand index
mutual information based scores
homogeneity, completeness and V-measure
fowlkes-mallows scores
silhouette coefficient
calinski-barabasz index
davies-bouldin index
contingency matrix
pair confusion matrix

Biclustering

Biclustering algorithms simultaneously cluster rows and columns of a data matrix. These clusters of rows and columns are known as biclusters. Each determines a submatrix of the original data matrix with some desired properties.

Algorithms differ in how they define biclusters. Some of the common types include:

constant values, constant rows, or constant columns
unusually high or low values
submatrices with low variance
correlated rows or columns

Algorithms also differ in how rows and columns may be assigned to biclusters, which leads to different bicluster structures. Block diagonal or checkerboard structures occur when rows and columns are divided into partitions.

If each row and each column belongs to exactly one bicluster, then rearranging the rows and columns of the data matrix reveals the biclusters on the diagonal.

Spectral Co-Clustering

The SpectralCoclustering algorithm finds biclusters with values higher than those in the corresponding other rows and columns. Each row and each column belongs to exactly one bicluster, so rearranging the rows and columns to make partitions contiguous reveals these high values along the diagonal

Here is an example:

Spectral Biclustering

The SpectralBiclustering algorithm assumes that the input data matrix has a hidden checkerboard structure. The rows and columns of a matrix with this structure may be partitioned so that the entries of any bicluster in the Cartesian product of row clusters and column clusters are approximately constant. For instance, if there are two row partitions and three column partitions, each row will belong to three biclusters, and each column will belong to two biclusters. The algorithm partitions the rows and columns of a matrix so that a corresponding blockwise-constant checkerboard matrix provides a good approximation to the original matrix.

Biclustering evaluation

In sklearn, only consensus_score is available where the minimum is 0 and occurs when all pairs of biclusters are totally dissimilar. The maximum score, 1 and occurs when both sets are identical.

The above order is not the exact as in the docs as I skipped some pages because I already read them recently on stream.

That is all for today!

See you tomorrow :)