(Day 236) Reading about unsupervised learning algorithms + making the final version of our ppt videos

Ivan Ivanov · August 24, 2024

traditional-machine-learning applying-knowledge theory

Hello :) Today is Day 236!

A quick summary of today:

reading about fundamental unsupervised learning algorithms (Chapter 8 from the Machine Learning Algorithms in Depth book)
final final final video versions for our ppt

Applications of unsupervised learning span from clustering customer segments in e-commerce to extracting features from image data. The author says the algorithms in this chapter were selected for their mathematical depth and usefulness in real-world applications.

Dirichlet process K-means

The Dirichlet process K-means is a Bayesian nonparametric extension of the K-means algorithm, in which the number of clusters grows with data.

Key Concepts:

Dynamic Clustering: New clusters are created when a data point is sufficiently far from existing cluster centroids. This flexibility allows the algorithm to adapt to the data, automatically increasing the number of clusters as needed.
Objective Function: DP-means optimizes a K-means-like objective that includes a penalty term for the number of clusters, converging to a local optimum.
Reassignment Step: Similar to K-means, data points are reassigned to the nearest cluster, or a new cluster is initiated if the squared Euclidean distance to all existing centroids exceeds a threshold, λ.

Performance Metrics:

Normalized Mutual Information (NMI): Measures the similarity between the true label assignments (if available) and the computed cluster assignments. NMI ranges between 0 and 1, with higher values indicating more similar assignments.
Variation of Information (VI): Quantifies the difference between label assignments. VI decreases as the overlap between predicted and true labels increases.
Adjusted Rand Index (ARI): Computes the similarity between two clusters by analyzing all pairs of samples, approaching 1 when the predicted clusters closely match the true clusters.

import numpy as np
import matplotlib.pyplot as plt

import time
from sklearn import metrics
from sklearn.datasets import load_iris

np.random.seed(42)

class dpmeans:
    
    def __init__(self,X):
        # Initialize parameters for DP means
        self.K = 1
        self.K_init = 4
        self.d = X.shape[1]
        self.z = np.mod(np.random.permutation(X.shape[0]),self.K)+1
        self.mu = np.random.standard_normal((self.K, self.d))
        self.sigma = 1
        self.nk = np.zeros(self.K)
        self.pik = np.ones(self.K)/self.K 
        
        #init mu
        self.mu = np.array([np.mean(X,0)])
        
        #init lambda
        self.Lambda = self.kpp_init(X,self.K_init)
        
        self.max_iter = 100
        self.obj = np.zeros(self.max_iter)
        self.em_time = np.zeros(self.max_iter)   
        
    def kpp_init(self,X,k):
        #k++ init
        #lambda is max distance to k++ means

        [n,d] = np.shape(X)
        mu = np.zeros((k,d))        
        dist = np.inf*np.ones(n)
            
        mu[0,:] = X[int(np.random.rand()*n-1),:]
        for i in range(1,k):
            D = X-np.tile(mu[i-1,:],(n,1))
            dist = np.minimum(dist, np.sum(D*D,1))
            idx = np.where(np.random.rand() < np.cumsum(dist/float(sum(dist))))
            mu[i,:] = X[idx[0][0],:]
            Lambda = np.max(dist)
        
        print("Lambda: ", Lambda)
        
        return Lambda
        
    def fit(self,X):

        obj_tol = 1e-3
        max_iter = self.max_iter        
        [n,d] = np.shape(X)
        
        obj = np.zeros(max_iter)
        em_time = np.zeros(max_iter)
        print('running dpmeans...')
        
        for iter in range(max_iter):
            tic = time.time()
            dist = np.zeros((n,self.K))
            
            #assignment step
            for kk in range(self.K):
                Xm = X - np.tile(self.mu[kk,:],(n,1))
                dist[:,kk] = np.sum(Xm*Xm,1)
            
            #update labels
            dmin = np.min(dist,1)
            self.z = np.argmin(dist,1)
            idx = np.where(dmin > self.Lambda)
            
            if (np.size(idx) > 0):
                self.K = self.K + 1
                self.z[idx[0]] = self.K-1 #cluster labels in [0,...,K-1]
                self.mu = np.vstack([self.mu,np.mean(X[idx[0],:],0)])                
                Xm = X - np.tile(self.mu[self.K-1,:],(n,1))
                dist = np.hstack([dist, np.array([np.sum(Xm*Xm,1)]).T])
             
            #update step
            self.nk = np.zeros(self.K)
            for kk in range(self.K):
                self.nk[kk] = self.z.tolist().count(kk)
                idx = np.where(self.z == kk)
                self.mu[kk,:] = np.mean(X[idx[0],:],0)
            
            self.pik = self.nk/float(np.sum(self.nk))
            
            #compute objective
            for kk in range(self.K):
                idx = np.where(self.z == kk)
                obj[iter] = obj[iter] + np.sum(dist[idx[0],kk],0)                
            obj[iter] = obj[iter] + self.Lambda * self.K
            
            #check convergence
            if (iter > 0 and np.abs(obj[iter]-obj[iter-1]) < obj_tol*obj[iter]):
                print('converged in %d iterations\n'% iter)
                break
            em_time[iter] = time.time()-tic
        #end for
        self.obj = obj
        self.em_time = em_time
        return self.z, obj, em_time
        
    def compute_nmi(self, z1, z2):
        # compute normalized mutual information
        
        n = np.size(z1)
        k1 = np.size(np.unique(z1))
        k2 = np.size(np.unique(z2))
        
        nk1 = np.zeros((k1,1))
        nk2 = np.zeros((k2,1))

        for kk in range(k1):
            nk1[kk] = np.sum(z1==kk)
        for kk in range(k2):
            nk2[kk] = np.sum(z2==kk)
            
        pk1 = nk1/float(np.sum(nk1))
        pk2 = nk2/float(np.sum(nk2))
        
        nk12 = np.zeros((k1,k2))
        for ii in range(k1):
            for jj in range(k2):
                nk12[ii,jj] = np.sum((z1==ii)*(z2==jj))
        pk12 = nk12/float(n)        
        
        Hx = -np.sum(pk1 * np.log(pk1 + np.finfo(float).eps))
        Hy = -np.sum(pk2 * np.log(pk2 + np.finfo(float).eps))
        
        Hxy = -np.sum(pk12 * np.log(pk12 + np.finfo(float).eps))
        
        MI = Hx + Hy - Hxy;
        nmi = MI/float(0.5*(Hx+Hy))
        
        return nmi

    def generate_plots(self,X):

        plt.close('all')
        plt.figure(0)
        for kk in range(self.K):
            #idx = np.where(self.z == kk)
            plt.scatter(X[self.z == kk,0], X[self.z == kk,1], \
                        s = 100, marker = 'o', c = np.random.rand(3,), label = str(kk))
        #end for
        plt.xlabel('X1')
        plt.ylabel('X2')
        plt.legend()
        plt.title('DP-means clusters')
        plt.grid(True)
        plt.show()
        
        plt.figure(1)
        plt.plot(self.obj)
        plt.title('DP-means objective function')
        plt.xlabel('iterations')
        plt.ylabel('penalized l2 squared distance')
        plt.grid(True)
        plt.show()
        
if __name__ == "__main__":        
    
    iris = load_iris()
    X = iris.data
    y = iris.target
    
    dp = dpmeans(X)
    labels, obj, em_time = dp.fit(X)
    dp.generate_plots(X)

    nmi = dp.compute_nmi(y,labels)
    ari = metrics.adjusted_rand_score(y,labels)
    
    print("NMI: %.4f" % nmi)
    print("ARI: %.4f" % ari)

NMI: 0.7582

ARI: 0.7302

Gaussian mixture models

I meet them again. For the third time actually. First in the papers where I read about predicting taxi demand and using a GMM on top of a normal model so the output is a distribution, 2nd when I watched the probabl youtube channel, and now third.

Mixture models are commonly used to model complex density distributions. For example, you may be interested in discovering patterns in census data consisting of information about the person’s age, income, occupation, and other dimensions. If we plot the resulting data in high-dimensional space, we’ll likely discover nonuniform density characterized by groups or clusters of data points. We can model each cluster using a base probability distribution. Mixture models consist of a convex combination of K base models. In the case of Gaussian mixture models, the base distribution is Gaussian and can be written as follows.

where π are the mixing proportions, and the total is 1.

Expectation maximization (EM) algorithm

EM is an iterative algorithm that alternates between inferring the missing values given the parameters (E step) and then optimizing the parameters given filled-in data (M step).

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn.cluster import KMeans
from scipy.stats import multivariate_normal
from scipy.special import logsumexp
from scipy import linalg

np.random.seed(3)

class GMM:

    def __init__(self, n=1e3, d=2, K=4):
        self.n = int(n)  #number of data points
        self.d = d  #data dimension 
        self.K = K  #number of clusters

        self.X = np.zeros((self.n, self.d))

        self.mu = np.zeros((self.d, self.K))
        self.sigma = np.zeros((self.d, self.d, self.K))
        self.pik = np.ones(self.K)/K

    def generate_data(self):
        #GMM generative model
        alpha0 = np.ones(self.K) 
        pi = np.random.dirichlet(alpha0)

        #ground truth mu and sigma
        mu0 = np.random.randint(0, 10, size=(self.d, self.K)) - 5*np.ones((self.d, self.K))
        V0 = np.zeros((self.d, self.d, self.K))
        for k in range(self.K):
            eigen_mean = 0 
            Q = np.random.normal(loc=0, scale=1, size=(self.d, self.d))
            D = np.diag(abs(eigen_mean + np.random.normal(loc=0, scale=1, size=self.d)))
            V0[:,:,k] = abs(np.transpose(Q)*D*Q)
        
        #sample data
        for i in range(self.n):
            z = np.random.multinomial(1,pi)
            k = np.nonzero(z)[0][0]
            self.X[i,:] = np.random.multivariate_normal(mean=mu0[:,k], cov=V0[:,:,k], size=1)

        plt.figure()
        plt.scatter(self.X[:,0], self.X[:,1], color='b', alpha=0.5)
        plt.title("Ground Truth Data"); plt.xlabel("X1"); plt.ylabel("X2")
        plt.show()

        return mu0, V0

    def gmm_em(self):
        
        #init mu with k-means
        kmeans = KMeans(n_clusters=self.K, random_state=42).fit(self.X)
        self.mu = np.transpose(kmeans.cluster_centers_)

        #init sigma
        for k in range(self.K):
            self.sigma[:,:,k] = np.eye(self.d)

        #EM algorithm
        max_iter = 10
        tol = 1e-5
        obj = np.zeros(max_iter)
        for iter in range(max_iter):
            print("EM iter ", iter)
            #E-step
            resp, llh = self.estep()
            #M-step
            self.mstep(resp)
            #check convergence
            obj[iter] = llh
            if (iter > 1 and obj[iter] - obj[iter-1] < tol*abs(obj[iter])):
                break
            #end if
        #end for
        plt.figure()
        plt.plot(obj)
        plt.title('EM-GMM objective'); plt.xlabel("iter"); plt.ylabel("log-likelihood")
        plt.show()

    def estep(self):
        
        log_r = np.zeros((self.n, self.K))
        for k in range(self.K):
            log_r[:,k] = multivariate_normal.logpdf(self.X, mean=self.mu[:,k], cov=self.sigma[:,:,k])
        #end for
        log_r = log_r + np.log(self.pik)
        L = logsumexp(log_r, axis=1)
        llh = np.sum(L)/self.n  #log likelihood
        log_r = log_r - L.reshape(-1,1) #normalize
        resp = np.exp(log_r)
        return resp, llh

    def mstep(self, resp):

        nk = np.sum(resp, axis=0)
        self.pik = nk/self.n
        sqrt_resp = np.sqrt(resp)
        for k in range(self.K):
            #update mu
            rx = np.multiply(resp[:,k].reshape(-1,1), self.X)
            self.mu[:,k] = np.sum(rx, axis=0) / nk[k]

            #update sigma
            Xm = self.X - self.mu[:,k]
            Xm = np.multiply(sqrt_resp[:,k].reshape(-1,1), Xm)
            self.sigma[:,:,k] = np.maximum(0, np.dot(np.transpose(Xm), Xm) / nk[k] + 1e-5 * np.eye(self.d))
        #end for

if __name__ == '__main__':

    gmm = GMM()
    mu0, V0 = gmm.generate_data()
    gmm.gmm_em()
    
    for k in range(mu0.shape[1]):
        print("cluster ", k)
        print("-----------")
        print("ground truth means:")
        print(mu0[:,k])
        print("ground truth covariance:")
        print(V0[:,:,k])
    #end for 
    
    for k in range(mu0.shape[1]):
        print("cluster ", k)
        print("-----------")
        print("GMM-EM means:")
        print(gmm.mu[:,k])
        print("GMM-EM covariance:")
        print(gmm.sigma[:,:,k])

    plt.figure()
    ax = plt.axes()
    plt.scatter(gmm.X[:,0], gmm.X[:,1], color='b', alpha=0.5)

    for k in range(mu0.shape[1]):

        v, w = linalg.eigh(gmm.sigma[:,:,k])
        v = 2.0 * np.sqrt(2.0) * np.sqrt(v)
        u = w[0] / linalg.norm(w[0])

        # plot an ellipse to show the Gaussian component
        angle = np.arctan(u[1] / u[0])
        angle = 180.0 * angle / np.pi  # convert to degrees
        ell = mpl.patches.Ellipse(gmm.mu[:,k], v[0], v[1], 180.0 + angle, color='r', alpha=0.5)
        ax.add_patch(ell)

        # plot cluster centroids
        plt.scatter(gmm.mu[0,k], gmm.mu[1,k], s=80, marker='x', color='k', alpha=1)
    plt.title("Gaussian Mixture Model"); plt.xlabel("X1"); plt.ylabel("X2")
    plt.show()

EM over the iterations:

cluster  0
-----------
ground truth means:
[-2.  2.]
ground truth covariance:
[[0.00169216 0.        ]
 [0.         1.33783795]]
cluster  1
-----------
ground truth means:
[4. 1.]
ground truth covariance:
[[2.45922460e-03 0.00000000e+00]
 [0.00000000e+00 2.63330463e+00]]
cluster  2
-----------
ground truth means:
[ 4. -5.]
ground truth covariance:
[[1.43773758 0.        ]
 [0.         0.03995228]]
cluster  3
-----------
ground truth means:
[ 0. -1.]
ground truth covariance:
[[0.29126579 0.        ]
 [0.         0.10456419]]
cluster  0
-----------
GMM-EM means:
[3.99752939 0.92575035]
GMM-EM covariance:
[[2.38145097e-03 3.76559287e-03]
 [3.76559287e-03 2.63147414e+00]]
cluster  1
-----------
GMM-EM means:
[ 0.05492661 -0.99472004]
GMM-EM covariance:
[[0.2790043  0.        ]
 [0.         0.10012724]]
cluster  2
-----------
GMM-EM means:
[ 3.89443404 -4.99587958]
GMM-EM covariance:
[[1.78361804 0.00829965]
 [0.00829965 0.04524815]]
cluster  3
-----------
GMM-EM means:
[-2.00245926  2.05932057]
GMM-EM covariance:
[[0.00141642 0.        ]
 [0.         1.3039054 ]]

Inferred Gaussian mixture overlayed on the original data:

Dimensionality reduction

PCA

We would like to project our data vector x ∈ R^D onto a lower dimensional vector z ∈ R^L with L < D such that the variance of the projected data is maximized. Maximizing the variance of the projected data is the core principle of PCA, and it allows us to preserve the unique characteristics of our data.

pseudo-code for PCA:

t-SNE manifold learning on images

Images are high dimensional objects that live on manifolds. A manifold is a topological space that locally resembles Euclidean space. By modeling image spaces as manifolds, we can study their geometric properties. We can visualize high dimensional objects with the help of an embedding. For this, two embeddings are considered: t-SNE and Isomap, and applied on the MNIST dataset.

import numpy as np
import matplotlib.pyplot as plt

from time import time
from sklearn import manifold

from sklearn.datasets import load_digits
from sklearn.neighbors import KDTree

def plot_digits(X):

    n_img_per_row = np.amin((20, np.int(np.sqrt(X.shape[0]))))
    img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))
    for i in range(n_img_per_row):
        ix = 10 * i + 1
        for j in range(n_img_per_row):
            iy = 10 * j + 1
            img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8))

    plt.figure()
    plt.imshow(img, cmap=plt.cm.binary)
    plt.xticks([])
    plt.yticks([])
    plt.title('A selection from the 64-dimensional digits dataset')    

def mnist_manifold():
            
    digits = load_digits()
    
    X = digits.data
    y = digits.target
    
    num_classes = np.unique(y).shape[0]
        
    plot_digits(X)
        
    #TSNE 
    #Barnes-Hut: O(d NlogN) where d is dim and N is the number of samples
    #Exact: O(d N^2)
    t0 = time()
    tsne = manifold.TSNE(n_components = 2, init = 'pca', method = 'barnes_hut', verbose = 1)
    X_tsne = tsne.fit_transform(X)
    t1 = time()
    print('t-SNE: %.2f sec' %(t1-t0))
    tsne.get_params()
    
    plt.figure()
    for k in range(num_classes):
        plt.plot(X_tsne[y==k,0], X_tsne[y==k,1],'o')
    plt.title('t-SNE embedding of digits dataset')
    plt.xlabel('X1')
    plt.ylabel('X2')
    axes = plt.gca()
    axes.set_xlim([X_tsne[:,0].min()-1,X_tsne[:,0].max()+1])
    axes.set_ylim([X_tsne[:,1].min()-1,X_tsne[:,1].max()+1])
    plt.show()
        
    #ISOMAP
    #1. Nearest neighbors search: O(d log k N log N)
    #2. Shortest path graph search: O(N^2(k+log(N))
    #3. Partial eigenvalue decomposition: O(dN^2)
    
    t0 = time()
    isomap = manifold.Isomap(n_neighbors = 5, n_components = 2)
    X_isomap = isomap.fit_transform(X)
    t1 = time()
    print('Isomap: %.2f sec' %(t1-t0))
    isomap.get_params()
    
    plt.figure()
    for k in range(num_classes):
        plt.plot(X_isomap[y==k,0], X_isomap[y==k,1], 'o', label=str(k), linewidth = 2)
    plt.title('Isomap embedding of the digits dataset')
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.show()
    
    #Use KD-tree to find k-nearest neighbors to a query image
    kdt = KDTree(X_isomap)
    Q = np.array([[-160, -30],[-102, 14]])
    kdt_dist, kdt_idx = kdt.query(Q,k=20)
    plot_digits(X[kdt_idx.ravel(),:])
                                       
if __name__ == "__main__":    
    mnist_manifold()

The above shows a t-SNE embedding with 10 clusters in 2D, where each cluster corresponds to a digit from 0 to 9. We can see that without labels, we are able to discover 10 clusters that use t-SNE embedding in the two-dimensional space. Moreover, we expect adjacent clusters of digits to be similar to each other. The image on the right-hand side of figure 8.7 shows sample digits from two adjacent clusters (digit 0 and digit 6). We can visualize adjacent clusters by constructing a KD tree to find KNN to a query point

It’s important to be aware of some of the pitfalls of t-SNE. For example, t-SNE results may vary based on the perplexity hyperparameter. The t-SNE algorithm also doesn’t always produce similar outputs on successive runs and there are additional hyperparameters related to the optimization process. Moreover, the cluster sizes (as measured by standard deviation) and distances between clusters might not mean anything. Thus, the high flexibility of t-SNE also makes the results of the algorithm tricky to interpret.

KB project ppt

Today I went to the lab and met with my project partner, we talked about the ppt and made final touches to it. I will share it tomorrow

Here is the final final final video. We added some audio to make it nicer.

And here is a video showing how to re-train a model using Mage.

That is all for today!

See you tomorrow :)

(Day 236) Reading about unsupervised learning algorithms + making the *final* version of our ppt videos