Hello :) Today is Day 236!
A quick summary of today:
- reading about fundamental unsupervised learning algorithms (Chapter 8 from the Machine Learning Algorithms in Depth book)
- final final final video versions for our ppt
Applications of unsupervised learning span from clustering customer segments in e-commerce to extracting features from image data. The author says the algorithms in this chapter were selected for their mathematical depth and usefulness in real-world applications.
Dirichlet process K-means
The Dirichlet process K-means is a Bayesian nonparametric extension of the K-means algorithm, in which the number of clusters grows with data.
Key Concepts:
- Dynamic Clustering: New clusters are created when a data point is sufficiently far from existing cluster centroids. This flexibility allows the algorithm to adapt to the data, automatically increasing the number of clusters as needed.
- Objective Function: DP-means optimizes a K-means-like objective that includes a penalty term for the number of clusters, converging to a local optimum.
- Reassignment Step: Similar to K-means, data points are reassigned to the nearest cluster, or a new cluster is initiated if the squared Euclidean distance to all existing centroids exceeds a threshold, λ.
Performance Metrics:
- Normalized Mutual Information (NMI): Measures the similarity between the true label assignments (if available) and the computed cluster assignments. NMI ranges between 0 and 1, with higher values indicating more similar assignments.
- Variation of Information (VI): Quantifies the difference between label assignments. VI decreases as the overlap between predicted and true labels increases.
- Adjusted Rand Index (ARI): Computes the similarity between two clusters by analyzing all pairs of samples, approaching 1 when the predicted clusters closely match the true clusters.
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn import metrics
from sklearn.datasets import load_iris
np.random.seed(42)
class dpmeans:
def __init__(self,X):
# Initialize parameters for DP means
self.K = 1
self.K_init = 4
self.d = X.shape[1]
self.z = np.mod(np.random.permutation(X.shape[0]),self.K)+1
self.mu = np.random.standard_normal((self.K, self.d))
self.sigma = 1
self.nk = np.zeros(self.K)
self.pik = np.ones(self.K)/self.K
#init mu
self.mu = np.array([np.mean(X,0)])
#init lambda
self.Lambda = self.kpp_init(X,self.K_init)
self.max_iter = 100
self.obj = np.zeros(self.max_iter)
self.em_time = np.zeros(self.max_iter)
def kpp_init(self,X,k):
#k++ init
#lambda is max distance to k++ means
[n,d] = np.shape(X)
mu = np.zeros((k,d))
dist = np.inf*np.ones(n)
mu[0,:] = X[int(np.random.rand()*n-1),:]
for i in range(1,k):
D = X-np.tile(mu[i-1,:],(n,1))
dist = np.minimum(dist, np.sum(D*D,1))
idx = np.where(np.random.rand() < np.cumsum(dist/float(sum(dist))))
mu[i,:] = X[idx[0][0],:]
Lambda = np.max(dist)
print("Lambda: ", Lambda)
return Lambda
def fit(self,X):
obj_tol = 1e-3
max_iter = self.max_iter
[n,d] = np.shape(X)
obj = np.zeros(max_iter)
em_time = np.zeros(max_iter)
print('running dpmeans...')
for iter in range(max_iter):
tic = time.time()
dist = np.zeros((n,self.K))
#assignment step
for kk in range(self.K):
Xm = X - np.tile(self.mu[kk,:],(n,1))
dist[:,kk] = np.sum(Xm*Xm,1)
#update labels
dmin = np.min(dist,1)
self.z = np.argmin(dist,1)
idx = np.where(dmin > self.Lambda)
if (np.size(idx) > 0):
self.K = self.K + 1
self.z[idx[0]] = self.K-1 #cluster labels in [0,...,K-1]
self.mu = np.vstack([self.mu,np.mean(X[idx[0],:],0)])
Xm = X - np.tile(self.mu[self.K-1,:],(n,1))
dist = np.hstack([dist, np.array([np.sum(Xm*Xm,1)]).T])
#update step
self.nk = np.zeros(self.K)
for kk in range(self.K):
self.nk[kk] = self.z.tolist().count(kk)
idx = np.where(self.z == kk)
self.mu[kk,:] = np.mean(X[idx[0],:],0)
self.pik = self.nk/float(np.sum(self.nk))
#compute objective
for kk in range(self.K):
idx = np.where(self.z == kk)
obj[iter] = obj[iter] + np.sum(dist[idx[0],kk],0)
obj[iter] = obj[iter] + self.Lambda * self.K
#check convergence
if (iter > 0 and np.abs(obj[iter]-obj[iter-1]) < obj_tol*obj[iter]):
print('converged in %d iterations\n'% iter)
break
em_time[iter] = time.time()-tic
#end for
self.obj = obj
self.em_time = em_time
return self.z, obj, em_time
def compute_nmi(self, z1, z2):
# compute normalized mutual information
n = np.size(z1)
k1 = np.size(np.unique(z1))
k2 = np.size(np.unique(z2))
nk1 = np.zeros((k1,1))
nk2 = np.zeros((k2,1))
for kk in range(k1):
nk1[kk] = np.sum(z1==kk)
for kk in range(k2):
nk2[kk] = np.sum(z2==kk)
pk1 = nk1/float(np.sum(nk1))
pk2 = nk2/float(np.sum(nk2))
nk12 = np.zeros((k1,k2))
for ii in range(k1):
for jj in range(k2):
nk12[ii,jj] = np.sum((z1==ii)*(z2==jj))
pk12 = nk12/float(n)
Hx = -np.sum(pk1 * np.log(pk1 + np.finfo(float).eps))
Hy = -np.sum(pk2 * np.log(pk2 + np.finfo(float).eps))
Hxy = -np.sum(pk12 * np.log(pk12 + np.finfo(float).eps))
MI = Hx + Hy - Hxy;
nmi = MI/float(0.5*(Hx+Hy))
return nmi
def generate_plots(self,X):
plt.close('all')
plt.figure(0)
for kk in range(self.K):
#idx = np.where(self.z == kk)
plt.scatter(X[self.z == kk,0], X[self.z == kk,1], \
s = 100, marker = 'o', c = np.random.rand(3,), label = str(kk))
#end for
plt.xlabel('X1')
plt.ylabel('X2')
plt.legend()
plt.title('DP-means clusters')
plt.grid(True)
plt.show()
plt.figure(1)
plt.plot(self.obj)
plt.title('DP-means objective function')
plt.xlabel('iterations')
plt.ylabel('penalized l2 squared distance')
plt.grid(True)
plt.show()
if __name__ == "__main__":
iris = load_iris()
X = iris.data
y = iris.target
dp = dpmeans(X)
labels, obj, em_time = dp.fit(X)
dp.generate_plots(X)
nmi = dp.compute_nmi(y,labels)
ari = metrics.adjusted_rand_score(y,labels)
print("NMI: %.4f" % nmi)
print("ARI: %.4f" % ari)
NMI: 0.7582
ARI: 0.7302
Gaussian mixture models
I meet them again. For the third time actually. First in the papers where I read about predicting taxi demand and using a GMM on top of a normal model so the output is a distribution, 2nd when I watched the probabl
youtube channel, and now third.
Mixture models are commonly used to model complex density distributions. For example, you may be interested in discovering patterns in census data consisting of information about the person’s age, income, occupation, and other dimensions. If we plot the resulting data in high-dimensional space, we’ll likely discover nonuniform density characterized by groups or clusters of data points. We can model each cluster using a base probability distribution. Mixture models consist of a convex combination of K base models. In the case of Gaussian mixture models, the base distribution is Gaussian and can be written as follows.
where π are the mixing proportions, and the total is 1.
Expectation maximization (EM) algorithm
EM is an iterative algorithm that alternates between inferring the missing values given the parameters (E step) and then optimizing the parameters given filled-in data (M step).
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn.cluster import KMeans
from scipy.stats import multivariate_normal
from scipy.special import logsumexp
from scipy import linalg
np.random.seed(3)
class GMM:
def __init__(self, n=1e3, d=2, K=4):
self.n = int(n) #number of data points
self.d = d #data dimension
self.K = K #number of clusters
self.X = np.zeros((self.n, self.d))
self.mu = np.zeros((self.d, self.K))
self.sigma = np.zeros((self.d, self.d, self.K))
self.pik = np.ones(self.K)/K
def generate_data(self):
#GMM generative model
alpha0 = np.ones(self.K)
pi = np.random.dirichlet(alpha0)
#ground truth mu and sigma
mu0 = np.random.randint(0, 10, size=(self.d, self.K)) - 5*np.ones((self.d, self.K))
V0 = np.zeros((self.d, self.d, self.K))
for k in range(self.K):
eigen_mean = 0
Q = np.random.normal(loc=0, scale=1, size=(self.d, self.d))
D = np.diag(abs(eigen_mean + np.random.normal(loc=0, scale=1, size=self.d)))
V0[:,:,k] = abs(np.transpose(Q)*D*Q)
#sample data
for i in range(self.n):
z = np.random.multinomial(1,pi)
k = np.nonzero(z)[0][0]
self.X[i,:] = np.random.multivariate_normal(mean=mu0[:,k], cov=V0[:,:,k], size=1)
plt.figure()
plt.scatter(self.X[:,0], self.X[:,1], color='b', alpha=0.5)
plt.title("Ground Truth Data"); plt.xlabel("X1"); plt.ylabel("X2")
plt.show()
return mu0, V0
def gmm_em(self):
#init mu with k-means
kmeans = KMeans(n_clusters=self.K, random_state=42).fit(self.X)
self.mu = np.transpose(kmeans.cluster_centers_)
#init sigma
for k in range(self.K):
self.sigma[:,:,k] = np.eye(self.d)
#EM algorithm
max_iter = 10
tol = 1e-5
obj = np.zeros(max_iter)
for iter in range(max_iter):
print("EM iter ", iter)
#E-step
resp, llh = self.estep()
#M-step
self.mstep(resp)
#check convergence
obj[iter] = llh
if (iter > 1 and obj[iter] - obj[iter-1] < tol*abs(obj[iter])):
break
#end if
#end for
plt.figure()
plt.plot(obj)
plt.title('EM-GMM objective'); plt.xlabel("iter"); plt.ylabel("log-likelihood")
plt.show()
def estep(self):
log_r = np.zeros((self.n, self.K))
for k in range(self.K):
log_r[:,k] = multivariate_normal.logpdf(self.X, mean=self.mu[:,k], cov=self.sigma[:,:,k])
#end for
log_r = log_r + np.log(self.pik)
L = logsumexp(log_r, axis=1)
llh = np.sum(L)/self.n #log likelihood
log_r = log_r - L.reshape(-1,1) #normalize
resp = np.exp(log_r)
return resp, llh
def mstep(self, resp):
nk = np.sum(resp, axis=0)
self.pik = nk/self.n
sqrt_resp = np.sqrt(resp)
for k in range(self.K):
#update mu
rx = np.multiply(resp[:,k].reshape(-1,1), self.X)
self.mu[:,k] = np.sum(rx, axis=0) / nk[k]
#update sigma
Xm = self.X - self.mu[:,k]
Xm = np.multiply(sqrt_resp[:,k].reshape(-1,1), Xm)
self.sigma[:,:,k] = np.maximum(0, np.dot(np.transpose(Xm), Xm) / nk[k] + 1e-5 * np.eye(self.d))
#end for
if __name__ == '__main__':
gmm = GMM()
mu0, V0 = gmm.generate_data()
gmm.gmm_em()
for k in range(mu0.shape[1]):
print("cluster ", k)
print("-----------")
print("ground truth means:")
print(mu0[:,k])
print("ground truth covariance:")
print(V0[:,:,k])
#end for
for k in range(mu0.shape[1]):
print("cluster ", k)
print("-----------")
print("GMM-EM means:")
print(gmm.mu[:,k])
print("GMM-EM covariance:")
print(gmm.sigma[:,:,k])
plt.figure()
ax = plt.axes()
plt.scatter(gmm.X[:,0], gmm.X[:,1], color='b', alpha=0.5)
for k in range(mu0.shape[1]):
v, w = linalg.eigh(gmm.sigma[:,:,k])
v = 2.0 * np.sqrt(2.0) * np.sqrt(v)
u = w[0] / linalg.norm(w[0])
# plot an ellipse to show the Gaussian component
angle = np.arctan(u[1] / u[0])
angle = 180.0 * angle / np.pi # convert to degrees
ell = mpl.patches.Ellipse(gmm.mu[:,k], v[0], v[1], 180.0 + angle, color='r', alpha=0.5)
ax.add_patch(ell)
# plot cluster centroids
plt.scatter(gmm.mu[0,k], gmm.mu[1,k], s=80, marker='x', color='k', alpha=1)
plt.title("Gaussian Mixture Model"); plt.xlabel("X1"); plt.ylabel("X2")
plt.show()
EM over the iterations:
cluster 0
-----------
ground truth means:
[-2. 2.]
ground truth covariance:
[[0.00169216 0. ]
[0. 1.33783795]]
cluster 1
-----------
ground truth means:
[4. 1.]
ground truth covariance:
[[2.45922460e-03 0.00000000e+00]
[0.00000000e+00 2.63330463e+00]]
cluster 2
-----------
ground truth means:
[ 4. -5.]
ground truth covariance:
[[1.43773758 0. ]
[0. 0.03995228]]
cluster 3
-----------
ground truth means:
[ 0. -1.]
ground truth covariance:
[[0.29126579 0. ]
[0. 0.10456419]]
cluster 0
-----------
GMM-EM means:
[3.99752939 0.92575035]
GMM-EM covariance:
[[2.38145097e-03 3.76559287e-03]
[3.76559287e-03 2.63147414e+00]]
cluster 1
-----------
GMM-EM means:
[ 0.05492661 -0.99472004]
GMM-EM covariance:
[[0.2790043 0. ]
[0. 0.10012724]]
cluster 2
-----------
GMM-EM means:
[ 3.89443404 -4.99587958]
GMM-EM covariance:
[[1.78361804 0.00829965]
[0.00829965 0.04524815]]
cluster 3
-----------
GMM-EM means:
[-2.00245926 2.05932057]
GMM-EM covariance:
[[0.00141642 0. ]
[0. 1.3039054 ]]
Inferred Gaussian mixture overlayed on the original data:
Dimensionality reduction
PCA
We would like to project our data vector x ∈ R^D onto a lower dimensional vector z ∈ R^L with L < D such that the variance of the projected data is maximized. Maximizing the variance of the projected data is the core principle of PCA, and it allows us to preserve the unique characteristics of our data.
pseudo-code for PCA:
t-SNE manifold learning on images
Images are high dimensional objects that live on manifolds. A manifold is a topological space that locally resembles Euclidean space. By modeling image spaces as manifolds, we can study their geometric properties. We can visualize high dimensional objects with the help of an embedding. For this, two embeddings are considered: t-SNE and Isomap, and applied on the MNIST dataset.
import numpy as np
import matplotlib.pyplot as plt
from time import time
from sklearn import manifold
from sklearn.datasets import load_digits
from sklearn.neighbors import KDTree
def plot_digits(X):
n_img_per_row = np.amin((20, np.int(np.sqrt(X.shape[0]))))
img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))
for i in range(n_img_per_row):
ix = 10 * i + 1
for j in range(n_img_per_row):
iy = 10 * j + 1
img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8))
plt.figure()
plt.imshow(img, cmap=plt.cm.binary)
plt.xticks([])
plt.yticks([])
plt.title('A selection from the 64-dimensional digits dataset')
def mnist_manifold():
digits = load_digits()
X = digits.data
y = digits.target
num_classes = np.unique(y).shape[0]
plot_digits(X)
#TSNE
#Barnes-Hut: O(d NlogN) where d is dim and N is the number of samples
#Exact: O(d N^2)
t0 = time()
tsne = manifold.TSNE(n_components = 2, init = 'pca', method = 'barnes_hut', verbose = 1)
X_tsne = tsne.fit_transform(X)
t1 = time()
print('t-SNE: %.2f sec' %(t1-t0))
tsne.get_params()
plt.figure()
for k in range(num_classes):
plt.plot(X_tsne[y==k,0], X_tsne[y==k,1],'o')
plt.title('t-SNE embedding of digits dataset')
plt.xlabel('X1')
plt.ylabel('X2')
axes = plt.gca()
axes.set_xlim([X_tsne[:,0].min()-1,X_tsne[:,0].max()+1])
axes.set_ylim([X_tsne[:,1].min()-1,X_tsne[:,1].max()+1])
plt.show()
#ISOMAP
#1. Nearest neighbors search: O(d log k N log N)
#2. Shortest path graph search: O(N^2(k+log(N))
#3. Partial eigenvalue decomposition: O(dN^2)
t0 = time()
isomap = manifold.Isomap(n_neighbors = 5, n_components = 2)
X_isomap = isomap.fit_transform(X)
t1 = time()
print('Isomap: %.2f sec' %(t1-t0))
isomap.get_params()
plt.figure()
for k in range(num_classes):
plt.plot(X_isomap[y==k,0], X_isomap[y==k,1], 'o', label=str(k), linewidth = 2)
plt.title('Isomap embedding of the digits dataset')
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()
#Use KD-tree to find k-nearest neighbors to a query image
kdt = KDTree(X_isomap)
Q = np.array([[-160, -30],[-102, 14]])
kdt_dist, kdt_idx = kdt.query(Q,k=20)
plot_digits(X[kdt_idx.ravel(),:])
if __name__ == "__main__":
mnist_manifold()
The above shows a t-SNE embedding with 10 clusters in 2D, where each cluster corresponds to a digit from 0 to 9. We can see that without labels, we are able to discover 10 clusters that use t-SNE embedding in the two-dimensional space. Moreover, we expect adjacent clusters of digits to be similar to each other. The image on the right-hand side of figure 8.7 shows sample digits from two adjacent clusters (digit 0 and digit 6). We can visualize adjacent clusters by constructing a KD tree to find KNN to a query point
It’s important to be aware of some of the pitfalls of t-SNE. For example, t-SNE results may vary based on the perplexity hyperparameter. The t-SNE algorithm also doesn’t always produce similar outputs on successive runs and there are additional hyperparameters related to the optimization process. Moreover, the cluster sizes (as measured by standard deviation) and distances between clusters might not mean anything. Thus, the high flexibility of t-SNE also makes the results of the algorithm tricky to interpret.
KB project ppt
Today I went to the lab and met with my project partner, we talked about the ppt and made final touches to it. I will share it tomorrow
Here is the final final final video. We added some audio to make it nicer.
And here is a video showing how to re-train a model using Mage.
That is all for today!
See you tomorrow :)