Hello :) Today is Day 231!
A quick summary of today:
- the results from the KB competition are out
- watching the latest probabl videos
- reading chapter 7 from the Machine Learning Algorithms in Depth book
The results from the 6th Kukmin Bank Future Finance AI competition are out!
And we are in the top 10 and advancing to the finals. Here is the link to the announcement (it is all in Korean only)
And the advancing teams are:
The order is alphabetical and my labmate and my project is the 10th one. They just show the 1st name from our team - which is my labmate’s name. Here is a translted table:
NO | Team Leader | Topic |
---|---|---|
1 | Kang Hee-won | Overseas Financial Report Generation Platform REPOGEN.ai |
2 | Kim Bong-seok | KB:OT |
3 | Park Jae-jae | SUPPLY LINK AI |
4 | Son Hye-seon | Fund Customer Asset Management through KFRI Development |
5 | Oh Seung-min | Counselor Support Service Utilizing LLM (Large Language Model) |
6 | Lee Chang-mook | Voice Phishing Prevention AI through Counselor Voice Recognition |
7 | Jang Hye-young | I-Diary: Helping Retail Investors with Their Investments |
8 | Cho Myung-geun | KB-HI |
9 | Choi Yong-woo | AI Content Marketer ‘Kard-Toon’ |
10 | »> Choi Jae-hyuk - this is us «< | Fraud Detection through Real-time Graph DB Analysis |
The finals are on the 28th in Ehwa University in Seoul, so we have to prepare well!
🥳🥳🥳
UMAP vs PCA
Both PCA and UMAP aim to reduce the dimansionality of our data - go from X_wide to X_thin. But what is the difference?
To show the different we will use the dataset in the pic. Yes, it looks like a swiss roll.
We will reduce the dimensionality to 2 (from 3d), and see what is preserved. Before we continue, one of the main differences is that PCA keeps in tact the global space (and distances), while UMAP is more interested in preserving the local space (and distance).
PCA result
We can see that the global structure (the Swiss roll) is preserved. However, if we examine this result closely, we may notice that for some points, searching for their nearest neighbors might not yield points of the same color. This is because, in some cases, points of a different color are closer to each other in the PCA projection. PCA looks at the overall variance and total information in the data. While it captures the major directions of variance, it does not prioritize local clusters. PCA does not specifically focus on preserving local neighborhood relationships; instead, it maximizes the global variance.
UMAP result
The notion of global distance is less significant here, and we see that all the points that were close together locally remain close in this 2D space as well. UMAP is effective at preserving local structures and keeping clusters intact from the original data. It focuses on maintaining local neighborhood relationships, ensuring that points that were neighbors in the high-dimensional space stay close in the lower-dimensional projection.
PCA and UMAP on text
Vincent (the instructor) also showed an example where text is embedded and then transformed to 2 dimensions using PCA and UMAP, and he shows the difference. I recommed watching the video as all the little things cannot be capture by pictures/text.
UMAP
The text is embedded and using the search box we can find similar text. We can also see the difference in how the ‘blob’ looks like in UMAP vs PCA (PCA pic is below).
When we search for ‘call of duty’ (the text dataset is related to Tesco), in UMAP that top-left cluster shines meaning that text most similar to it is there.
PCA
This is the ‘call of duty’ cluster when the text dims are reduced to 2d using PCA.
We can clearly see that PCA looks a bit more ‘structured’ as in more of a circle, and also we just by examining the shapes of the two we cannot say where clusters exist. If we look at the UMAP picture again, we can see certain parts that look like individual islands where particular clusters may exist. In UMAP, clusters in the original data can remain in tact.
Mixture Density Networks
Traditional regression models, including linear regression or even neural networks, typically assume that the target variable y is a deterministic function of the input variable x, or at best, has a simple noise model like Gaussian noise. This means they often output a single value (or mean and variance for simple Gaussian noise), which is insufficient for capturing multimodal distributions. For example, if y could take multiple distinct values for a given x (e.g., the bimodal distribution of outcomes), a traditional neural network wouldn’t capture this properly.
MDNs overcome this limitation by modeling the conditional probability distribution p(y | x) as a mixture of several distributions (typically Gaussians), where the parameters of these distributions are functions of the input x. |
Instead of outputting a single value, the network outputs the parameters of a mixture distribution:
- Mixture coefficients: 𝜋 (sum to 1, representing the weight of each Gaussian)
- Means for each Gaussian component
- Variances for each Gaussian component
Sample architecture:
And in code:
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
class MixtureDensityNetwork(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim, n_gaussians):
super(MixtureDensityNetwork, self).__init__()
self.input_dim = input_dim
self.hidden_dim = hidden_dim
self.output_dim = output_dim
self.n_gaussians = n_gaussians
self.hidden_layer = nn.Linear(input_dim, hidden_dim)
self.pi_layer = nn.Linear(hidden_dim, n_gaussians)
self.mu_layer = nn.Linear(hidden_dim, n_gaussians * output_dim)
self.sigma_layer = nn.Linear(hidden_dim, n_gaussians * output_dim)
def forward(self, x):
h = F.softplus(self.hidden_layer(x))
pi = F.softmax(self.pi_layer(h), dim=1)
mu = self.mu_layer(h).view(-1, self.n_gaussians, self.output_dim)
sigma = F.softplus(self.sigma_layer(h)).view(-1, self.n_gaussians, self.output_dim)
return pi, mu, sigma
def mdn_loss(pi, mu, sigma, target):
normal = torch.distributions.Normal(mu, sigma)
log_prob = normal.log_prob(target.unsqueeze(1).expand_as(mu))
weighted_logprob = log_prob + torch.log(pi.unsqueeze(-1))
return -torch.logsumexp(weighted_logprob, dim=1).mean()
Using a toy dataset, we can see and output when using an MDN:
We can also see what is happening internally in the MDN:
In the latest stream, Vincent actually talks about a side project - scikit-mdn where he developed an mdn using pytorch and then used scikit components to allow scikit-learn type of user experience. In this livestream he was just talking about the implementation and how it works.
Interview with Vincent
I youtube searched his name and turns out that he did an interview with Alexey from DataTalksClub. Link
Apparently he is writing a book related to expectations vs reality in the data world, and I am looking forward to it.
From the interview, I found out about Vincent’s platform called calm code. It has many many videos about different tools to help the python dev process <- so I need to check this out as well.
Machine Learning Algorithms in Depth - Chapter 7: Selected supervised learning algorithms
Markov models: page rank and HMM
- Markov models have the Markov property that conditioned on the present state—the future states are independent of the past. In other words, the present state serves as a sufficient statistic for the next state
- Page rank is a stationary distribution of the Markov chain described by the transition matrix, which is recurrent and aperiodic. At scale, the page rank algorithm is computed using power iterations
import numpy as np
from numpy.linalg import norm
np.random.seed(42)
class page_rank():
def __init__(self):
self.max_iter = 100
self.tolerance = 1e-5
def power_iteration(self, A):
n = np.shape(A)[0]
v = np.random.rand(n)
converged = False
iter = 0
while (not converged) and (iter < self.max_iter):
old_v = v
v = np.dot(A, v)
v = v / norm(v)
lambd = np.dot(v, np.dot(A, v))
converged = norm(v - old_v) < self.tolerance
iter += 1
#end while
return lambd, v
if __name__ == "__main__":
#construct a symmetric real matrix
X = np.random.rand(10,5)
A = np.dot(X.T, X)
pr = page_rank()
lambd, v = pr.power_iteration(A)
print(lambd)
print(v)
#compare against np.linalg implementation
eigval, eigvec = np.linalg.eig(A)
idx = np.argsort(np.abs(eigval))[::-1]
top_lambd = eigval[idx][0]
top_v = eigvec[:,idx][0]
assert np.allclose(lambd, top_lambd, 1e-3)
assert np.allclose(v, top_v, 1e-3)
- Hidden Markov models model time series data and consist of a latent state Markov chain, a transition matrix between the latent states, and an emission matrix that models observed data emitted from each state. Inference is carried out using either the EM algorithm or the forward–backward algorithm with a Viterbi maximum likelihood sequence decoder
import numpy as np
from scipy.sparse import coo_matrix
import matplotlib.pyplot as plt
np.random.seed(42)
class HMM():
def __init__(self, d=3, k=2, n=10000):
self.d = d #dimension of data
self.k = k #dimension of latent state
self.n = n #number of data points
self.A = np.zeros((k,k)) #transition matrix
self.E = np.zeros((k,d)) #emission matrix
self.s = np.zeros(k) #initial state vector
self.x = np.zeros(self.n) #emitted observations
def normalize_mat(self, X, dim=1):
z = np.sum(X, axis=dim)
Xnorm = X/z.reshape(-1,1)
return Xnorm
def normalize_vec(self, v):
z = sum(v)
u = v / z
return u, z
def init_hmm(self):
#initialize matrices at random
self.A = self.normalize_mat(np.random.rand(self.k,self.k))
self.E = self.normalize_mat(np.random.rand(self.k,self.d))
self.s, _ = self.normalize_vec(np.random.rand(self.k))
#generate markov observations
z = np.random.choice(self.k, size=1, p=self.s)
self.x[0] = np.random.choice(self.d, size=1, p=self.E[z,:].ravel())
for i in range(1, self.n):
z = np.random.choice(self.k, size=1, p=self.A[z,:].ravel())
self.x[i] = np.random.choice(self.d, size=1, p=self.E[z,:].ravel())
#end for
def forward_backward(self):
#construct sparse matrix X of emission indicators
data = np.ones(self.n)
row = self.x
col = np.arange(self.n)
X = coo_matrix((data, (row, col)), shape=(self.d, self.n))
M = self.E * X
At = np.transpose(self.A)
c = np.zeros(self.n) #normalization constants
alpha = np.zeros((self.k, self.n)) #alpha = p(z_t = j | x_{1:T})
alpha[:,0], c[0] = self.normalize_vec(self.s * M[:,0])
for t in range(1, self.n):
alpha[:,t], c[t] = self.normalize_vec(np.dot(At, alpha[:,t-1]) * M[:,t])
#end for
beta = np.ones((self.k, self.n))
for t in range(self.n-2, 0, -1):
beta[:,t] = np.dot(self.A, beta[:,t+1] * M[:,t+1])/c[t+1]
#end for
gamma = alpha * beta
return gamma, alpha, beta, c
def viterbi(self):
#construct sparse matrix X of emission indicators
data = np.ones(self.n)
row = self.x
col = np.arange(self.n)
X = coo_matrix((data, (row, col)), shape=(self.d, self.n))
#log scale for numerical stability
s = np.log(self.s)
A = np.log(self.A)
M = np.log(self.E * X)
Z = np.zeros((self.k, self.n))
Z[:,0] = np.arange(self.k)
v = s + M[:,0]
for t in range(1, self.n):
Av = A + v.reshape(-1,1)
v = np.max(Av, axis=0)
idx = np.argmax(Av, axis=0)
v = v.reshape(-1,1) + M[:,t].reshape(-1,1)
Z = Z[idx,:]
Z[:,t] = np.arange(self.k)
#end for
llh = np.max(v)
idx = np.argmax(v)
z = Z[idx,:]
return z, llh
if __name__ == "__main__":
hmm = HMM()
hmm.init_hmm()
gamma, alpha, beta, c = hmm.forward_backward()
z, llh = hmm.viterbi()
import pdb; pdb.set_trace()
Imbalanced Learning
Most classification algorithms will only perform optimally when the number of samples in each class is roughly the same. Highly skewed datasets where the minority class is outnumbered by one or more classes commonly occur in fraud detection, medical diagnosis, and computational biology. One way of addressing this issue is by resampling the dataset to offset the imbalance and arrive at a more robust and accurate decision boundary. Resampling techniques can be broadly divided into four categories: undersampling the majority class, over-sampling the minority class, combining over and undersampling, and creating an ensemble of balanced datasets.
- Random Undersampling: This method randomly removes data points from the majority class to balance the dataset
- Cluster Centroids: This approach uses the K-means algorithm to replace a cluster of samples with the centroid of the cluster, with the number of clusters determined by the desired level of undersampling
- Tomek Links: This method removes overlapping examples between classes to clean up the decision boundary. A Tomek link is a pair of instances from different classes that are the closest to each other. By removing these pairs, the method helps in eliminating noise and refining class boundaries
- One-Sided Selection (OSS): OSS selects a subset of the majority class and combines it with all the minority class examples. The resulting set is then refined by removing any majority class Tomek links, leading to a more balanced and cleaner dataset for training
Tomek Links algorithm:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from imblearn.under_sampling import TomekLinks
rng = np.random.RandomState(42)
def main():
#generate data
n_samples_1 = 500
n_samples_2 = 50
X_syn = np.r_[1.5 * rng.randn(n_samples_1, 2), 0.5 * rng.randn(n_samples_2, 2) + [2, 2]]
y_syn = np.array([0] * (n_samples_1) + [1] * (n_samples_2))
X_syn, y_syn = shuffle(X_syn, y_syn)
X_syn_train, X_syn_test, y_syn_train, y_syn_test = train_test_split(X_syn, y_syn)
# remove Tomek links
tl = TomekLinks(sampling_strategy='auto')
X_resampled, y_resampled = tl.fit_resample(X_syn, y_syn)
idx_resampled = tl.sample_indices_
idx_samples_removed = np.setdiff1d(np.arange(X_syn.shape[0]),idx_resampled)
#generate plots
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
idx_class_0 = y_resampled == 0
plt.scatter(X_resampled[idx_class_0, 0], X_resampled[idx_class_0, 1], alpha=.8, marker = "o", label='Class #0')
plt.scatter(X_resampled[~idx_class_0, 0], X_resampled[~idx_class_0, 1], alpha=.8, marker = "s", label='Class #1')
plt.scatter(X_syn[idx_samples_removed, 0], X_syn[idx_samples_removed, 1], alpha=.8, marker = "v", label='Removed samples')
plt.title('Undersampling: Tomek links')
plt.legend()
plt.show()
if __name__ == "__main__":
main()
Oversampling strategies
- Random Oversampling: This method adds data points to the minority class uniformly at random to balance the dataset
- Synthetic Minority Oversampling Technique (SMOTE): SMOTE generates synthetic examples by identifying K-nearest neighbors in the feature space and creating new data points along the line segments between the selected neighbors. The new data points are created by taking the difference between a sample and its nearest neighbor, multiplying it by a random number between 0 and 1, and adding this to the original sample
- Adaptive Synthetic Sampling (ADASYN)
SMOTE algorithm:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE
def plot_resampling(ax, X, y, title):
c0 = ax.scatter(X[y == 0, 0], X[y == 0, 1], label="Class #0", marker="o", alpha=0.5)
c1 = ax.scatter(X[y == 1, 0], X[y == 1, 1], label="Class #1", marker="s", alpha=0.5)
ax.set_title(title)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.get_xaxis().tick_bottom()
ax.get_yaxis().tick_left()
ax.spines['left'].set_position(('outward', 10))
ax.spines['bottom'].set_position(('outward', 10))
ax.set_xlim([-6, 8])
ax.set_ylim([-6, 6])
return c0, c1
def main():
# generate the dataset
X, y = make_classification(n_classes=2, class_sep=2, weights=[0.3, 0.7],
n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1,
n_samples=80, random_state=10)
# fit PCA for visualization
pca = PCA(n_components=2)
X_vis = pca.fit_transform(X)
# apply regular SMOTE
method = SMOTE()
X_res, y_res = method.fit_resample(X, y)
X_res_vis = pca.transform(X_res)
# generate plots
f, (ax1, ax2) = plt.subplots(1, 2)
c0, c1 = plot_resampling(ax1, X_vis, y, 'Original')
plot_resampling(ax2, X_res_vis, y_res, 'SMOTE')
ax1.legend((c0, c1), ('Class #0', 'Class #1'))
plt.tight_layout()
plt.show()
if __name__ == "__main__":
main()
Active Learning
- Active Learning is a machine learning approach where the algorithm actively selects the most informative data samples to train on. This strategy is designed to maximize learning efficiency, allowing the model to achieve high performance with fewer labeled training examples. This is particularly useful in scenarios where obtaining labeled data is expensive or time-consuming. There are two primary query strategies in active learning:
- Uncertainty Sampling: The model selects data points for which it is least confident in its predictions. These are typically the samples near the decision boundary where the model’s predictions have high uncertainty. By focusing on these uncertain samples, the model can learn more effectively and improve its performance more rapidly.
- Query by Committee (QBC): In this strategy, a committee of models with differing hypotheses is used. The models are trained on the same dataset but with different initializations or algorithms. The data points on which the committee members disagree the most are selected for labeling. This disagreement, or diversity in predictions, helps to identify samples that could provide the most learning benefit when labeled
- The Bias–Variance Tradeoff is a fundamental concept in machine learning that deals with the tradeoff between two types of errors that a model can make:
- Bias: This is the error due to overly simplistic assumptions in the model. High bias can cause the model to miss relevant relations between features and target outputs, leading to underfitting
- Variance: This is the error due to the model’s sensitivity to small fluctuations in the training data. High variance can cause the model to capture noise in the training data as if it were part of the true signal, leading to overfitting
Model selection: Hyperparameter tuning
In model selection, we often want to follow the Occam’s razor principle and choose the simplest model that explains the data well. In hyperparameter optimization, in addition to grid search and random search, Bayesian Optimization uses an active learning approach in a way that reduces uncertainty and provides a balance between exploration and exploitation.
Bayesian optimization
Instead of randomly exploring the parameter space, Bayesian optimization adopts an active learning approach to select continuous parameter values that reduce uncertainty while balancing exploration and exploitation. This method leverages Gaussian Processes (GPs) to model the algorithm’s generalization performance, providing a Bayesian framework for optimization.
Bayesian optimization operates under the assumption that the performance function can be sampled from a GP, maintaining a posterior distribution as observations are made. The optimization process involves strategies like Expected Improvement (EI) and the Gaussian Process Upper Confidence Bound (UCB), which are effective in minimizing the number of evaluations required to find the global optimum of complex, multimodal functions.
Unlike methods that rely on local gradient and Hessian approximations, Bayesian optimization utilizes all available information from previous evaluations. This approach is particularly useful for optimizing non-convex functions with relatively few evaluations, making it ideal for expensive processes like hyperparameter tuning in deep neural networks. The tradeoff is that more computation is required to determine the next point to evaluate, but this leads to a more efficient and automated optimization process.
Algorithm:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.svm import SVC
from bayes_opt import BayesianOptimization
np.random.seed(42)
# Load data set and target values
data, target = make_classification(
n_samples=1000,
n_features=45,
n_informative=12,
n_redundant=7
)
target = target.ravel()
def svccv(gamma):
val = cross_val_score(
SVC(gamma=gamma, random_state=0),
data, target, scoring='f1', cv=2
).mean()
return val
def rfccv(n_estimators, max_depth):
val = cross_val_score(
RFC(n_estimators=int(n_estimators),
max_depth=int(max_depth),
random_state=0
),
data, target, scoring='f1', cv=2
).mean()
return val
if __name__ == "__main__":
gp_params = {"alpha": 1e-5}
#SVM
svcBO = BayesianOptimization(svccv,
{'gamma': (0.00001, 0.1)})
svcBO.maximize(init_points=3, n_iter=4, **gp_params)
#Random Forest
rfcBO = BayesianOptimization(
rfccv,
{'n_estimators': (10, 300),
'max_depth': (2, 10)
}
)
rfcBO.maximize(init_points=4, n_iter=4, **gp_params)
print('Final Results')
print(svcBO.max)
print(rfcBO.max)
The F1 score was used as a performance objective function for a classification task. The figure on the left shows the Bayesian optimization of the F1 score as a function of the gamma parameter of the SVM RBF kernel (K(x, x’) = \exp{(-\gamma |x - x’|^2)}), where gamma is the precision, equal to the inverse variance. We can see that after only seven iterations, we have discovered the gamma parameter that gives the maximum F1 score. The peak of the Expected Improvement (EI) utility function at the bottom indicates which experiment to perform next.
The figure on the right shows the Bayesian optimization of the F1 score as a function of maximum depth and the number of estimators of a Random Forest classifier. From the heatmap, we can observe that the maximum F1 score is achieved for 158 estimators with a depth equal to 10.
Ensemble methods
Ensemble methods are meta-algorithms that combine several ML techniques into one predictive model to decrease variance (bagging), decrease bias (boosting), or improve predictions (stacking).
Ensemble methods can be divided into two groups: sequential ensemble methods, where the base learners are generated sequentially (e.g., AdaBoost), and parallel ensemble methods, where the base learners are generated in parallel (e.g., random forest).
To ensure ensemble methods are more accurate than any of their individual members, the base learners must be as accurate and diverse as possible.
Bagging
Bagging stands for bootstrap aggregation. One way to reduce the variance of an estimate is to average together multiple estimates. For example, we can train M different decision trees fm on different subsets of the data (chosen randomly with replacement) and compute the ensemble. Bagging uses bootstrap sampling to obtain the data subsets for training the base learners. For aggregating the outputs of base learners, bagging uses voting for classification and averaging for regression.
import itertools
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from mlxtend.plotting import plot_learning_curves
from mlxtend.plotting import plot_decision_regions
def main():
iris = datasets.load_iris()
X, y = iris.data[:, 0:2], iris.target
clf1 = DecisionTreeClassifier(criterion='entropy', max_depth=None)
clf2 = KNeighborsClassifier(n_neighbors=1)
bagging1 = BaggingClassifier(base_estimator=clf1, n_estimators=10, max_samples=0.8, max_features=0.8)
bagging2 = BaggingClassifier(base_estimator=clf2, n_estimators=10, max_samples=0.8, max_features=0.8)
label = ['Decision Tree', 'K-NN', 'Bagging Tree', 'Bagging K-NN']
clf_list = [clf1, clf2, bagging1, bagging2]
fig = plt.figure(figsize=(10, 8))
gs = gridspec.GridSpec(2, 2)
grid = itertools.product([0,1],repeat=2)
for clf, label, grd in zip(clf_list, label, grid):
scores = cross_val_score(clf, X, y, cv=3, scoring='accuracy')
print("Accuracy: %.2f (+/- %.2f) [%s]" %(scores.mean(), scores.std(), label))
clf.fit(X, y)
ax = plt.subplot(gs[grd[0], grd[1]])
fig = plot_decision_regions(X=X, y=y, clf=clf, legend=2)
plt.title(label)
plt.show()
#plt.savefig('./figures/bagging_ensemble.png')
#plot learning curves
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
plt.figure()
plot_learning_curves(X_train, y_train, X_test, y_test, bagging1, print_model=False, style='ggplot')
plt.show()
#plt.savefig('./figures/bagging_ensemble_learning_curve.png')
#Ensemble Size
num_est = list(map(int, np.linspace(1,100,20)))
bg_clf_cv_mean = []
bg_clf_cv_std = []
for n_est in num_est:
print("num_est: ", n_est)
bg_clf = BaggingClassifier(base_estimator=clf1, n_estimators=n_est, max_samples=0.8, max_features=0.8)
scores = cross_val_score(bg_clf, X, y, cv=3, scoring='accuracy')
bg_clf_cv_mean.append(scores.mean())
bg_clf_cv_std.append(scores.std())
plt.figure()
(_, caps, _) = plt.errorbar(num_est, bg_clf_cv_mean, yerr=bg_clf_cv_std, c='blue', fmt='-o', capsize=5)
for cap in caps:
cap.set_markeredgewidth(1)
plt.ylabel('Accuracy'); plt.xlabel('Ensemble Size'); plt.title('Bagging Tree Ensemble');
plt.show()
#plt.savefig('./figures/bagging_ensemble_size.png')
if __name__ == "__main__":
main()
The figure shows the decision boundary of a decision tree and KNN classifiers, along with their bagging ensembles applied to the iris dataset. The decision tree shows axes parallel boundaries, while the k =1 nearest neighbors fit closely to the data points. The bagging ensembles were trained using 10 base estimators with 0.8 subsampling of training data and 0.8 subsampling of features. The decision tree bagging ensemble achieved higher accuracy than the KNN bagging ensemble because KNN is less sensitive to perturbation on training samples; therefore, they are called stable learners. Combining stable learners is less advantageous, since the ensemble will not help improve generalization performance. Notice also that the decision boundary of the bagging KNN looks similar to the decision boundary of the bagging tree as a result of voting. The figure also shows how the test accuracy improves with the size of the ensemble. Based on cross-validation results, we can see the accuracy increases until approximately 20 base estimators and then plateaus afterward. Thus, adding base estimators beyond 20 only increases computational complexity, without accuracy gains for the iris dataset. The figure also shows learning curves for the bagging tree ensemble. We can see an average error of 0.15 on the training data and a U-shaped error curve for the testing data. The smallest gap between training and test errors occurs at around 80% of the training set size.
A commonly used class of ensemble algorithms is forests of randomized trees. In a random forest, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, instead of using all the features, a random subset of features is selected, further randomizing the tree. As a result, the bias of the forest increases slightly, but due to the averaging of less correlated trees, its variance decreases, resulting in an overall better model.
Boosting
Boosting refers to a family of algorithms that can convert weak learners to strong learners. The main principle of boosting is to fit a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as small decision trees) to weighted versions of the data, where more weight is given to examples that were misclassified by earlier rounds. The predictions are then combined through a weighted majority vote (classification) or a weighted sum (regression) to produce the final prediction. The principal difference between boosting and the committee methods, such as bagging, is that base learners are trained in sequence on a weighted version of the data.
AdaBoost algorithm:
import itertools
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from mlxtend.plotting import plot_learning_curves
from mlxtend.plotting import plot_decision_regions
def main():
iris = datasets.load_iris()
X, y = iris.data[:, 0:2], iris.target
#XOR dataset
#X = np.random.randn(200, 2)
#y = np.array(map(int,np.logical_xor(X[:, 0] > 0, X[:, 1] > 0)))
clf = DecisionTreeClassifier(criterion='entropy', max_depth=1)
num_est = [1, 2, 3, 10]
label = ['AdaBoost (n_est=1)', 'AdaBoost (n_est=2)', 'AdaBoost (n_est=3)', 'AdaBoost (n_est=10)']
fig = plt.figure(figsize=(10, 8))
gs = gridspec.GridSpec(2, 2)
grid = itertools.product([0,1],repeat=2)
for n_est, label, grd in zip(num_est, label, grid):
boosting = AdaBoostClassifier(base_estimator=clf, n_estimators=n_est)
boosting.fit(X, y)
ax = plt.subplot(gs[grd[0], grd[1]])
fig = plot_decision_regions(X=X, y=y, clf=boosting, legend=2)
plt.title(label)
plt.show()
#plt.savefig('./figures/boosting_ensemble.png')
#plot learning curves
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
boosting = AdaBoostClassifier(base_estimator=clf, n_estimators=10)
plt.figure()
plot_learning_curves(X_train, y_train, X_test, y_test, boosting, print_model=False, style='ggplot')
plt.show()
#plt.savefig('./figures/boosting_ensemble_learning_curve.png')
#Ensemble Size
num_est = list(map(int, np.linspace(1,100,20)))
bg_clf_cv_mean = []
bg_clf_cv_std = []
for n_est in num_est:
print("num_est: ", n_est)
ada_clf = AdaBoostClassifier(base_estimator=clf, n_estimators=n_est)
scores = cross_val_score(ada_clf, X, y, cv=3, scoring='accuracy')
bg_clf_cv_mean.append(scores.mean())
bg_clf_cv_std.append(scores.std())
plt.figure()
(_, caps, _) = plt.errorbar(num_est, bg_clf_cv_mean, yerr=bg_clf_cv_std, c='blue', fmt='-o', capsize=5)
for cap in caps:
cap.set_markeredgewidth(1)
plt.ylabel('Accuracy'); plt.xlabel('Ensemble Size'); plt.title('AdaBoost Ensemble');
plt.show()
#plt.savefig('./figures/boosting_ensemble_size.png')
if __name__ == "__main__":
main()
The boosting ensemble is illustrated in figure 7.16. Each base learner consists of a decision tree with depth 1, thus classifying the data based on a feature threshold that partitions the space into two regions separated by a linear decision surface parallel to one of the axes. The figure also shows how the test accuracy improves with the size of the ensemble and the learning curves for training and testing data. Gradient tree boosting is a generalization of boosting to arbitrary differentiable loss functions. It can be used for both regression and classification problems. Gradient boosting builds the model in a sequential way.
At each stage the decision tree, hm(x) is chosen to minimize a loss function L given the current model Fm– 1(x).
Stacking
Stacking is an ensemble learning technique that combines multiple classification or regression models via a meta-classifier or meta-regressor. The base level models are trained based on complete training set, and then the meta-model is trained on the outputs of base-level model as features. The base level often consists of different learning algorithms; therefore, stacking ensembles are often heterogeneous.
import itertools
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from mlxtend.plotting import plot_learning_curves
from mlxtend.plotting import plot_decision_regions
def main():
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
meta_classifier=lr)
label = ['KNN', 'Random Forest', 'Naive Bayes', 'Stacking Classifier']
clf_list = [clf1, clf2, clf3, sclf]
fig = plt.figure(figsize=(10,8))
gs = gridspec.GridSpec(2, 2)
grid = itertools.product([0,1],repeat=2)
clf_cv_mean = []
clf_cv_std = []
for clf, label, grd in zip(clf_list, label, grid):
scores = cross_val_score(clf, X, y, cv=3, scoring='accuracy')
print("Accuracy: %.2f (+/- %.2f) [%s]" %(scores.mean(), scores.std(), label))
clf_cv_mean.append(scores.mean())
clf_cv_std.append(scores.std())
clf.fit(X, y)
ax = plt.subplot(gs[grd[0], grd[1]])
fig = plot_decision_regions(X=X, y=y, clf=clf)
plt.title(label)
plt.show()
#plot classifier accuracy
plt.figure()
(_, caps, _) = plt.errorbar(range(4), clf_cv_mean, yerr=clf_cv_std, c='blue', fmt='-o', capsize=5)
for cap in caps:
cap.set_markeredgewidth(1)
plt.xticks(range(4), ['KNN', 'RF', 'NB', 'Stacking'], rotation='vertical')
plt.ylabel('Accuracy'); plt.xlabel('Classifier'); plt.title('Stacking Ensemble');
plt.show()
#plt.savefig('./figures/stacking_ensemble_size.png')
#plot learning curves
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
plt.figure()
plot_learning_curves(X_train, y_train, X_test, y_test, sclf, print_model=False, style='ggplot')
plt.show()
#plt.savefig('./figures/stacking_ensemble_learning_curve.png')
if __name__ == "__main__":
main()
The stacking example above consists of KNN, Random Forest, and naive Bayes base classifiers whose predictions are combined by logistic regression as a meta-classifier. We can see the blending of decision boundaries achieved by the stacking classifier. The figure also shows that stacking achieves higher accuracy than individual classifiers, and based on learning curves, it shows no signs of overfitting.
That is all for today!
See you tomorrow :)