(Day 228) Making a poster for the Not Google Devs Society

Ivan Ivanov · August 16, 2024

statistics theory applying-knowledge

Hello :) Today is Day 228!

A quick summary of today:

read a bit more of the ML Algorithms in depth book
made promotion material and a linkedin page for my potential society - Not Google Devs

As a fan of the ‘basics’ and testing my knowledge, I decided to continue with the ML Algs in Depth book today.

Machine Learning Algorithms in Depth Chapter 4: Software implementation

Why the chapter name? In the book it gave implementations for some of the algorithms explained below.

1. Data Structures: Storing and Organizing Data

A data structure is a systematic way of organizing and storing data so that it can be accessed and modified efficiently. Different data structures are suited for different types of applications, and they can be categorized into three main types:

Linear Data Structures: These store data in a linear or sequential order. Examples include arrays, linked lists, stacks, and queues. In these structures, elements are arranged one after another in memory.
Nonlinear Data Structures: These store data in a hierarchical or interconnected way rather than in a straight line. Examples include trees and graphs. Nonlinear structures are used to represent complex relationships among data.
Probabilistic Data Structures: These use randomization or probabilistic techniques to store and retrieve data efficiently, often with a trade-off between accuracy and performance. Examples include Bloom filters and skip lists.

2. Algorithmic Paradigms

An algorithmic paradigm is a general approach or strategy to solve a class of problems. The chapter discusses four key paradigms:

Complete Search: Also known as brute force, this method explores the entire search space to find a solution. It’s exhaustive and guarantees finding the correct solution if one exists. Pruning techniques can be applied to cut off parts of the search space that don’t need to be explored, improving efficiency.
Greedy Algorithm: This approach makes a series of choices, each of which looks best at the moment (locally optimum), with the hope that these choices lead to a globally optimum solution. However, greedy algorithms don’t always produce the best solution for every problem.
Divide and Conquer: This technique divides a problem into smaller, independent subproblems, solves each subproblem, and then combines their solutions to solve the original problem. Classic examples include merge sort and quicksort algorithms.
Dynamic Programming (DP): This approach is similar to divide and conquer but is used when subproblems overlap. DP solves each subproblem once, stores the solution in a table, and reuses it when needed. This avoids redundant calculations and optimizes performance.

3. Complete Search (Brute Force)

Complete search or brute force involves exploring every possible option in the search space to find the correct solution. For example, if you were trying to crack a password, a brute force approach would involve trying every possible combination until the correct one is found. Pruning refers to eliminating parts of the search space that cannot possibly contain the solution, which improves the efficiency of the search.

4. Greedy Algorithm

A greedy algorithm makes decisions step-by-step, each time choosing the option that looks the best at that moment. The hope is that by making a series of locally optimal choices, you will arrive at a globally optimal solution. However, greedy algorithms do not always guarantee the best solution, and they work best for problems where a locally optimal choice leads to a globally optimal solution.

5. Divide and Conquer

Divide and conquer is a strategy that breaks down a problem into smaller, more manageable subproblems that are easier to solve. Each subproblem is solved independently, and then their solutions are combined to form a solution to the original problem. This method is effective for problems like sorting (e.g., merge sort) or finding the closest pair of points in a set.

6. Dynamic Programming (DP)

Dynamic programming is a method used when a problem can be broken down into overlapping subproblems. DP solves each subproblem once, stores the results in a table (often called a DP table), and reuses these results to construct the solution to the original problem. This approach is especially useful for optimization problems, like finding the shortest path in a graph.

7. Mastery Through Competitive Programming

Mastering algorithms and software implementation often requires extensive practice, which can be effectively achieved through competitive programming. This involves solving algorithmic problems under time constraints, which sharpens problem-solving skills and deepens understanding of various algorithms and data structures. Appendix A provides resources for getting started with competitive programming.

8. Machine Learning Mastery

To achieve mastery in machine learning, it’s crucial to have a strong grasp of the underlying mathematical and algorithmic principles. This includes knowledge of linear algebra, probability, statistics, and optimization. Appendix A suggests texts that can help deepen your understanding and broaden your expertise in these fundamental areas.

9. Staying Current in Machine Learning

The field of machine learning is evolving rapidly, with new techniques and algorithms being developed continuously. To stay up-to-date, it’s important to regularly read the latest research papers presented at major conferences like NeurIPS, ICML, and CVPR. These papers provide insights into the cutting-edge developments in the field.

Machine Learning Algorithms in Depth Chapter 5: Classification

This chapter (and I suspect all the next ones will be similar) is pretty cool for implementing each algorithm from scratch

The goal of a classification algorithm is to learn a mapping from inputs x to outputs y, where y is a discrete quantity.

Perceptron is a classification algorithm that updates the decision boundary until there are no more classification mistakes.

import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt 

from scipy.stats import randint
from sklearn.datasets import load_iris
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

class perceptron:
    def __init__(self, num_epochs, dim):
        self.num_epochs = num_epochs
        self.theta0 = 0
        self.theta = np.zeros(dim)

    def fit(self, X_train, y_train):
        n = X_train.shape[0]
        dim = X_train.shape[1]

        k = 1
        for epoch in range(self.num_epochs):
            for i in range(n):
                #sample random point
                idx = randint.rvs(0, n-1, size=1)[0] 
                #hinge loss
                if (y_train[idx] * (np.dot(self.theta, X_train[idx,:]) + self.theta0) <= 0):
                    #update learning rate
                    eta = pow(k+1, -1)
                    k += 1
                    #print("eta: ", eta)

                    #update theta
                    self.theta = self.theta + eta * y_train[idx] * X_train[idx, :]
                    self.theta0 = self.theta0 + eta * y_train[idx]            
                #end if
            print("epoch: ", epoch)
            print("theta: ", self.theta)
            print("theta0: ", self.theta0)
            #end for
        #end for

    def predict(self, X_test):
        n = X_test.shape[0]
        dim = X_test.shape[1]

        y_pred = np.zeros(n)
        for idx in range(n):
            y_pred[idx] = np.sign(np.dot(self.theta, X_test[idx,:]) + self.theta0)
        #end for
        return y_pred

if __name__ == "__main__":
    
    #load dataset
    iris = load_iris()
    X = iris.data[:100,:]
    y = 2*iris.target[:100] - 1 #map to {+1,-1} labels

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    #perceptron (binary) classifier
    clf = perceptron(num_epochs=5, dim=X.shape[1])
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    cmt = confusion_matrix(y_test, y_pred)
    acc = np.trace(cmt)/np.sum(np.sum(cmt))
    print("percepton accuracy: ", acc)
    
    #generate plots
    plt.figure()
    sns.heatmap(cmt, annot=True, fmt="d")
    plt.title("Confusion Matrix"); plt.xlabel("predicted"); plt.ylabel("actual")
    #plt.savefig("./figures/perceptron_acc.png")
    plt.show()

Output:

SVM is a max-margin classifier. The training data points that lie on the margin boundaries become support vectors.

import cvxopt
import numpy as np

from sklearn.svm import SVC #for comparison only
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

def rbf_kernel(gamma, **kwargs):
    def f(x1, x2):
        distance = np.linalg.norm(x1 - x2) ** 2
        return np.exp(-gamma * distance)
    return f

class SupportVectorMachine(object):
    def __init__(self, C=1, kernel=rbf_kernel, power=4, gamma=None, coef=4):
        self.C = C
        self.kernel = kernel
        self.power = power
        self.gamma = gamma
        self.coef = coef
        self.lagr_multipliers = None
        self.support_vectors = None
        self.support_vector_labels = None
        self.intercept = None

    def fit(self, X, y):

        n_samples, n_features = np.shape(X)

        # Set gamma to 1/n_features by default
        if not self.gamma:
            self.gamma = 1 / n_features

        # Initialize kernel method with parameters
        self.kernel = self.kernel(
            power=self.power,
            gamma=self.gamma,
            coef=self.coef)

        # Calculate kernel matrix
        kernel_matrix = np.zeros((n_samples, n_samples))
        for i in range(n_samples):
            for j in range(n_samples):
                kernel_matrix[i, j] = self.kernel(X[i], X[j])

        # Define the quadratic optimization problem
        P = cvxopt.matrix(np.outer(y, y) * kernel_matrix, tc='d')
        q = cvxopt.matrix(np.ones(n_samples) * -1)
        A = cvxopt.matrix(y, (1, n_samples), tc='d')
        b = cvxopt.matrix(0, tc='d')

        if not self.C: #if its empty
            G = cvxopt.matrix(np.identity(n_samples) * -1)
            h = cvxopt.matrix(np.zeros(n_samples))
        else:
            G_max = np.identity(n_samples) * -1
            G_min = np.identity(n_samples)
            G = cvxopt.matrix(np.vstack((G_max, G_min)))
            h_max = cvxopt.matrix(np.zeros(n_samples))
            h_min = cvxopt.matrix(np.ones(n_samples) * self.C)
            h = cvxopt.matrix(np.vstack((h_max, h_min)))

        # Solve the quadratic optimization problem using cvxopt
        minimization = cvxopt.solvers.qp(P, q, G, h, A, b)

        # Lagrange multipliers
        lagr_mult = np.ravel(minimization['x'])

        # Extract support vectors
        # Get indexes of non-zero lagr. multipiers
        idx = lagr_mult > 1e-11
        # Get the corresponding lagr. multipliers
        self.lagr_multipliers = lagr_mult[idx]
        # Get the samples that will act as support vectors
        self.support_vectors = X[idx]
        # Get the corresponding labels
        self.support_vector_labels = y[idx]

    # Calculate intercept with first support vector
        self.intercept = self.support_vector_labels[0]
        for i in range(len(self.lagr_multipliers)):
          self.intercept -= self.lagr_multipliers[i] * self.support_vector_labels[
             i] * self.kernel(self.support_vectors[i], self.support_vectors[0])


    def predict(self, X):
        y_pred = []
    # Iterate through list of samples and make predictions
        for sample in X:
            prediction = 0
            # Determine the label of the sample by the support vectors
            for i in range(len(self.lagr_multipliers)):
                prediction += self.lagr_multipliers[i] * self.support_vector_labels[
                    i] * self.kernel(self.support_vectors[i], sample)
            prediction += self.intercept
            y_pred.append(np.sign(prediction))
        return np.array(y_pred)


def main():

    #load dataset
    iris = load_iris()
    X = iris.data[:100,:]
    y = 2*iris.target[:100] - 1 #map to {+1,-1} labels

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
    clf = SupportVectorMachine(kernel=rbf_kernel, gamma = 1)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print ("Accuracy (scratch):", accuracy)

    clf_sklearn = SVC(gamma = 'auto')
    clf_sklearn.fit(X_train, y_train)
    y_pred2 = clf_sklearn.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred2)
    print ("Accuracy :", accuracy)

if __name__ == "__main__":
    main()

Comparing from scratch and out-of-the-box SVM:

Logistic regression is a classification algorithm that computes class conditional density based on a softmax function.

import numpy as np
import matplotlib.pyplot as plt

def generate_data():
        
    n = 1000
    mu1 = np.array([1,1])
    mu2 = np.array([-1,-1])        
    pik = np.array([0.4,0.6])
        
    X = np.zeros((n,2))
    y = np.zeros((n,1))
        
    for i in range(1,n):
        u = np.random.rand()
        idx = np.where(u < np.cumsum(pik))[0]                            
            
        if (len(idx)==1):
            X[i,:] = np.random.randn(1,2) + mu1
            y[i] = 1
        else:
            X[i,:] = np.random.randn(1,2) + mu2
            y[i] = -1
              
    return X, y    


class sgdlr:
    
    def __init__(self):
        
        self.num_iter = 100
        self.lmbda = 1e-9
        
        self.tau0 = 10
        self.kappa = 1
        self.eta = np.zeros(self.num_iter)
        
        self.batch_size = 200        
        
    def fit(self, X, y):

        #random init
        theta = np.random.randn(X.shape[1],1)
        
        #learning rate schedule
        for i in range(self.num_iter):
            self.eta[i] = (self.tau0+i)**(-self.kappa)        
                
        #divide data in batches        
        batch_data, batch_labels = self.make_batches(X,y,self.batch_size)                    
        num_batches = batch_data.shape[0]
        num_updates = 0
        
        J_hist = np.zeros((self.num_iter * num_batches,1))
        t_hist = np.zeros((self.num_iter * num_batches,1))
        
        for itr in range(self.num_iter):
            for b in range(num_batches):
                Xb = batch_data[b]
                yb = batch_labels[b]

                J_cost, J_grad = self.lr_objective(theta, Xb, yb, self.lmbda)
                theta = theta - self.eta[itr]*(num_batches*J_grad)     

                J_hist[num_updates] = J_cost
                t_hist[num_updates] = np.linalg.norm(theta,2)
                num_updates = num_updates + 1
            print("iteration %d, cost: %f" %(itr, J_cost))
        
        y_pred = 2*(self.sigmoid(X.dot(theta)) > 0.5) - 1
        y_err = np.size(np.where(y_pred - y)[0])/float(y.shape[0])
        print("classification error:", y_err)

        self.generate_plots(X, J_hist, t_hist, theta)
        return theta        
    
    def make_batches(self, X, y, batch_size):
        n = X.shape[0]
        d = X.shape[1]
        num_batches = int(np.ceil(n/batch_size))

        groups = np.tile(range(num_batches),batch_size)
        batch_data=np.zeros((num_batches,batch_size,d))
        batch_labels=np.zeros((num_batches,batch_size,1))
        
        for i in range(num_batches):
            batch_data[i,:,:] = X[groups==i,:]
            batch_labels[i,:] = y[groups==i]
            
        return batch_data, batch_labels
            
    def lr_objective(self, theta, X, y, lmbda):
        
        n = y.shape[0]
        y01 = (y+1)/2.0
        
        #compute the objective
        mu = self.sigmoid(X.dot(theta))
    
        #bound away from 0 and 1
        eps = np.finfo(float).eps   
        mu = np.maximum(mu,eps)
        mu = np.minimum(mu,1-eps)
        
        #compute cost
        cost = -(1/n)*np.sum(y01*np.log(mu)+(1-y01)*np.log(1-mu))+np.sum(lmbda*theta*theta)
        
        #compute the gradient of the lr objective
        grad = X.T.dot(mu-y01) + 2*lmbda*theta
        
        #compute the Hessian of the lr objective
        #H = X.T.dot(np.diag(np.diag( mu*(1-mu) ))).dot(X) + 2*lmbda*np.eye(np.size(theta))
        
        return cost, grad
                    
    def sigmoid(self, a):
        return 1/(1+np.exp(-a))
    
    def generate_plots(self, X, J_hist, t_hist, theta):
                        
        plt.figure()
        plt.plot(J_hist)
        plt.title("logistic regression")
        plt.xlabel('iterations')
        plt.ylabel('cost')
        #plt.savefig('./figures/lrsgd_loss.png')
        plt.show()
        
        plt.figure()
        plt.plot(t_hist)
        plt.title("logistic regression")
        plt.xlabel('iterations')
        plt.ylabel('theta l2 norm')
        #plt.savefig('./figures/lrsgd_theta_norm.png')
        plt.show()   
        
        plt.figure()
        plt.plot(self.eta)
        plt.title("logistic regression")
        plt.xlabel('iterations')
        plt.ylabel('learning rate')
        #plt.savefig('./figures/lrsgd_learning_rate.png')
        plt.show()
        
        plt.figure()
        x1 = np.linspace(np.min(X[:,0])-1,np.max(X[:,0])+1,10)
        plt.scatter(X[:,0], X[:,1])
        plt.plot(x1, -(theta[0]/theta[1])*x1)
        plt.title('logistic regression')
        plt.grid(True)
        plt.xlabel('X1')
        plt.ylabel('X2')
        #plt.savefig('./figures/lrsgd_clf.png')
        plt.show()
                    
if __name__ == "__main__":
        
    X, y = generate_data()
    sgd = sgdlr()
    theta = sgd.fit(X,y)

The naive Bayes algorithm assumes features are conditionally independent, given the class label. It is commonly used in document classification.

import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt

from time import time
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

from sklearn.metrics import accuracy_score
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

sns.set_style("whitegrid")
tokenizer = RegexpTokenizer(r'\w+')
stop_words = set(stopwords.words('english'))
stop_words.update(['s','t','m','1','2'])

class naive_bayes:
    def __init__(self, K, D):
        self.K = K #number of classes
        self.D = D #dictionary size

        self.pi = np.ones(K) #class priors
        self.theta = np.ones((self.D, self.K)) #bernoulli parameters

    def fit(self, X_train, y_train):
        
        num_docs = X_train.shape[0]
        for doc in range(num_docs):
            
            label = y_train[doc]
            self.pi[label] += 1

            for word in range(self.D):
                if (X_train[doc][word] > 0):
                    self.theta[word][label] += 1
                #end if
            #end for
        #end for
        
        #normalize pi and theta
        self.pi = self.pi/np.sum(self.pi)
        self.theta = self.theta/np.sum(self.theta, axis=0)

    def predict(self, X_test):

        num_docs = X_test.shape[0]
        logp = np.zeros((num_docs,self.K))
        for doc in range(num_docs):
            for kk in range(self.K):
                logp[doc][kk] = np.log(self.pi[kk])
                for word in range(self.D):
                    if (X_test[doc][word] > 0):
                        logp[doc][kk] += np.log(self.theta[word][kk])
                    else:
                        logp[doc][kk] += np.log(1-self.theta[word][kk])
                    #end if
                #end for
            #end for
        #end for
        return np.argmax(logp, axis=1)
        
if __name__ == "__main__":

    import nltk
    nltk.download('stopwords')

    #load data
    print("loading 20 newsgroups dataset...")
    tic = time()
    classes = ['sci.space', 'comp.graphics', 'rec.autos', 'rec.sport.hockey']
    dataset = fetch_20newsgroups(shuffle=True, random_state=0, remove=('headers','footers','quotes'), categories=classes)
    X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.5, random_state=0)
    toc = time()
    print("elapsed time: %.4f sec" %(toc - tic))
    print("number of training docs: ", len(X_train))
    print("number of test docs: ", len(X_test))

    print("vectorizing input data...")
    cnt_vec = CountVectorizer(tokenizer=tokenizer.tokenize, analyzer='word', ngram_range=(1,1), max_df=0.8, min_df=2, max_features=1000, stop_words=stop_words)
    cnt_vec.fit(X_train)
    toc = time()
    print("elapsed time: %.2f sec" %(toc - tic))
    vocab = cnt_vec.vocabulary_
    idx2word = {val: key for (key, val) in vocab.items()}
    print("vocab size: ", len(vocab))

    X_train_vec = cnt_vec.transform(X_train).toarray()
    X_test_vec = cnt_vec.transform(X_test).toarray()

    print("naive bayes model MLE inference...")
    K = len(set(y_train)) #number of classes
    D = len(vocab) #dictionary size
    nb_clf = naive_bayes(K, D)
    nb_clf.fit(X_train_vec, y_train)

    print("naive bayes prediction...")
    y_pred = nb_clf.predict(X_test_vec)
    nb_clf_acc = accuracy_score(y_test, y_pred)
    print("test set accuracy: ", nb_clf_acc) 

Test acc: ~82%

The CART decision tree is a greedy, recursive algorithm that finds the optimum feature splits by minimizing an objective function, such as the Gini index.

import numpy as np 
import matplotlib.pyplot as plt 

from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

class TreeNode():
    def __init__(self, gini, num_samples, num_samples_class, class_label):
        self.gini = gini  #gini cost
        self.num_samples = num_samples #size of node
        self.num_samples_class = num_samples_class #number of node pts with label k
        self.class_label = class_label #predicted class label
        self.feature_idx = 0 #idx of feature to split on
        self.treshold = 0  #best threshold to split on
        self.left = None #left subtree pointer
        self.right = None #right subtree pointer

class DecisionTreeClassifier():
    def __init__(self, max_depth = None):
        self.max_depth = max_depth
    
    def best_split(self, X_train, y_train):
        m = y_train.size
        if (m <= 1):
            return None, None
        
        #number of points of class k
        mk = [np.sum(y_train == k) for k in range(self.num_classes)]

        #gini of current node
        best_gini = 1.0 - sum((n / m) ** 2 for n in mk)
        best_idx, best_thr = None, None

        #iterate over all features
        for idx in range(self.num_features):
            # sort data along selected feature
            thresholds, classes = zip(*sorted(zip(X[:, idx], y)))

            num_left = [0]*self.num_classes
            num_right = mk.copy()

            #iterate overall possible split positions
            for i in range(1, m):
                
                k = classes[i-1]
                
                num_left[k] += 1
                num_right[k] -= 1

                gini_left = 1.0 - sum(
                    (num_left[x] / i) ** 2 for x in range(self.num_classes)
                )

                gini_right = 1.0 - sum(
                    (num_right[x] / (m - i)) ** 2 for x in range(self.num_classes)
                )

                gini = (i * gini_left + (m - i) * gini_right) / m

                # check that we don't try to split two pts with identical values
                if thresholds[i] == thresholds[i - 1]:
                    continue

                if (gini < best_gini):
                    best_gini = gini
                    best_idx = idx
                    best_thr = (thresholds[i] + thresholds[i - 1]) / 2  # midpoint
                #end if
            #end for
        #end for
        return best_idx, best_thr
    
    def gini(self, y_train):
        m = y_train.size
        return 1.0 - sum((np.sum(y_train == k) / m) ** 2 for k in range(self.num_classes))

    def fit(self, X_train, y_train):
        self.num_classes = len(set(y_train))
        self.num_features = X_train.shape[1]
        self.tree = self.grow_tree(X_train, y_train)

    def grow_tree(self, X_train, y_train, depth=0):
        
        num_samples_class = [np.sum(y_train == k) for k in range(self.num_classes)]
        class_label = np.argmax(num_samples_class)
        
        node = TreeNode(
            gini=self.gini(y_train),
            num_samples=y_train.size,
            num_samples_class=num_samples_class,
            class_label=class_label,
        )

        # split recursively until maximum depth is reached
        if depth < self.max_depth:
            idx, thr = self.best_split(X_train, y_train)
            if idx is not None:
                indices_left = X_train[:, idx] < thr
                X_left, y_left = X_train[indices_left], y_train[indices_left]
                X_right, y_right = X_train[~indices_left], y_train[~indices_left]
                node.feature_index = idx
                node.threshold = thr
                node.left = self.grow_tree(X_left, y_left, depth + 1)
                node.right = self.grow_tree(X_right, y_right, depth + 1)

        return node

    def predict(self, X_test):
        return [self.predict_helper(x_test) for x_test in X_test]

    def predict_helper(self, x_test):
        node = self.tree
        while node.left:
            if x_test[node.feature_index] < node.threshold:
                node = node.left
            else:
                node = node.right
        return node.class_label


if __name__ == "__main__":

    #load data
    iris = load_iris()
    X = iris.data[:, [2,3]]
    y = iris.target

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
          
    print("decision tree classifier...")
    tree_clf = DecisionTreeClassifier(max_depth = 3)
    tree_clf.fit(X_train, y_train)

    print("prediction...")
    y_pred = tree_clf.predict(X_test)
    
    tree_clf_acc = accuracy_score(y_test, y_pred)
    print("test set accuracy: ", tree_clf_acc) 

Test acc: ~80%

Machine Learning Algorithms in Depth Chapter 6: Regression

Bayesian Linear Regression

In Bayesian linear regression defined by y(x) =wTx + ϵ, we assume the noise term is a zero-mean Gaussian random variable

import math
import numpy as np 
import pandas as pd

import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing 

class ridge_reg():

    def __init__(self, n_iter=20, learning_rate=1e-3, lmbda=0.1):
        self.n_iter = n_iter
        self.learning_rate = learning_rate
        self.lmbda = lmbda 
    
    def fit(self, X, y):
        #insert const 1 for bias term
        X = np.insert(X, 0, 1, axis=1)
        
        self.loss = []
        self.w = np.random.rand(X.shape[1])

        for i in range(self.n_iter):
            y_pred = X.dot(self.w)
            mse = np.mean(0.5*(y - y_pred)**2 + 0.5*self.lmbda*self.w.T.dot(self.w))
            self.loss.append(mse)
            print(" %d iter, mse: %.4f" %(i, mse))
            #compute gradient of NLL(w) wrt w
            grad_w = - (y - y_pred).dot(X) + self.lmbda*self.w
            #update the weights
            self.w -= self.learning_rate * grad_w

    def predict(self, X):
        #insert const 1 for bias term
        X = np.insert(X, 0, 1, axis=1)
        y_pred = X.dot(self.w)
        return y_pred

if __name__ == "__main__":
    
    X, y = fetch_california_housing(return_X_y=True)
    X_reg = X[:,2].reshape(-1,1) #avg number of rooms 
    X_std = (X_reg - X_reg.mean())/X.std() #standard scaling
    y_std = (y - y.mean())/y.std() #standard scaling

    X_std = X_std[:200,:]
    y_std = y_std[:200]

    rr = ridge_reg()
    rr.fit(X_std, y_std)
    y_pred = rr.predict(X_std)

    print(rr.w)

    plt.figure()
    plt.plot(rr.loss)
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.tight_layout()
    plt.show()

    plt.figure()
    plt.scatter(X_std, y_std)
    plt.plot(np.linspace(-1,1), rr.w[1]*np.linspace(-1,1)+rr.w[0], c='red')
    plt.xlim([-0.01,0.01])
    plt.xlabel("scaled avg num of rooms")
    plt.ylabel("scaled house price")
    plt.show()

The Bayesian linear regression successfully captures the trend of increasing house prices with the average number of rooms

Hierarchical Bayesian regression

Hierarchical models enable the sharing of features among groups. A hierarchical model assumes there is a common distribution from which individual parameters are sampled and, therefore, captures similarities between groups

import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

import pymc3 as pm 

def main():
    
    #load data
    data = pd.read_csv('./data/radon.txt')
    
    county_names = data.county.unique()
    county_idx = data['county_code'].values
    
    with pm.Model() as hierarchical_model:
        
        # Hyperpriors
        mu_a = pm.Normal('mu_alpha', mu=0., sd=100**2)
        sigma_a = pm.Uniform('sigma_alpha', lower=0, upper=100)
        mu_b = pm.Normal('mu_beta', mu=0., sd=100**2)
        sigma_b = pm.Uniform('sigma_beta', lower=0, upper=100)
    
        # Intercept for each county, distributed around group mean mu_a
        a = pm.Normal('alpha', mu=mu_a, sd=sigma_a, shape=len(data.county.unique()))
        # Slope for each county, distributed around group mean mu_b
        b = pm.Normal('beta', mu=mu_b, sd=sigma_b, shape=len(data.county.unique()))
    
        # Model error
        eps = pm.Uniform('eps', lower=0, upper=100)
    
        # Expected value
        radon_est = a[county_idx] + b[county_idx] * data.floor.values
    
        # Data likelihood
        y_like = pm.Normal('y_like', mu=radon_est, sd=eps, observed=data.log_radon)
        
    
    with hierarchical_model:
        # Use ADVI for initialization
        mu, sds, elbo = pm.variational.advi(n=100000)
        step = pm.NUTS(scaling=hierarchical_model.dict_to_array(sds)**2, is_cov=True)
        hierarchical_trace = pm.sample(5000, step, start=mu)

                
    pm.traceplot(hierarchical_trace[500:])
    plt.show()        
        
if __name__ == "__main__":    
    main()

KNN regression

K nearest neighbors (KNNs) regression is a nonparametric model, in which for a given query data point q, we find its KNN in the training set and compute the average response variable y

import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.model_selection import train_test_split

np.random.seed(42)

class KNN():
    
    def __init__(self, K):
        self.K = K

    def euclidean_distance(self, x1, x2):
        dist = 0
        for i in range(len(x1)):
            dist += np.power((x1[i] - x2[i]), 2)
        return np.sqrt(dist)
    
    def knn_search(self, X_train, y_train, Q):
        y_pred = np.empty(Q.shape[0])        

        for i, query in enumerate(Q):
            #get K nearest neighbors to query point
            idx = np.argsort([self.euclidean_distance(query, x) for x in X_train])[:self.K]            
            #extract the labels of KNN training labels
            knn_labels = np.array([y_train[i] for i in idx])
            #label query sample as the average of knn_labels
            y_pred[i] = np.mean(knn_labels)

        return y_pred 
        

if __name__ == "__main__":
    
    plt.close('all')
        
    #iris dataset
    iris = datasets.load_iris()
    X = iris.data[:,:2]
    y = iris.target            

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    K = 4
    knn = KNN(K)    
    y_pred = knn.knn_search(X_train, y_train, X_test)
    
    plt.figure(1)
    plt.scatter(X_train[:,0], X_train[:,1], s = 100, marker = 'x', color = 'r', label = 'data')
    plt.scatter(X_test[:,0], X_test[:,1], s = 100, marker = 'o', color = 'b', label = 'query')
    plt.title('K Nearest Neighbors (K=%d)'% K)
    plt.legend()
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.grid(True)
    plt.show()

Finding the exact nearest neighbors in high-dimensional space is often computationally intractable, and therefore, there exist approximate methods. There are two classes of approximate methods: those that partition the space into regions, such as k-d tree implemented in fast library for approximate nearest neighbors (FLANN), and hashing-based methods, such as locality-sensitive hashing (LSH)

Gaussian process regression

Gaussian processes (GPs) define a prior over functions that can be updated to a posterior once we have observed data. A GP assumes the function is defined at a finite and arbitrary chosen set of points x1, …, xn, such that p(f(x1), …, f(xn)) is jointly Gaussian with a mean of μ(x) and covariance of Σ(x), where Σij = κ(xi, xj) and κ is a positive definite kernel function.

import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist

np.random.seed(42)

class GPreg:
    
    def __init__(self, X_train, y_train, X_test):        

        self.L = 1.0
        self.keps = 1e-8
        
        self.muFn = self.mean_func(X_test)
        self.Kfn = self.kernel_func(X_test, X_test) + 1e-15*np.eye(np.size(X_test))
        
        self.X_train = X_train
        self.y_train = y_train
        self.X_test = X_test
            
    def mean_func(self, x):
        muFn = np.zeros(len(x)).reshape(-1,1)
        return muFn
        
    def kernel_func(self, x, z):                
        sq_dist = cdist(x/self.L, z/self.L, 'euclidean')**2
        Kfn = 1.0 * np.exp(-sq_dist/2)
        return Kfn
    
    def compute_posterior(self):
        K = self.kernel_func(self.X_train, self.X_train)  #K
        Ks = self.kernel_func(self.X_train, self.X_test)  #K_*
        Kss = self.kernel_func(self.X_test, self.X_test) + self.keps*np.eye(np.size(self.X_test))  #K_**
        Ki = np.linalg.inv(K) #O(Ntrain^3)        
        
        postMu = self.mean_func(self.X_test) + np.dot(np.transpose(Ks), np.dot(Ki, (self.y_train - self.mean_func(self.X_train))))
        postCov = Kss - np.dot(np.transpose(Ks), np.dot(Ki, Ks))
        
        self.muFn = postMu
        self.Kfn = postCov            

        return None    
    
    def generate_plots(self, X, num_samples=3):
        plt.figure()
        for i in range(num_samples):
            fs = self.gauss_sample(1)
            plt.plot(X, fs, '-k')
            #plt.plot(self.X_train, self.y_train, 'xk')
                
        mu = self.muFn.ravel()
        S2 = np.diag(self.Kfn)
        plt.fill(np.concatenate([X, X[::-1]]), np.concatenate([mu - 2*np.sqrt(S2), (mu + 2*np.sqrt(S2))[::-1]]), alpha=0.2, fc='b')
        plt.show()    
        
    def gauss_sample(self, n):
        # returns n samples from a multivariate Gaussian distribution
        # S = AZ + mu        
        A = np.linalg.cholesky(self.Kfn)
        Z = np.random.normal(loc=0, scale=1, size=(len(self.muFn),n))
        S = np.dot(A,Z) + self.muFn        
        return S    
        
def main():
    
    # generate noise-less training data
    X_train = np.array([-4, -3, -2, -1, 1])
    X_train = X_train.reshape(-1,1)
    y_train = np.sin(X_train)

    # generate  test data
    X_test = np.linspace(-5, 5, 50)
    X_test = X_test.reshape(-1,1)
    
    gp = GPreg(X_train, y_train, X_test)
    gp.generate_plots(X_test,3)  #samples from GP prior
    gp.compute_posterior()
    gp.generate_plots(X_test,3)  #samples from GP posterior
    
        
if __name__ == "__main__":    
    main()

The above shows three functions drawn at random from a GP prior (first) and GP posterior (second) after observing five data points in the case of noise-free observations. The shaded area corresponds to two times the standard deviation around the mean (95% confidence region). We can see that the model perfectly interpolates the training data and that the predictive uncertainty increases as we move further away from the observed data.

Not Google Devs Society

Yep. I will try to start my own society relateted to anything data.

The name?

I know there are official Google Developer Student Clubs however the deadline for this year has passed and applications will open for the 2025-26 academic year. And by that time I will have graduated.

So the names describes the sociecty members exactly as - Not Google Devs. I even ‘designed’ a logo.

And also created flyers that I will put around the uni. Below is the translated content.

2024-2nd semester looking for:

‭

Vice President‬
Event planner‬
General members ‭
What?

‭- We are the Not Google Devs society‬

Learn and Practice Data Engineering, Data, Science, AI, Cloud, other‬
Attend bi-weekly sessions‬
Work on projects alone and with others‬ ‭
Who is this for:
someone that loves new challenges
someone that loves to code
someone that can speak english or is studying english

Currently planned schedule

MLOps
Data engineering
What is HuggingFace
RAG apps
Hackathon
Collab with Uni of Essex’s CS society

Then some words from myself and contact info.

Why do I want to make a society?

Being part of a society is one of the most memorable experiences of my undergraduate years. I had fun, but also got to learn soft skills that later I could add to my CV. With this society, I not only want to share knowledge of whatever I know from my work and learning experience, but also create an atmosphere of collaboration and learning. I hope that whoever joins the society can learn a lot, create project, work in a team, and later on add these experiences to their CV - just like I did. As for the meetings, I think online/in-person would be fine (depending on the amount of people). The idea is to have presentation sessions where a member or myself present a project, a new tool, something interesting that people want to or might be benefitial.

I am also looking for help - a VP and an Event Planner, so hopefully I get some volunteers ^^

I also made a LinkedIn page for the society so that people can even add it on linkedin. I need to print out the promotion material and hang it around the school in the coming days. I also created a GitHub repo that we can potential use. For now it just has the promotional info.

For completeness, here is the poster I have at the moment:

That is all for today!

See you tomorrow :)