(Day 237) More unsupervised learning algorithms + submitting the KB project ppt

Ivan Ivanov · August 25, 2024

traditional-machine-learning applying-knowledge theory

Hello :) Today is Day 237!

A quick summary of today:

reading chapter 9 from the Machine Learning Algorithms in Depth book
re-watching previous year’s KB Future Finance AI competition

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) is a topic model used to analyze discrete data, such as text documents. It represents each document as a mixture of topics, where each topic is a distribution over words. The goal of LDA is to learn shared topic distributions and topic proportions for each document.

Bag of Words Model

LDA assumes a bag of words model, meaning that words are treated as exchangeable, and sentence structure is ignored. Each document is represented as a vector of word counts, and the entire corpus is summarized in a term-document matrix.

Matrix Factorization

LDA can be viewed as a nonnegative matrix factorization problem. The term-document matrix is factorized into a product of topics and topic proportions:

A = WH
- W: Topics (distribution over words)
- H: Topic proportions

TF-IDF and Smoothing

Term frequency–inverse document frequency (tf-idf) is often used to adjust word counts, reducing the influence of words that frequently occur across documents. This method helps in identifying discriminative words and improving model performance. For n-grams, different smoothing techniques like Laplace smoothing are used to handle sparse data.

Generative Model

In LDA, each word in a document is associated with a topic label, and each document is associated with topic proportions (θd). Topics (βk) are shared across all documents, and hyperparameters α and η capture prior knowledge about topic proportions and topics.

Inference and Variational Bayes

The key inference problem in LDA is computing the posterior distribution of latent variables for a given document. This can be approximated using a variational distribution that is optimized to maximize the evidence lower bound (ELBO). The variational parameter updates can be done in an online setting, allowing for topic analysis on large datasets or streaming data.

Density estimators

The goal of density estimation is to model the probability density of data.

Kernel density estimator

An alternative approach to a K-component mixture model is a kernel density estimator (KDE) that allocates one cluster center per data point. KDEs are an application of kernel smoothing for probability density estimation that use kernels as weights

The advantage of KDEs over parametric models, such as density mixtures, is that no model fitting is required (except for fine-tuning the bandwidth parameter h) and there is no need to pick the number of mixtures K. The disadvantage is that the model takes a lot of memory to store as well as time to evaluate. In other words, KDE is suitable when an accurate density estimate is required for a relatively small dataset (small number of points N)

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(14)

class KDE():

    def __init__(self):
        #Histogram and Gaussian Kernel Estimator used to
        #analyze RNA-seq data for flux estimation of a T7 promoter
        self.G = 1e9  #length of genome in base pairs (bp)
        self.C = 1e3  #number of unique molecules
        self.L = 100  #length of a read, bp
        self.N = 1e6  #number of reads, L bp long
        self.M = 1e4  #number of unique read sequences, bp
        self.LN = 1000  #total length of assembled / mapped RNA-seq reads
        self.FDR = 0.05  #false discovery rate

        #uniform sampling (poisson model)
        self.lmbda = (self.N * self.L) / self.G  #expected number of bases covered
        self.C_est = self.M/(1-np.exp(-self.lmbda)) #library size estimate
        self.C_cvrg = self.G - self.G * np.exp(-self.lmbda) #base coverage
        self.N_gaps = self.N * np.exp(-self.lmbda) #number of gaps (uncovered bases)

        #gamma prior sampling (negative binomial model)
        #X = "number of failures before rth success"
        self.k = 0.5 # dispersion parameter (fit to data)
        self.p = self.lmbda/(self.lmbda + 1/self.k) # success probability
        self.r = 1/self.k # number of successes
        
        #RNAP binding data (RNA-seq)
        self.data = np.random.negative_binomial(self.r, self.p, size=self.LN)

    def histogram(self):
        self.bin_delta = 1  #smoothing parameter 
        self.bin_range = np.arange(1, np.max(self.data), self.bin_delta)
        self.bin_counts, _ = np.histogram(self.data, bins=self.bin_range)

        #histogram density estimation 
        #P = integral_R p(x) dx, where X is in R^3
        #p(x) = K/(NxV), where K=number of points in region R
        #N=total number of points, V=volume of region R

        rnap_density_est = self.bin_counts/(sum(self.bin_counts) * self.bin_delta)
        return rnap_density_est

    def kernel(self):
        #Gaussian kernel density estimator with smoothing parameter h
        #sum N Guassians centered at each data point, parameterized by common std dev h

        x_dim = 1  #dimension of x
        h = 10 #standard deviation

        rnap_density_support = np.arange(np.max(self.data))
        rnap_density_est = 0
        for i in range(np.sum(self.bin_counts)):
            rnap_density_est += (1/(2*np.pi*h**2)**(x_dim/2.0))*np.exp(-(rnap_density_support - self.data[i])**2 / (2.0*h**2))
        #end for
        
        rnap_density_est = rnap_density_est / np.sum(rnap_density_est)
        return rnap_density_est

if __name__ == "__main__":

    kde = KDE()
    est1 = kde.histogram()
    est2 = kde.kernel()

    plt.figure()
    plt.plot(est1, '-b', label='histogram')
    plt.plot(est2, '--r', label='gaussian kernel')
    plt.title("RNA-seq density estimate based on negative binomial model")
    plt.xlabel("read length, [base pairs]"); plt.ylabel("density"); plt.legend()
    plt.show()

Tangent portfolio optimization

The objective of mean-variance analysis is to maximize the expected return of a portfolio for a given level of risk, as measured by the standard deviation of past returns. By varying the mixing proportions of each asset, we can achieve different risk–return tradeoffs.

Example:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.neighbors import KDTree
from pandas.plotting import scatter_matrix
from scipy.spatial import ConvexHull

import pandas_datareader.data as web
from datetime import datetime
import pytz

STOCKS = ['SPY','LQD','TIP','GLD','MSFT']

np.random.seed(42)    

if __name__ == "__main__":

    plt.close("all")
    
    #load data
    #year, month, day, hour, minute, second, microsecond
    start = datetime(2012, 1, 1, 0, 0, 0, 0, pytz.utc)
    end = datetime(2017, 1, 1, 0, 0, 0, 0, pytz.utc)     
    
    data = pd.DataFrame()
    series = []
    for ticker in STOCKS:
        price = web.DataReader(ticker, 'stooq', start, end)
        series.append(price['Close'])

    data = pd.concat(series, axis=1)
    data.columns = STOCKS
    data = data.dropna()

    #plot data correlations
    scatter_matrix(data, alpha=0.2, diagonal='kde')
    plt.show()

    #get current portfolio
    cash = 10000
    num_assets = np.size(STOCKS)
    cur_value = (1e4-5e3)*np.random.rand(num_assets,1) + 5e3        
    tot_value = np.sum(cur_value)
    weights = cur_value.ravel()/float(tot_value)
    
    #compute portfolio risk
    Sigma = data.cov().values
    Corr = data.corr().values        
    volatility = np.sqrt(np.dot(weights.T, np.dot(Sigma, weights)))
    
    plt.figure()
    plt.title('Correlation Matrix')        
    plt.imshow(Corr, cmap='gray')
    plt.xticks(range(len(STOCKS)),data.columns)
    plt.yticks(range(len(STOCKS)),data.columns)    
    plt.colorbar()
    plt.show()
        
    #generate random portfolio weights
    num_trials = 1000
    W = np.random.rand(num_trials, np.size(weights))    
    W = W/np.sum(W,axis=1).reshape(num_trials,1)  #normalize
    
    pv = np.zeros(num_trials)   #portoflio value  w'v
    ps = np.zeros(num_trials)   #portfolio sigma: sqrt(w'Sw)
    
    avg_price = data.mean().values
    adj_price = avg_price
    
    for i in range(num_trials):
        pv[i] = np.sum(adj_price * W[i,:])
        ps[i] = np.sqrt(np.dot(W[i,:].T, np.dot(Sigma, W[i,:])))
    
    points = np.vstack((ps,pv)).T
    hull = ConvexHull(points)
    
    plt.figure()
    plt.scatter(ps, pv, marker='o', color='b', linewidth = 3.0, label = 'tangent portfolio')
    plt.scatter(volatility, np.sum(adj_price * weights), marker = 's', color = 'r', linewidth = 3.0, label = 'current')
    plt.plot(points[hull.vertices,0], points[hull.vertices,1], linewidth = 2.0)    
    plt.title('expected return vs volatility')
    plt.ylabel('expected price')
    plt.xlabel('portfolio std dev')
    plt.legend()
    plt.grid(True)
    plt.show()
    
    #query for nearest neighbor portfolio
    knn = 5    
    kdt = KDTree(points)
    query_point = np.array([2, 115]).reshape(1,-1)
    kdt_dist, kdt_idx = kdt.query(query_point,k=knn)
    print("top-%d closest to query portfolios:" %knn)
    print("values: ", pv[kdt_idx.ravel()])
    print("sigmas: ", ps[kdt_idx.ravel()])

top-5 closest to query portfolios:
values:  [114.63340211 115.57074795 115.03231305 115.35426485 114.426947  ]
sigmas:  [3.9943071  4.06987543 4.36175633 4.45779421 4.44304117]

Structure learning

Itinvolves discovering a graph’s structure given relational data. The goal is to evaluate the probability of a graph ( G = (V, E) ) based on observed data ( D ). One challenge in this process is the exponential number of possible graphs. For example, in a directed graph with ( V ) vertices, there are ( O(2^{V(V-1)/2}) ) possible graphs due to the different edge directions. Since structure learning for general graphs is NP-hard, approximate methods are often used.

Approximate Methods

Two common approaches for structure learning are:

Chow-Liu algorithm for tree-based graphs
Inverse covariance estimation for general graphs

Chow-Liu Algorithm

The Chow-Liu algorithm focuses on tree structures. It uses a joint probability model for a tree ( T ), where:

( p(x_t) ) is a node marginal.
( p(x_s, x_t) ) is an edge marginal.

To derive the Chow-Liu algorithm, the likelihood is expressed based on the mutual information ( I(x_s, x_t

\theta) ) between pairs of nodes. The tree topology that maximizes the log likelihood is computed via the maximum weight spanning tree, where edge weights are the pairwise mutual information values. The maximum spanning tree can be calculated using either Prim’s or Kruskal’s algorithm, both of which run in ( O(E \log V) ) time.

Inverse covariance estimation

Identifying stock clusters helps one discover similar companies, which can be useful for comparative analysis or a pairs trading strategy. We can find similar clusters by estimating the inverse covariance (precision) matrix that can be used to construct a graph network of dependencies, using the fact that zeros in the precision matrix correspond to the absence of edges in the constructed graph.

import numpy as np
import pandas as pd
from scipy import linalg

from datetime import datetime
import pytz

from sklearn.datasets import make_sparse_spd_matrix
from sklearn.covariance import GraphicalLassoCV, ledoit_wolf
from sklearn.preprocessing import StandardScaler
from sklearn import cluster, manifold

import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection

import pandas_datareader.data as web

np.random.seed(42)

def main():
        
    #generate data (synthetic)
    #num_samples = 60
    #num_features = 20
    #prec = make_sparse_spd_matrix(num_features, alpha=0.95, smallest_coef=0.4, largest_coef=0.7)
    #cov = linalg.inv(prec)
    #X = np.random.multivariate_normal(np.zeros(num_features), cov, size=num_samples)
    #X = StandardScaler().fit_transform(X)    
   
    #generate data (actual)
    STOCKS = {
        'SPY': 'S&P500',
        'LQD': 'Bond_Corp',
        'TIP': 'Bond_Treas',
        'GLD': 'Gold',
        'MSFT': 'Microsoft',
        'XOM':  'Exxon',
        'AMZN': 'Amazon',
        'BAC':  'BofA',
        'NVS':  'Novartis'}
      
    symbols, names = np.array(list(STOCKS.items())).T
    
    #load data
    #year, month, day, hour, minute, second, microsecond
    start = datetime(2015, 1, 1, 0, 0, 0, 0, pytz.utc)
    end = datetime(2017, 1, 1, 0, 0, 0, 0, pytz.utc)    

    qopen, qclose = [], []
    data_close, data_open = pd.DataFrame(), pd.DataFrame()    
    for ticker in symbols:
        price = web.DataReader(ticker, 'stooq', start, end)
        qopen.append(price['Open'])
        qclose.append(price['Close'])

    data_open = pd.concat(qopen, axis=1)
    data_open.columns = symbols
    data_close = pd.concat(qclose, axis=1)
    data_close.columns = symbols
    
    #per day variation in price for each symbol
    variation = data_close - data_open
    variation = variation.dropna()                        
                
    X = variation.values
    X /= X.std(axis=0) #standardize to use correlations rather than covariance
                
    #estimate inverse covariance    
    graph = GraphicalLassoCV()
    graph.fit(X)
    
    gl_cov = graph.covariance_
    gl_prec = graph.precision_
    gl_alphas = graph.cv_alphas_
    gl_scores = graph.cv_results_['mean_test_score']

    plt.figure()        
    sns.heatmap(gl_prec, xticklabels=names, yticklabels=names)
    plt.xticks(rotation=45)
    plt.yticks(rotation=45)    
    plt.tight_layout()
    plt.show()
    
    plt.figure()    
    plt.plot(gl_alphas, gl_scores, marker='o', color='b', lw=2.0, label='GraphLassoCV')
    plt.title("Graph Lasso Alpha Selection")
    plt.xlabel("alpha")
    plt.ylabel("score")
    plt.legend()
    plt.show()
    
    #cluster using affinity propagation
    _, labels = cluster.affinity_propagation(gl_cov)
    num_labels = np.max(labels)
    
    for i in range(num_labels+1):
        print("Cluster %i: %s" %((i+1), ', '.join(names[labels==i])))
    
    #find a low dim embedding for visualization
    node_model = manifold.LocallyLinearEmbedding(n_components=2, n_neighbors=6, eigen_solver='dense')
    embedding = node_model.fit_transform(X.T).T
    
    #generate plots
    plt.figure()
    plt.clf()
    ax = plt.axes([0.,0.,1.,1.])
    plt.axis('off')
    
    partial_corr = gl_prec
    d = 1 / np.sqrt(np.diag(partial_corr))    
    non_zero = (np.abs(np.triu(partial_corr, k=1)) > 0.02)  #connectivity matrix
    
    #plot the nodes
    plt.scatter(embedding[0], embedding[1], s = 100*d**2, c = labels, cmap = plt.cm.Spectral)
    
    #plot the edges
    start_idx, end_idx = np.where(non_zero)
    segments = [[embedding[:,start], embedding[:,stop]] for start, stop in zip(start_idx, end_idx)]
    values = np.abs(partial_corr[non_zero])
    lc = LineCollection(segments, zorder=0, cmap=plt.cm.hot_r, norm=plt.Normalize(0,0.7*values.max()))
    lc.set_array(values)
    lc.set_linewidths(2*values)
    ax.add_collection(lc)
    
    #plot the labels
    for index, (name, label, (x,y)) in enumerate(zip(names, labels, embedding.T)):
        plt.text(x,y,name,size=12)
    
    plt.show()
    
if __name__ == "__main__":
    main()

Graph lasso estimated precision matrix:

Stock clusters:

The edge values in the precision matrix greater than a threshold correspond to connected components from which we can compute stock clusters, as visualized on the right-hand side of the figure.

Simulated annealing

It is a heuristic search method that allows for occasional transitions to less favorable states to escape local optima. We can formulate simulated annealing as an energy minimization problem with temperature parameter T, with which we can modify the energy landscape and select moves that are unfavorable in the short term but can lead to better longterm optima.

Genetic algorithms

Genetic algorithms (GA) are inspired by and modeled after evolutionary biology. They consist of a population of (randomly initialized) individual genomes that are evaluated for their fitness with respect to a target. In genetic algorithms, two individuals combine and cross over their genome to produce an offspring. This crossover step is followed by a mutation by which individual bases can change according to a mutation rate. The resulting offspring are added to the population and scored according to their fitness level.

Final ppt for the KB competition

Today we finally submitted our presentation for the KB Future Finance AI competition - here is a link to it on my google drive. The competition will be held on Tuesday. But there is a rehersal Monday evening where we are not sure what will happen but I will write about it tomorrow. Tomorrow (Monday) we will travel to Seoul mid-day.

I am getting some nerves haha. Today I rewatched the competitions from previous years to see how the winners are picked and the points they get. Teams are graded on the spot, right after the ppt and there are two grades out of 10 - from a professional panel from KB, and from the audience. The competition is all in Korean, but I will share the link for completeness - here

The winners from 2023 got the following points from professional judges + audience = total points

1. quantChat - 9.3 and 7.1 = 16.4
2. Kbeep voice phishing - 9 and 7.3 = 16.3
3. Ke ke ke - 8.2 and 8.0 = 16.2
4. Kb Athena - 7.8 and 7.4 = 15.2
5. KB DocSearch - 7.8 and 6.3 = 14.1
6. Katering - 7 and 7.3 = 14.3

Writer - 7.3 and 6.0 = 13.3
Money laundering prevention  - 7.6 and 6.8 = 14.4
Stock fund - 7.8 and 5.9 = 13.7
KB-Dialect - 7.6 and 6.8 = 14.4

Only the top 6 teams get an award that is why there are 4 teams without ranking.

That is all for today!

See you tomorrow :)