Hello :) Today is Day 237!
A quick summary of today:
- reading chapter 9 from the Machine Learning Algorithms in Depth book
- re-watching previous year’s KB Future Finance AI competition
Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA) is a topic model used to analyze discrete data, such as text documents. It represents each document as a mixture of topics, where each topic is a distribution over words. The goal of LDA is to learn shared topic distributions and topic proportions for each document.
Bag of Words Model
LDA assumes a bag of words model, meaning that words are treated as exchangeable, and sentence structure is ignored. Each document is represented as a vector of word counts, and the entire corpus is summarized in a term-document matrix.
Matrix Factorization
LDA can be viewed as a nonnegative matrix factorization problem. The term-document matrix is factorized into a product of topics and topic proportions:
- A = WH
- W: Topics (distribution over words)
- H: Topic proportions
TF-IDF and Smoothing
Term frequency–inverse document frequency (tf-idf) is often used to adjust word counts, reducing the influence of words that frequently occur across documents. This method helps in identifying discriminative words and improving model performance. For n-grams, different smoothing techniques like Laplace smoothing are used to handle sparse data.
Generative Model
In LDA, each word in a document is associated with a topic label, and each document is associated with topic proportions (θd). Topics (βk) are shared across all documents, and hyperparameters α and η capture prior knowledge about topic proportions and topics.
Inference and Variational Bayes
The key inference problem in LDA is computing the posterior distribution of latent variables for a given document. This can be approximated using a variational distribution that is optimized to maximize the evidence lower bound (ELBO). The variational parameter updates can be done in an online setting, allowing for topic analysis on large datasets or streaming data.
Density estimators
The goal of density estimation is to model the probability density of data.
Kernel density estimator
An alternative approach to a K-component mixture model is a kernel density estimator (KDE) that allocates one cluster center per data point. KDEs are an application of kernel smoothing for probability density estimation that use kernels as weights
The advantage of KDEs over parametric models, such as density mixtures, is that no model fitting is required (except for fine-tuning the bandwidth parameter h) and there is no need to pick the number of mixtures K. The disadvantage is that the model takes a lot of memory to store as well as time to evaluate. In other words, KDE is suitable when an accurate density estimate is required for a relatively small dataset (small number of points N)
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(14)
class KDE():
def __init__(self):
#Histogram and Gaussian Kernel Estimator used to
#analyze RNA-seq data for flux estimation of a T7 promoter
self.G = 1e9 #length of genome in base pairs (bp)
self.C = 1e3 #number of unique molecules
self.L = 100 #length of a read, bp
self.N = 1e6 #number of reads, L bp long
self.M = 1e4 #number of unique read sequences, bp
self.LN = 1000 #total length of assembled / mapped RNA-seq reads
self.FDR = 0.05 #false discovery rate
#uniform sampling (poisson model)
self.lmbda = (self.N * self.L) / self.G #expected number of bases covered
self.C_est = self.M/(1-np.exp(-self.lmbda)) #library size estimate
self.C_cvrg = self.G - self.G * np.exp(-self.lmbda) #base coverage
self.N_gaps = self.N * np.exp(-self.lmbda) #number of gaps (uncovered bases)
#gamma prior sampling (negative binomial model)
#X = "number of failures before rth success"
self.k = 0.5 # dispersion parameter (fit to data)
self.p = self.lmbda/(self.lmbda + 1/self.k) # success probability
self.r = 1/self.k # number of successes
#RNAP binding data (RNA-seq)
self.data = np.random.negative_binomial(self.r, self.p, size=self.LN)
def histogram(self):
self.bin_delta = 1 #smoothing parameter
self.bin_range = np.arange(1, np.max(self.data), self.bin_delta)
self.bin_counts, _ = np.histogram(self.data, bins=self.bin_range)
#histogram density estimation
#P = integral_R p(x) dx, where X is in R^3
#p(x) = K/(NxV), where K=number of points in region R
#N=total number of points, V=volume of region R
rnap_density_est = self.bin_counts/(sum(self.bin_counts) * self.bin_delta)
return rnap_density_est
def kernel(self):
#Gaussian kernel density estimator with smoothing parameter h
#sum N Guassians centered at each data point, parameterized by common std dev h
x_dim = 1 #dimension of x
h = 10 #standard deviation
rnap_density_support = np.arange(np.max(self.data))
rnap_density_est = 0
for i in range(np.sum(self.bin_counts)):
rnap_density_est += (1/(2*np.pi*h**2)**(x_dim/2.0))*np.exp(-(rnap_density_support - self.data[i])**2 / (2.0*h**2))
#end for
rnap_density_est = rnap_density_est / np.sum(rnap_density_est)
return rnap_density_est
if __name__ == "__main__":
kde = KDE()
est1 = kde.histogram()
est2 = kde.kernel()
plt.figure()
plt.plot(est1, '-b', label='histogram')
plt.plot(est2, '--r', label='gaussian kernel')
plt.title("RNA-seq density estimate based on negative binomial model")
plt.xlabel("read length, [base pairs]"); plt.ylabel("density"); plt.legend()
plt.show()
Tangent portfolio optimization
The objective of mean-variance analysis is to maximize the expected return of a portfolio for a given level of risk, as measured by the standard deviation of past returns. By varying the mixing proportions of each asset, we can achieve different risk–return tradeoffs.
Example:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KDTree
from pandas.plotting import scatter_matrix
from scipy.spatial import ConvexHull
import pandas_datareader.data as web
from datetime import datetime
import pytz
STOCKS = ['SPY','LQD','TIP','GLD','MSFT']
np.random.seed(42)
if __name__ == "__main__":
plt.close("all")
#load data
#year, month, day, hour, minute, second, microsecond
start = datetime(2012, 1, 1, 0, 0, 0, 0, pytz.utc)
end = datetime(2017, 1, 1, 0, 0, 0, 0, pytz.utc)
data = pd.DataFrame()
series = []
for ticker in STOCKS:
price = web.DataReader(ticker, 'stooq', start, end)
series.append(price['Close'])
data = pd.concat(series, axis=1)
data.columns = STOCKS
data = data.dropna()
#plot data correlations
scatter_matrix(data, alpha=0.2, diagonal='kde')
plt.show()
#get current portfolio
cash = 10000
num_assets = np.size(STOCKS)
cur_value = (1e4-5e3)*np.random.rand(num_assets,1) + 5e3
tot_value = np.sum(cur_value)
weights = cur_value.ravel()/float(tot_value)
#compute portfolio risk
Sigma = data.cov().values
Corr = data.corr().values
volatility = np.sqrt(np.dot(weights.T, np.dot(Sigma, weights)))
plt.figure()
plt.title('Correlation Matrix')
plt.imshow(Corr, cmap='gray')
plt.xticks(range(len(STOCKS)),data.columns)
plt.yticks(range(len(STOCKS)),data.columns)
plt.colorbar()
plt.show()
#generate random portfolio weights
num_trials = 1000
W = np.random.rand(num_trials, np.size(weights))
W = W/np.sum(W,axis=1).reshape(num_trials,1) #normalize
pv = np.zeros(num_trials) #portoflio value w'v
ps = np.zeros(num_trials) #portfolio sigma: sqrt(w'Sw)
avg_price = data.mean().values
adj_price = avg_price
for i in range(num_trials):
pv[i] = np.sum(adj_price * W[i,:])
ps[i] = np.sqrt(np.dot(W[i,:].T, np.dot(Sigma, W[i,:])))
points = np.vstack((ps,pv)).T
hull = ConvexHull(points)
plt.figure()
plt.scatter(ps, pv, marker='o', color='b', linewidth = 3.0, label = 'tangent portfolio')
plt.scatter(volatility, np.sum(adj_price * weights), marker = 's', color = 'r', linewidth = 3.0, label = 'current')
plt.plot(points[hull.vertices,0], points[hull.vertices,1], linewidth = 2.0)
plt.title('expected return vs volatility')
plt.ylabel('expected price')
plt.xlabel('portfolio std dev')
plt.legend()
plt.grid(True)
plt.show()
#query for nearest neighbor portfolio
knn = 5
kdt = KDTree(points)
query_point = np.array([2, 115]).reshape(1,-1)
kdt_dist, kdt_idx = kdt.query(query_point,k=knn)
print("top-%d closest to query portfolios:" %knn)
print("values: ", pv[kdt_idx.ravel()])
print("sigmas: ", ps[kdt_idx.ravel()])
top-5 closest to query portfolios:
values: [114.63340211 115.57074795 115.03231305 115.35426485 114.426947 ]
sigmas: [3.9943071 4.06987543 4.36175633 4.45779421 4.44304117]
Structure learning
Itinvolves discovering a graph’s structure given relational data. The goal is to evaluate the probability of a graph ( G = (V, E) ) based on observed data ( D ). One challenge in this process is the exponential number of possible graphs. For example, in a directed graph with ( V ) vertices, there are ( O(2^{V(V-1)/2}) ) possible graphs due to the different edge directions. Since structure learning for general graphs is NP-hard, approximate methods are often used.
Approximate Methods
Two common approaches for structure learning are:
- Chow-Liu algorithm for tree-based graphs
- Inverse covariance estimation for general graphs
Chow-Liu Algorithm
The Chow-Liu algorithm focuses on tree structures. It uses a joint probability model for a tree ( T ), where:
- ( p(x_t) ) is a node marginal.
- ( p(x_s, x_t) ) is an edge marginal.
To derive the Chow-Liu algorithm, the likelihood is expressed based on the mutual information ( I(x_s, x_t | \theta) ) between pairs of nodes. The tree topology that maximizes the log likelihood is computed via the maximum weight spanning tree, where edge weights are the pairwise mutual information values. The maximum spanning tree can be calculated using either Prim’s or Kruskal’s algorithm, both of which run in ( O(E \log V) ) time. |
Inverse covariance estimation
Identifying stock clusters helps one discover similar companies, which can be useful for comparative analysis or a pairs trading strategy. We can find similar clusters by estimating the inverse covariance (precision) matrix that can be used to construct a graph network of dependencies, using the fact that zeros in the precision matrix correspond to the absence of edges in the constructed graph.
import numpy as np
import pandas as pd
from scipy import linalg
from datetime import datetime
import pytz
from sklearn.datasets import make_sparse_spd_matrix
from sklearn.covariance import GraphicalLassoCV, ledoit_wolf
from sklearn.preprocessing import StandardScaler
from sklearn import cluster, manifold
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection
import pandas_datareader.data as web
np.random.seed(42)
def main():
#generate data (synthetic)
#num_samples = 60
#num_features = 20
#prec = make_sparse_spd_matrix(num_features, alpha=0.95, smallest_coef=0.4, largest_coef=0.7)
#cov = linalg.inv(prec)
#X = np.random.multivariate_normal(np.zeros(num_features), cov, size=num_samples)
#X = StandardScaler().fit_transform(X)
#generate data (actual)
STOCKS = {
'SPY': 'S&P500',
'LQD': 'Bond_Corp',
'TIP': 'Bond_Treas',
'GLD': 'Gold',
'MSFT': 'Microsoft',
'XOM': 'Exxon',
'AMZN': 'Amazon',
'BAC': 'BofA',
'NVS': 'Novartis'}
symbols, names = np.array(list(STOCKS.items())).T
#load data
#year, month, day, hour, minute, second, microsecond
start = datetime(2015, 1, 1, 0, 0, 0, 0, pytz.utc)
end = datetime(2017, 1, 1, 0, 0, 0, 0, pytz.utc)
qopen, qclose = [], []
data_close, data_open = pd.DataFrame(), pd.DataFrame()
for ticker in symbols:
price = web.DataReader(ticker, 'stooq', start, end)
qopen.append(price['Open'])
qclose.append(price['Close'])
data_open = pd.concat(qopen, axis=1)
data_open.columns = symbols
data_close = pd.concat(qclose, axis=1)
data_close.columns = symbols
#per day variation in price for each symbol
variation = data_close - data_open
variation = variation.dropna()
X = variation.values
X /= X.std(axis=0) #standardize to use correlations rather than covariance
#estimate inverse covariance
graph = GraphicalLassoCV()
graph.fit(X)
gl_cov = graph.covariance_
gl_prec = graph.precision_
gl_alphas = graph.cv_alphas_
gl_scores = graph.cv_results_['mean_test_score']
plt.figure()
sns.heatmap(gl_prec, xticklabels=names, yticklabels=names)
plt.xticks(rotation=45)
plt.yticks(rotation=45)
plt.tight_layout()
plt.show()
plt.figure()
plt.plot(gl_alphas, gl_scores, marker='o', color='b', lw=2.0, label='GraphLassoCV')
plt.title("Graph Lasso Alpha Selection")
plt.xlabel("alpha")
plt.ylabel("score")
plt.legend()
plt.show()
#cluster using affinity propagation
_, labels = cluster.affinity_propagation(gl_cov)
num_labels = np.max(labels)
for i in range(num_labels+1):
print("Cluster %i: %s" %((i+1), ', '.join(names[labels==i])))
#find a low dim embedding for visualization
node_model = manifold.LocallyLinearEmbedding(n_components=2, n_neighbors=6, eigen_solver='dense')
embedding = node_model.fit_transform(X.T).T
#generate plots
plt.figure()
plt.clf()
ax = plt.axes([0.,0.,1.,1.])
plt.axis('off')
partial_corr = gl_prec
d = 1 / np.sqrt(np.diag(partial_corr))
non_zero = (np.abs(np.triu(partial_corr, k=1)) > 0.02) #connectivity matrix
#plot the nodes
plt.scatter(embedding[0], embedding[1], s = 100*d**2, c = labels, cmap = plt.cm.Spectral)
#plot the edges
start_idx, end_idx = np.where(non_zero)
segments = [[embedding[:,start], embedding[:,stop]] for start, stop in zip(start_idx, end_idx)]
values = np.abs(partial_corr[non_zero])
lc = LineCollection(segments, zorder=0, cmap=plt.cm.hot_r, norm=plt.Normalize(0,0.7*values.max()))
lc.set_array(values)
lc.set_linewidths(2*values)
ax.add_collection(lc)
#plot the labels
for index, (name, label, (x,y)) in enumerate(zip(names, labels, embedding.T)):
plt.text(x,y,name,size=12)
plt.show()
if __name__ == "__main__":
main()
Graph lasso estimated precision matrix:
Stock clusters:
The edge values in the precision matrix greater than a threshold correspond to connected components from which we can compute stock clusters, as visualized on the right-hand side of the figure.
Simulated annealing
It is a heuristic search method that allows for occasional transitions to less favorable states to escape local optima. We can formulate simulated annealing as an energy minimization problem with temperature parameter T, with which we can modify the energy landscape and select moves that are unfavorable in the short term but can lead to better longterm optima.
Genetic algorithms
Genetic algorithms (GA) are inspired by and modeled after evolutionary biology. They consist of a population of (randomly initialized) individual genomes that are evaluated for their fitness with respect to a target. In genetic algorithms, two individuals combine and cross over their genome to produce an offspring. This crossover step is followed by a mutation by which individual bases can change according to a mutation rate. The resulting offspring are added to the population and scored according to their fitness level.
Final ppt for the KB competition
Today we finally submitted our presentation for the KB Future Finance AI competition - here is a link to it on my google drive. The competition will be held on Tuesday. But there is a rehersal Monday evening where we are not sure what will happen but I will write about it tomorrow. Tomorrow (Monday) we will travel to Seoul mid-day.
I am getting some nerves haha. Today I rewatched the competitions from previous years to see how the winners are picked and the points they get. Teams are graded on the spot, right after the ppt and there are two grades out of 10 - from a professional panel from KB, and from the audience. The competition is all in Korean, but I will share the link for completeness - here
The winners from 2023 got the following points from professional judges + audience = total points
1. quantChat - 9.3 and 7.1 = 16.4
2. Kbeep voice phishing - 9 and 7.3 = 16.3
3. Ke ke ke - 8.2 and 8.0 = 16.2
4. Kb Athena - 7.8 and 7.4 = 15.2
5. KB DocSearch - 7.8 and 6.3 = 14.1
6. Katering - 7 and 7.3 = 14.3
Writer - 7.3 and 6.0 = 13.3
Money laundering prevention - 7.6 and 6.8 = 14.4
Stock fund - 7.8 and 5.9 = 13.7
KB-Dialect - 7.6 and 6.8 = 14.4
Only the top 6 teams get an award that is why there are 4 teams without ranking.
That is all for today!
See you tomorrow :)