(Day 275) Finding new books to read + stream

Ivan Ivanov · October 2, 2024

Hello :) Today is Day 275!

A quick summary of today:

showed my mini paper to my lab professor
read a bit of Deep Learning by Bishop&Bishop
found many new books to read
next steps in the movie review PEFT LLM project
streamed again

Firstly about the paper

I talked about it on stream, but tldr is: I had the paper living in my repo for 2-3 months and I read a lot about taxi demand matrices and predicting them but when I finished writing the literature review, and briefly the rest of the ‘paper’ I kind of lost hope that this will be a good paper so I gave up on submitting it and talking more about it with my lab professor. But after 2-3 months, my SO finally convinced me to send it so that I can get help and continue writing it.

Movie review project update

For this project I had LoRA fine-tuned llama3 and a t5 model. Among the two, llama3 performed way better and the code uses the unsloth library which specialises in faster fine-tuning. So going forward as per my professor’s suggestion I will select a few models, along with a few base prompts and compare the performance across all combinations. A reminder, this movie review project is a cover-up for the actual project idea which I cannot share at the moment but is something similar to this idea.

DL by Bishop&Bishop

Talk about it on stream but unfortunately I did not take notes and as I was in the lab I did not want to put my phone to film myself reading as I did not want to film my labmates. I read chapter 5: single layer networks for classification, chapter 6: deep neural nets, and then skimmed through some of the other chapters.

New books

Thanks to Bishop&Bishop I was motivated to go and find more books to read. But I wanted to read more foundation knowledge books - math, stats, probability (again). I ended up finding a lot:

Before Machine Learning - 3 book series on linear algebra, calculus and stats&probs
The 100-page ML book and MLE by Andryi Burkov

Not foundational knowledge, but seem interesting:

After stream I rested for a bit, and read the 1st book of the Before Machine Learning series

It is very beginner friendly and written in a casual way.

Here is the link to it on o’Reilly.

Chapter 3

What Is a Vector?

A vector is essentially a list of numbers where the order of the elements matters, often used in machine learning to represent data points like height and weight. For instance, a two-dimensional vector could represent a student as (height, weight). Each vector has both direction and magnitude, with its origin typically at (0,0) in a coordinate system.

Vectors can have multiple dimensions; for example, a three-dimensional vector might be represented as (x, y, z). Understanding vectors involves visualizing their components: the starting point (origin), direction (based on coordinates), and magnitude (the length of the vector).

Vectors can describe real-world phenomena like wind by representing direction and speed. They also allow for mathematical operations, such as addition, which can represent paths or movements. Overall, vectors are foundational elements in mathematics and have wide applications in various fields.

Vector Addition

Scalar Multiplication

λ ∈ ℝ

A scalar is any real number, such as π, which can be used to scale vectors. The symbol ∈ means “belongs to,” and ℝ represents the set of real numbers. When multiplying a vector by a scalar (λ), there are four possible outcomes based on λ’s value. If λ is greater than 1, the vector stretches while maintaining direction. If 0 < λ < 1, the vector shrinks without changing direction. If λ is less than -1, the vector stretches but reverses direction. If -1 < λ < 0, the vector shrinks and reverses direction. Multiplying a vector by a scalar alters its length and, in some cases, its direction. Finally, vector multiplication has not yet been covered.

The Dot Product

The dot product is a method of multiplying two vectors that results in a scalar value. It is calculated by multiplying corresponding elements of two vectors and summing the results. For example, given two vectors, v = (2, 3) and w = (2, 1), their dot product is 2 * 2 + 3 * 1 = 7.

In linear algebra, the dot product has a geometric interpretation. It measures how much one vector aligns with another by projecting one vector onto the other and multiplying their magnitudes. The angle theta between vectors determines this relationship, where the dot product v dot w = \mathbf{v} \mathbf{w} \cos(\theta) ).

If two vectors point in the same direction, the dot product is positive; if they are perpendicular, the dot product is zero; if they point in opposite directions, it is negative.

In machine learning, the dot product is used to measure similarity. For example, when recommending movies based on user preferences, you can compare movie vectors by calculating the dot product or using cosine similarity, which focuses on vector direction rather than magnitude. Cosine similarity is preferred when the magnitude is not crucial, as in text analysis or machine learning tasks.

Both the dot product and cosine similarity are valuable tools for comparing and manipulating vectors, with their applications depending on whether vector length or direction is more relevant.

The Vector Space

A vector space is a structured set of vectors that follow specific rules (axioms) for operations like vector addition and scalar multiplication. These operations must produce results that remain within the vector space. For any collection of vectors to qualify as a vector space, they must adhere to these axioms, ensuring operations like vector addition and scalar multiplication behave as expected:

The basis of a vector space is a set of linearly independent vectors that can represent all vectors within that space through a combination of scalar multiples. Changing the basis provides a new perspective on the vectors, useful in different contexts such as machine learning, where altering the basis can reveal patterns or simplify computations. For example, changing the basis for data like house features can help better understand trends such as the total number of rooms or the balance between bedrooms and bathrooms.

Chapter 4: Matrices

Maybe because I was exhausted, but the topic is math are there was so much over-the-top casual explanations that kind of confused me as to why are they there. I am not sure if this make sense, but in its attempts to explain things too simply it added maybe a bit too much abstractions and deviations from the main point that at some points I needed to re-read 😆 even though I know the material.

Linear Transformations

L : ℝn → ℝm

It must satisfy:

L(v + u) = L(v) + L(u) L(c ⋅ v) = c ⋅ L(v)

where v and u are vectors that belon to ℝn, and c is a scalar that is a real number

The Eigen ‘Stuff’

The book tries, but 2blue1brown’s explanation is the best by far.

To be honest when I was reading this part, and after remembering 2blue1brown’s channel, I remembered how great that youtube channel is and that it explains everything so nicely.

Matrix decomposition

Where:

U represents a rotation (it preserves the angles and lengths of vectors) Σ is the scaling matrix (a diagonal matrix containing the singular values, which represent scaling factors) V^T is another rotation (the transpose of an orthogonal matrix V which can also be interpreted as a rotation or reflection. It defines the basis for the column space of the original matrix)

This transformation applied to a circle represents the operation:

Principal Component Analysis

The principal component analysis will create a new set of axes called the principal axes, where we will project the data and get these so-called principal components. These are a linear combination of the original features that will be equipped with outstanding characteristics. Characteristics which are not only uncorrelated, but the first components also capture most of the variance in the data, which makes this methodology a good technique for reducing the dimensions of complex data sets.

Start with some data:

Normalise it using the mean and st.d.:

If we imagine the table as a matrix A^T, multiply it by A and divide by N (8, normalising it) and we get:

Next we need to find the eigenvectors and eigenvalues. That can be done through eigen decomposition

Why?

Eigenvectors define the principal components (the directions of max variance):
- eigenvectors represent the directions in which the data varies the most
- when you perform PCA, you aim to transform your data into a new coordinate system where the axes (called principal components) correspond to these directions of maximum variance
- the eigenvectors of the covariance matrix (or the correlation matrix) point in these directions of maximum variance
- these eigenvectors become the principal components in PCA, where each principal component is a linear combination of the original features, and they are orthogonal to each other
Eigenvalues represent the magnitude of variance:
- eigenvalues correspond to the amount of variance (or the ‘importance’) along each principal component (eigenvector)
- larger eigenvalues indicate that the corresponding eigenvector (principal component) captures a larger amount of the data’s variance
- by sorting the eigenvalues in descending order, we can rank the principal components by how much variance they explain. This helps determine which principal components are most important for dimensionality reduction

P is a matrix with the eigenvectors, and this will be where we find our principal axes. It happens to be already sorted by eigenvalue magnitude. On the other hand, in the matrix Σ we have all the eigenvalues. These represent what can be called the explainability of the variance, how much of the variance present in the data is ”captured” by each component.

The sum of the eigenvalues is 5, and by dividing each of them by 5 we find the individual percentage of variance that each distinct eigenvalue explains

The 1st two components explain ~76% of the variance so we can use the two for as our principal components to reduce our dims.

So now we just need to multiply our original 8x5 data matrix by this 5x2 matrix to get the resulting dimenson reduced feature set:

These components are also called latent or hidden variables; relationships that are hidden in the data and are the result of linear combinations.

That is all for today!

See you tomorrow :)