(Day 133) Gathering data for the Scottish dataset project + Factor analysis + Grokking ML + MLxFundamentals Day 4 (2)

Ivan Ivanov · May 13, 2024

Hello :) Today is Day 133!

A quick summary of today:

  • started putting together some audio clips + transcription for the Scottish dataset project
  • saw how eigenvalues play a role in factor analysis
  • started reading Grokking Machine Learning on Manning.com
  • finished MLx Fundamentals’ final session

Firstly, about the audio data

My Scottish partner for this project has recorded various phrases in Glaswegian in the past and uploaded them to youtube. Today I did 4 of the 10 videos.

To cut the clips I ended up using an app called VideoPad, and even though it is a paid app, it allows me to just cut an audio clips in smaller pieces and save them as new files.

This is a sample audio waveform of one of the 4 videos

image

What I did was, make short clips around each expression. I am not sure what these waves are called when there is speech. So for example from the above, I ended up with 36 clips, and uploaded them all (along with the other 3 videos’ audio clips) to our project’s drive. And the total amount we have so far is 4.75 minutes.

Secondly, today I read about eigenvalues’ role in factor analysis

In my statistics class at uni, we learned about factor analysis, and at the end of the chapter I saw the word eigenvalues, and I am glad because once again I will see their real world impact (after my dive into multicollinearity).

Firstly, about factor analysis, here are the results after using unstandardized variables

image

And after standardization

image

Why standardize?

  • result interpretability
  • helps with linearity
  • treats variables equally

Where are the eigenvalues? The overall under each loading. 1.981 and 1.008. Which are sum of the square of each of the 3 values above it.

image

To interpret this, we take the 1st row and F1: 0.00089222 - this is F1 accounts for 0.0089% of the variance in Y1 which is Finance, while F2 accounts for 99.90%. And in total, the 2 factor space accounts for 99.99% of the variance in Y1 Finance.

The eigenvalues can help to determine which factors to keep (i.e. using scree plots). Love it when I see the math I studied for ML being used in practice, and where it is used.

Thirdly, about Grokking ML

I decided to subscribe to manning.com and the 1st book I decided to read was Grokking ML (as it is one of the most recommended and popular ones). Today I managed to read the 1st 4 chapters(what is ML, types of ML, linear regression, optimization), and I can definitely see why it is popular for beginners, and I am excited to keep reading.

Finally, the last session from MLxFundamentals was delivered by Wenhan Han a PhD candidate from TU Eindhoven.

It was about loading and using an LLM, and a diffusion model

Some interesting bits from both parts are:

Question to an LLM:

How many kinds of human beings are there in the history?

We saw the top answers from the model.

image

Various prompt strategies Zero-shot

image

Few shot

image

Chain-of-thought

image

Then, how to finetune an LLM with ‘unsloth’

image

We add LoRA adapters

image

Prepare the data

image

Data was from huggingface

image

Train the model

image

And then inference

image

For finetuning I need to try unsloth by myself. For some part of the tutorial, an openai api key was required which was unfortunate because I do not have one, so I just watched that part.

As for diffusion models

image

I saw the library ‘diffusers’ was used which I did not know.

And I played around with it a bit, a picture I tried to get and failed was ‘a horse riding an astronaut’

That is all for today!

See you tomorrow :)