Hello :) Today is Day 199!
A quick summary of today:
- saw that all the chapters from the book Build an LLM from scratch have been published so I decided to continue with it (after a few moths of waiting)
I like that even though we are in chapter 5 (out of 7), the author still reminds us the learners the process of going from input text to LLM generated text. Making sure we are still on the same page
The goal of this chapter is to train a model because at the moment, when we try to generate some text we get gibberish
At the moment the untrained model is given: Every effort moves you; and the model continues this with rentingetic wasnم refres RexMeCHicular stren
By the way, the current (untrained) model has the following config:
How can the model learn ? It’s weights need to be updated so they start to predict the target tokens. Here comes good ol’ backpropagation. And it requires a loss function which calculates the dif between desired and actual output (i.e. how far off the model is)
Perplexity is a metric used with cross-entropy loss to evaluate model performance in tasks like language modeling. It reflects the model’s uncertainty in predicting the next token, with lower perplexity indicating better performance. Perplexity is calculated as torch.exp(loss). For example, a loss resulting in 48725 indicates the model is uncertain about predicting among approximately 47,678 possible tokens. This makes perplexity a more interpretable measure than raw loss, as it signifies the effective vocabulary size over which the model is uncertain.
This is an interesting fact from the book: “The cost of pretraining LLMs To put the scale of our project into perspective, consider the training of the 7 billion parameter Llama 2 model, a relatively popular openly available LLM. This model required 184,320 GPU hours on expensive A100 GPUs, processing 2 trillion tokens. At the time of writing, running an 8xA100 cloud server on AWS costs around $30 per hour. A rough estimate puts the total training cost of such an LLM at around $690,000 (calculated as 184,320 hours divided by 8, then multiplied by $30).” - wow
Starting to train ~ we can see the model output as well. We give it Every effort moves you
And by the 10th last epoch it starts to look like normal text
A reminder that the text used is “The Verdict” short story by Edith Wharton. The train loss starts at 10.058 and converges to 0.502.
Seems the model is overfitting after the 2nd epoch which is not that surprising given the small datataset.
Sampling strategies
Temperature - the lower, the ‘less creative’ the model is
Top-k - picking from only top-k tokens with the highest logits when generating
On another note ~ I added new clips to the Glaswegian dataset.
We are now at ~1000 clips and total of 86 minutes of audio. Here is the link to the dataset on hugging face
Another note ~ a lab mate told me about a banking AI competition.
The bank is Kukmin Bank (국민은행 in Korean) and the competition started yesterday and the submission is 11th of Aug. We would have to develop some kind of project related to helping customers/workers using AI. We will discuss it more in the coming days/weeks. I am excited for it. Here is a link to the competition (it is in Korean).
That is all for today!
See you tomorrow :)