(Day 55) Learning about tokenization in LLMs

Ivan Ivanov · February 25, 2024

Hello :) Today is Day 55!

A quick summary of today:

  • continuing on the Andrej Karpathy streak from the last few days, today I watched his latest video about tokenization in LLMs
  • I used a pokemon name dataset on yesterday’s MLP code (with the manual backprop) and posted it on Kaggle for easier access

So… tokenization in language models.

Using this nice website we can see an example below.

image image

Using the above picture as refernce, tokenization is how we turn a piece of text to a representation through numbers. In the above pic, each coloured piece is a different token that is part of GPT2, and on the right is each number. There are different ways to turn a piece of text into numbers, and one of the common ones is to use utf8 encoding, which has 256 base bytes.

image

As seen in the example pic, using utf8 we can encode this piece of text. In addition to utf8, there is utf16 and utf32, but apparently, those encodings ‘overshoot’ and are too much for our task. And as we can see in the pic below, there are a bunch of 0s being added, which is unnecessarily.

image

Ok, we can use utf8, but just using utf8 will result in very long sequences of bytes, and transformers only have a limited amount of attention (for computational reasons), so we want to squash these long sequences into something more appropriate. We dont want to use the raw bytes, so instead we can use byte pair encoding.

(Apparently there is research that is looking into being able to do tokenization-free encodings, without needed to squash them, but a proper method is yet to be found)

image

So, how do we do byte pair encoding (bpe)?

image

Take a piece of text, encode it, turn it into a mapping and a list so we get proper numbers (otherwise its bytes).

And then

image

pair up the resulting tokens, and count their occurance. For example in this case, the byte pairing of (101,32) occurred 20 times.

image

And in that text, 101 and 32 refer to the letter e and whitespace. So “e “ has occurred the most (20 times). Through this merge function which takes in a list of bytes, a pair and with what index to replace it, we do the bpe.

image

The example is: example: merge([5, 6, 6, 7, 9, 1], (6, 7), 99) —> [5, 6, 99, 9, 1] replace the pair (6,7) with 99

This merging through the merge function, we can do it iteratively and the amount of times we do it (find the most common pair, replace it, repeat) is a hyperparam and up to us. the more times we do it, the larger will be our vocab and the shorter will be the resulting sequence, and we want to find the ‘sweet’ spot.

Using a longer piece of text, the next picture shows doing merging multiple times. And towards the end - 256 which is the 1st new token, is actually part of a common pair itself.

image

There is also the compression ratio, which measures how by how much did we reduce the bytes of our text.

image

In the GPT2 paper, they mention that sometimes the same work is tokenized multiple times when it is next to some special symbols.

image

So what they do is, they have a regex function as below.

image

To avoid unnecessary tokenization, as I wrote in the comments the text is ran through a regex and then each input is processed individually and then concat all together.

Special tokens

Special tokens are used to create a special structure to token streams. Popular ones are:

< endoftext > < eos > < pad > < bos > < eol > < math > < doc > < reward >
For example the < endoftext > one, tells the model that the text that comes after has no relation to the one before it. But if we pass them to chatgpt they trigger weird behaviour.
Actuall this last token < reward > I found from this presentation from Andrej Karpathy on Microsoft open dev day.

image

I passed it to chatgpt and:

image

It does not see it, and also does not react to any of the other special tokens. In other more complicated prompts it will ignore it and may cause other weird behaviour. (which is interesting, and kind of funny - can chatgpt be ‘broken’ by using some complicated prompt with these special tokens?)

After the lecture finished, I went onto my personal code colab and started playing around with the tokenization on random text. And also did one of the tests in Andrej Karpathy’s github, which is testing the wikipedia bpe example, and seeing if we get a correct output representation.

image

And indeed we get the right answer.

Andrej Karpathy also gave some questions to which tokenization was the answer, and I took note of his answers.

  • Why can’t LLM spell words? Tokenization.
    • Some tokens are too long and if a token is input (i.e. .DefaultCellStyle) GPT becomes confused.
  • Why can’t LLM do super simple string processing tasks like reversing a string? Tokenization.
    • Same as above
  • Why is LLM worse at non-English languages (e.g. Japanese)? Tokenization.
    • Lack of training data, and the tokenizer also is not trained well on languages other than English. (안녕하세요 is 3 tokens, Hello is 1 token)
  • Why is LLM bad at simple arithmetic? Tokenization.
    • Sometimes numbers are tokenized in one way, other times, in another way, very arbitrary. For example, if we give 3215, the tokenizer once will see 3 and 215, another case it will see 32 and 15.
  • Why did GPT-2 have more than necessary trouble coding in Python? Partly Tokenization.
    • Spacing in tokenization.
  • What is this weird warning I get about a “trailing whitespace”? Tokenization.
    • The model has not seen whitespace by itself a lot, so it can cause the warning if we put whitespace after the given text and “:”, when we do text completion.
  • Why the LLM break if I ask it about “SolidGoldMagikarp”? Tokenization.
    • https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation
  • Why should I prefer to use YAML over JSON with LLMs? Tokenization.
    • YAML is efficient in tokens, compared to JSON
  • What is the real root of suffering? Tokenization. xD

Now, for the pokemon name generator based on the manual backprop code. I did not change much from the original code, the main change as I mentioned yesterday is me adding notes while doing the backprop, so that I can fully understand what is happening. Today I decided to try to adopt it to a new dataset (Andrej used human names, I decided to use pokemon names) and also play around with the model’s hyperparams and see its reaction.

The full notebook (which is not different from the google colab link I shared yesterday) is here.

But the most interesting part is the generated names:

  • libico abesawing cabezllish
  • kirzons argola drapiad
  • xutdi swirlixs toldan
  • simiroty ledics bruxishoinpaul
  • ferrorua nidoran♂ hicmy
  • hippoono typcanth skorudi
  • koffing charmer roserire
  • tornu venonat girdreec
  • derno herdeer beynode
  • snobbull eevile wlabin
  • scazarill slowperno savimzar
  • cramsampardos electakron

Not bad haha. This is using 5 context length, 10 as embeddings size and 128 hidden layers. (full info in the kaggle link)

That is all for today!

See you tomorrow :)