(Day 70) Testing my backprop knowledge

Ivan Ivanov · March 11, 2024

applying-knowledge nlp

Hello :) Today is Day 70!

Today, I saw someone on X mentioning backprop and I decided to go back to my ‘backprop ninja’ code and re-do it to practice, and make sure I know the code.

I initially did the ‘become a backprop ninja tutorial’ by Andrej Karpathy on Day 54, but since then I have not looked at it in depth, and I felt that I was not satisfied with how I did it. Though I learned a lot about how backprop works, for the code… I did not feel like it was my own. Though I wrote the manual backprop by myself, the rest was pretty much copied from Andrej’s code. Well today I wanted to fix that, and fully make the code ‘my own’ but also it would give me a chance to test my knowledge.

Firstly, I started with proper naming for the vars, to make the code more readable. For instance,

original code:

updated:

The full new version is on this github repo.

Besides variable renaming to make the code more readable, I wanted to add something else, so I decided to add a new 3rd layer so the next structure was Layer1 - BatchNorm1 - Layer2 - BatchNorm2 - OutputLayer (before it was Layer1 - BatchNorm1 - OutputLayer).

Adding the 3rd layer required me to update the forward and backward pass, it took some time to get the dimensions of derivs right, but nevertheless I did it. I started training after this. This is where the long wait and changing hyperparams started. The original code (with the 2 layers - without much change) managed to reach ~1.5 loss on train and valid.

But mine (with the 3rd layer now), kept fluctiating too much, but slowly going doing - which is expected given I was working with batches. However, using the same amount of epochs for both models, the 1st one with the 2 layers always reached that ~1.5 loss but mine fluactuated and most times it reached ~2.0. I know that the modified (my version) of the model is more complex, so different params are needed, I tried with learning rate adjustments mainly, but most I could get was ~2.0. I need to run it for longer if I want to see a better result.

After some going back and forth with the learning rate, I added the eval and generating part of the model (updated it with more readable variable names) because I wanted to see what kind of names I am getting (with a ~2.0 loss).

(1st pic is from 3 layer model; 2nd pic is from 2 layer model, both on github - different commits)

These were the names I was getting. I knew something was up, because the simpler model (even at loss ~2.5) managed to generate somewhat better names (at least not like this one which generated some 2 letter, and sometimes just a eos ‘.’ , and some 1 letter names). Then I started looking into the model, whether my implementation of the new batch norm and 3rd layer, and their backward pass was correct. It was a long going back and forth between updating hyper params and looking at the train+valid loss+generated names. Tomorrow, I will remove my extra batch norm and 3rd layer, and make sure the model still works as normal (like the original from Andrej Karpathy) and then I will start building it up again. Another solution I might try is (because this is a more complex model), add something like dropout maybe, or some kind of a lr scheduler (I tried on some runes lr like if epoch < # 0.01 else 0.001, but did not improve anything). The model needs more time to train. Tomorrow I will do that haha

Again, the fully updated code is on the github repo above. I shall go to bed now ~ Tomorrow we start fresh again!

That is all for today!

See you tomorrow :)