(Day 27) Improving deep learning models

Ivan Ivanov · January 28, 2024

deep-learning

Hello :) Today is Day 27!

A quick summary of today:

covered the 2nd course from DeepLeaning.AI’s DL specialization

First of all, I finished what I started yesterday

I made a simple deep learning model from scratch.

Init params
do forward prop, compute cost, do backward prop, update params

I don’t know if people do this in practice, but I think it helped me to understand the neural network model more deeply.

Now for the Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization course

It introduced this basic recipe for ML

To overcome bias or variance there are many methods but one of them is to normalize the data

After normalizing, finding the minimum is way easier.

We can also do different initializations of the Weights and Biases of the layers

In this case W, b are set to 0
In this case, W is a large number, b is 0

There is this He initialization

parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1])*np.sqrt(2./layers_dims[l-1])
     parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))*np.sqrt(2./layers_dims[l-1])

Different inits lead to different results. Random inits are good for breaking symmetry and make sure different hidden layers learn different things.

Next, regularization

A model without
Using L2 norm
Using a dropout layer

Next, optimization

The Adam optimizer has 3 hyperparams that can affect its effects

Beta1: typically set close to 1 (but less than 1), controls the exponential decay rate for the first moment estimates (the mean of the gradients).
Beta2: Similar to beta1, beta2 is another exponential decay rate parameter, typically also set close to 1. It controls the decay rate for the second moment estimates (the uncentered variance of the gradients). A common default value for beta2 is 0.999.
Epsilon: This parameter is a small constant added to the denominator to prevent division by zero and to improve numerical stability. It ensures that the optimizer’s calculations don’t explode when the denominator approaches zero. A typical value for epsilon is around 1e-8.

Also, I was given a learning rate optimization method

Among these, the next one is supposed to be the most effective

Another regularization technique is batch normalization

It is a method used to normalize the pre-activation values (z) in neural networks, allowing for more stable training. It employs parameters, beta and gamma, to adjust the mean and standard deviation of the normalized values.

That is all for today!

See you tomorrow :)

Original post in Korean