(Day 31) Natural Language Processing and KCSE day 2

Ivan Ivanov · February 1, 2024

deep-learning nlp reading-research

Hello :) Today is Day 31!

A quick summary of today:

covered sequence models from DeepLearning.AI’s DL course
KCSE day 2

Firstly, about the NLP part of the deep learning specialization

Today’s part was about natural language processing models. The RNN model can process natural language well, but the Gated Recurring Unit (GRU) model can do it more effectively. The RNN model does not remember long-distance relationships well, so the GRU model overcomes its shortcomings. In addition, the RNN model also has the problem of vanishing gradients, so the GRU can overcome it. Vanishing gradient is a problem in which the gradient becomes too small and disappears if the distance is long when doing backprop.

And, I learned a little about BRNN. BRNN is a neural network structure that considers both past and future information by processing input sequential data in both directions. Through this, it is possible to more effectively grasp a long range of dependencies and patterns and is usefully used for sequential data processing such as natural language processing and speech recognition. In general, BRNN is composed of two RNN components, each of which provides a comprehensive input expression by processing sequences in forward and reverse directions.

Next ~ word embeddings. Whereas other methods may use indexing of words, or characters when making text generation models, we can use analogies and word vectors. With the first methods, we might end up with 50,000 classes (if we have 50,000 unique words), instead of that, we can use analogies to group words together.

analogy with gender/royalty/age/food, and each value repesents how close it is to that ‘group’ - like -1 and 1 for male and female, and king (-0.95) - queen (0.97), because the difference in the words comes from the gender, whereas apple/orange are 0, because it is not related to gender at all

Word embeddings encode semantic similarity between words. Words that are semantically similar tend to have similar vector representations. For example, in a well-trained word embedding model, the vectors for “king” and “queen” are expected to be closer to each other than the vectors for “king” and “car”. For example, man to woman is the same as king to {x}. man is -1, woman is 1> (-1) - (1) = -2 > so we look what value with king, gives a value very close to -2. To calculate this similarity, we can use ‘cosine similarity’.

Going back to softmax being used for classification in NLP models (i.e. text generation).

If there are too many classes, the computational cost may be very high, and since the probabilities for all classes are calculated, the computational cost may be high and the learning speed may be slow.

So the hierarchical softmax method was created. Like a tree, if we have 10,000 classes, it tasks, is the class in the first or second 5000, then it splits by 2, and tasks again

You can still use softmax, but you should think of it as a binary classification problem

Give pairs of words to the model, for example, 1 correct pair (orange + juice) and 4 incorrect pairs (orange king, orange book, etc), and the labels.

As for KCSE 2024 day 2

link to paper

The paper introduces SINVAD - a method for creating images to test DNNs with images that very closely resample the actual image. For example, the presentor showed images of numbers, 7 and 9, where both numbers could be seen in the image (a confusing image even for humans), and the point was to create images that try to deceive the machine.

I think this is especially memorable. It was kind of fun “Enhanced Transformer Variant Autoencoder for Generating Various Appropriate Facial Responses”. What got stuck in my head was that the authors want to create natural face videos that listen to a call, conversation. When we communicate, one person talks, the other listens, and the listener usually does a little smile, or nod, other small gestures with their face, and the authors looked into making a model that generates such faces, when given an input ‘talker’, so the result of the model is the ‘listender’. In the ppt, they showed some funny show videos of the generated face, but I did not record it, and neither it is in the paper above.

Tomorrow is the last day and I’ll present. I’ll put the presentation materials in tomorrow’s article. I prepared in Korean, but I’ll do it in English to present a longer content a little faster tomorrow. Still, it seems like the participants speak English , so I hope they present it without any problems. If there is anything I don’t understand well, I’ll explain it in Korean.

That is all for today!

See you tomorrow :)

Original post in Korean