(Day 76) Finishing the Retrieval-based LM talk, and learning about distillation, quantization and pruning

Ivan Ivanov · March 17, 2024

theory nlp

Hello :) Today is Day 76!

A quick summary of today:

Finished section 6 and 7 about multilingual retrieval-based LMs and retrieval-based LMs’ challenges and opportunities (ACL 2023)
Covered lecture 11 of CMU 11-711 Advanced NLP - Distillation, Quantization, and Pruning

Section 6:

Section 7:

Lecture 11: Distillation, Quantization, and Pruning

Problem: The best models for NLP tasks are massive. So how can we cheaply, efficiently and equitably deploy NLP systems at the expense of performance?

Answer: Model compression

Quantization - keep the model the same but reduce the number of bits Pruning - remove parts of the model while retaining performance Distillation - train a smaller model to imitate the larger model

Quantization - no parameters are changed, up to k bits of precision

Amongst other methods, we can use post-training quantization

We can binarize the parameters and activations

Pruning - a number of parameters are set to zero, the rest are unchanged

There is Magnitute pruning (Han et al., 2015; See et al., 2016)

set to 0, a % of parameters with the least magnitute

With machine translation, researchers found that we can remove almost half the params with almost none effect on downstream tasks. This is a type of unstructured pruning, because we remove params throughout the model, anywhere we see good.

Problems with unstructured pruning:

we can make our model(vectors) and make it sparse, but if our hardware cannot take advantage of this sparsity, we don’t gain anything - and this is currently an active research area.

A more currently useful idea is structured pruning (Xia et al, 2022).

Instead of picking params from the model, we remove entire components/layers from the model, example

We can remove almost half the heads of a transformers, and we get a negligent impact on performance.

Distillation - train one model (the ‘student’) to replicate the behaviour of another large(r) model (the ‘teacher’); ~all params are changed

Train the teacher

Generate soft targets - instead of using the actual(hard) labels for training, the teacher outputs soft targets which are probability distributions over the classes

Train the student using the soft targets, and it learns to mimic the behaviour of the teacher by minimizing the difference between its predictions and the soft targets

Example is DistilBERT (Sanh et al., 2019)

That is all for today!

See you tomorrow :)