Hello :) Today is Day 76!
A quick summary of today:
- Finished section 6 and 7 about multilingual retrieval-based LMs and retrieval-based LMs’ challenges and opportunities (ACL 2023)
- Covered lecture 11 of CMU 11-711 Advanced NLP - Distillation, Quantization, and Pruning
Section 6:
Section 7:
Lecture 11: Distillation, Quantization, and Pruning
Problem: The best models for NLP tasks are massive. So how can we cheaply, efficiently and equitably deploy NLP systems at the expense of performance?
Answer: Model compression
Quantization - keep the model the same but reduce the number of bits Pruning - remove parts of the model while retaining performance Distillation - train a smaller model to imitate the larger model
Quantization - no parameters are changed, up to k bits of precision
Amongst other methods, we can use post-training quantization
We can binarize the parameters and activations
Pruning - a number of parameters are set to zero, the rest are unchanged
There is Magnitute pruning (Han et al., 2015; See et al., 2016)
set to 0, a % of parameters with the least magnitute
With machine translation, researchers found that we can remove almost half the params with almost none effect on downstream tasks. This is a type of unstructured pruning, because we remove params throughout the model, anywhere we see good.
Problems with unstructured pruning:
we can make our model(vectors) and make it sparse, but if our hardware cannot take advantage of this sparsity, we don’t gain anything - and this is currently an active research area.
A more currently useful idea is structured pruning (Xia et al, 2022).
Instead of picking params from the model, we remove entire components/layers from the model, example
We can remove almost half the heads of a transformers, and we get a negligent impact on performance.
Distillation - train one model (the ‘student’) to replicate the behaviour of another large(r) model (the ‘teacher’); ~all params are changed
Train the teacher
Generate soft targets - instead of using the actual(hard) labels for training, the teacher outputs soft targets which are probability distributions over the classes
Train the student using the soft targets, and it learns to mimic the behaviour of the teacher by minimizing the difference between its predictions and the soft targets
Example is DistilBERT (Sanh et al., 2019)
That is all for today!
See you tomorrow :)