Teaching is Hard: How to Train Small Models and Outperforming Large Counterparts | by Salvatore Raieli

|MODEL DISTILLATION|AI|LARGE LANGUAGE MODELS|

Distilling the knowledge of a large model is complex but a new method shows incredible performances

efficient knowledge distillation NLP — Photo by JESHOOTS.COM on Unsplash

Large language models (LLMs) and few-shot learning have shown we can use these models for unseen tasks. However, these skills have a cost: a huge number of parameters. This means you need also a specialized infrastructure and restrict state-of-the-art LLMs to only a few companies and research teams.

Do we really need a unique model for each task?
Would it be possible to create specialized models that could replace them for specific applications?
How can we have a small model that competes with giant LLMs for specific applications? Do we necessarily need a lot of data?

In this article, I give an answer to these questions.

“Education is the key to success in life, and teachers make a lasting impact in the lives of their students.” –Solomon Ortiz

The art of teaching is the art of assisting discovery. — Mark Van Doren

Large language models (LLMs) have shown revolutionary capabilities. For example, researchers have been surprised by elusive behavior such as in-context learning. This has led to an increase in the scale of models, with larger and larger models searching for new capabilities that appear beyond a number of parameters.

Source link