## Understand essential techniques behind BERT architecture choices for producing a compact and efficient model

In recent years, the evolution of large language models has skyrocketed. BERT became one of the most popular and efficient models allowing to solve a wide range of NLP tasks with high accuracy. After BERT, a set of other models appeared later on the scene demonstrating outstanding results as well.

The obvious trend that became easy to observe is the fact that **with time large language models (LLMs) tend to become more complex by exponentially augmenting the number of parameters and data they are trained on**. Research in deep learning showed that such techniques usually lead to better results. Unfortunately, the machine learning world has already dealt with several problems regarding LLMs, and scalability has become the main obstacle in effective training, storing and using them.

As a consequence, new LLMs have been recently developed to tackle scalability issues. In this article, we will discuss ALBERT which was invented in 2020 with an objective of significant reduction of BERT parameters.

To understand the underlying mechanisms in ALBERT, we are going to refer to its official paper. For the most part, ALBERT derives the same architecture from BERT. There are three principal differences in the choice of the model’s architecture which are going to be addressed and explained below.

Training and fine-tuning procedures in ALBERT are analogous to those in BERT. Like BERT, ALBERT is pretrained on English Wikipedia (2500M words) and BookCorpus (800M words).

When an input sequence is tokenized, each of the tokens is then mapped to one of the vocabulary embeddings. These embeddings are used for the input to BERT.

Let *V* be the vocabulary size (the total number of possible embeddings) and *H* — embedding dimensionality. Then for each of the *V* embeddings, we need to store *H* values resulting in a *V x H* embedding matrix. As turns out in practice, this matrix usually has huge sizes and requires a lot of memory to store it. But a more global problem is that most of the time the elements of an embedding matrix are trainable and it requires a lot of resources for the model to learn appropriate parameters.

For instance, let us take the BERT base model: it has a vocabulary of 30K tokens, each represented by a 768-component embedding. In total, this results in 23M weights to be stored and trained. For larger models, this number is even larger.

This problem can be avoided by using matrix factorization. The original vocabulary matrix *V x H* can be decomposed into a pair of smaller matrices of sizes *V x E* and *E x H*.

As a consequence, instead of using *O(V x H)* parameters, decomposition results in only *O(V x E + E x H)* weights. Obviously, this method is effective when *H >> E*.

**Another great aspect of matrix factorization is the fact that it does not change the lookup process for obtaining token embeddings**: each row of the left decomposed matrix *V x E* maps a token to its corresponding embedding in the same simple way as it was in the original matrix *V x H*. This way, the dimensionality of embeddings decreases from *H* to *E*.

Nevertheless, in the case of decomposed matrices, to obtain the input for BERT, the mapped embeddings need then to be projected into hidden BERT space: this is done by multiplying a corresponding row of the left matrix by columns of the right matrix.

One of the ways to reduce the model’s parameters is to make them shareable. This means that they all share the same values. For the most part, it simply reduces the memory required to store weights. However, **standard algorithms like backpropagation or inference will still have to be executed on all parameters**.

One of the most optimal ways to share weights occurs when they are located in different but similar blocks of the model. Putting them into similar blocks results in a higher chance that most of the calculations for shareable parameters during forward propagation or backpropagation will be the same. This gives more opportunities for designing an efficient computation framework.

The mentioned idea is implemented in ALBERT which consists of a set of Transformer blocks with the same structure making parameter sharing more efficient. In fact, there exist several ways of parameter sharing in Transformers across layers:

- share only attention parameters;
- share only forward neural network (FNN) parameters;
- share all parameters (used in ALBERT).

In general, it is possible to divide all transformer layers into N groups of size M each where every group shares parameters within layers it has. Researchers found out that the smaller the group size M is, the better the results are. However, decreasing group size M leads to a significant increase in total parameters.

BERT focuses on mastering two objectives when pretraining: masked language modeling (MSM) and next sentence prediction (NSP). In general, MSM was designed to improve BERT’s ability to gain linguistic knowledge and the goal of NSP was to improve BERT’s performance on particular downstream tasks.

Nevertheless, multiple studies showed that it might be beneficial to get rid of the NSP objective mainly because of its simplicity, compared to MLM. Following this idea, ALBERT researchers also decided to remove the NSP task and replace it with sentence order prediction (SOP) problem whose goal is to predict whether both sentences are located in correct or inverse order.

Speaking of the training dataset, all positive pairs of input sentences are collected sequentially within the same text passage (the same method as in BERT). For negative sentences, the principle is the same except for the fact that both sentences go in inverse order.

It was shown that models trained with the NSP objective cannot accurately solve SOP tasks while models trained with the SOP objective perform well on NSP problems. These experiments prove that ALBERT is better adapted for solving various downstream tasks than BERT.

The detailed comparison between BERT and ALBERT is illustrated in the diagram below.

Here are the most interesting observations:

- By having only 70% of the parameters of BERT large, the xxlarge version of ALBERT achieves a better performance on downstream tasks.
- ALBERT large achieves comparable performance, compared to BERT large, and is faster 1.7x times due to the massive parameter size compression.
- All ALBERT models have an embedding size of 128. As was shown in the ablation studies in the paper, this is the optimal value. Increasing the embedding size, for example, up to 768, improves metrics but no more than 1% in absolute values which is not so much regarding the increasing complexity of the model.
- Though ALBERT xxlarge processes a single iteration of data 3.3x slower than BERT large, experiments showed that if training both of these models for the same amount of time, then ALBERT xxlarge demonstrates a considerably better average performance on benchmarks than BERT large (88.7% vs 87.2%).
- Experiments showed that ALBERT models with wide hidden sizes (≥ 1024) do not benefit a lot from an increase in the number of layers. That is one of the reasons why the number of layers was reduced from 24 in ALBERT large to 12 in the xxlarge version.

- A similar phenomenon occurs with the increase of in hidden-layer size. Increasing it with values larger than 4096 degrades the model performance.

At first sight, ALBERT seems a preferable choice over original BERT models as it outperforms them on downstream tasks. Nevertheless, ALBERT requires much more computations due to its longer structures. A good example of this issue is ALBERT xxlarge which has 235M parameters and 12 encoder layers. The majority of these 235M weights belong to a single transformer block. The weights are then shared for each of the 12 layers. Therefore, during training or inference, the algorithm has to be executed on more than 2 billion parameters!

Due to these reasons, ALBERT is suited better for problems when the speed can be traded off for achieving higher accuracy. Ultimately, the NLP domain never stops and is constantly progressing towards new optimisation techniques. It is very likely that the speed rate in ALBERT will be improved in the near future. The paper’s authors have already mentioned methods like **sparse attention** and **block attention** as potential algorithms for ALBERT acceleration.

*All images unless otherwise noted are by the author*

## Be the first to comment