Neural Attention Mechanisms. The original attention we needed. | by Frankie Cancino | Feb, 2024

The original attention we needed.

Photo by Mojahid Mottakin on Unsplash

You may have heard of Generative AI, ChatGPT, and/or Transformers. These advancements in AI came from decades of research building upon each other. One piece of research that was fundamental to these advancements was the Neural Attention function, introduced by Bahdanau et al., 2014.

Prior to the work on Neural Attention, the state of the art methods for Natural Language Processing (NLP) were to use Recurrent Neural Networks (RNN’s) to encode text to a fixed-length vector and then decode the fixed-length vector to different text. A typical example of this is translation between languages. The original text in one language (let’s say English) can be encoded into a fixed-length vector, then decoded from this fixed-length vector into a different language (such as French). This is called an encoder-decoder approach.

Neural Attention doesn’t require a function or model to encode the entire input text. Instead, the idea is to focus on words or inputs more relevant for predicting the next word. For example — when translating each new word, a model using neural attention will use the important information from the input and the already-translated words. This allows models to handle longer inputs more efficiently, since the “attention” is selective and not requiring the model to rely on the entire input.

To calculate a basic form of attention, we will require 3 inputs. Queries (Q), Keys (K), and Values (V). These 3 inputs will vary depending on use case and what data a model is being trained on, but you can think of the inputs in this way:

  • (Q)ueries: is the input text
  • (K)eys: positions that may or may not contain relevant information
  • (V)alues: the potentially relevant information to be used to generate output

For this example in Python, we use Tensorflow and its einsum and softmax functions. einsum essentially conducts matrix/vector multiplication for arbitrary multi-dimensional sizes of matrices or tensors. softmax

Source link

Be the first to comment

Leave a Reply

Your email address will not be published.