BERT Encoder Layer
Categories:
The Encoder is the main building block of the BERT model architecture. As input, the Encoder takes text embeddings produced by the Embedding Layer, and as output, it returns modified embeddings of the same shape.
Because the input and output of the Encoder have the same shape, it is possible to chain multiple Encoders together such that the output of one becomes the input to next. In fact, the original BERT model architecture includes 12 Encoders chained together in this fashion.
Adding Context
As mentioned, the BERT Encoder takes embeddings as input and produces embeddings as output. So, how are the embeddings produced by the Encoder different from those produced by the Embedding Layer?
For each input token, the Embedding Layer produces an embedding that represents that specific token, as well as its position in the input sequence of text. The individual word and its position, however, are not enough to understand a word’s meaning in a string of text.
For example, imagine you are given a sequence of 9 words, where the 5th word is “fire.” Can you confidently say what the meaning of “fire” is in this piece of text?
____ ____ ____ ____ fire ____ ____ ____ ____
Obviously, no. Depending on the context (that is, the words before and after), the word “fire” could have one of many different meanings:
He is going to fire one of his employees
There was a huge fire raging through the forest
I learned how to fire a gun last year
In each the three sentences above, the BERT Embedding Layer produces the same embedding for the word “fire.”
import transformers
model = transformers.TFBertModel.from_pretrained("bert-base-uncased")
tokenizer = transformers.TFBertTokenizer.from_pretrained("bert-base-uncased")
sentence_1 = tokenizer(["He is going to fire one of his employees"])
embeddings_1 = model.bert.embeddings(input_ids=sentence_1["input_ids"])
# embedding for "fire" is at index [0, 5]
print(f"Sentence 1 embedding: {embeddings_1[0, 5]}")
sentence_2 = tokenizer(["There was a huge fire raging through the forest"])
embeddings_2 = model.bert.embeddings(input_ids=sentence_2["input_ids"])
print(f"Sentence 2 embedding: {embeddings_2[0, 5]}")
sentence_3 = tokenizer(["I learned how to fire a gun last year"])
embeddings_3 = model.bert.embeddings(input_ids=sentence_3["input_ids"])
print(f"Sentence 3 embedding: {embeddings_3[0, 5]}")
Sentence 1 embedding: [ 0.804566 -0.00322175 -0.5201206 ...
-0.11026919 0.05673839 -0.59228456 ]
Sentence 2 embedding: [ 0.804566 -0.00322175 -0.5201206 ...
-0.11026919 0.05673839 -0.59228456 ]
Sentence 3 embedding: [ 0.804566 -0.00322175 -0.5201206 ...
-0.11026919 0.05673839 -0.59228456]
The BERT Encoder Layer, on the other hand, produces a contextualized embedding for each token by encoding information not just about the token itself, but also about the other tokens in the text.
After the three example sentences are passed through the first of BERT’s encoder layers, the embeddings that represent the word “fire” in each of the sentences are no longer identical:
encoder_call_args = {
"attention_mask": [0.0]*11,
"head_mask": None,
"encoder_hidden_states": None,
"encoder_attention_mask": None,
"past_key_value": None,
"output_attentions": False
}
# retreive the first of BERT's 12 encoders.
encoder_layer = model.bert.encoder.layer[0]
encoder_embeddings_1 = encoder_layer(hidden_states=embeddings_1, **encoder_call_args)[0]
print(f"Sentence 1 embedding: {encoder_embeddings_1[0, 5]}")
encoder_embeddings_2 = encoder_layer(hidden_states=embeddings_2, **encoder_call_args)[0]
print(f"Sentence 2 embedding: {encoder_embeddings_2[0, 5]}")
encoder_embeddings_3 = encoder_layer(hidden_states=embeddings_3, **encoder_call_args)[0]
print(f"Sentence 3 embedding: {encoder_embeddings_3[0, 5]}")
Sentence 1 embedding: [ 1.1353475 0.2934172 -0.94880563 ... -0.35108835 0.3613562
-0.4717724 ]
Sentence 2 embedding: [ 1.6853468 0.60971344 -0.2993545 ... -0.17775661 -0.2377944
-0.32776663]
Sentence 3 embedding: [ 0.6518909 0.16804317 -1.0398874 ... -0.36539418 -0.4513352
-0.26341832]
Comparing Contextual Embeddings
The similarity of contexual embeddings produced by the BERT encoder can be computed using cosine similarity. This allows us to determine the similarity of meaning between two words in their respective contexts.
For example, the cosine similarity between the word “fire” in the first and second sentences is 0.79:
def cosine_similarity(a, b):
result = (tf.reduce_sum(tf.multiply(a, b))) / (tf.norm(a, ord=2) * tf.norm(b, ord=2))
return result.numpy()
print("Similarity between 1 and 2: "
f"{cosine_similarity(encoder_embeddings_2[0, 5], encoder_embeddings_1[0, 5])}"
)
Similarity between 1 and 2: 0.7944693565368652
In the second sentence, “There was a huge fire raging through the forest”, the word “fire” refers specifically to a forest fire. Let’s create a different sentence that also refers to forest fires, and compare the resulting contextual embedding.
sentence_4 = tokenizer(["I knew the forest fire was not far off"])
embeddings_4 = model.bert.embeddings(input_ids=sentence_4["input_ids"])
encoder_embeddings_4 = encoder_layer(hidden_states=embeddings_4, **encoder_call_args)[0]
print("Similarity between 2 and 4: "
f"{cosine_similarity(encoder_embeddings_2[0, 5], encoder_embeddings_4[0, 5])}"
)
Similarity between 2 and 4: 0.9660714864730835
As you can see the similarity between the word “fire” in sentences 2 and 4 is much higher than in sentences 1 and 2.
Demo: Contextual Embeddings
Try encoding two sentences below, then touch any token on the left to view the cosine similarity of its embedding with each of the tokens on the right.
Sentence 1
Sentence 2
Note: This demo uses a smaller version of BERT, so the contextual embeddings produced will be slightly different than those produced by the original BERT model.
Transformers
The BERT Encoder is based on the Transformer architecture proposed in the paper Attention Is All You Need. Each encoder is made up of two parts: A multi-head self-attention layer, and a feed-forward neural network.
Attention
In order for the embedding representation of a token to become contextualized, some information about that token’s context (i.e. about the other tokens in the input sequence) must be incorporated into its embedding. This is the reponsibility of the Attention mechanism in the Encoder.
To intuitively understand the concept of attention, consider the sentence from earlier:
There was a huge fire raging through the forest
When we were only given the word fire, it was impossible to discern the meaning of this word within the context of the sentence. However, by seeing the entire sentence it is clear that the word fire refers to a forest fire.
But what exactly about the sentence makes this meaning clear? And do we actually need to see the entire sentence to reach this conclusion?
In most cases, it is not necessary to see the entire sentence to determine the meaning of a particular word. For example, given two additional words (“raging” and “forest”), the meaning of fire in the text becomes fairly clear:
____ ____ ____ ____ fire raging ____ ____ forest
If, however, you were given the words “was” and “the”, then the meaning of fire still remains opaque:
____ was ____ ____ fire ____ ____ the ____
This is all to say that, when ascertaining the meaning of the word fire in this sentence, we pay more attention to certain words (i.e. “raging” and “forest”) than to others (i.e. “was” and “the”).
To generalize this idea, we can use the following intuitive definition of attention:
Given a sentence S and two words A and B within the sentence, the Attention value between A and B defines the degree to which B helps to explain the meaning of A.
For example, when the sentence “There was a huge fire raging through the forest” is passed through the Encoder’s attention mechanism, an Attention value is computed for each pair of words in the sentence. The chart below shows the Attention values of each word relative to the word fire.
Demo: Attention Scores
Computing Attention
Now that we have an understanding of what the Attention values represent, we can look at how these values are computed by the Encoder. The figure below demonstrates how Attention values are computed given a set of word embeddings:
Let’s break down the process step-by-step.
-
Imagine that two embeddings representing the words "hello" and "world" are passed to the Encoder. First, each embedding is passed separately through two fully-connected layers called the Query Layer and Key Layer, producing a modified embedding for each word.
-
The embeddings produced by the Query Layer are referred to as query vectors, and the embeddings produced by the Key Layer are reffered to as key vectors. Notice that, for each word in the input, there is now both a query vector as well as a key vector.
-
Matrix multiplication is performed between the query vectors and the transpose of the key vectors, producing raw attention scores as an \( N \times N \) matrix, where \(N\) is the number of words in the input sequence, in this case, 2.
-
Next, the raw attention scores are scaled down to avoid producing very large attention scores. Specifically, the scores are multiplied by \( \frac{1}{\sqrt{d_{k}}} \), where \(d_{k}\) is the number of elements in each key vector.
-
Finally, a Softmax operation is performed on the scaled attention scores to produce a probability distribution over the input sequence.
The following code sample shows how to compute the attention scores using the TFBertModel
class. This code example is purely for demonstration; in any real application, the output_attentions
option should be used to compute attention scores with the TFBertModel
class.
import math
encoder_attention_layer = model.bert.encoder.layer[0].attention.self_attention
# tokenize the input string and compute embeddings
tokens = tokenizer(["hello world"])
embeddings = model.bert.embeddings(tokens['input_ids'])
query_layer = encoder_attention_layer.query
key_layer = encoder_attention_layer.key
# compute the query and key vectors. only the query and key
# vectors for the first attention head are selected here. See the section
# on "Multi-Head Attention" below for details
query_vectors = query_layer(embeddings)[:, :, :encoder_attention_layer.attention_head_size]
key_vectors = key_layer(embeddings)[:, :, :encoder_attention_layer.attention_head_size]
raw_attention_scores = tf.matmul(query_vectors, key_vectors, transpose_b=True)
scaled_attention_scores = raw_attention_scores / math.sqrt(key_vectors.shape[-1])
attention_scores = tf.nn.softmax(scaled_attention_scores, -1)
print(f"""Attention scores for 'hello world':
{attention_scores}
""")
Attention scores for 'hello world':
[[[0.14938147 0.11455351 0.09194218 0.6441229 ]
[0.17278576 0.21243298 0.37015226 0.24462911]
[0.16546023 0.35253984 0.20158045 0.28041947]
[0.3114629 0.18462852 0.14926663 0.35464197]]]
Value Vectors
The attention scores are essentially weights representing the degree of relatedness between each pair of words in the input sequence. In the attention scores matrix, the value at index \( [i, j] \) represents the strength of the relationship between the \(i^{th}\) and \(j^{th}\) tokens.
With these weights computed, the next step is to apply them to the original sequence to add context to the embeddings used to represent each word. Similar to the Query Layer and Key Layer from earlier, the attention mechanism relies on a third fully-conntected layer called the Value Layer.
The original input embeddings are passed through the Value Layer to produce value vectors. The attention scores are then applied to the value vectors through matrix multiplication:
Multi-Head Attention
The entire attention process described so far – namely producing query and key vectors, computing attention scores, then multiplying these attention scores by the value vectors – is everything that happens within a module called an Attention Head. The BERT Encoder uses a Multi-Head Attention mechanism, meaning that more than one attention head can be executed within a single Encoder.
The original BERT model architecture includes 12 attention heads in each Encoder layer. To better illustrate multi-head attention, however, consider an Encoder layer with just 2 attention heads.
The diagram below shows a high-level picture of multi-head attention with 2 attention heads. The abbreviations Q, K, and V are used in the diagram to indicate the Query, Key, and Value layers, respectively.
Note: The shapes of the inputs and intermediate outputs are included to illustrate how multi-head attention produces outputs in the same shape as the input. The values above assume an input length of 2 tokens and an embedding dimension (i.e. hidden size) of 768.
Notice that the input embeddings each have 768 elements, while the query, key, and value vectors computed in the attention heads have only 384 values each. The value 384 in this example is called the Attention Head Size, and is equal to:
This ensures that, after concatenating the results of each attention head, the final output has the same shape as the input embeddings.
The use of multiple attention heads helps the BERT Encoder model different kinds of relationships between words in a given sentence. Consider the sentence
The dog chased the cat up a tree
The attention scores relative to the word cat are shown below. This time, the attention scores of 2 different heads are shown on separate charts:
Attention Head 1
Attention Head 2
In this example, the first attention head focuses mostly on the words “dog” and “chased”, while the second head focuses more on the words “up”, “a”, and “tree” at the end of the sentence.
Feed Forward Neural Network
The last step in the Encoder layer is a feed forward neural network consisting of two dense layers separated by a GELU Activation Function.
The embeddings produced by the multi-head attention mechanism are passed through these layers in a “position-wise” fashion. In other words, the embedding of each token in the input sequence is passed independently through the same two dense layers.
As an example, consider the embeddings produced by the multi-head attention mechanism for the input string "hello world"
# tokenize the input string and compute embeddings
tokens = tokenizer(["hello world"])
embeddings = model.bert.embeddings(tokens['input_ids'])
# get the first encoder layer
encoder = model.bert.encoder.layer[0]
# set defaults for the required arguments
encoder_call_args = {
"attention_mask": [0.0]*len(tokens['input_ids'][0]),
"head_mask": None,
"encoder_hidden_states": None,
"encoder_attention_mask": None,
"past_key_value": None,
"output_attentions": False
}
# get the multi head attention operation
multi_head_attention = encoder.attention
# execute multi head attention on the input embeddings
attention_embeddings = multi_head_attention(embeddings, **encoder_call_args)[0]
# get the output embeddings for "hello" and "world"
hello_embedding = attention_embeddings[:, 1:2, :]
world_embedding = attention_embeddings[:, 2:3, :]
The two dense layers in the feed forward network can be accessed using the intermediate
and output
attributes of the encoder layer. The embeddings for each token are passed independently through each dense layer:
feed_forward_layer_1 = encoder.intermediate
feed_forward_layer_2 = encoder.bert_output
def feed_forward_network(attention_output):
intermediate_output = feed_forward_layer_1(attention_output)
return feed_forward_layer_2(intermediate_output, input_tensor=attention_output)
hello_output = feed_forward_network(hello_embedding)
world_embedding = feed_forward_network(world_embedding)
print(f"Output shape for token 'hello': {hello_output.shape}")
print(f"Output shape for token 'world': {world_embedding.shape}")
Output shape for token 'hello': (1, 1, 768)
Output shape for token 'world': (1, 1, 768)
Summary
At the end, we are left with something that looks like what we started with: an embedding with 768 values for each input token. As we have seen, however, the embeddings produced by the Encoder layer are not just representations of individual words, but rather representations of words within a particular context.
These contextualized embeddings can then be passed to another Encoder layer for further refinement, or be used for a variety of different Natural Language Processing tasks like named entity recognition, question answering, or sentiment analysis.
Next time, we will look at how BERT can be used to solve some of these different NLP problems.