BERT Tokenization

This page is all about Tokenization, the process of breaking down a piece of text into smaller units called tokens, and assigning a numerical value to each token.

The code examples use the TFBertTokenizer class from the open-source Hugging Face Transformers library, which maintains implementations of several popular model architectures.

The example below shows what happens when TFBertTokenizer is used to tokenize the string "hello world"

import transformers 
tokenizer = transformers.TFBertTokenizer.from_pretrained(
    "bert-base-uncased",
    return_attention_mask=False,
    return_token_type_ids=False
)
tokenizer(['hello world'])
{'input_ids': <tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[ 101, 7592, 2088,  102]])>}

The tokenizer outputs a dictionary with a single key, input_ids, and a value that is a tensor of 4 integers. These integer values are based on the input string, "hello world", and are selected using a vocabulary stored within the tokenizer.

The vocabulary of the TFBertTokenizer class is downloaded from the HuggingFace Model Hub by calling the from_pretrained() method and passing the name of the model (in this case, bert-base-uncased). Specifically, this downloads the vocabulary file at https://huggingface.co/bert-base-uncased/blob/main/vocab.txt

Vocabulary

The tokenizer’s vocabulary is the list of every token that it is capable of tokenizing. For now, you can think of a token as being analogous to a word. In the example, the input string consists of the tokens hello and world.

To produce the IDs as shown in the example, the tokenizer looks up the index of each input token in the vocabulary. The vocabulary of a TFBertTokenizer object can be accessed using the vocab_list attribute.

print(f"Index of 'hello': {tokenizer.vocab_list.index('hello')}")
print(f"Index of 'world': {tokenizer.vocab_list.index('world')}")
Index of 'hello': 7592
Index of 'world': 2088

The token hello is at index 7,592 in the vocabulary, and world is at index 2,088. This accounts for the 2nd and 3rd values from the output of the example (i.e. [[ 101, 7592, 2088, 102]]), but what about the values 101 and 102?

IDs 101 and 102 are special tokens that indicate the beginning and end of an input sequence, respectively. These are represented by the string values [CLS] and [SEP], and are inserted automatically into the tokenizer output.

print(f"Token with id 101: {tokenizer.vocab_list[101]}")
print(f"Token with id 102: {tokenizer.vocab_list[102]}")
Token with id 101: [CLS]
Token with id 102: [SEP]

All in all, the result of tokenizing the string "hello world" is 4 different IDs:

  • 1 for the special start token [CLS]
  • 1 for the word hello
  • 1 for the word world
  • 1 for the special end token [SEP]

The BERT Tokenizer’s vocabulary contains 30,522 unique tokens. Try searching for a few words to see what is present in the vocabulary:

Index - - - - -
Value - - - - -

Handling Punctuation

In the example above, the tokenization process was as simple as splitting the input string on whitespace, and then mapping each word to its corresponding index in the vocabulary.

In the real world, however, text is rarely as simple as "hello world". Often times it contains punctuation (e.g. !, ?, ;, etc.), spacing characters such as tabs (\t) and newlines (\n), or non-ASCII characters like emojis 🫤

To handle these cases, the BERT Tokenizer performs a few preprocessing operations.

First, all spacing characters like tabs and newlines are converted into single whitespaces. This means that the following input strings all result in the same Input IDs after tokenization:

# original input string
print(tokenizer(['hello world']))

# input string with tab (\t) character
print(tokenizer(['hello	world']))

# input string with newline (\n) character
print(tokenizer(['''
    hello
    world
''']))
{'input_ids': <tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[ 101, 7592, 2088,  102]])>}
{'input_ids': <tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[ 101, 7592, 2088,  102]])>}
{'input_ids': <tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[ 101, 7592, 2088,  102]])>}

Next, whitespace is added before and after every punctuation character. This allows punctuation characters to be treated as separate input tokens, apart from the words that they are connected with in the input string.

For example, the string "hello, world!" is split into the following 6 tokens:

[CLS] hello , world ! [SEP]

Which correspond to the following indices in the tokenizer’s vocabulary:

101 7592 1010 2088 999 102

Example:

print(tokenizer(['hello, world!']))
{'input_ids': <tf.Tensor: shape=(1, 6), dtype=int64, numpy=array([[ 101, 7592, 1010, 2088,  999,  102]])>}

Out-of-vocabulary tokens

The BERT Tokenizer’s vocabulary contains a limited set of unique tokens, which means that there is a possibility of coming across a token that is not present in the vocabulary. To handle such cases, the vocabulary contains a special token, [UNK] which is used to represent any “out-of-vocabulary” input token.

print(tokenizer(['hello world 👋']))
print("Token with id 100: {tokenizer.vocab_list[100]}")
{'input_ids': <tf.Tensor: shape=(1, 5), dtype=int64, numpy=array([[ 101, 7592, 2088,  100,  102]])>}
Token with id 100: [UNK]

Subword Tokenization

One way that the BERT tokenizer is able to effectively handle a wide variety of input strings with a limited vocabulary is by using a subword tokenization technique called WordPiece.

This technique allows certain out-of-vocabulary words to be represented as multiple in-vocabulary “sub-words”, rather than as the [UNK] token.

For example, consider the words "clock" and "clockwork". In the tokenizer’s vocabulary, "clock" is at index 5,119, but "clockwork" is not present at all.

print(f"Index of 'clock': {tokenizer.vocab_list.index('clock')}")
print(f"Index of 'clockwork': {tokenizer.vocab_list.index('clockwork')}")
Index of 'clock': 5119
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-87a92b407035> in <cell line: 2>()
      1 print(f"Index of 'clock': {tokenizer.vocab_list.index('clock')}")
----> 2 print(f"Index of 'clockwork': {tokenizer.vocab_list.index('clockwork')}")

When tokenizing the string "clockwork", instead of converting it into the [UNK] token, the tokenizer converts this single word into two separate tokens: clock and ##work.

print(f"Tokenization of 'clockwork': {tokenizer(['clockwork'])}")
print(f"Token with id 5119: {tokenizer.vocab_list[5119]}")
print(f"Token with id 6198: {tokenizer.vocab_list[6198]}")
Tokenization of 'clockwork': {'input_ids': <tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[ 101, 5119, 6198,  102]])>}
Token with id 5119: clock
Token with id 6198: ##work

The ## characters are used by the tokenizer as a suffix indicator to distinguish between “work” as a suffix (as in “clockwork”, “classwork”, etc.), and “work” as a standalone word.

Sometimes there are multiple valid ways to split a word into subword tokens. For example, the word "metalworking" could be split in either of the following ways:

metal ##work ##ing

metal ##working

The BERT tokenizer will always attempt to split the word into the fewest number of subwords, meaning that the string "metalworking" will be split into the tokens metal and ##working.

print(f"Tokenization of 'metalworking': {tokenizer(['metalworking'])}")
print(f"Token with id 3384: {tokenizer.vocab_list[3384]}")
print(f"Token with id 21398: {tokenizer.vocab_list[21398]}")
Tokenization of 'metalworking': {'input_ids': <tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[  101,  3384, 21398,   102]])>}
Token with id 3384: metal
Token with id 21398: ##working

Similarly, the word "artwork" could be represented by the tokens art and ##work. However, because the vocabulary contains the token artwork, this single token is used instead.

print(f"Tokenization of 'artwork': {tokenizer(['artwork'])}")
print(f"Token with id 8266: {tokenizer.vocab_list[8266]}")
Tokenization of 'artwork': {'input_ids': <tf.Tensor: shape=(1, 3), dtype=int64, numpy=array([[ 101, 8266,  102]])>}
Token with id 8266: artwork

WordPiece Tokenization

Why is it that the word “artwork” is present in the vocabulary, but “clockwork” is not? Although the process used to create the BERT tokenizer’s vocabulary deserves a separate page altogether, it is worth noting here that the vocabulary was created using an algorithm that attempts to select the most frequently repeated words and subwords across a large corpus of text.

More details about this algorithm can be found in the HuggingFace Course on WordPiece Tokenization

Demo: BERT Tokenizer

Try out the BERT tokenizer in your browser by entering a string below:

Summary

The goal of this guide is to understand the entire journey that a string of text takes when passed to a BERT model. This page described the first stage of this journey, tokenization, where the string is split into a list of tokens, which are then assigned numerical values using the tokenizer’s vocabulary.

Next, these numerical values are passed to the BERT Embeddings Layer.

References


Google Research BERT repository: https://github.com/google-research/bert

TensorFlow.js implementation of the BERT Tokenizer: https://github.com/tensorflow/tfjs-models