Text Tokenization

Posted at — Apr 3, 2021

A friend of mine asked me to explain how one went from words to tokens, to eventually represent text in a way that can be understood by machine learning algorithms. Since HuggingFace replaced their python implementation with a Rust one, I thought making a wasm experiment would be fun, all the examples below are loading the full tokenizer model locally.

Beginning

How do you teach a machine how to read? It’s a hard question. Intuitive concepts don’t translate into equations, or into programmable logic. Try explaining an idea without referring to common sense knowledge. We instead devise puzzle-crafting algorithms in order to coerce the training process to yield meaningful representations (they are stubborn, if it’s too easy they’ll cheat their way out). We call these puzzles tasks, and they are what the machine spends its time on when sifting through piles of text data, often comprised of massive crawls of the internet.

Nobody has labeled anything, the sheer amount of text is too large for anyone to read in a lifetime, let alone annotate. Deriving a training objective from the data itself is called self-supervised learning.

Here are two of these puzzles:

Language Modeling: Give the algorithm one token at a time, and ask it to predict the next one.

I am on [?]
I am on my [?]
I am on my way [?]

Masked Language Modeling: Aiming to capture more context, take random tokens out of the sentence, before trying to predict them.

I am on my [MASK] to the grocery [MASK].

Tokens

Notice how I keep using the word token instead of word. Since the model is going to learn a representation for each token, words should be divided into their constituents.

You may be tempted to split on whitespace, and on punctuation, but you’d quickly run into unforeseen issues. (Side note, there are grammars based on regular expressions to do that kind of tokenizing, they are usually language-specific with tons of hacks)

For starters, inflection should not matter - “hand” and “hands” should be very similar, same with contractions like “can’t”. Other languages have a lot more rules and suffixes, not to mention agglutination where entire noun phrases are written as a single word (e.g. “Beispielsätze”, “example sentences” in German).

Morphology is another important part of language, if you read “tokenizerologist”, you can guess what it means, even if it’s not a real word.

I’m going to cover two techniques by which machines figure out what makes up a token and what doesn’t. This is meant as a non-exhaustive overview, if you’re looking for a detailed in-depth explanation, Cathal Horan has you covered.

Byte Pair Encoding (BPE)

This one comes from an old data compression technique. The algorithm looks at pairs of bytes, and replaces them by a value not present in the document, the process is repeated until no unique values can be found, or until the max vocabulary size is hit (you don’t want millions of possible tokens, neither your model nor your memory, would like any of that).

It’s a pretty crude mean of compression, but with some adjustments, it can be an efficient way of discovering word structure.

Start with a list of “tokens” corresponding to valid byte values
Take some large corpus of text
Rank the existing tokens by decreasing frequency
Merge the top 2 tokens together
Repeat until you’re fed up, have a big enough vocabulary (or until Lei Mao’s blog says to).

Example using roberta-base’s tokenizer:

Special tokens (these are specific to how each model is trained):

<s> for the beginning of a sentence
</s> for the end of a sentence
Ġ indicate that a space was prefixing the token

Directly operating on bytes means it has poor handling of multi-byte characters. We can also argue that being the top two tokens doesn’t necessarily mean they should be merged.

WordPiece

WordPiece is an incremental improvement over BPE, instead of blindly merging the two top tokens by frequency of use, you consider the frequency of the new token you would create by the merge operation itself. For example, o and f occur less often than the word of, causing them to be merged into a single token.

Example using distilbert-base-multilingual-cased’s tokenizer:

Special tokens:

[CLS] for the beginning of the classification example
[SEP] for separating multiple examples in the training process
[UNK] when the given token is not part of the vocabulary
## is used to denote a token continuing an existing word

Working strictly by frequency of use has its advantages, it’s easy to understand what the model is doing, but there are better estimators for the probability of a token. Remember language modeling? By trying to predict the next token, it’s actually computing the probability of every possible token within its vocabulary. This makes it a good candidate for deciding what to merge next, that’s part of what the next technique is bringing to the table.

SentencePiece

SentencePiece mixes together different techniques, including language modeling in order to get better priors. It’s a lot more complicated to train since it relies on having another model yielding probability distributions to support merging tokens. Fortunately, it’s also much better.

Since it also deals directly with Unicode code points, it supports emojis correctly ✨

Example using albert-base-v2’s tokenizer:

Special tokens:

[CLS] for the beginning of the classification example
[SEP] for separating multiple examples in the training process
▁ is used to indicate a new word

That’s all!

BPE and Wordpiece are amongst the easiest to train, you can grab the tokenizers repo and get started in a few lines of (rust-backed) python, or just use a pre-trained one 😎.

If you were wondering where these [CLS] and [SEP] tokens came from, there are other self-supervised tasks, one of them is next sentence prediction and requires passing multiple sentences to the model, which then guesses if the two sentences appear consecutively or not. [CLS] is always the first token, and [SEP] separates multiple sentences within the same task.

Mathieu René