The embeddings produced by BERT
Nowadays, NLP (Natural Language Processing) and machine learning are ubiquitous in document analysis — they support us by giving auto-completion suggestions when searching google, analyze our writing to correct our grammar, or even summarize whole news articles. Probably everyone has used a service or tool that utilizes them, even though it might not seem like it on the surface. But because of their complexity, it is not easy to directly compare them and get some intuitions on why they work the way they do, besides comparing their evaluated scores. Can we find a way around this limitation and get a deeper look at their internal representations? In this article, we will explore the embedding space of two popular language models and examine the information encoded in them. For this, we will first introduce typical NLP tasks together with the language models used to solve them. Then, using two commonly-used models (GPT-2 and BERT), we show small interactive visualizations of their embedding space and how linguistic features are encoded differently inside them. During this journey, we will stumble across surprising insights and reveal how the embedding space of models trained on different tasks might be (unexpectedly) different! Finally, we close out our blog post discussing the comparability of such models and potential impacts on downstream workflows.1.1
Natural Language Tasks
The areas where NLP is used are as varied as the use of text. Naturally, several use-cases have stabilized over the last decade and evolved into specific tasks. Machine learning models are trained using these different tasks as objectives — therefore, they are also usually evaluated according to their specific task. One objective might be to automatically summarize a text into shorter paragraphs, while retaining information. Or we might want to automatically generate information in a chatbot setting, as an answer to a question. There also exist models that are evaluated across multiple tasks or domains, and are able to generalize better than other models. In this article we will focus on two very fundamental tasks in computational linguistics: Natural Language Generation (NLG) and Natural Language Understanding (NLU).
Natural Language Generation
In natural language generation, the goal is to generate new text automatically. For that, the model might be provided with some words to frame the initial context (seeded), or start blank — without any context (un-seeded). From there, it will predict words that have the highest likelihood of following the input sequence and which, hopefully, make sense considering the context that was given before. The model outputs one word at a time, following the given input sequence; to generate larger parts of text, the previous output is appended to the previous input, forming the new input sequence for the prediction of the next word.
Natural Language Understanding
In comparison, in natural language understanding, the goal is to get an accurate representation of the content of a word regarding its context. This usually means that we want to infer the semantic meaning or syntactic function of a word, given its context before and after. NLU, therefore, has many subtasks, ranging from semantics (Semantic Role Labelling, Word Sense Disambiguation) to more syntactic tasks (Part-of-Speech Tagging, Named Entity Recognition), or tasks that might span both semantics and syntax.
A recent breakthrough in the field of NLP has been the concept of using learned vector representations to improve results, where each vector represents an individual word. As shown by Mikolov et al.
The two components of the transformer, encoder and decoder, are central mechanisms in the architecture of models that solve each of the previously explained tasks. We will use two representative models that are based on these two components in this article, and show their differences.
In this article, we will use GPT-2
In contrast, NLU models use the encoder component of the transformer. The early front-runner of these encoder-only models was BERT
Of course, models also exist that try to solve multiple tasks at the same time, and combine encoder and decoder in one model. Other common adaptions are how the position of tokens is encoded and at what layer this information is pushed into the model. Due to its complexity, we will not talk about this further, but if you are interested, there are follow up blog posts here
Dataset & Pipeline
Because we want to visualize how the models' representations differ, we first need to find a dataset from which we can generate suitable embeddings. The two models use a different corpus as training data, and as the corpus of GPT-2 is not freely available, we chose a different open source corpus that was initially suggested for NLI (Natural Language Inference): The Multi-Genre Natural Language Inference (MultiNLI) corpus
The models we will look at have multiple layers — 12 to be precise. These layers are stacked on top of each other, and the output of one is given to the next layer as an input. Models might also insert additional information on all or only some layers, such as the position of a token. In the figures below, you will see 2D views of each individual layer, for both GPT-2 and BERT, using PCA.
As you can immediately tell, GPT-2 forms two very separate clusters. Otherwise, for the lower layers, the embedding space stays mostly the same, but makes some big changes in the last layer, where the separation decreases. We will explain what these two clusters represent in the following section, so stay tuned!
BERT does not have such a strict separation as GPT-2. Rather, the embedding space initially consists of multiple small clouds - however, these merge as we go up in the layers. The embedding space changes from these smaller clouds to few big clusters, and later transforms to a wing-like shape in the last layer.
Separation of GPT-2 into two Clusters
Because of previous findings by Cai et al.
We will now give a short introduction on how to use the following Plotly figures. The shown layer of all figures can be set using a global parameter, using the sliders which are located at the bottom of the screen. The layer can be set separately for GPT-2 and BERT. Next to the sliders, you can also change the number of datapoints shown - more datapoints can make the visualizations laggy, so we only show a random subset of the 190.000 tokens per default. The separate views of the two clusters of GPT-2 can be switched using the tabs surrounding PCA (left and right) — however, the default view is the projection of all datapoints, without the separation into the left and right cluster. In the upper right corner of the plot, you can switch to the 3D UMAP projection. Legend items can be selected by double-clicking. This makes all other traces disappear. To add a trace to the selection, click it once. Double-clicking already selected traces switches back to the default state, with all traces visible. To look around, click and drag the mouse, and zoom using the scroll wheel. Sometimes, text might be underlined - this text is clickable, and will change the plots to the matching configuration.2
Impact of Position
As we described previously, additional information such as the position of a token can be added to the embeddings. The position of a word is an integral part to how we form a sentence, therefore it makes sense to make this information available to the model. Of course, there are many ways in which the position can be encoded in the model - from static sinoid embeddings, to learned representations of positions. There is also the distinction between relative and absolute position, and if the information is only injected at the beginning or at multiple layers. The position might also be added onto the existing embeddings or concatenated. Generally, GPT-2 and BERT showcase two possible patterns well: information is either the globally most important feature, or is encoded only locally for every cluster.
If we switch to the UMAP projection, the global context is lost for both models. This makes sense, as a good UMAP projection is especially dependent on a successful initialization
Encoding the syntactic information of words
Some interesting phenomena could already be observed in the previous visualizations. However, now that we know that for GPT-2 the position along the manifold encodes the absolute position of a token inside a sentence, the question remains: What about the rings of the Swiss roll? And what about the clouds that can be seen in the BERT embeddings?
The next NLP feature that we are going to investigate are Part-of-Speech (POS) tags. POS tags are used in linguistics to describe the syntactic function of a word inside a sentence. Each function is denoted by its POS tag - a short abbreviation of the function that it represents. Examples would be nouns, such as the word 'letter', have the tag 'NOUN', while pronouns such as 'he' or 'she' have the tag 'PRON'.
It is already known that models such as BERT are very well suited for POS Tagging
To conclude, we can confirm that POS tags are well represented in the embedding space of BERT, especially in the middle layers of the model. This is consistent with findings from previous works, and might explain why BERT embeddings are a good choice for POS tagging. For GPT-2, the problem is that besides the POS tag, the position of the token is more strongly encoded in the embedding. This means that if we want to use GPT-2 for POS tagging, we would have to account for the manifold in the embeddings.4
Analyzing the Contextualization of Individual Tokens
An especially difficult task in NLP is sense disambiguation - to be able to tell the sense of a word given its context. Therefore, we want to investigate if we can find differences in polysemous vs non-polysemous words, and how the shape of all of the embeddings of such a word might form clusters for each sense.
If we only look at the tokens in the PCA projection, we can come to no direct conclusion of whether or not senses are encoded in the embeddings, at least not along its principal components. However, recent works
Using the previous three features, we can already get some insights into what information the models have learned and represent in their embeddings. However, for more detailed analysis, we need measures that can better show the degree of contextualization. Here, we characterize what the model learns and how it contextualizes word embeddings by examining a set of simple patterns based on the embedding similarity in the high-dimensional space.
Token-based-intra-sentence-similarity is the average cosine similarity between a token and other tokens in the same sentence. We can observe clear differences between GPT-2 and BERT using this measure. Throughout all layers of GPT-2, tokens at position 0 have the lowest similarity to the remaining tokens in the sentence. As mentioned earlier in this post, GPT-2 seems to emphasize this initial token and encodes very specific information distinctive from the remaining token embeddings. Tokens at higher positions have different patterns. At lower layers (1-4), this measure is affected by the token's POS tag. We can see similar ring patterns like in section 3. Here, function words have a higher similarity to other tokens in the sentence compared to content words. This is similar to what we can see in BERT. At higher layers (most obvious, at layer 8-11), tokens at early positions have the highest similarity to other tokens in the same sentence. We can observe that the later the token's position in the sentence, the lower the similarity for this token gets. Our assumption is the following: in GPT-2, attention is paid only to the previous tokens, and tokens in the beginning of the sentence pay most attention to the tokens shortly before them. In general, more tokens can pay attention to tokens at lower position IDs. However, the chance that attention will be paid to tokens at higher positions is small, and might explain the low similarity. Unlike with GPT-2, we don't see any patterns related to the token's position in the sentence in BERT. Throughout all layers, function words have a higher similarity to other tokens in the sentence compared to content words. In addition to function words, the [SEP] and punctuation marks have the lowest token-based-intra-sentence-similarity in the upper layers.
Token-based-self-similarity is the average cosine similarity of the same token occurring in different sentences. In both models, the self-similarity is higher for content than function words (we have made this observation in our previous work as well
Token similarity to their nearest neighbours (nearest-neighbours-are-same-tokens and cosine-similarity)
To generate further insights, we extracted ten nearest neighbours for each token using the cosine similarity in the high-dimensional space. We first explored whether the nearest neighbours are the same tokens as the target token, which might indicate a low contextualization. According to this feature, both models seem to have similar patterns. At the lower layers, most neighbours in both models are the same tokens as the target, except for some content words. In upper layers, more and more content and function words have nearest neighbours that are not necessarily the same tokens, which indicates an increase in contextualization. However, the cosine similarity between tokens and their nearest neighbours shows a difference between the two models. As shown in
In this blogpost, we have explored the embedding space of two popular language models, GPT-2 and BERT. We have shown that the models encode some information, such as positional ID, differently, and therefore the embedding space takes on a different form. As the resulting space is therefore not necessarily euclidean, one has to be careful when aggregating and using the same distance measures across models. Because of the inherent manifold structure of GPT-2, datapoints of the same token are not necessarily similar in the high-dimensional space, if using classic similarity measures such as cosine similarity. Therefore, it does not make sense to simply look for nearest neighbours in the high-dimensional space, as this would not include the token on a far away position in the sentence.
Because language models have shown such a strong performance gain on many NLP task, they are now used in many downstream applications. However, usually there is no distinction made depending on what kind of model is used! If we do not account for the manifold in the GPT-2 embeddings, the embeddings will not be of use in downstream tasks, if they are used as an in-place replacement for BERT, simply because its structure is inherently different. We compared this for other models as well, and just to give you a brief overview of what we mean, we will show the embedding spaces of the first layer from some selected state-of-the-art language models:
As you can see, the embedding spaces are vastly different! If embeddings of these models are interchangeably used, without caring for what is actually represented in the embedding space, unwanted artifacts might arise. However, it would be incredibly difficult to find their origin, as you cannot easily get an intuition of what is encoded in an embedding without making thorough investigations of the embedding space as we did in this article. We think it is therefore especially important to raise awareness of this fact, and help people investigate what could be encoded in the embeddings, and if it makes sense to use a certain model's embedding in a task, or if it is simply unsuited because of the structure of its embedding space. If your goal is to integrate pretrained embeddings of a language model in a downstream task, you should therefore be aware of potential manifolds and clusters in the embeddings. It is very important to consider which similarity measure is suitable for the model, or choose a model that is suitable for classical measures such as cosine similarity.
To conclude, we presented two very different language models in this blogpost, GPT-2 and BERT. Although from the outside, the models might seem similar, we showed that their embeddings differ significantly in some aspects. We hope that this article facilitated a deeper look into the embedding space of the two models and helped to get a better intuition about what their embeddings encode.