BERT vs. GPT-2

The embeddings produced by BERTand GPT-2are widely used in downstream tasks without questioning their usage or being aware of their differences. We want to investigate this space and shed some light on the preference of BERT over GPT-2 in NLP tasks.

Introduction

Nowadays, NLP (Natural Language Processing) and machine learning are ubiquitous in document analysis — they support us by giving auto-completion suggestions when searching google, analyze our writing to correct our grammar, or even summarize whole news articles. Probably everyone has used a service or tool that utilizes them, even though it might not seem like it on the surface. But because of their complexity, it is not easy to directly compare them and get some intuitions on why they work the way they do, besides comparing their evaluated scores. Can we find a way around this limitation and get a deeper look at their internal representations? In this article, we will explore the embedding space of two popular language models and examine the information encoded in them. For this, we will first introduce typical NLP tasks together with the language models used to solve them. Then, using two commonly-used models (GPT-2 and BERT), we show small interactive visualizations of their embedding space and how linguistic features are encoded differently inside them. During this journey, we will stumble across surprising insights and reveal how the embedding space of models trained on different tasks might be (unexpectedly) different! Finally, we close out our blog post discussing the comparability of such models and potential impacts on downstream workflows.

1.1

Natural Language Tasks

The areas where NLP is used are as varied as the use of text. Naturally, several use-cases have stabilized over the last decade and evolved into specific tasks. Machine learning models are trained using these different tasks as objectives — therefore, they are also usually evaluated according to their specific task. One objective might be to automatically summarize a text into shorter paragraphs, while retaining information. Or we might want to automatically generate information in a chatbot setting, as an answer to a question. There also exist models that are evaluated across multiple tasks or domains, and are able to generalize better than other models. In this article we will focus on two very fundamental tasks in computational linguistics: Natural Language Generation (NLG) and Natural Language Understanding (NLU).

1.1.1

Natural Language Generation

In natural language generation, the goal is to generate new text automatically. For that, the model might be provided with some words to frame the initial context (seeded), or start blank — without any context (un-seeded). From there, it will predict words that have the highest likelihood of following the input sequence and which, hopefully, make sense considering the context that was given before. The model outputs one word at a time, following the given input sequence; to generate larger parts of text, the previous output is appended to the previous input, forming the new input sequence for the prediction of the next word.

1.1.2

Natural Language Understanding

In comparison, in natural language understanding, the goal is to get an accurate representation of the content of a word regarding its context. This usually means that we want to infer the semantic meaning or syntactic function of a word, given its context before and after. NLU, therefore, has many subtasks, ranging from semantics (Semantic Role Labelling, Word Sense Disambiguation) to more syntactic tasks (Part-of-Speech Tagging, Named Entity Recognition), or tasks that might span both semantics and syntax.

1.2

Models

A recent breakthrough in the field of NLP has been the concept of using learned vector representations to improve results, where each vector represents an individual word. As shown by Mikolov et al.the information in these vector representations can be meaningful, and retain relationships between datapoints that exist in the data. From these more simple bag-of-word models, deep neural networks using the transformer architecture emerged. It formulates the problem via an attention mechanism, where each word is represented with a learned vector embedding using attention both on itself, as well as the words around it. The transformer architecture was a breakthrough, and allowed for improvements over previous state of art in many NLP tasks. Today, we have many specialized models, that might differ in small ways over the initial proposed transformer architecture, but the general mechanism of attention has stayed the same. The original transformer paper proposed a model that consists of an encoder and a decoder component - specialized models may now only use one of them. For a more thorough explanation of attention and the transformer, we suggest the linked blogpostsor a look in the paper itself.

The two components of the transformer, encoder and decoder, are central mechanisms in the architecture of models that solve each of the previously explained tasks. We will use two representative models that are based on these two components in this article, and show their differences.

1.2.1

GPT-2

In this article, we will use GPT-2as a representative for decoder-style models. The GPT-2 model by OpenAI is a transformer model that is trained on the NLG task to predict the next word. It was the first model to be able to produce human-like text in selected instances. The model was trained on a huge unfiltered corpus of the web. It therefore contains different types of text, such as news, Wikipedia articles or even code snippets. The model's architecture is built by stacking multiple transformer decoder-only blocks (i.e., blocks that consist of a fully connected-layer on top of the masked self-attention). GPT-2 uses masked self-attention on the context before only, so tokens to the right do not influence the embeddings of tokens to the left. GPT-2 is the second model of the GPT-series (GPT, GPT-2, GPT-3) published by OpenAI. Due to limited access to GPT-3, in this paper, we only focus on GPT-2.

1.2.2

Bert

In contrast, NLU models use the encoder component of the transformer. The early front-runner of these encoder-only models was BERT, short for Bidirectional Encoder Representations from Transformers. BERT has a different training objective compared to GPT-2. A random word of a sentence is masked, and the model is then trained to predict the masked word using the context before and after. Therefore, the embedding of a word is influenced by the proceeding as well as following words in a sentence. It was trained on a different data corpus than GPT-2 (Wikipedia + BookCorpus). BERT and it’s intermediate embedding representations are incorporated into many downstream NLP workflows, as they have shown to improve performance across many different tasks.

Of course, models also exist that try to solve multiple tasks at the same time, and combine encoder and decoder in one model. Other common adaptions are how the position of tokens is encoded and at what layer this information is pushed into the model. Due to its complexity, we will not talk about this further, but if you are interested, there are follow up blog posts here.

1.3

Dataset & Pipeline

Because we want to visualize how the models' representations differ, we first need to find a dataset from which we can generate suitable embeddings. The two models use a different corpus as training data, and as the corpus of GPT-2 is not freely available, we chose a different open source corpus that was initially suggested for NLI (Natural Language Inference): The Multi-Genre Natural Language Inference (MultiNLI) corpus. It is a crowd sourced corpus of 433k sentence pairs annotated with textual entailment information. It is difficult to visualize such a large corpus interactively, so we had to reduce the corpus to a subset of the sentences. Because we are interested in semantics, we curated a list of polysemous words, so words having multiple semantic meanings, and reduced the dataset to sentences only containing these. We then fed the sentences to the models and extracted the embedding representation of each word. Internally, a word might be split into smaller chunks - this process is called tokenization, and the resulting chunks are called tokens. As GPT-2 and BERT use a different tokenizer, the resulting tokens might not perfectly align, but are mostly similar. Therefore, each datapoint in the following visualizations equals one token embedding. We calculated 3D projections using both UMAPas well as PCA. For UMAP, we experimented with several parameters for the n_neighbours and min_dist parameters, as well as adapting the initialization to use the PCA positions. However, UMAP is unable to correctly maintain the global structure of the data, which can be seen in PCA. Therefore, the default view in the following visualizations will always be the 3D PCA projection. This way, results are not distorted because of artifacts. For some more local features, it might make sense to look at the UMAP projection. In these cases, it is possible to switch to the UMAP tab on the right of the figures and explore this view.

1.4

Layers

The models we will look at have multiple layers — 12 to be precise. These layers are stacked on top of each other, and the output of one is given to the next layer as an input. Models might also insert additional information on all or only some layers, such as the position of a token. In the figures below, you will see 2D views of each individual layer, for both GPT-2 and BERT, using PCA.

1.4.1

GPT-2

As you can immediately tell, GPT-2 forms two very separate clusters. Otherwise, for the lower layers, the embedding space stays mostly the same, but makes some big changes in the last layer, where the separation decreases. We will explain what these two clusters represent in the following section, so stay tuned!

1.4.2

Bert

BERT does not have such a strict separation as GPT-2. Rather, the embedding space initially consists of multiple small clouds - however, these merge as we go up in the layers. The embedding space changes from these smaller clouds to few big clusters, and later transforms to a wing-like shape in the last layer.

1.5

Separation of GPT-2 into two Clusters

Because of previous findings by Cai et al., we also separated the datapoints of GPT-2 that belong to the two clusters and projected them individually. If we retain both clusters, the second cluster is almost two-dimensional. However, if we separate their datapoints and project them individually, the underlying structure is better preserved. This allows us to see the shape of each cluster more clearly. Actually the right cluster contains a manifold - similar to the Swiss roll.

We will now give a short introduction on how to use the following Plotly figures. The shown layer of all figures can be set using a global parameter, using the sliders which are located at the bottom of the screen. The layer can be set separately for GPT-2 and BERT. Next to the sliders, you can also change the number of datapoints shown - more datapoints can make the visualizations laggy, so we only show a random subset of the 190.000 tokens per default. The separate views of the two clusters of GPT-2 can be switched using the tabs surrounding PCA (left and right) — however, the default view is the projection of all datapoints, without the separation into the left and right cluster. In the upper right corner of the plot, you can switch to the 3D UMAP projection. Legend items can be selected by double-clicking. This makes all other traces disappear. To add a trace to the selection, click it once. Double-clicking already selected traces switches back to the default state, with all traces visible. To look around, click and drag the mouse, and zoom using the scroll wheel. Sometimes, text might be underlined - this text is clickable, and will change the plots to the matching configuration.

Impact of Position

As we described previously, additional information such as the position of a token can be added to the embeddings. The position of a word is an integral part to how we form a sentence, therefore it makes sense to make this information available to the model. Of course, there are many ways in which the position can be encoded in the model - from static sinoid embeddings, to learned representations of positions. There is also the distinction between relative and absolute position, and if the information is only injected at the beginning or at multiple layers. The position might also be added onto the existing embeddings or concatenated. Generally, GPT-2 and BERT showcase two possible patterns well: information is either the globally most important feature, or is encoded only locally for every cluster.

2.1

GPT-2 separates into two clusters

2.2

Bert consists of multiple small clusters

If we switch to the UMAP projection, the global context is lost for both models. This makes sense, as a good UMAP projection is especially dependent on a successful initialization. However, this failed with our data, and UMAP fell back to a random initialization.

Encoding the syntactic information of words

Some interesting phenomena could already be observed in the previous visualizations. However, now that we know that for GPT-2 the position along the manifold encodes the absolute position of a token inside a sentence, the question remains: What about the rings of the Swiss roll? And what about the clouds that can be seen in the BERT embeddings?

The next NLP feature that we are going to investigate are Part-of-Speech (POS) tags. POS tags are used in linguistics to describe the syntactic function of a word inside a sentence. Each function is denoted by its POS tag - a short abbreviation of the function that it represents. Examples would be nouns, such as the word 'letter', have the tag 'NOUN', while pronouns such as 'he' or 'she' have the tag 'PRON'.

It is already known that models such as BERT are very well suited for POS Tagging. Therefore, it is a valid assumption to think that the model has learned to distinguish between different POS Tags, and represents this information in its embeddings. Using this hypothesis, we want to color the individual tokens by their POS tags. We used spacy to calculate POS tags for all the sentences in our corpus. To look at all the datapoints with a specific POS tag, this POS tag has to be double-clicked in the visualization legend. The other datapoints are then still visible as gray shadow points, to stabilize the visualization and show where the selected points lie in comparison to the overall embedding space.

3.1

GPT-2 places POS Tags along the manifold

3.2

Bert clusters represent POS tags

To conclude, we can confirm that POS tags are well represented in the embedding space of BERT, especially in the middle layers of the model. This is consistent with findings from previous works, and might explain why BERT embeddings are a good choice for POS tagging. For GPT-2, the problem is that besides the POS tag, the position of the token is more strongly encoded in the embedding. This means that if we want to use GPT-2 for POS tagging, we would have to account for the manifold in the embeddings.

Analyzing the Contextualization of Individual Tokens

An especially difficult task in NLP is sense disambiguation - to be able to tell the sense of a word given its context. Therefore, we want to investigate if we can find differences in polysemous vs non-polysemous words, and how the shape of all of the embeddings of such a word might form clusters for each sense.

4.1

GPT-2 shows clustering in UMAP

4.2

Bert shows clustering in UMAP

If we only look at the tokens in the PCA projection, we can come to no direct conclusion of whether or not senses are encoded in the embeddings, at least not along its principal components. However, recent worksshow that it is possible to disambiguate senses using BERT embeddings, if both are trained alongside the other. Therefore, the model must be able to distinguish between senses in the high dimensional space. This is supported by looking at individual tokens in the UMAP projection, which is able to better represent local neighbourhoods at the cost of losing the overall structure of the data. Further investigations have to be made in this direction, to be able to visually show the different senses, while retaining the global structure of the data.

Context Measures

Using the previous three features, we can already get some insights into what information the models have learned and represent in their embeddings. However, for more detailed analysis, we need measures that can better show the degree of contextualization. Here, we characterize what the model learns and how it contextualizes word embeddings by examining a set of simple patterns based on the embedding similarity in the high-dimensional space.

5.1

GPT-2

5.2

Bert

Token-based-intra-sentence-similarity

Token-based-intra-sentence-similarity is the average cosine similarity between a token and other tokens in the same sentence. We can observe clear differences between GPT-2 and BERT using this measure. Throughout all layers of GPT-2, tokens at position 0 have the lowest similarity to the remaining tokens in the sentence. As mentioned earlier in this post, GPT-2 seems to emphasize this initial token and encodes very specific information distinctive from the remaining token embeddings. Tokens at higher positions have different patterns. At lower layers (1-4), this measure is affected by the token's POS tag. We can see similar ring patterns like in section 3. Here, function words have a higher similarity to other tokens in the sentence compared to content words. This is similar to what we can see in BERT. At higher layers (most obvious, at layer 8-11), tokens at early positions have the highest similarity to other tokens in the same sentence. We can observe that the later the token's position in the sentence, the lower the similarity for this token gets. Our assumption is the following: in GPT-2, attention is paid only to the previous tokens, and tokens in the beginning of the sentence pay most attention to the tokens shortly before them. In general, more tokens can pay attention to tokens at lower position IDs. However, the chance that attention will be paid to tokens at higher positions is small, and might explain the low similarity. Unlike with GPT-2, we don't see any patterns related to the token's position in the sentence in BERT. Throughout all layers, function words have a higher similarity to other tokens in the sentence compared to content words. In addition to function words, the [SEP] and punctuation marks have the lowest token-based-intra-sentence-similarity in the upper layers.

Token-based-self-similarity

Token-based-self-similarity is the average cosine similarity of the same token occurring in different sentences. In both models, the self-similarity is higher for content than function words (we have made this observation in our previous work as well). An exception are tokens at position 0 in GPT-2, which have a very high similarity to themselves. It shows again that in GPT-2 the positional information has a very high impact. Although general patterns for both models are similar, we can see some exceptions. One of those can be seen in the coordinating conjunction POS tag (CCONJ). GPT-2 captures it, like all other function words, with a lower self-similarity than content words. In comparison, BERT first treats these tokens as function words, and in the upper layers, models them as being more similar to adverbs (i.e., with a higher self-similarity than DET or ADP). This pattern might indicate that BERT not only learns to distinguish between function and content words, but it also learns specific properties of POS tags. One assumption might be that BERT detects that CCONJ tokens link constituents without syntactically subordinating one to the other. Hence, the CCONJ tokens are not 'attached' to any specific token as, e.g., determiners are. But, in order to confidentially state that, further studies are needed.

Token similarity to their nearest neighbours (nearest-neighbours-are-same-tokens and cosine-similarity)

To generate further insights, we extracted ten nearest neighbours for each token using the cosine similarity in the high-dimensional space. We first explored whether the nearest neighbours are the same tokens as the target token, which might indicate a low contextualization. According to this feature, both models seem to have similar patterns. At the lower layers, most neighbours in both models are the same tokens as the target, except for some content words. In upper layers, more and more content and function words have nearest neighbours that are not necessarily the same tokens, which indicates an increase in contextualization. However, the cosine similarity between tokens and their nearest neighbours shows a difference between the two models. As shown in, BERT and GPT-2 are extremely anisotropic. Our observations for token nearest neighbours show that GPT-2 outperforms BERT negatively since the average cosine similarity between tokens and their nearest neighbours is 1 in most cases.

Closing Thoughts

In this blogpost, we have explored the embedding space of two popular language models, GPT-2 and BERT. We have shown that the models encode some information, such as positional ID, differently, and therefore the embedding space takes on a different form. As the resulting space is therefore not necessarily euclidean, one has to be careful when aggregating and using the same distance measures across models. Because of the inherent manifold structure of GPT-2, datapoints of the same token are not necessarily similar in the high-dimensional space, if using classic similarity measures such as cosine similarity. Therefore, it does not make sense to simply look for nearest neighbours in the high-dimensional space, as this would not include the token on a far away position in the sentence.

Because language models have shown such a strong performance gain on many NLP task, they are now used in many downstream applications. However, usually there is no distinction made depending on what kind of model is used! If we do not account for the manifold in the GPT-2 embeddings, the embeddings will not be of use in downstream tasks, if they are used as an in-place replacement for BERT, simply because its structure is inherently different. We compared this for other models as well, and just to give you a brief overview of what we mean, we will show the embedding spaces of the first layer from some selected state-of-the-art language models:

As you can see, the embedding spaces are vastly different! If embeddings of these models are interchangeably used, without caring for what is actually represented in the embedding space, unwanted artifacts might arise. However, it would be incredibly difficult to find their origin, as you cannot easily get an intuition of what is encoded in an embedding without making thorough investigations of the embedding space as we did in this article. We think it is therefore especially important to raise awareness of this fact, and help people investigate what could be encoded in the embeddings, and if it makes sense to use a certain model's embedding in a task, or if it is simply unsuited because of the structure of its embedding space. If your goal is to integrate pretrained embeddings of a language model in a downstream task, you should therefore be aware of potential manifolds and clusters in the embeddings. It is very important to consider which similarity measure is suitable for the model, or choose a model that is suitable for classical measures such as cosine similarity.

To conclude, we presented two very different language models in this blogpost, GPT-2 and BERT. Although from the outside, the models might seem similar, we showed that their embeddings differ significantly in some aspects. We hope that this article facilitated a deeper look into the embedding space of the two models and helped to get a better intuition about what their embeddings encode.