Learn how the siamese BERT network accurately converts sentences into embeddings

Brief introduction

It's no secret that Transformer has made evolutionary advances in NLP. Based on Transformer, many other machine learning models have been developed. One of them is BERT, which consists mainly of several stacked Transformer encoders. In addition to being used for a range of different problems, such as sentiment analysis or question answering, BERT is becoming increasingly popular in constructing word embeddings (digital vectors that represent the semantics of words).

Representing words in the form of embeddings has huge advantages because machine learning algorithms cannot process the original text, but they can operate on vectors of vectors. This allows the similarity of different words to be compared using standard metrics such as Euclidean distance or cosine distance.

The problem is that in practice, we often need to build embeddings for entire sentences rather than individual words. However, the basic BERT version only builds embeddings at the word level. Therefore, several BERT-like methods were later developed to solve this problem, which will be discussed in this article [1]. By talking about them step by step, we will reach the most advanced model called SBERT.

BERT

First, let's review how BERT processes information. As input, it requires a [CLS] tag and two sentences separated by a special [SEP] tag. Depending on the model configuration, this information is processed 12 or 24 times by the multihead attention module. The output is then aggregated and passed to a simple regression model to obtain the final label.

Cross-encoder architecture

You can use BERT to calculate the similarity between a pair of documents. Consider the goal of finding the most similar pair of sentences in a large set. To solve this problem, every possible pair is put into the BERT model. This leads to secondary complexity in the inference process. For example, processing n = 10 000 sentences requires n * (n — 1) / 2 = 49 995 000 inference BERT calculations, which is not really scalable.

Other methods

Analyzing the inefficiencies of cross-encoder architectures, it seems logical to precompute embeddings independently for each sentence. After that, we can directly calculate the selected distance metric on all document pairs, which is much faster than entering a square number of sentence pairs into BERT.

Unfortunately, this approach is not possible for BERT: the core problem with BERT is that it is difficult to obtain embeddings that represent only a single sentence independently every time two sentences are passed and processed at the same time.

The researchers tried to eliminate this problem by using the output embedded with the [CLS] token, hoping that it would contain enough information to represent a sentence. However, [CLS] turned out to be completely useful for this task, as it was originally pre-trained in BERT for the next sentence prediction.

Another approach is to pass a single sentence to BERT and then average the output tag embedding. However, the results obtained are even worse than simple average GLoVe embeddings.

❝

Deriving independent sentence embeddings is one of the main problems with BERT. To alleviate this problem, SBERT was developed.

❞

SBERT

SBERT introduced the Siamese network concept, which means that each time two sentences are passed independently through the same BERT model. Before discussing the SBERT architecture, let's refer to a subtle note about Siamese networks:

Most of the time, in scientific papers, Siamese network architecture is described with multiple models that receive so many inputs. In fact, it can be thought of as a single model with the same configuration and sharing weights across multiple parallel inputs. Whenever you update the model weights for a single input, they also update the other inputs.

Going back to SBERT, after passing a sentence to BERT, the pooling layer is applied to the BERT embedding to obtain its lower-dimensional representation: the initial 512 768-dimensional vector is converted into a single 768-dimensional vector. For pooling tiers, SBERT authors recommend choosing the average pooling tier as the default tier, although they also mention that you can use the maximum pooling strategy or simply take the output of the [CLS] token.

When both sentences pass through the pooling layer, we have two 768-dimensional vectors you and v. By using these two vectors, the authors propose three ways to optimize different objectives, which will be discussed below.

Categorical objective function

The goal of the problem is to correctly classify a given pair of sentences into one of several categories.

After generating the embeddings you and v, the researchers found that another vector was derived from these two vectors as the absolute difference of the elements |u-v| Very useful. They also tried other feature engineering techniques, but this one showed the best results.

Finally, the three vectors u, v, and |u-v| are concatenated, multiplied by the trainable weight matrix W, and the multiplication result is fed into a softmax classifier that outputs normalized probabilities corresponding to sentences of different classes. The cross-entropy loss function is used to update the weights of the model.

One of the most popular existing problems used to solve this goal is NLI (Natural Language Reasoning), wherein for a given sentence A and B that define assumptions and premises, it is necessary to predict whether the hypothesis is true (entailment), and in a given premise, false (contradictory) or uncertain (neutral). For this problem, the inference process is the same as training.

As described in the paper, the SBERT model was originally trained on two datasets, SNLI and MultiNLI, which contain one million sentence pairs with corresponding labels implicit, contradictory, or neutral. Afterward, the paper researchers mentioned detailed information about SBERT tuning parameters:

❝

"We fine-tune the SBERT of an epoch using a 3-way softmax classifier objective function. We warmed up using a batch size of 16, an Adam optimizer with a learning rate of 2e−5, and a linear learning rate of more than 10% of the training data. ”

❞

Regression objective function

In this formula, after obtaining the vectors you and v, the similarity score between them is calculated directly from the selected similarity metric. Compare the predicted similarity score with the true value and update the model using the MSE loss function. By default, authors select cosine similarity as the similarity measure.

During inference, you can use the schema in one of two ways:

Given a pair of sentences, a similarity score can be calculated. The inference workflow is exactly the same as training.
For a given sentence, its sentence embedding can be extracted (after the pooling layer is applied) for later use. This is especially useful when we get a large set of sentences and aim to calculate pairwise similarity scores between them. By running BERT only once per sentence, we extracted all the necessary sentence embeddings. After that, we can directly calculate the chosen measure of similarity between all vectors (undoubtedly, it still requires the number of quadratic comparisons, but at the same time we avoid using BERT for quadratic inference calculations as before).

Triplet objective function

Triplet targeting introduces triplet loss, which is calculated based on three sentences commonly referred to as anchor, positive, and negative. Suppose that anchor sentences and affirmative sentences are very close, while anchor sentences and negative sentences are very different. During training, the model evaluates how close the pair (anchor, positive) is compared to that pair (anchor, negative). Mathematically, the following loss functions are minimized:

Interval ε ensures that positive sentences are at least ε closer to the anchor than negative sentences are to the anchor point. Otherwise, the loss will be greater than 0. By default, in this formula, the author selects Euclidean distance as the vector norm, and the parameter ε set to 1.

The triplet SBERT schema differs from the previous two in that the model now accepts three input sentences in parallel (instead of two).

Code

SentenceTransformers is a state-of-the-art Python library for building sentence embeddings. It contains several pre-trained models for different tasks. Building embeddings with SentenceTransformers is simple, and an example is shown in the code snippet below.

The built embedding can then be used for similarity comparison. Each model is trained for a specific task, so it's always important to refer to the documentation to choose the appropriate similarity measure for comparison.

summary

We have looked at an advanced NLP model for getting sentence embeddings. By reducing the number of quadratic times of BERT inference execution to linear, SBERT achieves a significant increase in speed while maintaining high accuracy.

To finally understand how significant this difference is, it is enough to refer to the example described in the paper, in which the researchers tried to find the most similar pair in n = 10000 sentences. On modern V100 GPUs, this process takes about 65 hours with BERT and only 5 seconds with SBERT! This example shows that SBERT is a huge advance over NLP.

Reference

[1] Source: https://towardsdatascience.com/sbert-deb3d4aef8a4

Large language model: SBERT — sentence BERT