Embark on an evolutionary journey of artificial intelligence and amazing advances in natural language processing (NLP). In the blink of an eye, artificial intelligence has risen and shaped our world. The enormous impact of training large language models has revolutionized NLP and revolutionized our technical interactions. Fast back in 2017, a pivotal moment marked by "attention is what you need," and the groundbreaking "Transformer" architecture was born. The architecture now forms the cornerstone of NLP and is an irreplaceable ingredient in every large-scale language model recipe – including the famous ChatGPT.
Imagine easily generating coherent, context-rich text – that's the magic of models like GPT-3. As a powerful force for chatbots, translation, and content generation, their brilliance stems from architecture and complex dances pre-trained and trained. Our upcoming article will delve into this symphony, revealing the artistry behind the task execution with large language models, utilizing pre-trained and trained dynamic duos to achieve outstanding results. Uncover the mysteries of these transformative technologies with us!
Learning objectives
- Learn about the different ways to structure LLM applications.
- Learn techniques such as feature extraction, layer training, and adapter methods.
- Use the Huggingface converter library to train LLM on downstream tasks.
Entry
LLM stands for Large Language Model. LLM is a deep learning model designed to understand the meaning of human-like text and perform various tasks such as sentiment analysis, language modeling (next word prediction), text generation, text summarization, and many more. They were trained on large amounts of text data.
We use applications based on these LLMs every day without even realizing it. Google uses BERT (Transformers Bidirectional Encoder Representation) for a variety of applications, such as query completion, understanding query context, outputting more relevant and accurate search results, language translation, and more.
These models are built on advanced technologies such as deep learning techniques, deep neural networks, and self-attention. They are trained on large amounts of textual data to learn the patterns, structure, and semantics of language.
Since these models are trained on a wide range of datasets, it takes a lot of time and resources to train them, and it makes no sense to train them from scratch.
There are techniques that we can use these models directly to accomplish specific tasks. So let's discuss them in detail.
1. Overview of the different methods for constructing an LLM application
We often see exciting LLM applications in our daily lives. Do you want to know how to structure an LLM application? Here are 3 ways to structure an LLM application:
- Train large language models with Scratch
- Train large language models
- prompt
1.1 Train large language models with Scratch
People are often confused by these two terms: training and fine-tuning LLM. The two techniques work similarly, changing the model parameters, but with different training objectives.
Training LLM from scratch is also known as pre-training. Pretraining is a technique for training large language models on large amounts of unlabeled text. But the question is, "How do we train a model on unlabeled data and then expect the model to predict the data accurately?" ”。 This is the concept of "self-supervised learning". In self-supervised learning, the model masks a word and tries to predict the next word with the help of the previous word. For example, suppose we have a phrase: "I'm a data scientist."
The model can create its own labeled data based on this sentence, for example:
This is called the next work prediction and is done by MLM (Mask Language Model). BERT, a shielded language model, uses this technique to predict masked words. We can think of MLM as a "fill in the blank" concept, where the model predicts which words can be filled in the blanks.
There are several ways to predict the next word, but in this article, we will only talk about BERT, or MLM. BERT can look at preceding and following words to understand the context of a sentence and predict masked words.
So, as a high-level overview of pre-training, it's just a technique for the model to learn to predict the next word in the text.
1.2 Train large language models
Training is about adjusting the parameters of a model to make it suitable for a specific task. Once the model is pre-trained, it is trained, or simply trained, to perform specific tasks such as sentiment analysis, text generation, finding document similarities, and so on. We don't have to train the model again on a specific environment. large text; Instead, we use the trained model to perform the task we want to perform. We'll discuss how to train a large language model in detail later in this article.
1.3 Tips
The tips are the simplest of all 3 techniques, but it's also a bit tricky. It involves providing the model with a context (hint) according to which the model performs tasks. Think of it as a chapter in a detailed teaching book for children, being very careful with the explanations and then asking them to solve the problems related to that chapter.
In the case of LLM, take ChatGPT as an example; We set a context and ask the model to follow the instructions to solve the given problem.
Let's say I want ChatGPT to just ask me a few interview questions about Transformers. For a better experience and accurate output, you need to set the appropriate context and give a detailed description of the task.
Example: I'm a data scientist with two years of experience and am currently preparing for an interview at such and such a company. I love problem solving and am currently using state-of-the-art NLP models. I am aware of the latest trends and technologies. Ask me very tough questions about the Transformer model, which the interviewer at this company can ask based on the company's previous experience. Ask me ten questions and give answers to them.
The more detailed and specific you suggest, the better the results. The most interesting part is that you can generate hints from the model itself and then add a personal touch or the information you need.
Second, understand different training techniques
Traditionally, there are several ways to train a model, and the different approaches depend on the specific problem you want to solve. Let's discuss techniques for training models.
Traditionally, there are 3 ways to train LLM.
2.1 Feature extraction
People use this technique to extract features from a given text, but why would we extract embeddings from a given text? The answer is simple. Since computers cannot understand text, there needs to be a representation of text that we can use to perform various tasks. Once we extract the embeddings, they are able to perform tasks such as sentiment analysis, identifying document similarities, etc. In feature extraction, we lock the backbone layers of the model, which means we don't update the parameters of these layers; Only the parameters of the classifier layer are updated. The classifier layer involves a fully connected layer.
2.2 Full model training
As the name suggests, we train a specific number of epochs per model layer on a custom dataset in this technique. We adjust the parameters of all layers in the model based on the new custom dataset. This can improve the accuracy of the model on the data and the specific tasks we want to perform. Considering that there are billions of parameters in training a large language model, it is computationally expensive and takes a lot of time to train the model.
2.3 Adapter-based training
Adapter-based training is a relatively new concept in which additional random initialization layers or modules are added to the network and then trained for specific tasks. In this technique, the parameters of the model are not disturbed, or we can say that the parameters of the model are not changed or adjusted. Instead, the adapter layer parameters are trained. This technique helps to adjust the model in a computationally efficient manner.
Implementation: Train BERT on downstream tasks
Now that we know the training technique, let's use BERT to perform sentiment analysis on IMDB movie reviews. BERT is a large language model that combines a translator layer and contains only an encoder. Google developed it and has proven to perform well on various tasks. BERT comes in different sizes and variants, such as BERT-base-uncased, BERT Large, RoBERTa, LegalBERT, and many more.
3.1 BERT model for sentiment analysis
We used the BERT model to perform sentiment analysis on IMDB film reviews. For free use of the GPU, Google Colab is recommended. Let's start training by loading some important libraries.
Since BERT (Bidirectional Encoder Representation of Encoder) is based on Transformer, the first step is to install Transformer in our environment.
!pip installs the transformer
Let's load some libraries that will help us load the data needed for the BERT model, label the loaded data, load the model that we will use for classification, perform train-test-split, load CSV files, and some other functions.
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
from transformers import BertTokenizer, BertModel
For faster calculations, we must change the device from CPU to GPU
device = torch.device("cuda")
The next step is to load the dataset and view the first 5 records in the dataset.
df = pd.read_csv('/content/drive/MyDrive/movie.csv')
df.head()
We will divide the dataset into a training set and a validation set. You can also split the data into a training set, a validation set, and a test set, but for simplicity, I'm just splitting the dataset into a training and validation set.
x_train, x_val, y_train, y_val = train_test_split(df.text, df.label, random_state = 42, test_size = 0.2, stratify = df.label)
3.2 Import and load the BERT model
Let's import and load the BERT model and tokenizer.
from transformers.models.bert.modeling_bert import BertForSequenceClassification
# import BERT-base pretrained model
BERT = BertModel.from_pretrained('bert-base-uncased')
# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
We will use a tokenizer to convert the text into a mark with a maximum length of 250, and fill and truncate it if needed.
train_tokens = tokenizer.batch_encode_plus(x_train.tolist(), max_length = 250, pad_to_max_length=True, truncation=True)
val_tokens = tokenizer.batch_encode_plus(x_val.tolist(), max_length = 250, pad_to_max_length=True, truncation=True)
The tokenizer returns a dictionary with three key-value pairs containing input_ids, which are tokens related to a particular word; token_type_ids, it is a list of integers that distinguish between different segments or parts of the input. Attention_mask indicates which tag to focus on.
Convert these values to tensors
train_ids = torch.tensor(train_tokens['input_ids'])
train_masks = torch.tensor(train_tokens['attention_mask'])
train_label = torch.tensor(y_train.tolist())
val_ids = torch.tensor(val_tokens['input_ids'])
val_masks = torch.tensor(val_tokens['attention_mask'])
val_label = torch.tensor(y_val.tolist())
Load TensorDataset and DataLoaders to further preprocess the data and fit it to the model.
from torch.utils.data import TensorDataset, DataLoader
train_data = TensorDataset(train_ids, train_masks, train_label)
val_data = TensorDataset(val_ids, val_masks, val_label)
train_loader = DataLoader(train_data, batch_size = 32, shuffle = True)
val_loader = DataLoader(val_data, batch_size = 32, shuffle = True)
Our task is to freeze the parameters of BERT using a classifier and then train these layers on a custom dataset. So, let's freeze the parameters of the model.
for param in BERT.parameters():
param.requires_grad = False
Now we must define forward and backward passes for the layers we added. The BERT model will act as a feature extractor, and we must clearly define the forward and backward passes of the classification.
class Model(nn.Module):
def __init__(self, bert):
super(Model, self).__init__()
self.bert = bert
self.dropout = nn.Dropout(0.1)
self.relu = nn.ReLU()
self.fc1 = nn.Linear(768, 512)
self.fc2 = nn.Linear(512, 2)
self.softmax = nn.LogSoftmax(dim=1)
def forward(self, sent_id, mask):
# Pass the inputs to the model
outputs = self.bert(sent_id, mask)
cls_hs = outputs.last_hidden_state[:, 0, :]
x = self.fc1(cls_hs)
x = self.relu(x)
x = self.dropout(x)
x = self.fc2(x)
x = self.softmax(x)
return x
Let's move the model to the GPU
model = Model(BERT)
# push the model to GPU
model = model.to(device)
3.3 Define the optimizer
# optimizer from hugging face transformers
from transformers import AdamW
# define the optimizer
optimizer = AdamW(model.parameters(),lr = 1e-5)
So far, we have preprocessed the dataset and defined our model. Now it's time to train the model. We have to write code to train and evaluate the model.
Train features:
def train():
model.train()
total_loss, total_accuracy = 0, 0
total_preds = []
for step, batch in enumerate(train_loader):
# Move batch to GPU if available
batch = [item.to(device) for item in batch]
sent_id, mask, labels = batch
# Clear previously calculated gradients
optimizer.zero_grad()
# Get model predictions for the current batch
preds = model(sent_id, mask)
# Calculate the loss between predictions and labels
loss_function = nn.CrossEntropyLoss()
loss = loss_function(preds, labels)
# Add to the total loss
total_loss += loss.item()
# Backward pass and gradient update
loss.backward()
optimizer.step()
# Move predictions to CPU and convert to numpy array
preds = preds.detach().cpu().numpy()
# Append the model predictions
total_preds.append(preds)
# Compute the average loss
avg_loss = total_loss / len(train_loader)
# Concatenate the predictions
total_preds = np.concatenate(total_preds, axis=0)
# Return the average loss and predictions
return avg_loss, total_preds
3.4 Evaluation Functions
def evaluate():
model.eval()
total_loss, total_accuracy = 0, 0
total_preds = []
for step, batch in enumerate(val_loader):
# Move batch to GPU if available
batch = [item.to(device) for item in batch]
sent_id, mask, labels = batch
# Clear previously calculated gradients
optimizer.zero_grad()
# Get model predictions for the current batch
preds = model(sent_id, mask)
# Calculate the loss between predictions and labels
loss_function = nn.CrossEntropyLoss()
loss = loss_function(preds, labels)
# Add to the total loss
total_loss += loss.item()
# Backward pass and gradient update
loss.backward()
optimizer.step()
# Move predictions to CPU and convert to numpy array
preds = preds.detach().cpu().numpy()
# Append the model predictions
total_preds.append(preds)
# Compute the average loss
avg_loss = total_loss / len(val_loader)
# Concatenate the predictions
total_preds = np.concatenate(total_preds, axis=0)
# Return the average loss and predictions
return avg_loss, total_preds
We will now use these functions to train the model:
# set initial loss to infinite
best_valid_loss = float('inf')
#defining epochs
epochs = 5
# empty lists to store training and validation loss of each epoch
train_losses=[]
valid_losses=[]
#for each epoch
for epoch in range(epochs):
print('\n Epoch {:} / {:}'.format(epoch + 1, epochs))
#train model
train_loss, _ = train()
#evaluate model
valid_loss, _ = evaluate()
#save the best model
if valid_loss < best_valid_loss:
best_valid_loss = valid_loss
torch.save(model.state_dict(), 'saved_weights.pt')
# append training and validation loss
train_losses.append(train_loss)
valid_losses.append(valid_loss)
print(f'\nTraining Loss: {train_loss:.3f}')
print(f'Validation Loss: {valid_loss:.3f}')
Now you get it. You can use the trained model to infer any data or text of your choice.
IV. Conclusion
This article explores the world of training large language models (LLM) and its significant impact on natural language processing (NLP). The pre-training process, in which LLM uses self-supervised learning to train on large amounts of unlabeled text, is discussed. We also drilled down into training, which involves adapting a pre-trained model for specific tasks and prompts, where context is provided to the model to produce relevant outputs. In addition, we also investigate different training techniques, such as feature extraction, full model training, and adapter-based training. Large language models have revolutionized NLP and continue to drive advances in a variety of applications.
5. Frequently Asked Questions
Q1: How can a large language model (LLM) like BERT understand the meaning of text without explicit labels?
A: LLM employs self-supervised learning techniques, such as mask language models, to predict the next word based on the context of surrounding words, effectively creating labeled data from unlabeled text.
Q2: What is the purpose of training a large language model?
A: Training allows LLM to adapt its parameters to specific tasks by adjusting them, making them suitable for sentiment analysis, text generation, or document similarity tasks. It builds on the pre-trained knowledge of the model.
Q3: What is the meaning of the prompts in LLM?
A: The prompt involves providing background or instructions to LLM to generate relevant output. Users can set specific prompts to guide the model to answer questions, generate text, or perform specific tasks based on a given context.
Source: MomodelAI_https://mp.weixin.qq.com/s/bXvIRlxM28aSffKBYCgjGg