ChatGPT's past and present

This article will detail the key events in the development process of chatbot chatbot to help readers establish a preliminary concept of the timeline of intelligent chatbot development.

Sprout: 60s - 70s of the 20th century

During this period, early prototypes of chatbots were developed. The most famous of these is Eliza, which simulates the conversation process of a psychologist. During this period, the technical framework mainly used was rule-based syntax parsing.

Eliza conversation

ELIZA was an early computer program for natural language processing, created by Joseph Weizenbaum at MIT between 1964 and 1966. Originally developed on the IBM 7094 using MAD-SLIP to explore communication between humans and machines, it uses a pattern-matching and substitution method to simulate conversations, giving users the illusion that the program understands what they're saying, but it doesn't actually understand what either side is saying.

ELIZA has five basic technical issues to solve: keyword identification, minimal context discovery, selecting appropriate conversions, generating appropriate responses based on conversions or no keywords, and providing the end capacity of ELIZA scripts. To solve these problems, Weizenbaum makes ELIZA have no built-in contextual framework or discourse world, but this requires ELIZA to have a set of scripts to guide how it responds to user input.

ELIZA's process of processing user input is to first check the keywords entered by text, sort them in order of importance, and put them into a keyword stack. The entered utterance is then manipulated and transformed based on the highest-ranking keywords in the stack to produce a response. Next, according to the appropriate transformation rules, the input sentences are broken down and reorganized to form a response and sent to the user as text. The way ELIZA operates depends on the action script, which can be edited or created to suit the desired context. ELIZA'S MOST FAMOUS SCRIPT IS DOCTOR, WHICH MIMICS A ROGERS-STYLE PSYCHOTHERAPIST. ELIZA also has some specialized features, such as using the "MEMORY" structure to record previous inputs when there are no keywords, and using these inputs to create responses if no keywords are encountered.

The success of ELIZA has inspired many researchers to explore how natural language processing and artificial intelligence techniques can be used to simulate human conversations and thought processes. The results of this research have also spawned a series of conversational robots similar to ELIZA, such as PARRY and Jabberwacky. Although these conversational robots can also achieve simple natural language interactions, they are difficult to compare with human dialogue levels due to their lack of deep understanding and reasoning ability.

However, ELIZA's design philosophy and technical means still play an important role in later natural language processing research. For example, techniques such as keyword-based text generation and question answering systems, conversation status tracking, and conversation strategies have all been inspired by ELIZA.

In addition, ELIZA has also become a source of inspiration for many cultural works, such as Samantha, the artificial intelligence character in the movie "Her" "Link", which was influenced by ELIZA.

Exploration: 1990s - 2000s

During this period, the key technical framework for chatbots was based on finite state machines (FSM) and template matching techniques. Chatbots developed during this period include Jabberwacky and Alice.

Jabberwacky — 1986

Jabberwacky prototype

Jabberwacky was developed by Englishman Rollo Carpenter in 1986. It started as a chat program with artificial intelligence, using some language processing and machine learning techniques, capable of simulating human conversations.

The characteristic of Jabberwacky is that it is able to learn and adapt to the user's preferences and behaviors through dialogue, so that its dialogue ability is constantly improved. For example, when a user asks Jabberwacky about his hobbies, Jabberwacky remembers the user's responses and uses the information to generate more targeted responses.

Jabberwacky's technical principles are based on natural language processing and machine learning algorithms in artificial intelligence technology. Based on the existing dialogue library and language model, the system generates corresponding replies by analyzing and matching the text entered by the user.

Jabberwacky's system consists of two main components: front-end and back-end. The front end includes the user interface and conversational interaction, while the back end is the core technology that includes machine learning algorithms and language models.

During use, the text entered by the user is passed through the front-end to the back-end for processing. The backend will first perform natural language processing techniques such as word segmentation and part-of-speech labeling of the input text to obtain a more semantic representation. This representation is then compared to the text in the existing conversation library, finds the conversation content that is most similar to it, and generates a reply based on the rules in the conversation library.

For questions that are not in the dialogue library or cannot be accurately matched, Jabberwacky can also automatically learn the relationship between user input and responses through machine learning algorithms, and continuously update and improve its language model to improve the quality and fluency of conversations.

Alice — 1995

Alice prototype

Alice (also known as AIML Bot) is a chatbot based on the AIML (Artificial Intelligence Markup Language) language. AIML is an XML-based markup language for defining the dialogue logic and pattern matching rules of AI chatbots to create conversational bots. Originally developed by Richard Wallace in 1995, Alice is considered one of the first chatbots.

Alice's goal is to mimic human natural language and engage in meaningful conversations with users. It can answer questions about various topics such as weather, news, sports, and also have interesting chats with users. Alice generates her response by using an AIML template. An AIML template is a rule system that matches input text to predefined patterns and generates appropriate responses. Alice also uses some natural language processing (NLP) techniques, such as part-of-speech tagging and word sense disambiguation, to improve the accuracy and fluency of her conversations.

Alice's source code and AIML templates are open, so developers are free to modify and extend them. Alice has been integrated into multiple applications, including online chat rooms, voice assistants, and smart home systems.

Alice's technical principles are mainly based on natural language processing and machine learning techniques. Its dialogue logic mainly includes pattern matching and response. Pattern matching determines user intent by comparing user input with predefined patterns. The response part generates responses based on matching patterns and uses machine learning algorithms to classify and process user input when needed to improve the accuracy and fluency of conversations.

Alice also uses a knowledgebase-based approach to enhance her answers. It is able to answer more complex questions by collecting and storing knowledge and information in various fields.

Predefined patterns are a text processing technique used to identify specific language patterns from user input and extract relevant information. In chatbots, predefined patterns are often used to match the topic, intent, or keyword asked by the user and answer the question without the need for complex natural language processing.

Predefined patterns can be implemented in the form of regular expressions, keyword matching, language models, and so on. Some chatbot platforms offer graphical interfaces that make it easy for users to create their own predefined patterns.

For example, a user asks "What would you do?" You can use the predefined pattern-matching keyword "what to do" and return a relevant response, such as "I can answer your question or make small talk." "The advantage of predefined patterns is that they are simple to use, but their limitation is that they cannot recognize complex language structures and intent.

Accumulation: 2010s

During this period, the development of deep learning technology led to the new development of intelligent chatbots. The most famous of these are Microsoft's Xiaoice and Google's Neural Conversational Model (Seq2Seq). The robots use techniques such as Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) to learn the context of the conversation and generate natural language responses.

Siri — 2011

Siri

Siri is an intelligent voice assistant developed by Apple that was first released in 2011. It interacts with users through technologies such as speech recognition, natural language processing, knowledge graphs, and provides users with various services and features.

Siri's technical architecture and principles are based on a combination of natural language understanding (NLU), natural language generation (NLG), conversation management, and knowledge base.

On the NLU side, Siri uses speech recognition technology to convert user speech to text, and natural language processing (NLP) technology to analyze and understand the user's text. This includes identifying the user's intent, identifying entities such as date, time, location, and so on, and handling tasks such as semantic parsing and ambiguity resolution.

On the NLG side, Siri uses techniques such as template generation and data-driven generation to generate natural language responses based on user intent and information in the knowledge base.

In terms of conversation management, Siri uses a series of algorithms and models to manage the conversation process, including tasks such as conversation status tracking, context recognition, conversation policy, and conversation generation. This allows Siri to infer the user's intent and needs based on their previous rounds of conversation and generate a corresponding response.

In terms of knowledge base, Siri uses a variety of information sources, including data on the user's device, third-party data, and web searches, to provide richer answers and services. At the same time, Siri can also learn and adapt to user preferences and behaviors to provide a more personalized service.

Messenger Bot — 2016

Messenger Bot is a chatbot on the Facebook Messenger platform that allows developers to create and deploy their own chatbots to interact with users on Messenger. Based on Facebook's bot engine, Messenger Bot uses natural language processing and artificial intelligence to interact with users via text, voice or images.

Messenger bots can implement many different functions, such as providing user support, subscribing to the latest news, purchasing goods, subscribing to services, booking restaurants, etc. Developers can use the APIs and tools provided by Facebook to build their own bots and integrate them into the Messenger platform.

The advantages of Messenger Bot include: the ability to achieve natural and easy-to-understand conversations, provide fast and personalized service, achieve 24/7 uninterrupted service, save labor costs, etc. As the Messenger platform continues to evolve and update, Messenger Bot has also evolved and improved to become one of the more and more popular chatbot platforms.

The technical architecture of Messenger Bot is mainly divided into the following aspects:

Messenger platform: Messenger is an instant messaging application for Facebook that supports voice calls, video calls, text and multimedia messaging, and is also the basic platform for Messenger Bot to operate.
Bot Engine: Bot Engine is one of the core components of Messenger Bot, a machine learning framework that uses natural language processing (NLP) technology to help bots understand users' messages, identify users' intent, and provide corresponding responses based on the user's needs.
Wit.ai: Wit.ai is Facebook's natural language processing platform that provides an API that allows developers to build and train their own NLP models. Messenger bots can use Wit.ai APIs to build their own NLP models and apply those models through the Bot Engine.
Webhook: Webhook is another core component of Messenger Bot, which is an HTTP callback mechanism for integrating Messenger Bot with other applications. When a user interacts with a Messenger bot on the Messenger platform, Messenger passes the user's messages to the Messenger bot's webhook endpoint, which receives them, processes them, and returns the response to the user.
Application programming interface (API): Messenger Bot also provides a set of APIs for interacting with the Messenger platform, such as sending messages, setting menus, getting user profiles, and more.

Among them, the core module Bot Engine technical principle:

Training data: Bot Engine trains its natural language processing model using a large amount of corpus data, which includes information such as users' chat history, questions, and answers. By learning from this data, Bot Engine can identify the user's intent, extract key information, and respond to the user's request.
Natural Language Understanding: The Bot Engine's Natural Language Understanding (NLU) module converts the natural language entered by the user into a language that the computer can understand, while recognizing the user's intent and extracting key information. For example, if a user can enter "Please help me book a ticket from Beijing to Shanghai", Bot Engine can understand that the user's intention is to order a ticket, and extract two key pieces of information, "Beijing" and "Shanghai".
Decision trees: Bot Engine uses decision trees to respond based on the intent entered by the user and the information extracted. A decision tree is a rule-based algorithm that makes decisions by executing corresponding rules on different nodes. For example, when the Bot Engine recognizes that the user's intention is to book a flight ticket and extracts the origin and destination information, it can generate a corresponding response based on preset rules, such as "Okay, you need to book a ticket from Beijing to Shanghai, please provide the departure time and the number of passengers".
Machine learning: Bot Engine also uses machine learning techniques to improve the accuracy and efficiency of natural language processing. It can continuously optimize its models and algorithms by learning from the user's historical chat history and feedback to better understand and respond to the user's request.

Google Duplex — 2018

Google Duplex is a natural language dialogue system, and its technical principles and architecture include the following aspects:

Speech recognition: Google Duplex first converts the user's speech input into text through speech recognition technology. This process requires the use of sophisticated acoustic models, language models, and machine learning algorithms to identify the user's speech input as accurately as possible.

Natural language understanding: After converting the user's speech input into text, Google Duplex uses natural language processing technology to understand the user's intent. This process includes word segmentation, part-of-speech tagging, named entity recognition, dependency syntax analysis, and more on text to determine the user's intent and the tasks that need to be performed.

Conversation management: Google Duplex uses dialog management technology to maintain the state of conversations, manage the flow of conversations, and decide when more information needs to be requested from users. This process requires the use of techniques such as machine learning algorithms and decision trees to determine the next step in the conversation.

Natural language generation: When Google Duplex needs to provide information to the user, it uses natural language generation technology to convert computer-generated text into natural language. This process requires the use of techniques such as text generation models and language models.

Speech synthesis: Finally, Google Duplex uses speech synthesis technology to convert computer-generated text into natural language audio output to provide information to the user.

The architecture of Google Duplex can be divided into two parts: front-end and back-end. The front end includes components such as speech recognition, natural language understanding, and dialog management, and the back end includes components such as speech synthesis, text generation, and natural language processing. The front-end and back-end are connected via Google's own network for efficient communication and processing speed.

Outbreak: 2020s

During this period, technical frameworks based on pre-trained models became mainstream. These models can be trained on large corpora to better understand user intent and generate natural language responses. These models include BERT, GPT, and T5. In addition, technical frameworks using Generative Adversarial Network (GAN) are also being used in intelligent chatbots.

GPT3 — 2020

GPT-3 (Generative Pre-trained Transformer 3) is a natural language processing model developed by OpenAI and is one of the largest and most advanced pre-trained language models in the world. It adopts the Transformer model architecture, uses a large number of corpus for unsupervised learning, can generate high-quality natural language text, and has strong natural language generation, question answering, text summary, machine translation and other capabilities.

GPT-3 models have many different capabilities and uses and can be used to generate various types of natural language text, such as emails, news articles, abstracts, and more. In addition, it can also be used in intelligent customer service, intelligent assistant, intelligent programming and other fields to provide people with more convenient and efficient services.

The architecture and principles of GPT-3 are based on deep neural networks, including 175 billion parameters, capable of processing large amounts of linguistic information, while also having strong contextual awareness and understanding. GPT-3 can automatically master the knowledge of language rules, semantics, grammar and other knowledge through the learning of a large number of corpus, so as to produce more accurate, fluent and natural text. At the same time, GPT-3 also has certain reasoning capabilities, can understand context and semantic information, provide corresponding answers according to questions, and realize intelligent dialogue and question answering systems based on natural language.

GPT-3 is powerful because it uses Transformer-based deep learning technology and has significantly improved training data and model scale. Here's how GPT-3 works:

Transformer-based architecture: GPT-3 uses a Transformer-based neural network structure, which was proposed by Vaswani et al. in 2017. Compared with the traditional recurrent neural network (RNN) or convolutional neural network (CNN), Transformer realizes the parallel processing and global correlation of sequence information through the self-attention mechanism, which greatly improves the efficiency and accuracy of the model.
Large-scale training data: GPT-3's training dataset reaches trillions of levels, which allows the model to learn more complex language patterns and patterns from a large amount of real language data, thereby improving the performance and generalization ability of the model.
Adaptive training technology: GPT-3 adopts adaptive learning technology in the training process, that is, each sample has a different weight, which makes the model pay more attention to difficult and error-prone samples, thereby further improving the accuracy of the model.
Zero-shot learning techniques: GPT-3 has also made important advances in zero-shot learning, where models can perform tasks without specific training data. This is thanks to the large-scale pre-training mode of GPT-3, which enables the model to establish a common language model and knowledge base between training data and task-specific data.

Transformer is a neural network structure based on the self-attention mechanism, which is widely used in natural language processing tasks, such as machine translation, text classification, text generation, etc. Compared with traditional RNN and CNN models, Transformer has the advantages of parallel computing and long-distance dependency modeling capabilities.

The core idea of Transformer is the self-attention mechanism, that is, each word in a sentence can depend on other words, and their dependencies can be dynamically adjusted according to the task. Specifically, the Transformer model is divided into two parts, encoder and decoder, each of which is stacked with several identical modules. Inside each module are two sublayers: multi-headed self-attention and feed-forward neural networks.

The multi-head self-attention layer can be seen as the process of aggregating and encoding all words in the input sentence, and each word can be represented by a vector by focusing on other words. In this process, different attention heads can be weighted and fused, so that different attention mechanisms can learn different dependencies.

The feedforward neural network layer can be seen as a nonlinear transformation that maps the vectors output from the attention layer into a new high-dimensional vector space. This allows for better capture of nonlinear features in the input sentence, which improves the performance of the model.

The entire training process of the Transformer model uses a masking mechanism based on the self-attention mechanism, so that each word can only see the word in front of it, thus avoiding the situation that the model sees future information during training. In addition, techniques such as residual joins and layer normalization are used to mitigate gradient vanishing and accelerate model convergence.

GPT-3 uses a variety of methods to collect large-scale training data, the main sources are:

Web crawlers: GPT-3 uses web crawlers to collect text data from the Internet. It collects data from hundreds of millions of web pages and uses language processing techniques to clean and standardize this data for use in model training.
Public Corpus: GPT-3 uses data from public corpora, which are typically maintained by universities, research institutes, and government agencies. These corpus contain large amounts of textual data, such as news articles, novels, essays, etc.
Human-generated datasets: GPT-3 also uses human-created datasets such as question-answer pairs, sentence breaks, etc. These datasets are often created by workers on crowdsourcing platforms and then reviewed and cleansed to ensure the quality of the data.
Internal data: GPT-3 also uses data from other Google products, such as search queries, Google Translate translations, etc. This data is available through Google's internal API.

In the process of model training, adaptive training technology dynamically adjusts the training strategy and hyperparameters according to the difficulty of each data sample and the performance of the model on that sample, so as to improve the generalization ability and performance of the model. The goal of adaptive training techniques is to make the model better adapt to different types and difficulty of data, reduce overfitting and underfitting problems, and improve the prediction accuracy of the model on unseen data.

Adaptive training techniques typically include the following two approaches:

Curriculum Learning: It refers to the training process, the data is divided into different stages according to a certain degree of difficulty, training simple samples first, and gradually adding more difficult samples. This can avoid the model from falling into the local optimal solution prematurely and improve the generalization ability of the model.
Adaptive Learning Rates: refers to the dynamic adjustment of the learning rate of each parameter during the training process, so that the model converges quickly in the early stage of training and gradually reduces the learning rate when it is close to convergence, so as to prevent the model from over-adjusting, oscillating and unstable in the later stage of training.

Zero-shot learning refers to the task of classifying unknown categories by using prior knowledge and meta-learning techniques without any training data. While traditional classification models can only classify classes that have appeared in their training data, zero-shot learning can expand the application scenarios of the model so that it can handle more unknown categories.

The basic idea of zero-shot learning is to associate the descriptive information of categories (such as attributes, semantic representations, etc.) with the model, so that the model can predict the corresponding labels based on the descriptive information of the categories. In this way, when the model encounters an unknown category, it only needs to enter the description information of the category into the model to get the label of the category. The advantage of this approach is that it can expand the application scenario of the model to uncharted territory, while avoiding the cost and time consumption of collecting a large amount of training data.

In the field of natural language processing, zero-sample learning techniques have been widely used in tasks such as word sense disambiguation and sentence relationship judgment. For example, by learning the semantic representation of words and the semantic representation of sentences, the model can be applied to unknown word-sentence relationship judgment tasks, such as determining whether "pasta" and "spaghetti" belong to the same category. At the same time, by learning the attribute information and relationships of words, the model can be applied to the emotion classification task of unknown words, such as judging the emotional polarity of "unicorn".

Zero-shot learning techniques are still an open research problem, and there are still many challenges and problems to be solved. For example, how to improve the generalization ability of the model, how to solve the noise and incompleteness of the description information, and so on.

Paper Recommendation:

“Attention Is All You Need” by Ashish Vaswani et al. (2017)

This paper introduces a new neural network architecture called Transformer, which excels in translation tasks. Transformer is a widely used framework in pre-trained model-based chatbots today.
“BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” by Jacob Devlin et al. (2019)

This paper introduces the BERT (Bidirectional Encoder Representations from Transformers) model, which is currently one of the most popular pre-trained models in natural language processing. The BERT model has achieved leading results in many natural language processing tasks.
“GPT-2: Language Models are Unsupervised Multitask Learners” by Alec Radford et al. (2019)

This paper introduces the GPT-2 (Generative Pre-trained Transformer 2) model, a large-scale language model based on Transformer. The GPT-2 model is capable of generating high-quality natural language text and has achieved leading results in multiple natural language processing tasks.
“Dialogue Response Generation with Implicit Memory Net” by Xiaodong Liu et al. (2018)

The paper introduces a model called Implicit Memory Net, which uses memory networks to maintain conversation histories and generate natural language responses. This model achieves excellent results on multiple tasks.
“DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation” by Wei Wu et al. (2020)

This paper introduces DialoGPT, a large-scale dialogue generation model based on GPT-2. DialoGPT is able to generate high-quality natural language responses and has achieved leading results in multiple dialogue generation tasks.