Thank you for your interest in "Yongda English"!

Talk about big language models and their applications from ChatGPT

Liu Ting

I. Introduction

The rapid development of the Internet, the Internet of Things and big data has made the information space juxtaposed with the material space and spiritual space to form a ternary space. Artificial intelligence technology processes massive amounts of data in the information space, and the results in turn act on the material space and the spiritual space. After the emergence of the big language model (hereinafter referred to as the "big model"), the machine can automatically generate data, which is true and false, which not only enriches the information space, but also pollutes the information space, and its impact on the ternary world is difficult to estimate.

Second, large model technology

Artificial intelligence has experienced 4 highs. The third upsurge was driven by deep learning around 2010. In November 2022, OpenAI released ChatGPT–3.5, powered by large models, marking the arrival of the fourth climax of artificial intelligence.

GPT (Generative Pre-Training Converter) is a technology that focuses on the natural language processing industry in GPT-3 and before, while ChatGPT based on GPT-3.5 is out of the circle because of its amazing human-machine dialogue ability and is sought after by various industries around the world.

GPT is a language generation model, simply put, it predicts the next "word" according to the above, and so on and so on and so forth, connecting words into sentences, sentences into articles, and answering the user's questions beyond expectations.

Why is GPT able to produce such a breakthrough? The key mystery is that by mining words to construct fill-in-the-blank questions with standard answers, an unlimited amount of training data is generated at no cost, allowing the machine to obtain universal linguistic intelligence in the process of trying to learn to solve fill-in-the-blank questions. This machine learning method is different from supervised learning, because it does not need to manually label the data, and different from unsupervised learning, because it has standard answers, and has the advantages of both standard answers and no labor costs, which is called self-supervised learning.

In addition, the use of low-dimensional, continuous, dense vectors instead of symbols to express semantics is also an important reason for the breakthrough of large models. Traditionally, we regard words as symbols, but symbols and symbols are isolated, and it is necessary to establish an additional knowledge base to define the relationship between symbols, and word vectors are automatically calculated from massive text data according to the principle of "watching their companions and knowing their meaning", according to the word vector can judge the semantic distance of "table" and "bench", which is closer than the semantic distance of "table" and "tomato", according to this superposition deduction, sentence semantics, chapter semantics can be calculated.

Of course, big data and high computing power are also important reasons. ChatGPT has 175 billion parameters inside, an order of magnitude higher than the number of neurons in the human brain. Quantitative changes produce qualitative changes, and the phenomenon of "emergence" appears. In addition, instruction fine-tuning, human feedback reinforcement learning, etc. are also the reasons for the revolutionary breakthrough of ChatGPT.

Large models answer users' questions well, and some answers are better than people's. From the perspective of question and answer, you can compare the large model with the database and search engine. Databases store information in a structured way (such as two-dimensional tables) and are accessed using standard query languages (such as SQL), and the way users access the database is unnatural. Search engines store data (such as web pages, videos, etc.) in an unstructured way, and users can express their search requests with keywords relatively freely, but the expression is still insufficient and unnatural, and the search result is raw information that has not been processed. The large model stores all the information collected in a parametric way, allowing users to freely and fully express their needs for information with the most familiar natural language sentences, and the large model understands the user's intention and generates paragraphs and chapters as answers based on the complex parameter system inside the model. Since the answer is automatically written by the machine for the user's specific question, rather than found out, it can fully meet the user's specific information needs, but there is also a risk of "hallucination".

In general, ChatGPT is essentially a deep neural network large model represented by 175 billion floating-point parameters, which is a conversational AI system, which realizes the wisdom of linguistic intelligence for the first time, and has made major breakthroughs in five aspects, including full online memory of massive information, conversational understanding of arbitrary tasks, thinking chain reasoning of complex logic, multi-role and multi-style long text generation, and instant new knowledge learning and evolution.

Third, the influence of large models

From the different levels of solving language problems, the development of natural language processing can be divided into four stages: form, semantics, reasoning and pragmatics. Traditional search engines solve the problem of form matching, but to express the same meaning in different forms, semantic analysis is required, for example, in the telecom customer service scenario, "Please check my phone bill balance" (standard question) and "How much money do I have left" (spoken question) is a meaning, which can not be solved by form matching. The deep meaning of the language is literally unknowable, such as the user comment "This five-star hotel, no swimming pool", its emotional tendency is negative, which needs to be based on the knowledge of "five-star hotels generally have swimming pools" to conclude. The highest state of natural language processing is to understand the sound behind the text, such as "he is amazing", whether it is a sincere praise, you need to fully understand the context to determine.

The author made up a sentence and asked ChatGPT: "'Come to think of it, there is only one profession that is not threatened by ChatGPT - unemployed vagrants. What does this mean? ChatGPT replied: "There is some irony in this statement, suggesting that the development of artificial intelligence may lead to some people losing their jobs while unemployed vagrants are not affected by it." "ChatGPT has been able to understand irony, showing that natural language processing is in the stage of moving from "reasoning" to "pragmatics" driven by large models.

From the research paradigm of natural language processing, 1950~1990 dominated small-scale expert knowledge, 1990~2010 was shallow machine learning, 2010~2017 was deep learning, 2019~2022 was pre-trained language model, and 2023 began to enter the era of large models. In the deep learning stage, manual feature engineering is no longer required; In the pre-training model period, large-scale data does not need to be manually labeled; In the era of large models, various language processing tasks are unified into generation tasks.

In the era of large models, the boundaries of natural language processing many tasks (question answering, translation, text generation, information extraction, etc.) have been broken, and a large model can be competent for many tasks and can better handle new tasks that have not been seen before. The original "jungle-style" natural language processing research pattern instantly evolved into a "big tree" model, the root of the tree is a "big model", the trunk is very short, contains a number of specific tasks, dense foliage, deep into thousands of industries, is the application of large model.

ChatGPT will not only break the landscape of natural language processing research, but will also have a profound impact on society. In March 2023, Yuval Harari, author of A Brief History of Humanity, spoke to Triptych Life Weekly, saying, "Human culture is based on language. And because AI has cracked language, it can now start creating culture. ...... Humans will begin to adapt to cultures created by non-human entities. And, since culture is humanity's 'operating system', this means that AI will be able to change the way humans think, feel and behave. Elon Musk believes that ChatGPT is too good to be scary, and we are not far from dangerous strong artificial intelligence. American writer, Robust. AI founder Gary Marcus also said that generative AI will pose a tangible, imminent threat to the fabric of society.

Fourth, the future of the big model

Large models are not absolutely perfect, and there are still many research work to be done around the improvement of large models, mainly including making up for the shortcomings of large models, promoting the application of large models, and exploring the mechanism of large models.

The shortcomings of the large model exist in many aspects: (1) The factual consistency is insufficient, and "illusions" often occur. Essentially, instead of looking for information, a large model turns massive amounts of data into parameters and regenerates text. There is the possibility of fabricating "facts" during the generation process. (2) Lack of logical consistency. In multiple rounds of dialogue, the big model sometimes loses control of the internal logic of people and things. (3) The amount of data and computing power resources required is too large, resulting in huge costs in the training and application process, which most research institutions and enterprises cannot afford. I will not go into details.

In order to make up for the shortcomings of large models, the following issues need to be studied urgently in the coming period: how to improve the credibility of production texts; How to roleize the robot so that the words produced by the machine conform to the identity, personality characteristics and language style of the specific role; How to make the machine understand a specific user better and provide personalized services; How to break down data silos in some serious fields (such as healthcare) and use private data for training; How to crop or compress the model so that the model tends to be miniaturized for easy application; How to obtain richer information from multimodal data (images, video, voice, etc.); How to improve training efficiency through data splitting and reorganization, engineering optimization, and P2P training; How to use the computing power resources scattered in multiple places through the computing power network; Wait a minute.

On the basis of the general big model, how to combine with various industries and scenarios, and use domain big data to train the industry big model, is the future focus of most scientific research institutions and enterprises. There are two major categories of industries: one is the industry with more interaction between people and data, such as education, medical care, finance, etc., there are rich scenarios in these industries that require human participation, and large models can directly come in handy; The other category is industries with more interaction between things and data, such as manufacturing, electric power, agriculture, transportation and other industries, in these industries, large models can only play a role in the mining and questioning of industry knowledge, and how to play a role in the production and circulation processes in the industry still needs to be explored. The effect of large models has indeed greatly surpassed the previous natural language processing technology, but whether it can meet the requirements of human users in serious scenarios (such as medical, legal, military, etc.), whether it can break through the last mile, truly land, and bring value to the industry remains to be observed and practiced.

All kinds of inventions in human history have been created with the principle clarified, but the big model is an exception. It's true that wisdom has emerged, but no one, including the inventor of ChatGPT, has been able to accurately explain the mechanism of the large model. Exploring the intrinsic mechanism of large models through all-round evaluation is an important research topic in the coming period.

Fifth, the challenges brought by large models to cognitive security

ChatGPT can automatically answer people's questions, and when people become dependent on ChatGPT, ChatGPT's views will have an impact on users' cognition, resulting in cognitive security challenges.

Cognition refers to the process of people acquiring knowledge or applying knowledge, or the process of information processing, which is the most basic psychological process of people, including feeling, perception, memory, thinking, imagination and language. Cognitive safety refers to the safety of people's will, beliefs, thinking, psychology and other spiritual factors.

Cognitive safety includes many dimensions, such as knowledge, psychology, morality (values), law, politics, etc. For example, knowledge is wrong, or statements are not fantastic; cognitive distortion, extreme or very emotional speech; Views that do not conform to mainstream values or deviate from local cultural practices at the time; The user's question or the machine's reply violates laws and regulations; The content is reactionary and endangers national security. All of this can be regarded as harmful information and pose a threat to the cognitive security of the Chinese public.

We emphasize "building a strong strategic deterrent force system, increasing the proportion of new domain and new quality combat forces, accelerating the development of unmanned intelligent combat forces, and coordinating the construction and application of network information systems." Compared with the traditional form of land, sea, air, and space warfare, cyber warfare and public opinion warfare are silent battlefields without gunsmoke, and they are also battles that are being fought all the time in peacetime. The public opinion war is different from the network war, and its battle does not occur at the "external information level", but through the information space at the "inner consciousness level" to fight, its concealment and penetration are stronger. In the war of public opinion, language has become a weapon, and today's big model can automatically answer questions, comment comments, write articles, automates language weapons, and the risks it brings are incalculable.

In order to maintain the cognitive safety of Chinese citizens, it is urgent to strengthen the research on automatic identification, automatic refutation and correction technology of harmful information. If it is a domestic large model, the model should also be self-filtered for content security to avoid the output of harmful information due to the existence of bad information in the training data or the "illusion" of the model.

6. Construction of language resources in the era of large models

The big model is based on language big data, which is not only a collection of language symbol strings, but also a treasure trove of human knowledge and experience. Just as a nation's genetic big data is the nation's biological genetic code, a nation's language big data contains the nation's ideological code, cultural code, and historical code. Real-time language big data also contains the current political and economic dynamics of society, which is a very valuable asset for a country.

Really high-quality language data is not all can be captured on the Internet at will, they are contained in the private databases of various departments, enterprises and institutions, and the protection of these data should be like the protection of genetic data, attracting our great attention. At the same time, in the process of developing state-owned general large models, it is necessary to centralize and use high-quality language big data scattered everywhere to ensure the leading position of China's own Chinese models in the world.

At the same time, the big model has no language boundaries, in order to compete for international discourse power, in order to further expand international economic and trade exchanges, China also needs to consciously collect, save and sort out massive multilingual data resources, including both major and minor languages in the world. The linguistic big data of each country and ethnic group contains its own culture, history and current political and economic situation, which is essential for our research on regional country studies and for building a powerful multilingual model of China.

At present, people are trying to centrally allocate computing power scattered in various places through the computing power network, similarly, the language big data scattered in various places can be through the formulation of standards, through the evaluation and pricing of language big data, promote sharing, trading, so that domestic language big data is fully used by domestic large model research and development institutions, playing an important role in language data as a new generation factor, and it is also worthy of in-depth discussion by relevant people.

The large model can generate very coherent and logical discourse, and its mastery of the laws of language itself has reached or even exceeded the level of ordinary people. However, the big model needs to learn from people about the knowledge, experience, and experience carried by language, especially morality, culture, values, etc. Therefore, the future processing of language resources, more not the annotation of the lexical syntactic semantic structure of the language itself, but the artificial annotation of the information loaded by the language, in the development process of ChatGPT, including the construction of "prompt prompt", including human feedback reinforcement learning (RLHF), etc. have the full participation of people, worthy of our reference.

VII. Conclusion

ChatGPT was born, and its amazing Q&A effect pulled the level of natural language processing to a height far beyond expectations, which not only surprised people in the AI industry to see the dawn of general artificial intelligence, but also let thousands of industries see the major opportunities and challenges that general artificial intelligence may bring to their own industries. The emergence of ChatGPT has caused us to deeply reflect: what exactly is language? What is the future of efforts for language-related workers, including linguists and natural language processing specialists? Why didn't China take the lead in inventing large models?

Although there are still many confusions, but the era of big models has arrived, we must keep our eyes open, face reality, plan the layout of the overall plan, fully cross-collaborate, and vigorously develop domestic large models under the premise of ensuring data security and cognitive security, and promote their implementation in various industries, contributing to China's economic development and social prosperity.

(This article was first published in Language Strategy Research, Issue 5, 2023)

Talk about big language models and their applications from ChatGPT

Talk about big language models and their applications from ChatGPT

Read on