Pushing boundaries: High-performance computing leads LLM to the era of innovation for AGI for general artificial intelligence

AGI | AIGC | Large model training | GH200

LLM | LLMs | Large Language Models | MI300

The success of ChatGPT has led to the development of the entire AIGC industry, especially in the fields of LLM (Large Language Model, Large Language Model), NLP, high-performance computing and deep learning. The development of LLM will provide strong impetus for the growth of the global and Chinese AI chip and AI server markets, and it is estimated that LLM will bring about $89.12 billion and $33.82 billion to the global and Chinese AI server market space.

Foreign manufacturers have a leading edge in the LLM field, but mainland LLM products are also developing rapidly, and since 2023, many manufacturers have launched self-developed general LLM, and the application of domestic LLM in various industries and ecological construction have also made positive progress. Although there may still be a certain gap between mainland LLM and GPT-4, it is expected to reach or approach the level of ChatGPT in the short term.

It is worth noting that AMD launched the MI300 series accelerator card last week, designed to compete with NVIDIA. The MI300 series is AMD's latest line of APU accelerator cards for AI and high-performance computing, including the MI300A and MI300X. The MI300A integrates the CPU and GPU, while the MI300X is an accelerator specifically for generative AI, benchmarking against the NVIDIA H100. From the perspective of performance parameters, MI300 series products are comparable to or even surpass NVIDIA's high-end accelerator cards, but in general, it is still difficult to shake NVIDIA's absolute leading position in this field in the short term.

Pushing boundaries: High-performance computing leads LLM to the era of innovation for AGI for general artificial intelligence

Looking forward to the second half of the year, mainland large model products have initially acquired commercial capabilities. The release of general artificial intelligence development policies in Beijing, Shanghai and Shenzhen is conducive to the mainland, demonstrating the mainland's emphasis on and support for the development of AIGC, and will bring demonstration effects to other cities to release similar policies. Under the resonance of policy and technology, the future development prospects of the mainland AIGC industry are broad.

Nowadays, the gap between China and the most advanced technology in LLM model related technology has further increased. In the one to two years after the emergence of Bert, China has also proposed some good improvement models while the technology catches up quickly. The watershed in which the gap widens should be after GPT 3.0 comes out, around mid-2020. At that time, only a few people realized that GPT 3.0 was not only a specific technology, but also a development concept of where LLM should go.

Large Language Modeling (LLM) is a low-cost, high-efficiency technology that has attracted a lot of attention in the field of natural language processing (NLP) and artificial intelligence (AI). Among them, does ChatGPT, as a representative of LLM, bring about a paradigm shift in the field of NLP and AI? If so, what will be the impact? LLM has accumulated a wealth of knowledge by learning from massive amounts of data. So, how does LLM access this knowledge? As LLM grows, how will it impact research and application? In addition, In Context Learning is a mysterious technique that is closely related to Instruct. Does LLM have reasoning skills? How is Chain of Thought (CoT) implemented? Next, the aspects mentioned above will be described in detail.

Background, capabilities of LLMs

1. Background of LLM

LLM (Large Language Model, Large Language Model) refers to a language model trained using large amounts of text data with hundreds of billions or more of parameters. Pre-training targets are modeled using the Transformer architecture and language, but LLM models are larger in size, pre-training data, and total computation than small models. This allows them to better understand natural language and produce high-quality text. The capacity improvement of LLM can be partially described by scaling laws, but some capabilities can only be observed when the model size exceeds a certain level.

Second, the emergence capacity of LLM

The emergence capability of LLM, which refers to the ability that does not exist in small models but appears in large models, is one of the most distinguishing features of LLM from previous PLMs. When the scale reaches a certain level, the performance of LLM is significantly higher than that of the random state, and this new mode is closely related to the phenomenon of phase transitions in physics. The emergent capability of LLM can be related to some complex tasks, and people are more concerned with its generic capabilities.

The three representative emergence capabilities of LLM include contextual learning, instructional following, and step-by-step reasoning. Among them, the contextual learning capability enables the language model to generate the expected output of the test instance by completing the word sequence of the input text; Instruction compliance enables LLM to perform new tasks by understanding task instructions without using explicit samples, thereby improving generalization; The step-by-step reasoning capability allows LLM to arrive at the final answer by utilizing a prompt mechanism involving intermediate reasoning steps to solve complex tasks.

NLP studies the paradigm shift from shallow semantics to deep semantic modeling

The NLP field may have undergone two important research paradigm shifts in the past 10 years.

First, from deep learning to two-stage pre-training models

The introduction of deep learning in the NLP field began roughly in 2013 until the advent of GPT 3.0 (around May 2020). Before the emergence of Bert and GPT models, the popular technologies in the NLP field were mainly deep learning models, mainly relying on improved LSTM and CNN models as feature extractors, and Sequence to Sequence + Attention as the overall technical framework. However, although these increase the depth of the model, they are still not successful enough in solving specific tasks. This is mainly due to the limited amount of task training data and the insufficient expression ability of LSTM/CNN feature extractor.

It was not until the emergence of two pre-trained models, Bert and GPT, that it represented a technological leap in the NLP field and brought about a paradigm shift in the entire field of research. The impact of this paradigm shift is mainly reflected in two aspects: first, the decline or even the gradual disappearance of some NLP research subfields; Second, the technical methods and technical frameworks of different subfields of NLP are becoming increasingly unified, and the technology stack basically converges into two technical modes.

1. The decline or even the gradual demise of some NLP research subfields

NLP is a general term for a macroscopic research field, which has a variety of specific sub-fields and sub-directions, if carefully analyzed, from the perspective of the nature of the task, these tasks can be divided into two categories: intermediate tasks and final tasks.

1) Intermediate tasks

Typical intermediate tasks mainly include Chinese word segmentation, part-of-speech tagging, NER, syntactic analysis, reference resolution, semantic Parser, etc., such tasks generally do not solve the actual needs of the application, most of them exist as intermediate or auxiliary stages of those tasks that solve actual needs. For example, there is almost no need to say, to have a syntactic Parser, the syntactic analysis tree of this sentence to the user to see, the user does not need to see the intermediate stage processing results of these NLP, only care about whether a specific task is done well.

2) Final task

The characteristic of this type of task (text classification, text similarity calculation, machine translation, text summary, etc.) is that each subfield solves an actual need, and the task results can basically be presented directly to the user, such as the user does have a need to give you a sentence in English and tell him what Chinese is.

It stands to reason that "intermediate tasks" should not appear, and the reason why they exist is that the level of NLP technology development is not high enough. In the early stages of technological development, it was difficult to do a difficult final task in one step due to the relative backwardness of technology at that time. Such as machine translation, early technology to do a good job of machine translation is very difficult, so researchers divide and conquer the problem, decompose various intermediate stages such as component words, part-of-speech labeling, syntactic analysis, etc., first do each intermediate stage well, and then put together to complete the final task.

Since the advent of Bert/GPT, there has been no need to do intermediate tasks. Because through the pre-training of a large amount of data, Bert/GPT has absorbed these intermediate tasks as linguistic features into the parameters of the Transformer, and then it is completely possible to directly solve those final tasks end-to-end without specifically modeling this intermediate process.

2. Unification of technical routes in different research directions

In addition to "intermediate tasks", NLP tasks can be divided into two main types: natural language understanding and natural language generation. Natural language understanding tasks include classification tasks such as text classification, sentence relationship judgment, and emotional tendency judgment, and the model determines which category it belongs to based on the input text. Natural language generation tasks include chatbots, machine translation, text summaries, question answering systems, and other generation tasks, and the model generates corresponding output text based on the input text.

Since the emergence of the Bert/GPT model, there has been a trend of technical unification in the NLP field, and the feature decimators have gradually been unified from LSTM/CNN to Transformer, and most tasks use pre-training + fine-tuning or Zero/Few Shot Prompt mode. The natural language understanding task adopts the bidirectional language model pre-training + fine-tuning mode represented by Bert, and the natural language generation task adopts the autoregressive language model + Zero/Few Shot Prompt mode represented by GPT 2.0. The development ideas behind these two models and the future development direction are different, but many people underestimate the potential of the GPT model. The autoregressive language model of GPT mode can generate high-quality text, can be applied to multiple natural language generation tasks, and has good transfer ability. In contrast, the Bert mode performs poorly in the generation task and the fine-tuning method requires a large amount of annotated data, which is not easy to adapt to new tasks.

From pre-trained models to general artificial intelligence (AGI)

This paradigm shift covers roughly the time since the advent of GPT 3.0 and has continued from around June 2020 until now. A key node for ChatGPT conversion, but before the advent of InstructGPT, LLM was in a transition period of paradigm shift.

1. The "autoregressive language model + prompting" mode represented by GPT 3.0 dominates

In the early stage of the development of pre-training models, the technical framework converged to the two different technical paradigms of Bert mode and GPT mode, and people are generally more optimistic about the Bert model, and most subsequent technical improvements are along the path of Bert. However, with the continued development of technology, it has been found that the largest LLM models are almost all "autoregressive language model + prompting" mode similar to GPT 3.0 (such as GPT-3, PaLM, GLaM, Gopher, Chinchilla, MT-NLG, LaMDA, etc.). Why is that? There must be an inevitability behind it, mainly due to two reasons.

1) Google's T5 model, which formally unifies the external manifestations of natural language understanding and natural language generation tasks

As shown in the figure above, the text classification problem in the T5 model and the regression or classification problem to judge the similarity of sentences are typical natural language understanding problems. In the T5 model, the input and output forms of these natural language understanding problems are consistent with the generation problems, and the classification problems can be converted into strings for the LLM model to generate corresponding categories, so as to achieve complete unity in the representation of understanding and generation tasks. This suggests that natural language generation tasks can be manifestively compatible with natural language understanding tasks, which is difficult to do. The advantage of this is that the same LLM generative model can solve almost any NLP problem. In contrast, if the Bert mode is still used, the LLM model does not handle the generation task well.

2) If you want to do a good job in the form of zero shot prompting or few shot prompting, you must adopt GPT mode

Studies have shown that the Bert mode outperforms the GPT mode when solving downstream tasks in a fine-tuning manner. However, if the downstream task is solved in zero shot/few shot prompting mode, GPT mode is better than Bert mode. This shows that the generative model is easier to complete the task of zero shot/few shot prompting, and the Bert mode has disadvantages when doing tasks in this way.

So the question is: why pursue zero shot/few shot prompting to do tasks? To explain this problem, we first need to clarify another question: what kind of LLM model is ideal?

For the LLM model, first of all, it should have a strong self-learning ability. If all the different types of data available in the world, such as texts and pictures, are entered into the model, it should be able to automatically learn all the knowledge points contained therein, the learning process does not require human intervention, and the knowledge learned can be flexibly applied to solve practical problems. Due to the huge amount of data, to absorb all knowledge, the model must have a large number of parameters to store knowledge, so this model must be a giant model.

Second, the LLM model should be able to solve problems in any subfield of NLP, not limited to a limited domain, and should even be able to respond to problems in other areas outside of NLP. In addition, when using the LLM model to solve a problem in a specific domain, the expression of human habituality should be used, i.e. the LLM should understand human commands. This means adapting LLM to humans, rather than adapting people to LLM models. A classic example of a person adapting to LLM is when people go to great lengths to try a variety of different prompts in an attempt to find good prompts to better solve the problem at hand.

Why pursue zero shot/few shot prompting as a way to solve tasks? There are two main reasons.

1) The ideal LLM model must be very large, and only a few institutions have the ability to make this model or change the model parameters. There are thousands of task demanders, most of which are small and medium-sized organizations or even individuals, even if the model is open source, they cannot deploy this model, let alone use the Fine-tuning model to modify the model parameters. Therefore, we should pursue a way to allow the task demander to complete the task without modifying the model parameters, that is, the prompt mode should be used to complete the task, rather than the fine-tuning mode. Model makers run LLM as a common service as LLM as Service.

As a service supporter, considering the diversity of user needs, LLM model makers should pursue LLM to complete as many types of tasks as possible, which is a collateral impact and a realistic factor why super large models will pursue AGI.

2) Zero shot prompting, few shot prompting, or even CoT, Chain of Thought prompting, which promotes LLM reasoning ability, is one of the existing technologies. Specifically, the original intention of zero shot prompting was to directly use the common human task expression to let LLM do things, but it was found that LLM was not well understood and did not work well. After continuing to study, people found that for a task, if LLM is given a few examples, using these examples to represent the task description, the effect will be better than zero shot prompting, so they all began to study better few shot prompting technology.

It can be understood that LLM was originally expected to be able to perform a task in the usual command mode of humans, but the current technology cannot do it, so it retreats to the second place and uses these alternative technologies to express human task needs. If you understand the above logic, it is easy to conclude that few shot prompting (also known as In Context Learning) is just a transitional technology. If a task can be described more naturally, and LLM can be understood, then there will be no hesitation in abandoning these transitional technologies, for the obvious reason that using these methods to describe task requirements is not in line with human habits.

2. A new interactive interface that adapts LLM to people

ChatGPT is a powerful, empathetic approach to technology that comes closest to the ideal LLM model. The power of ChatGPT is mainly due to the GPT 3.5 model, rather than manually annotating data. Although manual annotation data is added, the magnitude of these data is only tens of thousands, and the basic ability enhancement effect of GPT 3.5 is minimal.

The biggest contribution of ChatGPT is that it basically implements the interface layer of the ideal LLM, allowing LLM to adapt to people's habitual command expressions, rather than conversely. This increases the ease of use and user experience of LLM, and is a human-machine interface technology that is more in line with human expression habits for people to interact with LLM. The technical contributions of ChatGPT will surely inspire subsequent LLM models to continue to do further work in the area of user-friendly human-machine interface.

3. Many NLP subfields no longer have independent research value

Paradigm shift will change the pattern of NLP field, and many independent research fields will be included in the LLM technology system and gradually disappear. Although many "intermediate tasks" no longer need to exist independently, most of the "final tasks" will still exist as independent domains, but under the framework of "pre-training + fine-tuning", new improvements are proposed.

Studies have shown that as the scale of LLM models increases, the effectiveness of many NLP tasks is greatly improved. Therefore, the so-called "unique" problems in many fields are only the external appearance of lack of domain knowledge. As long as more domain data is provided to LLM and let it learn more on its own, these problems can be well solved. The future trend of technology development should be to pursue larger and larger LLM models, covering more and more fields by increasing the diversity of pre-training data. The focus will be on how to build an ideal LLM model, rather than solving a domain-specific problem. Therefore, more and more subfields of NLP will be included in the LLM technology system and gradually disappear.

To determine whether independent research in a specific field needs to be stopped immediately, the following two methods can be adopted: First, to judge whether the research effect of LLM exceeds human performance, and there is no need for independent research in those research fields where the effect of LLM has exceeded human performance. The second is to compare the task effect of the two modes, if the effect of few-shot prompting or instruct-based methods meets or exceeds the effect of fine-tuning with larger domain-specific data, it means that there is no need for this field to continue to exist independently.

If this speculation is true, it will mean that many researchers in the NLP field are faced with the choice of where to go, whether to continue to do field-specific problems? Or abandon this approach and build a better LLM instead?

4. More research fields other than NLP will be included in the LLM technology system

The ideal LLM should be a general-purpose AI model that should not be confined to a single subject area. The advent of ChatGPT proves the viability of this pursuit of AGI, and it is time to let go of the shackles of thinking about "domain disciplines". In addition to demonstrating smooth dialogue in a variety of NLP tasks, ChatGPT also has powerful coding capabilities.

LLM technology is expanding outward, and one of the natural directions is image processing and multimodal tasks. There is already some work on attempts to incorporate multimodality into LLM to enable the functionality of a common human machine interface, such as DeepMind's Flamingo and Microsoft's "Language Models are General-Purpose Interfaces."

The effect of pre-trained models applied to downstream tasks in the image domain is much less significant than that of pre-trained models in the NLP domain, which may be because image pre-processing models still need to be explored in depth to unlock the potential of image data. As a result, the integration of the field of image processing into LLM may be slower than thought. Of course, if the pre-trained models in the image domain are traversed, they are likely to be integrated into large LLMs and directly complete terminal tasks, similar to the situation in the NLP field.

In addition to images and multimodality, other fields will gradually be incorporated into LLM, which is a high-value research topic. Personal reflections on paradigm shift show that the main technological advances in LLM technology can be divided into two categories: one is how LLM models absorb knowledge from data, and the impact of model scale growth on LLM's ability to absorb knowledge. The second category is about how people use LLM's intrinsic capabilities to solve tasks, including the In Context Learning and Instruct modes. CoT prompting, an LLM reasoning technique, is essentially In Context Learning.

Derive massive amounts of knowledge from endless data

The current findings suggest that the Transformer is powerful enough as a feature extractor to not require special improvements. What did the pre-training process teach Transformer? How is knowledge stored? How do I fix incorrect knowledge? These questions are the focus of current research. This section describes the progress of research in this area.

1. What LLM learned

LLM has acquired a large amount of knowledge by learning massive free texts, which can be roughly divided into two categories: language knowledge and world knowledge. Language knowledge includes lexical, part-of-speech, syntactic and semantics, which helps humans or machines understand natural language. Studies have shown that LLM can learn linguistic knowledge at various levels and that this knowledge is stored in the lower and middle layers of the Transformer. World knowledge includes real events (fact-based knowledge) and common sense knowledge.

Studies have shown that LLM can absorb a large amount of world knowledge from training data, and this knowledge is mainly distributed in the middle and upper layers of the Transformer, and the amount of knowledge that can be learned gradually increases exponentially as the number of model layers increases. For Bert-type language models, only 10 million to 100 million words of corpus can learn linguistic knowledge such as syntactic semantics, but to learn factual knowledge, more training data is needed. As the amount of training data increases, the better the pre-trained model performs in various downstream tasks, which shows that the world knowledge is more learned from the incremental training data.

Second, how LLM accesses knowledge

LLM is a language model based on Transformer structure, which can learn rich language classes and world knowledge from a large number of free texts. But how is LLM stored and extracted for a specific piece of knowledge? From the structure of the Transformer, the model parameters consist of two parts: the long attention (MHA) part accounts for about one-third of the parameter population, and two-thirds of the parameters are concentrated in the FFN structure.

The first layer of FFN is an MLP wide hidden layer, that is, the Key layer; The second layer is an MLP narrow hidden layer, that is, the Value layer. The input layer of FFN is actually the output result of MHA corresponding to a word, Embedding, that is, through Self Attention, the input context of the entire sentence is integrated together, representing the overall information of the entire input sentence.

Each neuron node in the Key layer records a pair of <Key, Value > information. For example, for the i-th node ki of the first hidden layer of FFN, perhaps it records the knowledge of <Beijing, is-capital-of, and China>. The key vector corresponding to the ki node actually refers to the weight vector of each node in the node ki and the input layer; The corresponding Value vector refers to the weight vector that each node of the node ki and the Value layer of the second layer of FFN form a connection.

The key vector of each neuron, used to identify a certain language or knowledge pattern in the input, is a pattern detector. If the input contains a certain pattern it wants to detect, then the input vector and the key weight of the ki node are calculated in the vector product, and Relu is added to form a large numerical response of ki, which means that ki detects this pattern, and then propagates this response value to the second layer of FFN through the value weight vector of the ki node. This is equivalent to weighting the value of the Value vector with the response value, and then passing and reflecting it on the output of each node in the second layer of the Value layer.

In this way, the forward propagation calculation process of FFN looks like it detects a certain knowledge pattern through the Key, then takes out the corresponding Value, and reflects the Value on the second-layer output of the FFN. Of course, each node in the second layer of FFN will collect all node information of the key layer of FFN, so it is a mixed response, while the mixed response of all nodes in the Value layer can be interpreted as representing the probability distribution information of the output word. The idea of FFN as a Key-Value memory may not be the final correct answer, but it is not too far from the final correct answer.

Third, how to correct the knowledge stored in LLM

When using LLM for natural language processing, you may encounter some outdated or erroneous knowledge. To solve this problem, three different methods can be used to correct the knowledge stored in the LLM.

1. Revise knowledge from the source of training data

By tracking the training data source corresponding to a piece of knowledge, the data source is located to locate which data leads LLM to learn knowledge. Then, delete the data source and retrain the entire LLM model to remove the relevant knowledge in LLM. However, this approach does not work in a few, multiple routine knowledge correction scenario.

2. Revise knowledge through fine-tuning

The training data is constructed according to the new knowledge to be corrected, and the LLM model is fine-tuned on the training data to guide the LLM to remember new knowledge and forget old knowledge. However, there will be the phenomenon of forgetting the knowledge that should be forgotten, and forgetting the knowledge that should not be forgotten, resulting in the effect of some downstream tasks being reduced after doing so. In addition, the cost is quite high.

3. Directly modify the model parameters of LLM to correct knowledge

By locating the specific location of stored knowledge, the corresponding model parameters in FFN are forcibly adjusted to replace the old knowledge with new knowledge. However, this approach needs to address two key issues. First of all, you need to know how to locate the specific storage location of a piece of knowledge in the LLM parameter space. Secondly, you need to know how to correct the model parameters to achieve the correction of old knowledge to new knowledge.

Understanding the process of revising LLM knowledge is helpful for a deeper understanding of the inner workings of LLM. Although the three methods have their own advantages and disadvantages, they can help correct outdated or erroneous knowledge in LLM and improve the performance of LLM in natural language processing tasks.

What happens when LLM gets bigger

In recent years, the scale of LLM models has been growing, and the parameter scale of the best LLM models at present has exceeded the scale of 100 billion (100B) parameters. For example, OpenAI's GPT-3 scale is 175B, Google's LaMDA scale is 137B, PaLM is 540B, and DeepMind's Gogher scale is 280B. There are also Chinese giant models in China, such as Tsinghua & Zhipu GLM scale 130B, Huawei "Pangu" scale 200B, Baidu "Wenxin" scale 260B, Inspur "Source 1.0" scale 245B.

So the question is, what happens as LLM models continue to grow? The application of pre-trained models tends to be two-stage: the pre-training phase and the scenario-specific application stage. In the pre-training phase, the optimization goal of the LLM model is cross-entropy, which for an autoregressive language model like GPT is to see if the LLM correctly predicts the next word. In the scenario application stage, it is generally necessary to look at the evaluation indicators of specific scenarios. In general, the better the indicators of the LLM model in the pre-training phase, the stronger the ability to solve downstream tasks. However, this is not entirely the case.

Existing studies have shown that the optimization metrics in the pre-training phase do show a positive correlation with downstream tasks, but not completely positively. That is, it is not enough to just look at the indicators of the pre-training stage to judge whether an LLM model is good enough. Therefore, it is necessary to fully evaluate and test the model in both the pre-training phase and the application phase.

In the pre-training phase, research by OpenAI and DeepMind has shown that increasing the amount of training data and model parameters at the same time is the optimal choice, and that increasing either alone is not good enough. DeepMind believes that the amount of training data and model parameters are equally important and should therefore be increased proportionally. For example, if the total computing power budget used to train LLM is increased by 10 times, then the number of model parameters and 3.3 times the amount of training data should be increased by 3.3 times, so that the model works best. For the Chinchilla model, it chose to increase the training data by a factor of 4, but reduced the model parameters to a quarter of Gopher's, which is about 70B. As a result, Chinchilla's pre-training metrics and many downstream task metrics outperform the larger Gopher. This shows that you can choose to enlarge the training data and reduce the LLM model parameters proportionally, so as to achieve the purpose of greatly reducing the model size without reducing the model effect.

From the perspective of LLM's effect on solving specific tasks downstream, different types of tasks have different performances as the scale of the model increases. For example, for simple tasks, such as the perplexity of the language model, the model effect will continue to improve as the size of the model increases. In the OpenAI study, when the amount of training data increased from 12B to 800B, the confusion of the GPT-3 model decreased from 3.15 to 1.28.

For medium-difficulty tasks, such as question answering and text classification, the model effect will first improve and then stabilize as the model size increases. In OpenAI's study, when the amount of training data increased from 12B to 800B, the GPT-3 model improved on tasks such as LAMBADA and SuperGLUE, but the improvement gradually decreased. For complex tasks, such as machine translation and semantic understanding, as the scale of the model increases, the model effect will first increase and then become saturated or slightly decrease. In Google's study, when the number of model parameters increased from 1558M to 137B, the BLEU score increased from 36.8 to 37.5, but as the model size increased further, the BLEU score decreased slightly. Therefore, when selecting the scale of LLM models, it is necessary to comprehensively consider various factors according to the difficulty and requirements of specific tasks to obtain the best model performance.

The first type of task shows the scaling law of the LLM model, that is, as the scale of the model increases, the performance of the task becomes better and better. Such tasks are usually knowledge-intensive, and the more knowledge the LLM model contains, the better the task performs. Studies have shown that the larger LLM model learning efficiency, the same amount of training data, the larger model can learn more knowledge points. Most of the traditional natural language understanding tasks belong to this type, and the effect of these tasks has been greatly improved in the past two years, probably due to the growth of the scale of LLM models.

The second type of task shows that LLM has a certain "emergence capability", and when the model size reaches a certain threshold, the effect of the LLM model on such tasks will have a sudden performance increase. This "emergence" capability is the key to the growth of LLM models, and LLM models gradually unlock new capabilities as they become larger. This phenomenon is amazing, because even if the LLM model does not solve some tasks well now, if you continue to push the model up, maybe one day this ability will be suddenly unlocked. These tasks generally consist of multiple steps, requiring multiple intermediate steps to be solved first, and logical reasoning skills play an important role in the final solution of such tasks. Prompting is a typical technique for enhancing LLM reasoning ability, which can greatly improve the effectiveness of such tasks. Further research is needed on why LLM exhibits this "emergence" phenomenon.

There are also some task effect curves showing U-shaped characteristics, that is, as the scale of the model increases, the task effect gradually deteriorates, but when the model scale further increases, the effect begins to improve, showing a U-shaped growth trend. Implicit in these tasks are two different types of subtasks, one is a real task and the other is a "jamming task". When the size of the model is small, it is impossible to identify any kind of subtask, so the performance of the model follows the machine to choose the answer.

When the model grows to medium scale, it mainly performs interference tasks, so it has a negative impact on the real task effect, which is reflected in the decline of the real task effect. When the model size is increased further, LLM can ignore the disturbing task, perform the real task, and the effect begins to grow. If CoT prompting is adopted, the performance of some tasks is converted to follow the scaling law, i.e. the larger the model size, the better, while other tasks are converted to a U-shaped growth curve. This shows that such tasks should belong to the inference type of tasks, and the performance of the tasks will change qualitatively after adding CoT.

From In Context Learning to Instruct Understanding

Commonly mentioned interface technologies for people and LLM include: Instruct and In Context Learning. Instruct is the interface to ChatGPT, where a person gives a description of a task in natural language, such as "translate this sentence from Chinese to English". In Context Learning and few shot prompting mean similarly, give LLM a few examples as a model, and then let LLM solve new problems.

Although these techniques are all ways of describing tasks, the thinking is different. Instruct is an abstract way of describing it, while In Context Learning is an illustration of examples. Although the names are somewhat confusing, these two technologies are the most common human and LLM interface technologies. The following will focus on Instruct and In Context Learning, without mentioning zero shots and few shots.

In Context Learning

In Context Learning is a very amazing technology. It's amazing because only a few sample examples of LLM are needed<x1,y1>,<x2,y2>.... < xn, yn >, and then give a new input xn+1, LLM can successfully predict the corresponding output yn+1. This sounds somewhat similar to Fine-tuning, but it's actually more complicated.

Both Fine-tuning and In Context Learning seem to provide some examples for LLM, but there is a qualitative difference between the two. Fine-tuning uses these examples as training data to correct the model parameters of LLM through backpropagation, thus realizing the process of LLM learning from examples. In Context Learning simply shows the example and then asks LLM to predict the new example, without using backpropagation to correct the model parameters, which means that it does not appear to have gone through the learning process. However, In Context Learning is able to predict new examples at a glance.

At present, there are a number of studies that offer different views on this issue, but there are conflicting conclusions between them. The truth of this question remains an unsolved mystery. Some studies argue that In Context Learning does not learn mapping functions from examples, but rather achieves predictions through the distribution of inputs and outputs. Other studies suggest that LLM still learns the mapping function from the example, but the process is implicit.

Second, the magical understanding of Instruct

Instruct is a statement of tasks to facilitate human understanding. Based on this premise, current Instruct research can be divided into two categories: one is an Instruct that is biased towards academic research, and the other is an Instruct that focuses on the description of real human needs.

First, let's look at Instruct, which is biased towards academic research. The core research theme in this field is the generalization ability of LLM models to understand Instruct in multi-task scenarios. This method uses multiple NLP tasks, each with one or more Prompt templates as Instructs, and fine-tunes the LLM model with training data to enable it to learn multiple tasks at the same time.

After training the model, give the LLM model a new task Instruct, and then let LLM solve the zero shot task to determine whether the LLM model has the ability to generalize Instruct. The current study shows that factors such as increasing the number of multitasking, increasing the size of the LLM model, providing CoT prompting, and increasing the diversity of tasks can effectively increase the LLM model's ability to understand Instruct.

The second is Instruct, which is oriented to the real needs of humans, and this type of research is represented by InstructGPT and ChatGPT. This approach is also based on multitasking, but the biggest difference from the academic research bias is that it is oriented towards real needs. It uses Task Description Prompt, sampled from real requests submitted by a large number of users, for LLM multitasking, rather than fixing the scope of the research task and then having the researcher write the task description prompt.

The advantage of this method is that it can cover more diverse task types and be more in line with the real needs of users; At the same time, the prompt description of the task comes from the request submitted by the user, reflecting the real expression of the user when expressing the requirements of the task. Therefore, the LLM model trained by this method can better meet the needs of users. The InstructGPT paper also compares the method to the FLAN method, which is biased towards academic research. The results show that the effect of the FLAN method is far from that of InstructGPT. This is because the FLAN method involves relatively few task domains, while InstructGPT uses more diverse task types that are more in line with the real needs of users. Therefore, collecting real requirements from user data is very important to improve the effectiveness of LLM models.

In Context Learning and Instruct link

While In Context Learning can be thought of as expressing task commands through some concrete examples, while Instruct is an abstract task description that is more in line with human habits. This raises a natural question: Is there a connection between the two approaches? For example, can you provide some specific examples for LLM to find the corresponding natural language description of the Instruct command to accomplish a task?

At present, some research work is exploring the connection between concrete task examples and natural language commands, and this direction has high research value. On this question, the answer is yes: LLM can indeed achieve this task. A recent study used GPT-3 and InstructGPT as the base model, allowing LLM to generate natural language commands to describe a task with specific examples, and then use that description to test the effect of the task. The blessing of this technology has greatly improved the effect of the Instruct generated by LLM, and even surpassed the performance of humans in some tasks. This suggests a mysterious intrinsic connection between figurative task examples and natural language commands, but we can't yet determine the exact nature of this connection.

How to enhance LLM's reasoning ability

At present, many studies have shown that LLM has a strong memory ability, but it is usually not considered that a person/she is smart just because he or she has a strong memory, because reasoning ability is often an important criterion for judging whether a person is smart or not. Therefore, strong reasoning skills are also essential for LLM. Over the past year, LLM's reasoning ability has become one of the most important and hot areas of research. The current research shows that when the model is large enough, LLM itself has inference ability, and has reached good ability in simple reasoning problems, but more in-depth research is still needed on complex reasoning problems.

The research on LLM reasoning ability can be divided into two categories, namely Prompt-based methods and methods for introducing program code. Based on the Prompt's method to stimulate the reasoning ability of LLM itself through appropriate prompts or prompt samples, Google has done a lot of fruitful work in this direction. The method of introducing program code participates in pre-training together with code and text in the pre-training process, so as to further enhance the reasoning ability of LLM, which is the idea practiced by OpenAI. The general direction of these two methods is very different: the former directly enhances LLM inference ability by providing diverse training data, while the latter is a technical method that allows LLM to better demonstrate reasoning ability in the process of problem solving. While the two approaches complement each other, the root cause approach is more important in the long run.

To summarize, it can be roughly divided into three technical routes.

First, directly add auxiliary reasoning prompts to the problem

In various fields, prompt-based methods have proven to be an effective way to enhance LLM's reasoning capabilities. This method is very simple, just add auxiliary reasoning prompts to the problem. Among them, Zero-shot CoT is a widely used method that stimulates the reasoning ability of LLM itself by adding the prompt "Let's think step by step" to the question being asked.

Specifically, it is divided into two stages, the first stage is to add a prompt to the problem, and LLM will output the specific reasoning process; In the second stage, the specific reasoning process of the LLM output is stitched, and then a Prompt is added, at which point the LLM will give the answer. This simple operation can greatly increase the effectiveness of LLM in various inference tasks. At present, the reason why LLM has inference ability is inconclusive, but it may be because there is a large amount of data in the pre-training data that starts with "Let's think step by step", and LLM remembers these patterns when pre-training.

Therefore, when we enter this prompt, LLM will imitate these examples for step-by-step reasoning and give an answer. Of course, the effect of Zero-shot CoT is worse than that of standard CoT, because the accuracy estimation is not too high by relying on LLM recall examples. But both Zero-shot CoT and standard CoT illustrate the truth that LLM itself has the ability to reason, but we can't stimulate this ability.

Few-shot CoT (Chain of Thought) prompting

At present, the Prompt-based method is the main direction of LLM reasoning research, and a lot of work is carried out on this idea. In this direction, several representative works have achieved remarkable results, which can basically represent the direction of CoT technology development.

The main idea of CoT is very simple and clear, in order for the LLM model to learn to reason, it is necessary to give some manually written reasoning examples, in the examples detailed the specific reasoning steps before getting the final answer, and these manually written detailed reasoning process is the thinking chain prompting. The purpose of CoT is to make the LLM model understand that in the process of reasoning, the steps should not be too large, and it is necessary to turn big problems into small problems, step by step, and accumulate small victories into big victories. The first article to explicitly propose the concept of CoT was "Chain of thought prompting elicits reasoning in large language models", which was published in January 2022. Although the CoT approach is simple, the reasoning ability of the LLM model has been greatly improved after applying CoT, and the accuracy of the GSM8K mathematical reasoning test set has increased to about 60.1%. It is worth mentioning that this idea of giving detailed reasoning steps and intermediate processes was not first proposed by CoT. Earlier "Scratchpad" technologies adopted a similar idea.

Shortly after the CoT was proposed, in March 2022, an improved technology called "Self-Consistency" quickly came out, increasing the accuracy of the GSM8K test set to 74.4%. The idea of this improved technique is also very simple, first using CoT to give several examples of written reasoning processes, and then asking LLM to reason about a given problem, but unlike CoT, "Self-Consistency" requires LLM to output multiple different reasoning processes and answers, and use voting to select the best answer. This line of thinking teaches LLM that there can be many correct solutions to a mathematical problem, and each different derivation process points to the final answer. Simple methods often have deep philosophical implications. Later, the work "On the Advance of Making Language Models Better Reasoners" further integrates the three improvement points of "extending from one Prompt question to multiple Prompt questions, checking the correctness of intermediate steps in reasoning, and weighted voting of responses to multiple outputs" based on "Self-Consistency". Improved GSM8K test set accuracy to approximately 83%.

Third, the divide and rule algorithm

The core idea is to decompose a complex reasoning problem into several easily solvable subproblems, solve these subproblems, and then derive the answers to complex questions from the answers to the subproblems. This line of thinking may be the authentic way to reveal the essence of the problem and ultimately solve the complex reasoning problem of LLM. Take the "least-to-most prompting" technique, for example, which is divided into two phases. In the first stage, we get the final question to be asked from the original question, and then construct a Prompt template, fill in the content of "If I want to solve the Final Q problem, then I need to solve it first", let the LLM model answer it, and get the pre-sub-problem Sub Q. In the second stage, LLM answers the sub-question Sub Q first and gets the corresponding answer, then stitches the original question with the sub-question Sub Q and the corresponding answer, and then asks LLM the final question, Final Q, at which point LLM will give the final answer. This line of thinking embodies the idea of dismantling sub-problems and gradually finding the final answer from the answer to the sub-problem, similar to the idea of divide and conquer algorithms.

Code pre-training enhances LLM inference capabilities

The above mentioned three mainstream methods of using Prompt to stimulate the reasoning ability of LLM models, and also observed an interesting and puzzling phenomenon: in addition to text, participating in the pre-training of the model together with the text can significantly improve the reasoning ability of LLM models.

In the paper "On the Advance of Making Language Models Better Reasoners", an interesting phenomenon is shown through experimental data: participating in model pre-training together with program code and text can significantly improve the reasoning ability of LLM models. Experimental results show that just switching from a plain text pre-trained model to a text and code hybrid pre-training model can improve the model inference ability by 20 to 50 percentage points on almost all test data sets.

In addition, the study also found that GPT 3, a plain text pre-training model, actually has a considerable degree of inference ability, but it needs to be stimulated by appropriate methods. Adding instruct fine-tuning will damage the reasoning ability of the LLM model, but it will improve natural language understanding to a certain extent. As for why pre-trained models can obtain additional inference capabilities from the pre-training of code, there is no exact reason, but it may be because code training is essentially < text, and Code works > multimodal alignment of two types of data, which contains a considerable proportion of mathematical or logical reasoning data, which is helpful for solving downstream mathematical reasoning problems. These conclusions inspired further thinking and exploration.

Reflections on LLM reasoning ability

In the past year, the technology to stimulate LLM's reasoning ability has advanced rapidly, but the overall feeling is still some way from getting to the real essence of the problem, and more in-depth thinking and exploration are needed. For complex inference problems, break them down into simple subproblems, because subproblems have a higher probability of answering correctly for LLM. Inspired by the "least-to-most prompting" technique, LLM inference may essentially be a graphical reasoning problem that constantly interacts with LLM, or a program flowchart execution problem that constantly interacts with LLM.

Suppose we can disassemble a complex problem into a graph structure composed of subproblems or substeps, where nodes represent subproblems or substeps, and edges represent dependencies between subproblems. Based on the dependencies, we can guide LLM step by step to answer the sub-questions that must be answered first until the final answer is derived. There may be a circular structure in the diagram, where some substeps need to be performed repeatedly. If we can get the above subproblem disassembly diagram, then we can effectively guide LLM to reasoning.

Suppose we can disassemble a complex problem into subproblems or substeps and generate a structure similar to a procedural flowchart, with loops and conditional branching. We can interact with LLM as we execute each sub-step, get the answer to the sub-step, and follow the process until the final answer is output. This multimodal pre-trained model can enhance the LLM model's ability to build an implicit flowchart from text and execute it according to the flowchart, thereby enhancing its inference ability.

However, how to get the diagram structure or flowchart structure according to the text description is still a difficult point. One possible idea is to implicitly learn internal implicit structures by enhancing text and higher quality code pre-training. The current CoT technology is trying to reverse the graph structure or program flow chart according to the last graph node, but the current method limits its backward depth and can only derive a simple graph structure, which is the reason for its limited ability.

LLM research trends and key directions worth studying

Here are some of the more important LLM research areas or research directions worth exploring in depth.

First, explore the scale ceiling of the LLM model

Although the scale of the LLM model may seem untechnical, it is actually very important. Since the advent of Bert, the core contributions to GPT 3 and ChatGPT's impressive key technology breakthroughs have come from the growth of LLM model scale, not from a specific technology. This means that for knowledge-intensive tasks, as the model gets larger, the results of various tasks will get better and better. For many inference types of difficult tasks, after adding CoT Prompting, the effect also shows a tendency to follow the scaling law. Therefore, the natural question is: for these tasks, to what extent can LLM scale solve these tasks?

Given LLM's magical "emergence capabilities," what unexpected new capabilities will be unlocked if you continue to increase the size of your model? This is also an interesting question. Therefore, it is necessary to continuously increase the size of the model to see where the ceiling of the model scale is for solving various tasks. Of course, this kind of thing can only be said that for 99.99% of practitioners, there is no opportunity and ability to do this thing.

To do this, the financial resources and willingness to invest, engineering capabilities, and technical enthusiasm of research institutions are extremely high, and all of them are indispensable. The institutions that can do this are roughly estimated to be no more than 5 abroad and no more than 3 in China. This is because doing the ultra-large-scale LLM model requires very high engineering implementation capabilities of the technical team, and requires very strong hardware and software support. Therefore, it is a work with a technical content.

Nevertheless, it is still important to continue to promote the significance of research on the scale of LLM models. In addition to exploring the extent to which the scale effect of LLM affects the effectiveness of various tasks, you can also explore what new capabilities will be unlocked when the LLM model increases in size. The answers to these questions will contribute to a better understanding of the nature and behavior of LLM models, providing important references for future research and applications. Therefore, it is very valuable for competent research institutions to continue to push the scale of LLM models.

2. Enhance the complex reasoning ability of LLM

As previously described on LLM's reasoning ability, although LLM has made great progress in reasoning ability in the last year, there are still some limitations. For example, many studies have shown that LLM still does not solve complex reasoning problems well, especially when it comes to long strings or numbers, where LLM's reasoning ability decreases significantly. Therefore, strengthening the complex reasoning ability of LLM should be one of the focuses of future research.

In the previous article, we mentioned a way to directly enhance LLM inference by adding code to pre-training. While this approach has been summarized in some practice, the rationale behind it needs to be explored in depth and more types of new types of data introduced to enhance LLM's reasoning capabilities. This may be a more essential direction to improve LLM reasoning capabilities, not just the addition of code.

Third, LLM is included in more research fields besides NLP

The current ChatGPT is a model that excels in natural language processing (NLP) and programming tasks. As one of the cutting-edge research leading to artificial general intelligence (AGI), combining multimedia data such as images, videos, and audio with language models, and further applying AI to other fields such as scientific research and robot control, is an important way to achieve a wider range of applications and differentiated development. Although this research direction is still in its infancy, it has high research value.

Fourth, easier to use the interface between people and LLM

As discussed earlier, the main technical contribution of ChatGPT lies in its excellence in specific areas such as NLP and programming tasks. However, we also realize that there are still imperfections in the current technology, and there are many commands and instructions that LLM cannot understand. Therefore, a very promising and new technological direction is to find better ways to enable LLM to understand the way humans use their habitual command expressions. The exploration in this direction will create new opportunities for us and provide more potential solutions for improving the technical level of LLM.

5. Build a highly difficult comprehensive task evaluation dataset

An excellent measurement data set is the foundation for driving continuous technological progress. As LLM models continue to expand and task effectiveness rapidly improves, many classic test sets quickly become too easy to effectively assess the flaws and blind spots of current technologies. Therefore, building highly challenging test datasets is essential to drive the advancement of LLM technology. At present, some new test sets have emerged in the industry, such as BIGBench and OPT-IML. These test sets are difficult, integrate the requirements of multiple task types, and better reflect the challenges of current LLM technology.

Inspired by ChatGPT, in addition to the difficulty and diversity of the test set, factors that reflect the needs of real users should also be considered. In other words, these tasks should be proposed by real users, and only the LLM model built in this way can truly solve the actual needs of users. In addition, LLM will rapidly expand its capabilities beyond NLP, so it is necessary to consider in advance how to incorporate more evaluation data from other fields. This will help further improve the broad adaptability of the LLM model.

6. High-quality data engineering

Data is the core of a pre-trained model, and the pre-training process is the process of gaining knowledge from data. Therefore, more attention needs to be paid to mining, collecting and cleaning high-quality data. Data quality and quantity are two key aspects. Based on the experimental comparison of T5, it can be concluded that between quality and quantity, quality should be prioritized. Therefore, the right thing to do is to increase the size of your data while ensuring data quality. In terms of data quality, several criteria such as the information content and diversity of the data need to be considered. Wikipedia, for example, is high-quality data with a high level of information. Increasing the diversity of data types is critical to unleash the various new capabilities of LLM. For example, data added to the Q&A website is a direct help in improving LLM's Q&A capabilities. Diverse data gives LLM the ability to better solve various types of tasks, so diversity is the most critical criterion in data quality.

Regarding the amount of data, in principle, what can be included in the pre-trained model is the data that is publicly available on the Internet. However, there are limits to the amount of data. One study estimated the scalability of the data volume and concluded that by around 2026, high-quality NLP data will be depleted, low-quality NLP data will be depleted between 2030 and 2050, and low-quality image data will be depleted between 2030 and 2060. This suggests that either new types of data sources need to be developed, or LLM models must be made more efficient in their use of data. Otherwise, the current approach to model optimization that relies on data drives will cease to progress or diminish benefits. New solutions are needed to address the limits of data.

Seventh, the thinning of the super-large LLM model Transformer

Some of the largest scale models in LLM, such as GPT 3, PaLM, GLaM, etc., employ sparse structures. The main advantage of using a thinning model is that it can greatly reduce training and inference time. Compared with dense models, sparse models can train 4 to 7 times faster with the same computing power budget. This is because although the sparse model has a huge amount of parameters, for each training instance, the sparse model uses only a small part of the parameters to participate in training and inference through the routing mechanism, so it is faster.

Future hyperscale LLM models are likely to tend to sparse models for two main reasons. First, the study shows that the standard dense model itself also shows sparse activation during training and inference, that is, only some parameters will be activated, and most of the parameters are not involved in training and inference. Based on this, migrating to the sparse model is a reasonable choice. Secondly, the scale of LLM models will continue to increase, and the high training cost is the main obstacle to scaling up the model. Using sparse models can significantly reduce the training cost of very large models, so the benefits of sparse models become more pronounced as the model size increases. For these reasons, it is likely that future larger LLM models will adopt sparse model schemes.

However, at present, the reason why other large-scale models have not adopted sparse models is that sparse models have problems such as unstable training and easy overfitting, and it is difficult to train well. Therefore, solving the problems faced by sparse models and designing sparse models that are easier to train is an important direction for future research.

What should I pay attention to when forking ChatGPT?

To replicate an amazing LLM model like ChatGPT, we need to weigh the following issues when selecting technology.

First, regarding the pre-training mode, you can choose an autoregressive language model such as GPT, a bidirectional language model such as Bert, or a hybrid mode such as T5. Based on the analysis of this paper, the GPT autoregressive language model may be a better choice. However, it seems that many domestic LLM projects have chosen the Bert bidirectional language model or the T5 mixed language model, which may lead to a shift in direction.

Second, strong reasoning ability is an important basis for users to recognize LLM, in order to achieve this goal, according to current experience, it is best to introduce a large number of codes and texts in the pre-training stage, and carry out LLM training at the same time. This is also explained in the corresponding analysis above.

Third, if you want the model parameters to be not too large but still have good results, there are two options. The first is to strengthen the high-level feature extraction and representation capabilities, which can be achieved through deeper network structures or more complex feature extraction methods. Second, the text retrieval model is combined with LLM to provide preliminary screening and matching through the text retrieval model, and then LLM for further generation and inference, which can greatly reduce the parameter scale of LLM model.

Fourth, because the training cost of super-large models is too high, few institutions have the ability to implement them. Therefore, it is very important to reduce the training cost of LLM. Among them, an effective technical choice is to Sparse the feature extractor of LLM, which can effectively reduce the training and inference cost of the model. Therefore, as the size of the model increases, the Sparseization of LLM models is an option that should be considered.

At present, the closest technical solution to the ideal LLM is ChatGPT, and the ideal LLM should be an almost omnipotent general-purpose large model that can support various task types. To achieve this, more task types can be supported by increasing the diversity of LLM's pre-training data. The better the diversity of data, the richer the types of tasks that LLM can support. Therefore, emphasis should be placed on enhancing the capabilities of LLM by increasing data diversity.

Sixth, the easy-to-use man-machine operation interface is also very important. LLM needs to be able to understand the true meaning of a task that humans are used to describing in a way that they are used to. At the same time, it is also necessary to collect task expressions according to the needs of the end user, rather than relying on the imagination or guesswork of the developer. ChatGPT inspired me a lot in this regard, so it doesn't matter if you use reinforcement learning or not, other alternative techniques can achieve similar results.

To replicate an amazing LLM model like ChatGPT, it is necessary to weigh factors such as pre-training mode, inference ability, model scale, training cost, data diversity and human-machine operation interface in technology selection, and choose the most appropriate method to achieve the goal.

Factors required for LLM training

When training large language models, there are several challenges, which can be summarized into the following six aspects: hardware requirements, health checks, orchestration technology, data processing, model scaling, and cost management. Each aspect has a significant impact on the effectiveness and efficiency of model training.

When training large language models, we face several challenges. The first is the hardware aspect. Using the latest hardware can provide better performance, while not taking full advantage of the latest hardware can lead to longer training times and not achieving optimal results.

Blue Ocean Brain's high-performance LLM model training platform uses the working fluid as the medium of intermediate heat transfer, and transfers heat from the hot zone to the distant place for cooling. Supports a variety of hardware accelerators, including CPUs, GPUs, FPGAs, and AI, to meet the needs of large-scale data processing and complex computing tasks. It adopts distributed computing architecture to efficiently process large-scale data and complex computing tasks, providing powerful computing power support for the research and development of deep learning, high-performance computing, large model training, and large-scale language model (LLM) algorithms. It is highly flexible and scalable, and can be customized according to different application scenarios and needs. Various computing tasks can be quickly deployed and managed, improving the utilization and efficiency of computing resources.

Another challenge is health checks to ensure that the hardware is functioning properly and reducing interference. You also need to consider orchestration to ensure that workloads in your team don't interfere with each other while keeping your network and security configured. Working with large-scale datasets is also a challenge that requires efficient storage, processing, and loading methods. Scaling infrastructure and designing algorithms to overcome limiting issues is also an important task. These models usually don't work with a single GPU, so you need to consider how to split the model across multiple GPUs.

Finally, cost management is a factor that cannot be ignored. Training large models can be costly, and the machine learning team's time should be leveraged to focus on creating new models rather than spending too much time on infrastructure.