laitimes

Large model + complete search technology, Baichuan intelligent search enhancement has given a strong dose of enterprise customization

Reported by the Heart of the Machine

Heart of the Machine Editorial Department

Making good use of the enterprise knowledge base is the key to breaking the situation in the application of large models.

Since the initial release of ChatGPT, although the craze for large models has lasted for more than a year, most of the time it still stays at the academic frontier and technological innovation level, and there are few cases of in-depth implementation of industrial value in specific scenarios.

The challenges of practical implementation ultimately point in one direction: industry knowledge.

In the face of vertical scenarios in various industries, it is difficult to solve the problems of accuracy, stability, and cost-effectiveness of general models that rely on network public information and knowledge pre-training. If the external real-time information search is supplemented by a strong and dedicated enterprise knowledge base, the model's understanding of industry knowledge will be greatly enhanced, and the effect will naturally be better.

This is like the familiar "open-book exam" mode, the stronger the "memory capacity" of the human brain, the better, but there is an upper limit in the end, and the reference materials brought into the examination room are like external "hard disks", so that candidates do not need to memorize complicated knowledge points, but can spend more energy on understanding the essential logic of knowledge.

At the Baichuan2 Turbo series API release event held on December 19, Wang Xiaochuan, founder and CEO of Baichuan Intelligence, made a more accurate analogy: the large model is like the CPU of the computer, which internalizes the knowledge inside the model through pre-training, and then generates the result according to the user's Prompt, the context window can be regarded as the memory of the computer, storing the text being processed at the moment, and the real-time information of the Internet and the complete knowledge base of the enterprise together constitute the hard disk in the era of the large model.

These latest technical thinking has been integrated into Baichuan Intelligence's large-scale model products.

Baichuan Intelligence has officially opened the Baichuan2-Turbo series API based on search enhancement, including Baichuan2-Turbo-192K and Baichuan2-Turbo. This series of APIs not only supports a 192K ultra-long context window, but also adds the ability to search for an enhanced knowledge base, allowing all users to upload specific text materials to build their own knowledge base, building more complete and efficient intelligent solutions according to their business needs.

At the same time, Baichuan Intelligence has also upgraded the model experience on the official website, officially supporting PDF text upload and URL input, and ordinary users can experience the soaring level of general intelligence through the long context window and search enhancement through the official website entrance.

Large model + complete search technology, Baichuan intelligent search enhancement has given a strong dose of enterprise customization

When a large model is landed, "memory" and "hard disk" are indispensable

The key to the application of large models is to make good use of enterprise data, and practitioners in the field feel this very deeply.

For the enterprise itself, in the process of digital construction in the past few years, a large amount of high-value data and experience have been precipitated, which constitutes the core competitiveness of the enterprise and determines the depth and breadth of the implementation of the large model.

In the past, powerful enterprises mostly used their own data to train large models in the pre-training stage, but the time and computing power required for this method are huge, and a professional technical team is also required. Some enterprise teams also choose to introduce industry-leading basic large models and use their own data for post-training (Post-Train) and supervised fine-tuning (SFT), which to a certain extent makes up for the shortcomings of the long construction period of large models and the lack of domain knowledge, but still cannot solve the illusion and timeliness of large model landing. Whether it is pre-train, post-train, or supervised fine-tuning (SFT), the model needs to be retrained or fine-tuned every time the data is updated, and the reliability of the training and the stability of the application cannot be guaranteed, and problems will still occur after multiple trainings.

This means that the implementation of large models requires a more efficient, accurate, and real-time way of data utilization.

Recently, there have been high hopes for the expansion of the context window and the introduction of vector databases. From a technical point of view, the more information that the context window can hold, the more information the model can refer to when generating the next word, the less likely it is that "hallucination" will occur, and the more accurate the information will be, so this technology is one of the necessary conditions for the implementation of large model technology. The vector database provides a "storage" for the large model. Compared with simply expanding the size of the model, the introduction of an external database can allow the large model to answer user questions on a wider dataset, and improve the adaptability of the model to various environments and problems at a very low cost.

However, each method has limitations, and large models cannot rely on a single solution to break through the landing challenges.

For example, there are capacity limitations, costs, performance, and efficiency issues when the context window is too long. First of all, there is the issue of capacity, the 128K window can hold up to 230,000 Chinese characters, which is only a text document of about 658KB. Another is the problem of computational cost, the inference process of the long window model needs to consume a large number of tokens. From the perspective of performance, since the inference speed of the model is positively correlated with the length of the text, even if a large number of caching techniques are used, the long text will lead to the degradation of performance.

For vector databases, because their query and indexing operations are more complex than traditional relational databases, this will put more pressure on computing and storage resources on enterprises. Moreover, the domestic vector database ecosystem is relatively weak, and there is a fairly high development threshold for small and medium-sized enterprises.

In Baichuan Intelligence's view, only by combining the long window model with search/RAG (retrieval enhanced generation) to form a complete technology stack of "long window model + search" can we truly achieve efficient and high-quality information processing.

In terms of context windows, Baichuan Intelligence launched the world's longest context window model Baichuan2-192K on October 30, which can input 350,000 Chinese characters at a time, reaching the industry-leading level. At the same time, Baichuan Intelligent upgraded the vector database to a search-enhanced knowledge base, which greatly enhanced the ability of the large model to obtain external knowledge, and its combination with the ultra-long context window can connect the whole network information and all enterprise knowledge bases, thereby replacing the vast majority of enterprise personalized fine-tuning and solving the customization needs of 99% of enterprise knowledge bases.

In this way, the benefits of enterprises are obvious, not only the cost is greatly reduced, but also the vertical domain knowledge can be better precipitated, so that the core asset of the enterprise's proprietary knowledge base continues to increase in value.

Long Window Model + Search Enhancements

How to improve the application potential of large models?

On the one hand, without modifying the underlying model itself, the large model can integrate internalized knowledge with external knowledge by increasing memory (i.e., a longer context window) and by means of search augmentation (i.e., access to real-time information from the Internet and expert knowledge from the domain knowledge base).

On the other hand, the addition of search enhancement technology can better play the advantage of long context windows. The search augmentation technology allows the large model to accurately understand the user's intent, find the knowledge most relevant to the user's intent in the massive documents of the Internet and the professional/enterprise knowledge base, and then load enough knowledge into the context window, and further summarize and refine the search results with the help of the long window model, so as to give full play to the context window capability and help the model generate the best results, so as to realize the linkage between various technical modules and form a closed-loop powerful capability network.

The combination of these two approaches expands the capacity of the context window to a whole new level. Through the method of long window + search enhancement, on the basis of the 192K long context window, Baichuan Intelligent has increased the original text scale that can be obtained by the large model by two orders of magnitude to 50 million tokens.

The "Needle in the Heystack" test is designed by Greg Kamrott, a well-known overseas AI entrepreneur and developer, and is recognized as the most authoritative method in the industry for testing the accuracy of long text on large models.

In order to verify the enhanced capability of long window + search, Baichuan Intelligent sampled a dataset of 50 million tokens as the Haystack, and used the Q&A of multiple domains as a needle to insert it into different locations of the Haystack, and tested the retrieval methods of pure embedding retrieval and sparse retrieval + embedding retrieval respectively.

For requests within 192K tokens, Baichuan Intelligent can achieve 100% answer accuracy.

Large model + complete search technology, Baichuan intelligent search enhancement has given a strong dose of enterprise customization

For document data with more than 192K tokens, Baichuan Intelligent combined with the search system to extend the context length of the test set to 50 million tokens, and evaluated the retrieval effects of pure vector retrieval and sparse retrieval + vector retrieval respectively.

The test results show that the method of sparse retrieval + vector retrieval can achieve 95% answer accuracy, even in a dataset of 50 million tokens, it can achieve close to the full score of the whole domain, while the simple vector retrieval can only achieve 80% answer accuracy.

Large model + complete search technology, Baichuan intelligent search enhancement has given a strong dose of enterprise customization

At the same time, in the three test sets of the Bojin Large Model Challenge - Financial Dataset (Document Understanding Part), MultiFieldQA-zh and DuReader, the scores of Baichuan Intelligent Search Enhanced Knowledge Base are all ahead of GPT-3.5, GPT-4 and other industry leading models.

Large model + complete search technology, Baichuan intelligent search enhancement has given a strong dose of enterprise customization

It is not easy to combine long windows with search, and Baichuan Intelligence "sees and dismantles moves"

"Long window model + search" can certainly break through the bottleneck of large models in terms of illusion, timeliness and knowledge, but the premise is to solve the problem of combining the two first.

Whether the two can be perfectly integrated or not largely determines the final use effect of the model.

Especially at present, the expression of user information needs is undergoing subtle changes, and its in-depth combination with search has put forward a new test to Baichuan Intelligence in all aspects.

On the one hand, the user's question is no longer a word or phrase in the way of input, but is transformed into a more natural dialogue or even multiple rounds of conversation. On the other hand, the questions are more diverse and contextual. The input style is more colloquial, and the input questions tend to be more complicated.

These changes in Prompt do not match the traditional search logic based on keywords or short sentences, and how to achieve the alignment of the two is the first problem to be solved by combining the long window model and search.

In order to understand user intent more accurately, Baichuan Intelligent first uses the self-developed large model to fine-tune the understanding of user intent, and converts the user's continuous multiple rounds of colloquial prompts into keywords or semantic structures that are more in line with the understanding of traditional search engines, and the search results presented are more accurate and relevant.

Large model + complete search technology, Baichuan intelligent search enhancement has given a strong dose of enterprise customization

Secondly, in response to the increasingly complex problems in the actual scenarios of users, Baichuan Intelligence not only draws on Meta's CoVe (Chain Verification) technology to split complex prompts into multiple search-friendly queries that can be retrieved in parallel, so that the large model can conduct a targeted knowledge base search for each subquery, and ultimately provide more accurate and detailed answers while reducing illusion output. In addition, it also uses the self-developed TSF (Think Step-Further) technology to infer and mine the deeper problems behind the user's input, understand the user's intent more accurately and comprehensively, and guide the model to output more valuable answers.

Another challenge relates to the enterprise knowledge base itself. The higher the degree of matching between user needs and search queries, the better the output of the large model. However, in the knowledge base scenario, in order to further improve the efficiency and accuracy of knowledge acquisition, the model needs more powerful retrieval and recall solutions.

The knowledge base scenario has its own unique characteristics, user data is usually private, and the traditional vector database cannot ensure the semantic matching between user needs and the knowledge base.

To this end, Baichuan Intelligent has developed its own Baichuan-Text-Embedding vector model, which is pre-trained on high-quality Chinese data with more than 1.5T tokens, and solves the problem of relying on batchsize in contrastive Xi mode through the self-developed loss function. The vector model topped the current largest and most comprehensive Chinese semantic vector evaluation benchmark C-MTEB, and took the lead in five tasks and comprehensive scoring of classification, clustering, sorting, retrieval and text similarity.

Large model + complete search technology, Baichuan intelligent search enhancement has given a strong dose of enterprise customization

Although the current mainstream method for building large model knowledge bases is vector retrieval, it is obviously not enough to rely on it alone. The reason is that the effect of vector database depends greatly on the coverage of training data, and the generalization ability in the uncovered domain will be greatly reduced, which undoubtedly causes a lot of trouble to the data privatization knowledge base scenario. At the same time, there is a gap between the length of user prompt and the Chinese file of the knowledge base, and the mismatch between the two also brings challenges to vector retrieval.

Therefore, Baichuan Intelligent introduces sparse retrieval and rerank model on the basis of vector retrieval to form a hybrid retrieval method of vector retrieval and sparse retrieval in parallel, which greatly improves the recall rate of target documents. Using data to speak for itself, this hybrid retrieval approach achieves a 95% recall of the target document, while the vast majority of open-source vector models have a recall rate of less than 80%.

In addition, in the process of answering questions, the large model will also aggravate its own hallucinations due to inaccurate citations and mismatch with the large model.

In this regard, Baichuan Intelligent pioneered the Self-Critique large model self-reflection technology on the basis of general RAG, allowing the large model to introspect the retrieved content based on Prompt, from the perspectives of relevance and usability, and conduct a second view, from which the most suitable and high-quality candidate content with Prompt is screened out, so that the knowledge density and breadth of materials can be raised to a higher level, and the knowledge noise in the search results can also be reduced.

Large model + complete search technology, Baichuan intelligent search enhancement has given a strong dose of enterprise customization

Along the route of the "long window model + search" technology stack, Baichuan Intelligence solves the pain point of mismatch in the combination of large models, user prompts and enterprise knowledge bases by virtue of its own technical accumulation in the search field, especially the industry-leading combination of vector retrieval and sparse retrieval, and makes its own search enhance the knowledge base prominent, which is really a tiger with wings for large models to empower industry vertical scenarios more efficiently. The large model is implemented, and the search enhancement opens a new stage of enterprise customization

In just one year, the development of large models has exceeded people's imagination. We have hoped that the "industry model" can bring about the release of productivity in thousands of industries, but the industry model is constrained by factors such as professional and technical talents and computing power, and more small and medium-sized enterprises are unable to reap the dividends in this wave of large models.

It can be seen that it is indeed more difficult to take the step of "from product to landing" than from "technology to product" at the beginning.

From the very beginning of the industry model based on pre-training, the enterprise exclusive model based on post-training or SFT, and then the development of exclusive customized models using technologies such as long windows and vector databases, although the large models have been pushed closer to the ideal "omniscient and omnipotent", the application in a wide range of industry vertical scenarios has not yet been realized.

Baichuan Intelligence has built a "large model + search" technology stack, which not only improves the basic performance of the model with a long window, but also uses search to enhance the connection of domain knowledge and the knowledge of the whole network more efficiently and comprehensively, providing a low-cost customized large model road, and taking the lead in realizing "omniscience". We have reason to believe that this will lead the large-scale model industry to a new stage.

Read on