The implementation practice of large language model in data warehouse data governance

With the continuous growth of data scale and the increasing complexity of business needs, the construction and management of data warehouses has become more and more important, and data governance has become a non-negligible part of data warehouse construction. Among them, data warehouse metadata and indicators are descriptions and measurements of data, which play a crucial role in data analysis and decision-making. However, due to the large and complex size of data, traditional metadata and indicator retrieval methods are often inefficient and cannot meet the needs of fast and accurate retrieval. This article describes how to use large-scale language model technology to implement data warehouse metadata and indicator retrieval governance, including technical architecture, detailed technical description, and solved problems.

1. Background of practice

Among the existing system tools, we have built a number of data tools such as indicator management system, metadata system, IDE user query platform and so on. Traditional platforms are more instrumental support, users have purposeful input, and the system retrieves and returns results. This form is easy to form information islands, because the metadata system is metadata information, indicator management is indicator information, and users more often hope to be integrated to answer business problems.

The big language model is a natural language processing technology based on the Transformer model. It learns language models through large-scale pre-training, which can then be fine-tuned on a variety of tasks for more specific applications. Large language models have powerful semantic understanding and generation capabilities to generate relevant responses based on the natural language text entered. It can act as a glue on existing application systems, organically combining information from different systems to provide users. Therefore, we decided to approach our current solution scenario through a big language model to meet our needs.

2. Technical architecture

2.1

Retrieve the corpus into the warehouse

The implementation practice of large language model in data warehouse data governance

Through the connection with the metadata system and indicator management system, you can access the content you need to consult into the data warehouse for storage and management. In the data warehouse, the corpus content is organized in the form of KV pairs composed of corpus phrases and detail information to form the initial index information.

The organization of index information can be designed according to specific needs and data structure. A common way is to use the table name as the key and the table structure as the value. This makes it easy to index and query based on the table name and quickly find the relevant corpus content.

Another way is to use the metric name as the key, and the metric description and the generation rule description as the value. This method is suitable for scenarios where consulting based on indicators is required. With indicator information as part of the index, you can easily search by indicator name to quickly find the corpus content related to the indicator.

By connecting with the metadata system and metric management system, you can obtain more metadata information and indicator definitions. This information can be used as a supplement to the corpus content and enrich the detailed information of the corpus. At the same time, you can store this information together with corpus phrases in the data warehouse for subsequent index access and consultation retrieval.

2.2

Corpus vectorization processing

After completing the corpus assembly, we use the Milvus vector library to vectorize the corpus content. The advantage of this treatment is that by converting the corpus content into a vector representation, we can avoid the matching range reduction problem caused by the exact match or fuzzy matching method in plaintext matching. Because the user's language variation is very diverse, using plaintext matching directly may not cover all the changes.

Through vectorization, we map corpus content into a high-dimensional vector space. In this vector space, each corpus content is represented as a vector, and the distance between these vectors can be used to measure the similarity between them. In this way, we can find the corpus content closest to the user input through vector matching.

Another benefit of vectorization processing is that it can avoid the interference caused by various modifiers in language. Modifiers are usually adjectives, adverbs, or other linguistic modifiers that can skew the matching results in plaintext matching. But through vectorization processing, we convert the corpus content into numerical vectors that can more accurately reflect the semantic information of the corpus content, independent of modifiers.

2.3

Large language model access

In the process of user consultation, the dialogue content is also vectorized, which can be used to match with the results. In this way, we can find the corpus content that is most relevant to the content of the user's inquiry based on the similarity of the vector.

Often, users may describe very few words when entering a consultation, but these descriptions tend to be focused. Therefore, with vectorization, this key information can be encoded into vector representations. The process of vectorization can use a variety of techniques to convert text into a vector of values. In this way, we can find the corpus content that is most similar to the user input by calculating the similarity between the vectors.

When a matching result is found, the result can be sorted according to the vector score. The vector score can reflect the degree of matching, and a higher score indicates a good match. By sorting the results, we can rank the corpus content with a high degree of matching first, providing users with more relevant answers to the inquiry.

Once the matching results have been determined, these corpus can be recalled and the content sent to the interface provided by the large language model for assembly. The model generates coherent, natural responses based on input. By combining matching results with large language models, we can transform relevant corpus content into more specific and detailed answers, providing users with more professional and accurate consulting services.

2.4

Front-end application deployment

When choosing the front-end environment, we considered the original two systems and other comprehensive platforms, but did not find the right entry point and reason for users to access and use these systems. Finally, we decided to integrate the system into the IDE user query platform. For users who use SQL to query information, they often need to understand the meaning of metrics, their usage, and the metadata information of tables. These users are mainly distributed in different groups such as data points, products, operations, algorithms, etc. They are currently the entry point that best fits our use case.

By integrating the system into the IDE user query platform, we can provide these users with a unified interface and entry point that allows them to easily query and understand information about metrics. Users can enter SQL statements in the query platform and obtain detailed information about metrics, including the definition of metrics, calculation methods, usage examples, and table metadata. This allows users to query and understand metrics on one platform without having to switch between multiple systems. At the same time, users can obtain accurate and comprehensive indicator information through the query platform, helping them better understand and use indicators, and improve work efficiency.

Overall framework diagram:

3. Apply effects

Through the implementation practice of large language model in metadata and indicator retrieval in data warehouse governance, we solve the following problems:

Improve retrieval efficiency: Traditional metadata and indicator retrieval methods often require complex query statements and cumbersome operations, which are inefficient. Using the large language model technology, users only need to enter natural language query questions, and the system can quickly return the corresponding results, which greatly improves the retrieval efficiency.
Improve search accuracy: Traditional metadata and metric retrieval methods are easily affected by inaccurate expression of query statements, resulting in inaccurate results returned. The large language model has powerful semantic understanding and reasoning capabilities, which can better understand the query intent of users and improve the accuracy of search results.
Provide a better user experience: Traditional metadata and indicator retrieval methods require users to have certain technical background and operational experience, which is difficult for non-experts. Using the big language model technology, users only need to enter natural language query questions, and do not need to understand complex query syntax and operation steps, which greatly improves the user experience.
Summary: With the technical support of big language models, we can achieve smarter and more convenient metadata and metric management. It understands natural language input and provides relevant metadata and metric information based on user needs. This intelligent capability makes data manipulation and data analysis more efficient and accurate. With the support of big language models, we can better manage and leverage metadata and metrics in data warehouses, and improve the level of data governance and data analysis. Hope this article is helpful to you, thank you!

About the author

Fan

■ Data Platform Department - Data Warehouse Team

■ Mainly responsible for Autohome data warehouse construction, data development, search business docking.

Author: Fan Wen

Source: WeChat public account: Home Technology

Source: https://mp.weixin.qq.com/s/LSrYbDMT38YovyNIpkUhAg

The implementation practice of large language model in data warehouse data governance

Read on

Global AI Agent inventory, big language model entrepreneurship must refer to 60 AI agents

Reversing the Curse: The Powerlessness of Big Language Models

CNCC | Prospective problems and challenges of large language models in mathematics: theory, methods and applications

Recently, the desktop operating system, the three camps have very large version updates. First of all, domestic DeepinOS accesses AI large language models. Immediately after the 26th, Microsoft Wind

The breakthrough of the big language model is to equip AI with five senses and five senses

How to use big language models to build a private knowledge base?

🚀Langchain-Chatchat: The New Choice for Local Knowledge Base Q&A! 🌟 Project Highlights: Based on the Big Language Model: Combining Langchain and Ch

Microsoft launched the AutoGen framework to help developers create complex applications based on large language models

Live Review | Potential and resistance, explore the application of big language models in the field of financial risk control

Under the wave of ChatGPT, look at the development of China's large language model industry #Dongshroom Business School#

The Big Language Model of Federal Law

The bookstore picked it up casually and took a look, and stood for three hours to read it, the fastest reading speed 😂 ever#Large Language Model#OpenAI

KOSMOS-2.5: Multimodal Large Language Model for Reading "Text-Dense Images"

MIT Amazing Proof: Big Language Model is the World Model? LLM understands space and time

How to Become LLM Word Master! "The Underlying Mental Method of Big Language Model"

Use LM Studio to deploy local AI large language models with one click

With 3 times the sensitivity, it only takes a few seconds to search for millions of protein pairs, and Fudan and others have developed new language models

8.3K Stars!

Meta Researchers Crack the Curse of Large Model Reversal and Launch "Language Model Physics"

Decoding AI: Demystifying the "brain" of chatbots - large language models

Predicting protein co-regulation and function, Harvard & MIT trained a genomic language model

Intel has made important progress in the field of artificial intelligence accelerators, and its subsidiary HabanaLabs is in

Researchers propose a new concept of artificial intelligence that allows large language models to interact with the real physical world

Llama 3: The next frontier of open-source large language models

The secret of using large language models: How to control AI with efficient prompt words?

Apple has been exposed to a big move again, self-developed device-side large language model, AI is a new way out of "revitalization"?

No wonder the previous iPhone 16 series national version of the AI function will be provided by Baidu, the original Baidu in the Chinese artificial intelligence invention patent enterprise ranking is still high. Ranked in the top 10

Apple released OpenELM, an efficient language model based on an open-source training and inference framework

Solomonov: The Prophet of Large Language Models

Large Language Model Deployment: vLLM and Quantization

Apple launches OpenELM, an efficient language model, Xiaomi plans a new car for 150,000 yuan, and AI successfully rewrites human DNA