laitimes

The implementation practice of large language model in data warehouse data governance

author:Flash Gene

With the continuous growth of data scale and the increasing complexity of business needs, the construction and management of data warehouses has become more and more important, and data governance has become a non-negligible part of data warehouse construction. Among them, data warehouse metadata and indicators are descriptions and measurements of data, which play a crucial role in data analysis and decision-making. However, due to the large and complex size of data, traditional metadata and indicator retrieval methods are often inefficient and cannot meet the needs of fast and accurate retrieval. This article describes how to use large-scale language model technology to implement data warehouse metadata and indicator retrieval governance, including technical architecture, detailed technical description, and solved problems.

1. Background of practice

Among the existing system tools, we have built a number of data tools such as indicator management system, metadata system, IDE user query platform and so on. Traditional platforms are more instrumental support, users have purposeful input, and the system retrieves and returns results. This form is easy to form information islands, because the metadata system is metadata information, indicator management is indicator information, and users more often hope to be integrated to answer business problems.

The big language model is a natural language processing technology based on the Transformer model. It learns language models through large-scale pre-training, which can then be fine-tuned on a variety of tasks for more specific applications. Large language models have powerful semantic understanding and generation capabilities to generate relevant responses based on the natural language text entered. It can act as a glue on existing application systems, organically combining information from different systems to provide users. Therefore, we decided to approach our current solution scenario through a big language model to meet our needs.

2. Technical architecture

2.1

Retrieve the corpus into the warehouse

The implementation practice of large language model in data warehouse data governance

Through the connection with the metadata system and indicator management system, you can access the content you need to consult into the data warehouse for storage and management. In the data warehouse, the corpus content is organized in the form of KV pairs composed of corpus phrases and detail information to form the initial index information.

The organization of index information can be designed according to specific needs and data structure. A common way is to use the table name as the key and the table structure as the value. This makes it easy to index and query based on the table name and quickly find the relevant corpus content.

Another way is to use the metric name as the key, and the metric description and the generation rule description as the value. This method is suitable for scenarios where consulting based on indicators is required. With indicator information as part of the index, you can easily search by indicator name to quickly find the corpus content related to the indicator.

By connecting with the metadata system and metric management system, you can obtain more metadata information and indicator definitions. This information can be used as a supplement to the corpus content and enrich the detailed information of the corpus. At the same time, you can store this information together with corpus phrases in the data warehouse for subsequent index access and consultation retrieval.

2.2

Corpus vectorization processing

The implementation practice of large language model in data warehouse data governance

After completing the corpus assembly, we use the Milvus vector library to vectorize the corpus content. The advantage of this treatment is that by converting the corpus content into a vector representation, we can avoid the matching range reduction problem caused by the exact match or fuzzy matching method in plaintext matching. Because the user's language variation is very diverse, using plaintext matching directly may not cover all the changes.

Through vectorization, we map corpus content into a high-dimensional vector space. In this vector space, each corpus content is represented as a vector, and the distance between these vectors can be used to measure the similarity between them. In this way, we can find the corpus content closest to the user input through vector matching.

Another benefit of vectorization processing is that it can avoid the interference caused by various modifiers in language. Modifiers are usually adjectives, adverbs, or other linguistic modifiers that can skew the matching results in plaintext matching. But through vectorization processing, we convert the corpus content into numerical vectors that can more accurately reflect the semantic information of the corpus content, independent of modifiers.

2.3

Large language model access

In the process of user consultation, the dialogue content is also vectorized, which can be used to match with the results. In this way, we can find the corpus content that is most relevant to the content of the user's inquiry based on the similarity of the vector.

Often, users may describe very few words when entering a consultation, but these descriptions tend to be focused. Therefore, with vectorization, this key information can be encoded into vector representations. The process of vectorization can use a variety of techniques to convert text into a vector of values. In this way, we can find the corpus content that is most similar to the user input by calculating the similarity between the vectors.

When a matching result is found, the result can be sorted according to the vector score. The vector score can reflect the degree of matching, and a higher score indicates a good match. By sorting the results, we can rank the corpus content with a high degree of matching first, providing users with more relevant answers to the inquiry.

Once the matching results have been determined, these corpus can be recalled and the content sent to the interface provided by the large language model for assembly. The model generates coherent, natural responses based on input. By combining matching results with large language models, we can transform relevant corpus content into more specific and detailed answers, providing users with more professional and accurate consulting services.

2.4

Front-end application deployment

The implementation practice of large language model in data warehouse data governance

When choosing the front-end environment, we considered the original two systems and other comprehensive platforms, but did not find the right entry point and reason for users to access and use these systems. Finally, we decided to integrate the system into the IDE user query platform. For users who use SQL to query information, they often need to understand the meaning of metrics, their usage, and the metadata information of tables. These users are mainly distributed in different groups such as data points, products, operations, algorithms, etc. They are currently the entry point that best fits our use case.

By integrating the system into the IDE user query platform, we can provide these users with a unified interface and entry point that allows them to easily query and understand information about metrics. Users can enter SQL statements in the query platform and obtain detailed information about metrics, including the definition of metrics, calculation methods, usage examples, and table metadata. This allows users to query and understand metrics on one platform without having to switch between multiple systems. At the same time, users can obtain accurate and comprehensive indicator information through the query platform, helping them better understand and use indicators, and improve work efficiency.

Overall framework diagram:

The implementation practice of large language model in data warehouse data governance

3. Apply effects

Through the implementation practice of large language model in metadata and indicator retrieval in data warehouse governance, we solve the following problems:

  • Improve retrieval efficiency: Traditional metadata and indicator retrieval methods often require complex query statements and cumbersome operations, which are inefficient. Using the large language model technology, users only need to enter natural language query questions, and the system can quickly return the corresponding results, which greatly improves the retrieval efficiency.
  • Improve search accuracy: Traditional metadata and metric retrieval methods are easily affected by inaccurate expression of query statements, resulting in inaccurate results returned. The large language model has powerful semantic understanding and reasoning capabilities, which can better understand the query intent of users and improve the accuracy of search results.
  • Provide a better user experience: Traditional metadata and indicator retrieval methods require users to have certain technical background and operational experience, which is difficult for non-experts. Using the big language model technology, users only need to enter natural language query questions, and do not need to understand complex query syntax and operation steps, which greatly improves the user experience.
  • Summary: With the technical support of big language models, we can achieve smarter and more convenient metadata and metric management. It understands natural language input and provides relevant metadata and metric information based on user needs. This intelligent capability makes data manipulation and data analysis more efficient and accurate. With the support of big language models, we can better manage and leverage metadata and metrics in data warehouses, and improve the level of data governance and data analysis. Hope this article is helpful to you, thank you!

About the author

Fan

■ Data Platform Department - Data Warehouse Team

■ Mainly responsible for Autohome data warehouse construction, data development, search business docking.

Author: Fan Wen

Source: WeChat public account: Home Technology

Source: https://mp.weixin.qq.com/s/LSrYbDMT38YovyNIpkUhAg

Read on