In this paper, for structural data retrieval, we propose a structure aware deNse ReTrievAl, SANTA-oriented dense vector retrieval method, which designs two tasks: structural data alignment and mask entity prediction to continue training pre-trained language models. Experimental results show that SANTA, which learns more accurate structured data representation by capturing the semantics of structured data, finally achieves advanced results in the two tasks of code and product search.
Paper Link:
https://aclanthology.org/2023.findings-acl.734/
Open Source Code:
https://github.com/OpenMatch/SANTA
1. Research background
Structured data such as code, HTML documents, and product descriptions are ubiquitous in articles, books, and web pages. Learning the semantic information behind text structure to represent structured data is essential to building a more complete retrieval system. As shown in Figure 1, structured data retrieval tasks, such as code retrieval and commodity retrieval, require the model to retrieve structured data based on user queries. Dense vector retrieval is a commonly used information retrieval method that returns the structured data required by the user by encoding the user query and structured data in the vector space, and matching according to the similarity of the vector.
Figure 1. Example diagram of unstructured data retrieval
However, most pre-trained language models lack structure-aware pre-training to provide efficient vector representations for structured data retrieval. Related work proposes some structure-aware pre-training methods for continuing to train pre-trained language models to make them structure-aware to better represent structured data. These methods typically design a specific mask strategy and train a pre-trained language model using mask language modeling.
However, simply using masked language modeling may not adequately train a pre-trained language model for efficient structured data representation. Since there are usually some natural alignment signals between structured and unstructured data, structured data also contains special structural information, which provides strong support for training structured data representation. On this basis, we propose a structure-aware language model pre-training method to realize a dense vector retrieval model for structured data.
Second, the pre-training method of language model oriented to structure perception
Figure 2. Structure-aware pre-training method description diagram. We use two pre-training methods: structured data alignment (SDA) and mask entity prediction (MEP).
For structure data retrieval, we propose a dense vector retrieval method for structure perception (SANTA). As shown in Figure 2, SANTA designed two pre-training tasks: Structured Data Alignment (SDA) and Masked Entity Prediction (MEP) to continue training the pre-trained language model to make it more sensitive to structured data and better learn the representation of structured data.
- Data collection and processing: We construct pre-trained data pairs using natural alignment signals, code-description documents, and description-commodity-gist that exist between structured and unstructured data. For code, we treat some code identifiers as entities, such as variables, function names, external libraries, and methods, and use BytesIO and tree_sitter in Python and other programming languages to identify entities, respectively. For content, we use NLTK tools to identify nouns and special nouns that appear in both content and titles, and treat them as entities.
- Structured data alignment: We calculate the similarity score between the encoded unstructured data and the structured data, and then use contrastive learning to continue training the language model. The language model is guided to optimize vector spaces by aligning training with two modal data.
Equation 1. Structural data alignment. Consists of structural data sampled in negative samples within batches
- Masked entity prediction: Since entity semantics play an important role in learning the structured semantic information of the data, we use the masked language model method to help the language model capture the structured semantic information behind the data when pre-training the language model. Specifically, we train the language model using Equation 2 to obtain the necessary information from the context and the learned knowledge to recover the masked entity, so as to better understand the structured semantic information of the data.
Equation 2. Masked entity predictions
Third, the experimental results
Table 1. The effect of different search models on code retrieval and item retrieval tasks
As shown in Table 1, our model (SANTA) exhibits strong zero-sample capability by comparing it to the fine-tuned model performance. A 6.8% performance improvement over fine-tuned CodeT5 for code retrieval tasks. After fine-tuning, there are about 8% and 2% performance improvements over CodeT5 and T5 on code retrieval tasks and product search tasks, respectively. At the same time, CodeRetriever also achieved a performance improvement of 4.3% compared to the most advanced code retrieval model.
Table 2. Ablation experiments
As shown in Table 2, adding the MEP task to the baseline model is almost identical to the previous performance, indicating that the mask language model approach alone has little effect on representation learning of structured text. However, unlike MEP, the SDA task is a significant improvement in both structured data retrieval tasks. When two pretraining tasks are used at the same time, the retrieval performance is further improved. This phenomenon suggests that MEP tasks can provide a more efficient vector representation of structured data by combining with SDA tasks.
Figure 3. Vector space display of different pre-training methods
As shown in Figure 3, we find that the SDA task aligns unstructured and unstructured data well, but the vector representations of the two are mixed together; With the addition of MEP tasks, language models have the ability to distinguish between structured and unstructured text and distribute it into different regions. In summary, SDA and MEP help language models capture the structural characteristics of data from different aspects, so as to achieve more accurate search results.
IV. Summary
The current pre-training work neglects to design specific structure-aware pre-training tasks to learn the representation of structured data, which makes their performance in the corresponding structural data retrieval task unsatisfactory. In this article, we design two tasks, structural data alignment and mask entity prediction, and train the language model to learn the structural semantic information behind the data structure. Our experimental results show that SANTA, which learns more accurate structured data representations by capturing the semantics of structured data, ultimately achieves advanced results in two tasks: code and product search.
Author: Li Xinze, Liu Zhenghao and other sources: public number [social media SMP]
Illustration by IconScout Store from IconScout
-The End-
Scan the code to watch!
New this week!
"AI Technology Stream" original submission plan
TechBeat is an AI Learning Community (www.techbeat.net) established by Jiangmen Ventures. The community has launched 480+ talk videos and 2400+ technical dry goods articles, covering CV/NLP/ML/ROBOTIS, etc.; Hold top meetings and other online communication activities on a regular basis every month, and hold offline gatherings and exchange activities for technicians from time to time. We are striving to become a high-quality, knowledge-based communication platform that AI talents love, hoping to create more professional services and experiences for AI talents, and accelerate and accompany their growth.
Contents
Latest Technology Interpretation/Systematic Knowledge Sharing //
Cutting-edge information commentary/experience narration //
Instructions for submission
Manuscripts need to be original articles and indicate author information.
We will select some directions in in-depth technical analysis and scientific research experience, inspire users with more inspirational articles, and do original content rewards
Submission method
Send mail to
Or add staff WeChat (chemn493) to submit articles to communicate the details of submissions; You can also pay attention to the "Jiangmen Venture Capital" public account, and reply to the word "submission" in the background to get submission instructions.
>>> Add WeChat!
About me "door" ▼
Jiangmen is a new venture capital institution focusing on the core technology field of digital intelligence, and is also a benchmark incubator in Beijing. The company is committed to discovering and cultivating scientific and technological innovation enterprises with global influence by connecting technology and business, and promoting enterprise innovation and development and industrial upgrading.
Founded at the end of 2015, the founding team was built by the founding team of Microsoft Venture Capital in China, and has selected and deeply incubated 126 innovative technology-based startups for Microsoft.
If you are a start-up in the technology field and want not only investment, but also a series of continuous and valuable post-investment services, please send or recommend a project to my "door":
⤵ One click to send you to TechBeat Happy Planet