ACL 2023 | Structure-aware language model training method for information retrieval

In this paper, for structural data retrieval, we propose a structure aware deNse ReTrievAl, SANTA-oriented dense vector retrieval method, which designs two tasks: structural data alignment and mask entity prediction to continue training pre-trained language models. Experimental results show that SANTA, which learns more accurate structured data representation by capturing the semantics of structured data, finally achieves advanced results in the two tasks of code and product search.

Paper Link:

https://aclanthology.org/2023.findings-acl.734/

Open Source Code:

https://github.com/OpenMatch/SANTA

1. Research background

Structured data such as code, HTML documents, and product descriptions are ubiquitous in articles, books, and web pages. Learning the semantic information behind text structure to represent structured data is essential to building a more complete retrieval system. As shown in Figure 1, structured data retrieval tasks, such as code retrieval and commodity retrieval, require the model to retrieve structured data based on user queries. Dense vector retrieval is a commonly used information retrieval method that returns the structured data required by the user by encoding the user query and structured data in the vector space, and matching according to the similarity of the vector.

Figure 1. Example diagram of unstructured data retrieval

However, most pre-trained language models lack structure-aware pre-training to provide efficient vector representations for structured data retrieval. Related work proposes some structure-aware pre-training methods for continuing to train pre-trained language models to make them structure-aware to better represent structured data. These methods typically design a specific mask strategy and train a pre-trained language model using mask language modeling.

However, simply using masked language modeling may not adequately train a pre-trained language model for efficient structured data representation. Since there are usually some natural alignment signals between structured and unstructured data, structured data also contains special structural information, which provides strong support for training structured data representation. On this basis, we propose a structure-aware language model pre-training method to realize a dense vector retrieval model for structured data.

Second, the pre-training method of language model oriented to structure perception

Figure 2. Structure-aware pre-training method description diagram. We use two pre-training methods: structured data alignment (SDA) and mask entity prediction (MEP).

For structure data retrieval, we propose a dense vector retrieval method for structure perception (SANTA). As shown in Figure 2, SANTA designed two pre-training tasks: Structured Data Alignment (SDA) and Masked Entity Prediction (MEP) to continue training the pre-trained language model to make it more sensitive to structured data and better learn the representation of structured data.

Data collection and processing: We construct pre-trained data pairs using natural alignment signals, code-description documents, and description-commodity-gist that exist between structured and unstructured data. For code, we treat some code identifiers as entities, such as variables, function names, external libraries, and methods, and use BytesIO and tree_sitter in Python and other programming languages to identify entities, respectively. For content, we use NLTK tools to identify nouns and special nouns that appear in both content and titles, and treat them as entities.
Structured data alignment: We calculate the similarity score between the encoded unstructured data and the structured data, and then use contrastive learning to continue training the language model. The language model is guided to optimize vector spaces by aligning training with two modal data.

Equation 1. Structural data alignment. Consists of structural data sampled in negative samples within batches

Masked entity prediction: Since entity semantics play an important role in learning the structured semantic information of the data, we use the masked language model method to help the language model capture the structured semantic information behind the data when pre-training the language model. Specifically, we train the language model using Equation 2 to obtain the necessary information from the context and the learned knowledge to recover the masked entity, so as to better understand the structured semantic information of the data.

Equation 2. Masked entity predictions

Third, the experimental results

Table 1. The effect of different search models on code retrieval and item retrieval tasks

As shown in Table 1, our model (SANTA) exhibits strong zero-sample capability by comparing it to the fine-tuned model performance. A 6.8% performance improvement over fine-tuned CodeT5 for code retrieval tasks. After fine-tuning, there are about 8% and 2% performance improvements over CodeT5 and T5 on code retrieval tasks and product search tasks, respectively. At the same time, CodeRetriever also achieved a performance improvement of 4.3% compared to the most advanced code retrieval model.

Table 2. Ablation experiments

As shown in Table 2, adding the MEP task to the baseline model is almost identical to the previous performance, indicating that the mask language model approach alone has little effect on representation learning of structured text. However, unlike MEP, the SDA task is a significant improvement in both structured data retrieval tasks. When two pretraining tasks are used at the same time, the retrieval performance is further improved. This phenomenon suggests that MEP tasks can provide a more efficient vector representation of structured data by combining with SDA tasks.

Figure 3. Vector space display of different pre-training methods

As shown in Figure 3, we find that the SDA task aligns unstructured and unstructured data well, but the vector representations of the two are mixed together; With the addition of MEP tasks, language models have the ability to distinguish between structured and unstructured text and distribute it into different regions. In summary, SDA and MEP help language models capture the structural characteristics of data from different aspects, so as to achieve more accurate search results.

IV. Summary

The current pre-training work neglects to design specific structure-aware pre-training tasks to learn the representation of structured data, which makes their performance in the corresponding structural data retrieval task unsatisfactory. In this article, we design two tasks, structural data alignment and mask entity prediction, and train the language model to learn the structural semantic information behind the data structure. Our experimental results show that SANTA, which learns more accurate structured data representations by capturing the semantics of structured data, ultimately achieves advanced results in two tasks: code and product search.

Author: Li Xinze, Liu Zhenghao and other sources: public number [social media SMP]

Illustration by IconScout Store from IconScout

-The End-

Scan the code to watch!

New this week!

"AI Technology Stream" original submission plan

TechBeat is an AI Learning Community (www.techbeat.net) established by Jiangmen Ventures. The community has launched 480+ talk videos and 2400+ technical dry goods articles, covering CV/NLP/ML/ROBOTIS, etc.; Hold top meetings and other online communication activities on a regular basis every month, and hold offline gatherings and exchange activities for technicians from time to time. We are striving to become a high-quality, knowledge-based communication platform that AI talents love, hoping to create more professional services and experiences for AI talents, and accelerate and accompany their growth.

Contents

Latest Technology Interpretation/Systematic Knowledge Sharing //

Cutting-edge information commentary/experience narration //

Instructions for submission

Manuscripts need to be original articles and indicate author information.

We will select some directions in in-depth technical analysis and scientific research experience, inspire users with more inspirational articles, and do original content rewards

Submission method

Send mail to

[email protected]

Or add staff WeChat (chemn493) to submit articles to communicate the details of submissions; You can also pay attention to the "Jiangmen Venture Capital" public account, and reply to the word "submission" in the background to get submission instructions.

>>> Add WeChat!

About me "door" ▼

Jiangmen is a new venture capital institution focusing on the core technology field of digital intelligence, and is also a benchmark incubator in Beijing. The company is committed to discovering and cultivating scientific and technological innovation enterprises with global influence by connecting technology and business, and promoting enterprise innovation and development and industrial upgrading.

Founded at the end of 2015, the founding team was built by the founding team of Microsoft Venture Capital in China, and has selected and deeply incubated 126 innovative technology-based startups for Microsoft.

If you are a start-up in the technology field and want not only investment, but also a series of continuous and valuable post-investment services, please send or recommend a project to my "door":

⤵ One click to send you to TechBeat Happy Planet

ACL 2023 | Structure-aware language model training method for information retrieval

1. Research background

Second, the pre-training method of language model oriented to structure perception

Third, the experimental results

IV. Summary

Read on

Jinglianwen Technology: High-quality AI data annotation helps large language model training

Solomonov: The Prophet of Large Language Models

Large Language Model Deployment: vLLM and Quantization

Apple launches OpenELM, an efficient language model, Xiaomi plans a new car for 150,000 yuan, and AI successfully rewrites human DNA

The combination of deep learning and chemical language models is used for de novo drug design, which is published in the journal Nature

The tuyere belonging to major technology companies is here again! This large language model leads to the "new industrial revolution."

The landing of large language models Why the first step is to do customer service

OpenAI launches new large language model GPT-4o; Apple will start selling the Vision Pro in China; SoftBank sold almost all of its shares in Alibaba

探索大语言模型：理解Self Attention| 京东物流技术团队

The synergy of knowledge graphs with large language models

Multi-functional RNA analysis, the RNA language model of the Baidu team was published in the journal Nature

The parameters are improved slightly, and the performance index explodes! Google: Large language models hide mysterious skills

Learn more about large language model operations (LLMOps)

#头条创作挑战赛#Gai是现在人工智能追求的目标, which is also the essence of artificial intelligence now, the establishment of a knowledge base cannot be like an industry knowledge base

CVPR 2024|Only one language model is needed to generate high-quality 360-degree scenes from image diffusion models

Altman talks about the opportunities, challenges and human self-reflection of AI: China will have a unique large language model

19 Best Large Language Models in 2024