When LLM meets Database: Ali Dharma Academy and HKU launch a new benchmark for text-to-SQL

Machine Heart column

Authors: Izuku, Lincheng, Huang Fei, Shuide

On the new benchmark BIRD, ChatGPT can only reach 40.08%, which is still far from 92.96% for humans.

background

The large model (LLM) provides a new direction for the development of general artificial intelligence (AGI), which obtains powerful language understanding, language generation, reasoning and other capabilities through large-scale self-supervised training of massive public data, such as the Internet, books and other corpus. However, the use of large models for private domain data still faces some challenges, private domain data refers to data owned by specific enterprises or individuals, usually contains domain-specific knowledge, combining large models with private domain knowledge, will play a huge value.

Private domain knowledge can be divided into unstructured and structured data from the form of data. Unstructured data, such as documents, is often enhanced by retrieval, and Q&A systems can be quickly implemented using tools such as LangChain. Structured data, such as a database (DB), requires large models to interact with the database, query and analyze to obtain useful information. Around large models and databases, a series of products and applications have recently been derived, such as using LLM to build intelligent databases, perform BI analysis, and complete automatic table construction. Among them, text-to-SQL technology, that is, interacting with databases in natural language, has been a highly anticipated direction.

In academia, where text-to-SQL benchmarks in the past only focused on small-scale databases, state-of-the-art LLM can already achieve 85.3% execution accuracy, but does this mean that LLM can already be used as a natural language interface to databases?

Next-generation datasets

Recently, Alibaba, together with the University of Hong Kong and other institutions, launched a new benchmark for large-scale real databases, BIRD (Can LLM Already Serve as A Database Interface). A BIg Bench for Large-Scale Database Grounded Text-to-SQLs), containing 95 large-scale databases and high-quality text-SQL pairs, with up to 33.4 GB of data storage. The previous optimal model was evaluated only 40.08% on BIRD, which is still far from the 92.96% human result, which proves that challenges remain. In addition to evaluating SQL correctness, the authors also add an assessment of SQL execution efficiency, expecting that the model can not only write correct SQL, but also write efficient SQL.

When LLM meets Database: Ali Dharma Academy and HKU launch a new benchmark for text-to-SQL

Thesis: https://arxiv.org/abs/2305.03111

Home: https://bird-bench.github.io

Code: https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/bird

At present, BIRD's data, code, and list have been open source, and the number of downloads around the world has exceeded 10,000. At the beginning of its launch, BIRD sparked widespread attention and discussion on Twitter.

Reviews from overseas users are also wonderful:

An LLM project not to be missed

Very useful checkpoints, raised hotbeds

AI can help you, but it can't replace you yet

My job is safe for the time being...

Method overview

New challenges

The study focuses on text-to-SQL evaluation of real databases, and popular test benchmarks in the past, such as Spider and WikiSQL, only focus on database schemas with a small amount of database content, resulting in a gap between academic research and practical application. BIRD focuses on three new challenges: massive and authentic database content, external knowledge reasoning between natural language issues and database content, and the efficiency of SQL when working with large databases.

First, the database contains values for massive and noisy data. In the example on the left, the calculation of average salary needs to be calculated by converting a string in the database to a float value and then aggregating;

In the intermediate example, in order to accurately return the answer for the user, the model must first know that the account type eligible for the loan must be "OWNER", which means that the mystery behind the huge database content sometimes needs external knowledge and reasoning to reveal;

Finally, query execution efficiency needs to be considered. In the example on the right, using more efficient SQL queries can significantly improve speed, which is of great value to industry, because users expect not only to write correct SQL, but also to execute SQL efficiently, especially in large databases;

Data labeling

BIRD decouples problem generation and SQL annotation during annotation. At the same time, add experts to write database description files to help problem and SQL annotators better understand the database.

1. Database ingestion: The authors collected and processed 80 databases from open source data platforms such as Kaggle and CTU Prague Relational Learning Repository. By collecting real tabular data, building ER diagrams, and setting database constraints, 15 databases were manually created as black-box tests to avoid the current database being learned by the current large model. BIRD's database contains patterns and values from multiple domains, 37 domains, covering blockchain, sports, healthcare, gaming, and more.

2. Problem collection: First, the author hires an expert to write a description file for the database, which includes a complete description of the column names, a description of the database value, and external knowledge used to understand the value. Then 11 native speakers from the United States, the United Kingdom, Canada, Singapore, and other countries were recruited to generate problems for BIRD. Each speaker has at least a bachelor's degree or above.

3.SQL Generation: A global annotation team consisting of data engineers and database course students was recruited to generate SQL for BIRD. Given a database and a reference database description file, the annotator generates SQL to answer the question correctly. The double-blind labeling method is used, requiring two annotators to annotate the same problem. Double-blind labeling minimizes errors caused by a single labeler.

4. Quality inspection: Quality testing is divided into two parts: the effectiveness and consistency of the results execution. Validity requires not only the correctness of the execution, but also that the execution result cannot be null (NULL). The expert will gradually modify the problem conditions until the SQL execution results are valid.

5. Difficulty division: The difficulty indicator of text-to-SQL can provide researchers with a reference for optimizing algorithms. The difficulty of Text-to-SQL depends not only on the complexity of the SQL, but also on factors such as the difficulty of the problem, the ease of understanding of additional knowledge, and the complexity of the database. Therefore, the authors asked the SQL annotators to rate the difficulty during the labeling process and categorized the difficulty into three categories: easy, moderate, and challenging.

Statistics

1. Problem type statistics: Problems are divided into two categories, Fundamental Type and Reasoning Type. The underlying question types include the types of questions covered in traditional text-to-SQL datasets, while the inference question types include questions that require external knowledge to understand values:

2. Database distribution: The author uses a sunburst diagram to show the relationship between the database domain and the size of its data volume. A larger radius means that there is more text-SQL based on that database, and vice versa. The darker the color means that the database size is larger, for example, donor is the largest database in the benchmark, occupying 4.5GB.

3.SQL Distribution: The author proves that BIRD's SQL is the most diverse and complex so far through 4 dimensions such as the number of tokens, the number of keywords, the number of n-gram types, and the number of JOINs.

Evaluation indicators

1. Execution accuracy: compare the difference between the SQL execution results predicted by the model and the real labeled SQL execution results;

2. Effective efficiency score: Considering the accuracy and efficiency of SQL at the same time, compare the relative difference between the SQL execution speed predicted by the model and the real labeled SQL execution speed, and consider the running time as the main indicator of efficiency.

Experimental analysis

The authors chose the outperformers of the trained T5 model and the Large Language Model (LLM) in previous benchmarks as the baseline models: Codex (code-davinci-002) and ChatGPT (GPT-3.5-turbo). To better understand whether multi-step reasoning can stimulate the reasoning ability of large language models in a real database environment, a version of their chain-of-thought is also provided. The baseline model is tested in two settings: one is a complete schema information input, and the other is human understanding of the database values involved in the problem, summarized into a natural language description (knowledge evidence) to assist the model to understand the database.

The author gives some conclusions:

1. Gain of additional knowledge: Increasing the knowledge evidence of understanding database values has obvious effects, which proves that in real database scenarios, relying only on semantic parsing ability is not enough, and understanding database values will help users find answers more accurately.

2. Thought chains are not necessarily entirely beneficial: In the absence of a given database value description and zero-shot, the model's own COT inference can more accurately generate an answer. However, given additional knowledge evidence, LLM is allowed to COT, and the effect is not significant, or even decreases. Therefore, LLM may create knowledge conflicts in this scenario. How to resolve this conflict so that the model can not only accept external knowledge, but also benefit from its own powerful multi-step reasoning, will be the focus of future research direction.

3. Gap with humans: BIRD also provides human indicators, which the authors test the performance of annotators in the first face of the test set in the form of exams and use them as a basis for human indicators. Experiments have found that the current best LLM is still far from humans, proving that challenges still exist. The authors performed a detailed error analysis that gave some potential direction for future research.

conclusion

The application of LLM in the database field will provide users with a smarter and more convenient database interaction experience. The emergence of BIRD will promote the intelligent development of natural language interaction with real databases, provide room for progress in text-to-SQL technology for real database scenarios, and help researchers develop more advanced and practical database applications.

When LLM meets Database: Ali Dharma Academy and HKU launch a new benchmark for text-to-SQL

Read on

Alibaba Cloud Data Middle Platform 03 - Implementation Case

Where did the 38 pieces of cultural relics and artworks returned from the United States come from? Expert: Some of them should be fragments of Ming Dynasty murals in the Ali area

4 price cuts in 1 year, Alibaba Cloud buried 3 hidden dangers

Douyin applet will do graphic content;Huawei P70 sold out in one minute on the shelves;Ali responds to the small essay incident | Evening

Ali Auction responded to the "Former Employee's Small Essay" incident: he was reported by his subordinates for rough management and other problems

Ali Auction responds to a former employee who was reported by a subordinate for rough management: false reports

Global top 20 retailers, Alibaba has regressed

The layoff female general manager posted a complaint, and Ali exposed sexual harassment in the workplace again

Lei Jun refuted the rumor of "Shuangwen Life", and Ali Auction responded to the report of a former employee by his subordinates

Ma Yun made three internal statements, and the real end of Ali's reorganization

was "chased and blocked" by old opponents, and Alibaba Cloud's price war was not easy to fight

Wang Xing's rare comments are ideal; Ali responds to a former employee's essay; Lei Jun responds by imitating Musk......

Ali employees were hacked? Netizens broke the news, one by one!

Spring arrives in Ali Gar and is picturesque

Iraq U23 4-2 Tajikistan U23, Ali Jassim 1 shot and 2 passes, Karaal Saad scored

Alibaba Cloud is stubbornly committed to chip self-development, and does not want to become a "working boy" for Intel and Nvidia