聯手自然語言處理專委會：“根據自然語言生成SQL”術語釋出

本期釋出術語熱詞：根據自然語言生成SQL (text-to-SQL)。

根據自然語言生成SQL (text-to-SQL)

作者：鄧乃豪，陳雨龍，張嶽（西湖大學）

InfoBox：

中文名：根據自然語言生成SQL

外文名：text-to-SQL

學科：自然語言處理，機器學習，資料庫

實質：學習如何将自然語言中的語義資訊轉化成SQL的形式

基本簡介：

Text-to-SQL是一種将自然語言轉化成SQL形式的任務。如上圖所示，當使用者輸入“堪薩斯州主要城市有哪些”這個問題，text-to-SQL系統輸出了對應的SQL。我們可以用這個SQL去通路資料庫，得到問題的答案。

研究概況：

Text-to-SQL因其用SQL形式表達了自然語言中的語義資訊，以及對搭建對資料庫的自然語言界面的幫助（如封面圖所示），吸引到了自然語言處理(NLP)的研究者，以及資料庫(DB)的研究者[1, 2]。

我們歸納Text-to-SQL任務的主要難點，在于1）對自然語言的語意的編碼，2）解碼到SQL的形式，以及3）在自然語言以及SQL這兩種表達形式中的語義轉化。

資料集：

早期的text-to-SQL資料集，如Academic[3]， ATIS[4, 5]，一般隻關注某個特定領域，如ATIS關注飛機訂票業務。為了更好的探索模型在不同領域上的泛化性，研究者們提出了多領域的資料集，如WikiSQL[6], Spider[7]。除此之外，研究者們還收集了不同語言的text-to-SQL資料集[8]，多輪對話的text-to-SQL資料集等[9]。我們總結了text-to-SQL的資料集，詳情可見https://text-to-sql-survey-coling22.github.io/

方法：

目前自然語言處理學界較為流行的方法包含資料增廣，模型的編碼，解碼，以及強/弱監督中的一些方法。

因為現有的多數模型大多需要大量資料進行訓練，并從這些資料中學習到text-to-SQL的轉化的能力，資料增廣從資料的角度嘗試優化模型[10, 11]。

大多數text-to-SQL模型采用的seq2seq的範式，包含了編碼以及解碼兩個過程。模型的編碼過程編碼了自然語言（使用者的問題）以及資料庫的結構的資訊[12, 13]。解碼則強調如何将對應的語義，在盡可能貼合自然語言輸入願意的條件下，高效準确地轉化成SQL的形式[14, 15]。

強/弱監督學習中的方法以及其他的一些進行text-to-SQL方法對應了不同應用場景，或者運用了一些技巧提升模型的性能，如active learning幫助模型在使用者與資料庫互動的過程中不斷優化[16]。我們在https://text-to-sql-survey-coling22.github.io/上歸納了這些方法。

評判标準：

Text-to-SQL的評判測量包括了簡單的SQL運作準确度（Naiive Execution Accuracy）[17]，字元串完全比對（Exact String Match）[18]，但他們一個太“松”（false positive），一個太“嚴”（false negative）。為此，研究者們提出了對于SQL中小的子產品分别進行比較(exact set match)[7]，和用人為生成的資料庫運作SQL對比結果(test suite accuracy)[19]。對資料集的分割包含随機将資料集中問題配置設定到train, dev, test中[3]；為了研究模型在未見過SQL或者新的領域上的表現，研究者們提出了根據SQL的分割以及根據資料庫的分割[7, 18]。另外，一些研究者會根據自己的研究問題進行一些分割，如有些對組合泛化研究的工作會采取他們自己的分割方式[20]。

一些讨論及對未來的展望：

我們看到在text-to-SQL領域不論是從資料的角度，還是從模型以及評判方法的角度，在近年來都有了長足的進步。模型在如Spider這樣相對複雜的多領域資料集上從剛開始的10%到20%的成績，進步到目前的75%左右的成績。但是，text-to-SQL領域仍有許多問題亟待研究，目前的研究未涉及或較少涉及一些應用場景。我們列出了以下text-to-SQL可能的研究方向：

1. 跨領域text-to-SQL: 如何更高效的運用領域知識，更高效的運用已有的資料集，讓我們快速地在新領域中運用模型。

2. 在現實中的應用的不同場景: 如何處理不完整的表格；如果表格沒有被提供，如何進行text-to-SQL任務；如何處理不同種類的使用者問題；如何協助資料庫管理者管理資料庫；多語言text-to-SQL；為殘障人士士建立特殊的界面。

3. 将text-to-SQL融入到更廣泛的研究中去：建立對資料庫的問答界面；從資料庫中抽取知識，進行對話的任務；探索SQL及其餘邏輯形式的内在聯系；搭建一種通用的語義分析模型，處理包括SQL在内的不同語義表達。

4 其餘的如運用prompt learning，并且讓系統更加穩定；完善現有的衡量機制。

參考文獻

[1]E. F. Codd. 1970. A relational model of data for large shared data banks. Commun. ACM, 13(6):377–387.

[2]Charles T. Hemphill, John J. Godfrey, and George R. Doddington. 1990. The ATIS spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27,1990.

[3]Fei Li and Hosagrahar V Jagadish. 2014. Constructing an interactive natural language interface for relational databases. Proceedings of the VLDB Endowment, 8(1):73–84.

[4]P. J. Price. 1990. Evaluation of spoken language systems: the ATIS domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27,1990.

[5]Deborah A. Dahl, Madeleine Bates, Michael Brown, William Fisher, Kate Hunicke-Smith, David Pallett, Christine Pao, Alexander Rudnicky, and Elizabeth Shriberg. 1994. Expanding the scope of the ATIS task: The ATIS-3 corpus. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994.

[6]Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generating structured queries from natural language using reinforcement learning. ArXiv preprint, abs/1709.00103.

[7]Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2018c. Spider: A large- scale human-labeled dataset for complex and cross- domain semantic parsing and text-to-SQL task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3911–3921, Brussels, Belgium. Association for Computational Linguistics.

[8]Qingkai Min, Yuefeng Shi, and Yue Zhang. 2019a. A pilot study for Chinese SQL semantic parsing. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3652– 3658, Hong Kong, China. Association for Computational Linguistics.

[9]Tao Yu, Rui Zhang, Michihiro Yasunaga, Yi Chern Tan, Xi Victoria Lin, Suyi Li, Heyang Er, Irene Li, Bo Pang, Tao Chen, Emily Ji, Shreya Dixit, David Proctor, Sungrok Shim, Jonathan Kraft, Vincent Zhang, Caiming Xiong, Richard Socher, and Dragomir Radev. 2019b. SParC: Cross-domain semantic parsing in context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4511–4523, Florence, Italy. Association for Computational Linguistics.

[10]Victor Zhong, Mike Lewis, Sida I. Wang, and Luke Zettlemoyer. 2020b. Grounded adaptation for zero- shot executable semantic parsing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6869– 6882, Online. Association for Computational Linguistics.

[11]Bailin Wang, Wenpeng Yin, Xi Victoria Lin, and Caiming Xiong. 2021b. Learning to synthesize data for semantic parsing. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2760–2766, Online. Association for Computational Linguistics.

[12]Ben Bogin, Jonathan Berant, and Matt Gardner. 2019a. Representing schema structure with graph neural networks for text-to-SQL parsing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4560–4565, Florence, Italy. Association for Computational Linguistics.

[13]Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. 2020a. RAT-SQL: Relation-aware schema encoding and linking for text-to-SQL parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7567–7578, Online. Association for Computational Linguistics.

[14]Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-Guang Lou, Ting Liu, and Dongmei Zhang. 2019. Towards complex text-to-SQL in cross- domain database with intermediate representation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4524–4535, Florence, Italy. Association for Computational Linguistics.

[15]Xiaojun Xu, Chang Liu, and Dawn Song. 2017. SQL-Net: Generating structured queries from natural

language without reinforcement learning. preprint, abs/1711.04436.

[16]Ansong Ni, Pengcheng Yin, and Graham Neubig. 2020. Merging weak and active supervision for semantic parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8536–8543.

[17]John M Zelle and Raymond J Mooney. 1996. Learning to parse database queries using inductive logic programming. In Proceedings of the national conference on artificial intelligence, pages 1050–1055.

[18]Catherine Finegan-Dollak, Jonathan K. Kummerfeld, Li Zhang, Karthik Ramanathan, Sesh Sadasivam, Rui Zhang, and Dragomir Radev. 2018. Improving text-to-SQL evaluation methodology. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 351–360, Melbourne, Australia. Association for Computational Linguistics.

[19]Ruiqi Zhong, Tao Yu, and Dan Klein. 2020a. Semantic evaluation for text-to-SQL with distilled test suites. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 396–411, Online. Association for Computational Linguistics.

[20]Peter Shaw, Ming-Wei Chang, Panupong Pasupat, and Kristina Toutanova. 2021. Compositional generalization and natural language variation: Can a semantic parsing approach handle both? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 922–938, Online. Association for Computational Linguistics.

作者介紹

鄧乃豪

研究領域：語義分析

郵箱：

[email protected]

陳雨龍

研究領域：文本摘要

郵箱：

[email protected]

張嶽

研究領域：語義分析，機器翻譯，情感分析，模型泛化等

郵箱：

[email protected]

術語工委及術語平台介紹：

計算機術語審定委員會(Committee on Terminology)主要職能為收集、翻譯、釋義、審定和推薦計算機新詞，并在CCF平台上宣傳推廣。這對厘清學科體系，開展科學研究，并将科學和知識在全社會廣泛傳播，都具有十分重要的意義。

術語衆包平台CCFpedia的建設和持續優化，可以有效推進中國計算機術語的收集、審定、規範和傳播工作，同時又能起到各領域規範化标準定制的推廣作用。

新版的CCFpedia計算機術語平台(http://term.ccf.org.cn)将術語的編輯營運與浏覽使用進行了整合，摒棄老版中跨平台操作的繁瑣步驟，在界面可觀性上進行了更新，讓使用者能夠簡單友善地查閱術語資訊。同時，新版平台中引入知識圖譜的方式對所有術語資料進行組織，通過圖譜多層關聯的形式更新了術語浏覽的應用形态。

計算機術語審定工作委員會

主任：

劉挺（哈爾濱工業大學）

副主任：

王昊奮（同濟大學）

李國良（清華大學）

主任助理：

李一斌（上海海乂知資訊科技有限公司）

執行委員：

丁軍（上海海乂知資訊科技有限公司）

林俊宇（中國科學院資訊工程研究所）

蘭豔豔（清華大學）

張偉男（哈爾濱工業大學）

聯手自然語言處理專委會：“根據自然語言生成SQL”術語釋出 | CCF

繼續閱讀

傳統的seq2seq模型與seq2seq with attention的模型原理細節解析

torch.nn.Embedding的使用torch.nn.Embedding

nn.Embedding()參數的了解nn.Embedding()

pytorch中nn.RNN()總結

自然語言了解（NLU）相關微信小程式大全

聯考志願填報：人工智能專業怎麼樣？人工智能行業發展前景如何？

【Python學習筆記】- Day6

Windows版本的Google word2vec和Stanford GloVe工具

seq2sqe與attenton實作聊天機器人

奮戰聊天機器人（四）自然語言進行中的文本分類nltk中的貝葉斯分類器

從詞向量衡量标準到全局向量的詞嵌入模型GloVe再到一詞多義的解決方式衡量标準Evaluation引子全局向量的詞嵌入應用對一詞多義的思考Reference

GloVe與word2vec的差別，及GloVe的缺陷

統計學習大作業-BERT模型1 文本處理-BERT模型2 參考資料：

anaconda中科大鏡像

NLP從入門到放棄_IBM Model1IBM Model1

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合