天天看點

多語言姿态檢測:加泰羅尼亞獨立語料庫(CS.CL)

姿态檢測旨在确定給定文本相對于特定主題或主張的态度。盡管最近幾年對姿勢檢測進行了很好的研究,但大多數工作都集中在英語上。這主要是由于其他語言中相對缺少帶注釋的資料。在IberEval 2018上釋出的TW-10全民公決資料集是以前的工作,旨在以加泰羅尼亞語和西班牙語提供多語言立場注釋資料。不幸的是,TW-10加泰羅尼亞子集非常不平衡。本文通過為加泰羅尼亞語和西班牙語提供一種新的多語種姿态檢測資料集來解決這些問題,目的是促進在多語種和跨語言環境中進行姿态檢測的研究。該資料集帶有一個主題的注釋,即加泰羅尼亞的獨立性。我們還提供了一種基于Twitter使用者分類的半自動方法來注釋資料集。我們使用許多監督方法(包括線性分類器和深度學習方法)對新語料庫進行實驗。将我們的新語料庫與TW-1O資料集進行的比較顯示,平衡态語料庫在姿勢檢測的多語言和跨語言研究中既有好處,也有潛力。最後,我們在TW-10資料集上建立了針對加泰羅尼亞語和西班牙語的最新技術成果。

原文題目:Multilingual Stance Detection: The Catalonia Independence Corpus

原文:Stance detection aims to determine the attitude of a given text with respect to a specific topic or claim. While stance detection has been fairly well researched in the last years, most the work has been focused on English. This is mainly due to the relative lack of annotated data in other languages. The TW-10 Referendum Dataset released at IberEval 2018 is a previous effort to provide multilingual stance-annotated data in Catalan and Spanish. Unfortunately, the TW-10 Catalan subset is extremely imbalanced. This paper addresses these issues by presenting a new multilingual dataset for stance detection in Twitter for the Catalan and Spanish languages, with the aim of facilitating research on stance detection in multilingual and cross-lingual settings. The dataset is annotated with stance towards one topic, namely, the independence of Catalonia. We also provide a semi-automatic method to annotate the dataset based on a categorization of Twitter users. We experiment on the new corpus with a number of supervised approaches, including linear classifiers and deep learning methods. Comparison of our new corpus with the with the TW-1O dataset shows both the benefits and potential of a well balanced corpus for multilingual and cross-lingual research on stance detection. Finally, we establish new state-of-the-art results on the TW-10 dataset, both for Catalan and Spanish.

原文作者:Elena Zotova, Rodrigo Agerri, Manuel Nuñez, German Rigau

原文位址:https://arxiv.org/abs/2004.00050

多語言姿态檢測:加泰羅尼亞獨立語料庫(CS.CL).pdf