初學者|什麼是語義角色标注

點選上方藍色字型，關注AI小白入門喲

跟着部落客的腳步，每天進步一點點

本文記錄自然語言基礎技術之語義角色标注學習過程，包括定義、常見方法、例子、以及相關評測，最後推薦python實戰利器，并且包括工具的用法。

定義

先來看看語義角色标注在維基百科上的定義：Semantic role labeling, sometimes also called shallow semantic parsing, is a process in natural language processing that assigns labels to words or phrases in a sentence that indicate their semantic role in the sentence, such as that of an agent, goal, or result. It consists of the detection of the semantic arguments associated with the predicate or verb of a sentence and their classification into their specific roles.

語義角色标注(Semantic Role Labeling,簡稱 SRL)是一種淺層的語義分析。給定一個句子， SRL 的任務是找出句子中謂詞的相應語義角色成分，包括核心語義角色（如施事者、受事者等）和附屬語義角色（如地點、時間、方式、原因等）。根據謂詞類别的不同，又可以将現有的 SRL 分為動詞性謂詞 SRL 和名詞性謂詞 SRL。

常見方法

語義角色标注的研究熱點包括基于成分句法樹的語義角色标注和基于依存句法樹的語義角色标注。同時，根據謂詞的詞性不同，又可進一步分為動詞性謂詞和名詞性謂詞語義角色标注。盡管各任務之間存在着差異性，但标注架構類似。

目前 SRL 的實作通常都是基于句法分析結果，即對于某個給定的句子，首先得到其句法分析結果，然後基于該句法分析結果，再實作 SRL。這使得 SRL 的性能嚴重依賴于句法分析的結果。

例子

以基于成分句法樹的語義角色标注為例，任務的解決思路是以句法樹的成分為單元，判斷其是否擔當給定謂詞的語義角色：

角色剪枝：通過制定一些啟發式規則，過濾掉那些不可能擔當角色的成分。

角色識别：在角色剪枝的基礎上，建構一個二進制分類器，即識别其是或不是給定謂詞的語義角色。

角色分類：對那些是語義角色的成分，進一步采用一個多元分類器，判斷其角色類别。

相關評測

CoNLL會議2008、 2009 年對依存分析和語義角色标注聯合任務進行評測。

CoNLL 2008：https://www.clips.uantwerpen.be/conll2008/
CoNLL 2009：http://ufal.mff.cuni.cz/conll2009-st/task-description.html

工具推薦

Nlpnet

一個基于神經網絡的自然語言處理任務的Python庫。目前提供詞性标注，語義角色标記和依存分析功能。該系統的靈感來自SENNA。

Github位址：https://github.com/erickrf/nlpnet

預訓練模型位址：http://nilc.icmc.usp.br/nlpnet/models.html#

# 安裝：pip install nlpnet
# 國内源安裝：pip install nlpnet -i https://pypi.tuna.tsinghua.edu.cn/simple
# 1.nlpnet是一個基于神經網絡的自然語言處理任務的Python庫。 目前，它支援詞性标注、依存分析以及語義角色标記。
# 2.首先要下載下傳預訓練模型：http://nilc.icmc.usp.br/nlpnet/models.html#srl-portuguese 目前語義角色标注隻提供了葡萄牙語的預訓練模型
import nlpnet
tagger = nlpnet.SRLTagger('nlpnet-model\srl-pt', language='pt')
sents = tagger.tag(u'O rato roeu a roupa do rei de Roma.')[0]
print(sents.arg_structures)

[('roeu',
  {'A0': ['O', 'rato'],
   'A1': ['a', 'roupa', 'do', 'rei', 'de', 'Roma'],
   'V': ['roeu']})]

複制

Pyltp

語言技術平台(LTP) 是由哈工大社會計算與資訊檢索研究中心曆時11年的持續研發而形成的一個自然語言處理工具庫，其提供包括中文分詞、詞性标注、命名實體識别、依存句法分析、語義角色标注等豐富、高效、精準的自然語言處理技術。

Github位址：https://https://github.com/HIT-SCIR/pyltp

# pyltp安裝有點麻煩-.-，這裡記錄window 10下的安裝方法
# 1.首先，pip install pyltp安裝報錯：error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe' failed with exit status 2
# 安裝cmake，下載下傳位址，https://cmake.org/download/ 
# 安裝VS2008 EXPRESS，下載下傳網址：https://visualstudio.microsoft.com/zh-hans/vs/express/
# 2.然後，我選擇使用python setup.py install安裝
# 下載下傳pyltp，位址：https://github.com/hit-scir/pyltp 
# 下載下傳ltp，位址：https://github.com/hit-scir/ltp  
# 解壓ltp，然後将解壓之後檔案命名為ltp，覆寫pyltp檔案夾中的ltp
# 打開cmd，進入到pyltp目錄下，找到setup.py
# 先執行指令：python setup.py build  
# 然後執行指令：python setup.py install

# 使用裡面的預訓練模型，需要先下載下傳，然後指定相應目錄
# 下載下傳位址：http://ltp.ai/download.html
# 要先進行分詞，詞性标注，依存分析
sentence = "我愛自然語言處理技術！"
from pyltp import Segmentor
seg = Segmentor() #生成對象
seg.load("pyltp-model\ltp_data_v3.4.0\cws.model") #加載分詞預訓練模型
seg_words = seg.segment(sentence)
print(" ".join(seg_words))
seg.release() #釋放資源
我 愛 自然 語言 處理 技術 ！
from pyltp import Postagger  
pos=Postagger()
#加載詞性預訓練模型
pos.load("pyltp-model\ltp_data_v3.4.0\pos.model")
words_pos=pos.postag(seg_words)
for k,v in zip(seg_words, words_pos):
    print(k+'\t'+v)
pos.release()

我   r
愛   v
自然  n
語言  n
處理  v
技術  n
！   wp

from pyltp import Parser
parser=Parser()
parser.load("pyltp-model\ltp_data_v3.4.0\parser.model")
arcs=parser.parse(seg_words,words_pos)
print([(arc.head,arc.relation) for arc in arcs])
parser.release()
[(2, 'SBV'), (0, 'HED'), (4, 'ATT'), (5, 'FOB'), (2, 'VOB'), (5, 'VOB'), (2, 'WP')]
from pyltp import SementicRoleLabeller
labeller = SementicRoleLabeller()
labeller.load("pyltp-model\ltp_data_v3.4.0\pisrl_win.model")
roles = labeller.label(seg_words,words_pos,arcs)
for role in roles:
    print(role.index, "".join(
        ["%s:(%d,%d)" % (arg.name, arg.range.start, arg.range.end) for arg in role.arguments]))
labeller.release()

1 A0:(0,0)A1:(2,5)
4 A1:(5,5)

複制

代碼已上傳：https://github.com/yuquanle/StudyForNLP/blob/master/NLPbasic/SRL.ipynb

參考：

1.統計自然語言處理

2.中文資訊處理報告-2016

The End