因為現實中的資料多為‘非結構化資料’,比如一般的txt文檔,或是‘半結構化資料’,比如html,對于這樣的資料需要采用一些技術才能從中提取 出有用的資訊。如果所有資料都是‘結構化資料’,比如xml或關系資料庫,那麼就不需要特别去提取了,可以根據中繼資料去任意取到你想要的資訊。
那麼就來讨論一下用nltk來實作文本資訊提取的方法,
first, the raw text of the document is split into sentences using a sentence segmenter, and each sentence is further subdivided into words using a tokenizer . next, each sentence is tagged with part-of-speech tags , which will prove very helpful in the next step,named entity recognition . in this step, we search for mentions of potentially interesting entities in each sentence. finally, we use relation recognition to search for likely relations between different entities in the text.
可見這兒描述的資訊提取的過程,包含4步,分詞,詞性标注,命名實體識别,實體關系識别,對于分詞和詞性标注前面已經介紹過了,那麼就詳細來看看named entity recognition 怎麼來實作的。
chunking
the basic technique we will use for entity recognition is chunking, which segments and labels multitoken sequences。
實體識别最基本的技術就是chunking,即分塊,可以了解為把多個token組成詞組。
noun phrase chunking
我們就先以名詞詞組從chunking為例,即np-chunking
one of the most useful sources of information for np-chunking is part-of-speech tags.
>>> sentence = [("the", "dt"), ("little", "jj"), ("yellow", "jj"), ("dog", "nn"), ("barked", "vbd"), ("at", "in"), ("the", "dt"), ("cat", "nn")]
>>> grammar = "np: {<dt>?<jj>*<nn>}" #tag patterns,定語(0或1個)形容詞(任意個)名詞(1個)
>>> cp = nltk.regexpparser(grammar)
>>> result = cp.parse(sentence)
>>> print result
(s
(np the/dt little/jj yellow/jj dog/nn) #np-chunking, the little yellow dog
barked/vbd
at/in
(np the/dt cat/nn)) #np-chunking, # np-chunking, the cat
上面的這個方法就是用regular expressions來表示tag pattern,進而找到np-chunking
再給個例子,tag patterns可以加上多條,可以變的更複雜
grammar = r"""np: {<dt|pp/>?<jj>*<nn>} # chunk determiner/possessive, adjectives and nouns {<nnp>+} # chunk sequences of proper nouns """ cp = nltk.regexpparser(grammar) sentence = [("rapunzel", "nnp"), ("let", "vbd"), ("down", "rp"), ("her", "pp>?<jj>*<nn>} # chunk determiner/possessive, adjectives and nouns {<nnp>+} # chunk sequences of proper nouns """ cp = nltk.regexpparser(grammar) sentence = [("rapunzel", "nnp"), ("let", "vbd"), ("down", "rp"), ("her", "pp"), ("long", "jj"), ("golden", "jj"), ("hair", "nn")]
>>> print cp.parse(sentence)
(np rapunzel/nnp) #np-chunking, rapunzel
let/vbd
down/rp
(np her/pp$ long/jj golden/jj hair/nn)) #np-chunking, her long golden hair
下面給個例子看看怎麼從語料庫中找到比對的詞性組合,
>>> cp = nltk.regexpparser(''chunk: {<v.*> <to> <v.*>}'') #找‘動詞 to 動詞’的組合
>>> brown = nltk.corpus.brown
>>> for sent in brown.tagged_sents():
... tree = cp.parse(sent)
... for subtree in tree.subtrees():
... if subtree.node == ''chunk'': print subtree
...
(chunk combined/vbn to/to achieve/vb)
(chunk continue/vb to/to place/vb)
(chunk serve/vb to/to protect/vb)
(chunk wanted/vbd to/to wait/vb)
(chunk allowed/vbn to/to place/vb)
(chunk expected/vbn to/to become/vb)
本文章摘自部落格園,原文釋出日期:2011-07-04