Extracting Information from Text With NLTK

因為現實中的資料多為‘非結構化資料’，比如一般的txt文檔，或是‘半結構化資料’，比如html，對于這樣的資料需要采用一些技術才能從中提取出有用的資訊。如果所有資料都是‘結構化資料’，比如xml或關系資料庫，那麼就不需要特别去提取了，可以根據中繼資料去任意取到你想要的資訊。

那麼就來讨論一下用nltk來實作文本資訊提取的方法，

first, the raw text of the document is split into sentences using a sentence segmenter, and each sentence is further subdivided into words using a tokenizer . next, each sentence is tagged with part-of-speech tags , which will prove very helpful in the next step,named entity recognition . in this step, we search for mentions of potentially interesting entities in each sentence. finally, we use relation recognition to search for likely relations between different entities in the text.

可見這兒描述的資訊提取的過程，包含4步，分詞，詞性标注，命名實體識别，實體關系識别，對于分詞和詞性标注前面已經介紹過了，那麼就詳細來看看named entity recognition 怎麼來實作的。

chunking

the basic technique we will use for entity recognition is chunking, which segments and labels multitoken sequences。

實體識别最基本的技術就是chunking，即分塊，可以了解為把多個token組成詞組。

noun phrase chunking

我們就先以名詞詞組從chunking為例，即np-chunking

one of the most useful sources of information for np-chunking is part-of-speech tags.

>>> sentence = [("the", "dt"), ("little", "jj"), ("yellow", "jj"), ("dog", "nn"), ("barked", "vbd"), ("at", "in"), ("the", "dt"), ("cat", "nn")]

>>> grammar = "np: {<dt>?<jj>*<nn>}" #tag patterns，定語（0或1個）形容詞（任意個）名詞（1個）

>>> cp = nltk.regexpparser(grammar)

>>> result = cp.parse(sentence)

>>> print result

(np the/dt little/jj yellow/jj dog/nn) #np-chunking, the little yellow dog

barked/vbd

at/in

(np the/dt cat/nn)) #np-chunking, # np-chunking, the cat

上面的這個方法就是用regular expressions來表示tag pattern，進而找到np-chunking

再給個例子，tag patterns可以加上多條，可以變的更複雜

grammar = r"""np: {<dt|pp/>?<jj>*<nn>} # chunk determiner/possessive, adjectives and nouns {<nnp>+} # chunk sequences of proper nouns """ cp = nltk.regexpparser(grammar) sentence = [("rapunzel", "nnp"), ("let", "vbd"), ("down", "rp"), ("her", "pp>?<jj>*<nn>} # chunk determiner/possessive, adjectives and nouns {<nnp>+} # chunk sequences of proper nouns """ cp = nltk.regexpparser(grammar) sentence = [("rapunzel", "nnp"), ("let", "vbd"), ("down", "rp"), ("her", "pp"), ("long", "jj"), ("golden", "jj"), ("hair", "nn")]

>>> print cp.parse(sentence)

(np rapunzel/nnp) #np-chunking, rapunzel

let/vbd

down/rp

(np her/pp$ long/jj golden/jj hair/nn)) #np-chunking, her long golden hair

下面給個例子看看怎麼從語料庫中找到比對的詞性組合，

>>> cp = nltk.regexpparser(''chunk: {<v.*> <to> <v.*>}'') ＃找‘動詞 to 動詞’的組合

>>> brown = nltk.corpus.brown

>>> for sent in brown.tagged_sents():

... tree = cp.parse(sent)

... for subtree in tree.subtrees():

... if subtree.node == ''chunk'': print subtree

...

(chunk combined/vbn to/to achieve/vb)

(chunk continue/vb to/to place/vb)

(chunk serve/vb to/to protect/vb)

(chunk wanted/vbd to/to wait/vb)

(chunk allowed/vbn to/to place/vb)

(chunk expected/vbn to/to become/vb)

本文章摘自部落格園，原文釋出日期：2011-07-04

Extracting Information from Text With NLTK

繼續閱讀

set define off關閉替代變量功能

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

報錯：'mysql' 不是内部或外部指令，也不是可運作的程式或批處理檔案。

Linxu常用指令技巧彙總

ERROR 1 (HY000): Can't create/write to file '/tmp/#sql_4188_1.MYI' (Errcode: 28)

艱難安裝LDAP,SSL認證

《Linux指令行與Shell腳本程式設計大全第2版.布盧姆》pdf

MySQL的4種隔離級别？出現問題

XX系統實施過程問題總結

無元件上傳圖檔到資料庫中，最完整解決方案

【MySQL資料庫】資料庫索引事務1.索引2.事務

neo4j之cypher使用文檔

NOSQL安全攻擊

mybatis_入門程式Mybatis入門

登入plsql 報錯 the account is locked --使用者被鎖

SequoiaDB巨杉資料庫C++驅動概述