Extracting Information from Text With NLTK

因为现实中的数据多为‘非结构化数据’，比如一般的txt文档，或是‘半结构化数据’，比如html，对于这样的数据需要采用一些技术才能从中提取出有用的信息。如果所有数据都是‘结构化数据’，比如xml或关系数据库，那么就不需要特别去提取了，可以根据元数据去任意取到你想要的信息。

那么就来讨论一下用nltk来实现文本信息提取的方法，

first, the raw text of the document is split into sentences using a sentence segmenter, and each sentence is further subdivided into words using a tokenizer . next, each sentence is tagged with part-of-speech tags , which will prove very helpful in the next step,named entity recognition . in this step, we search for mentions of potentially interesting entities in each sentence. finally, we use relation recognition to search for likely relations between different entities in the text.

可见这儿描述的信息提取的过程，包含4步，分词，词性标注，命名实体识别，实体关系识别，对于分词和词性标注前面已经介绍过了，那么就详细来看看named entity recognition 怎么来实现的。

chunking

the basic technique we will use for entity recognition is chunking, which segments and labels multitoken sequences。

实体识别最基本的技术就是chunking，即分块，可以理解为把多个token组成词组。

noun phrase chunking

我们就先以名词词组从chunking为例，即np-chunking

one of the most useful sources of information for np-chunking is part-of-speech tags.

>>> sentence = [("the", "dt"), ("little", "jj"), ("yellow", "jj"), ("dog", "nn"), ("barked", "vbd"), ("at", "in"), ("the", "dt"), ("cat", "nn")]

>>> grammar = "np: {<dt>?<jj>*<nn>}" #tag patterns，定语（0或1个）形容词（任意个）名词（1个）

>>> cp = nltk.regexpparser(grammar)

>>> result = cp.parse(sentence)

>>> print result

(np the/dt little/jj yellow/jj dog/nn) #np-chunking, the little yellow dog

barked/vbd

at/in

(np the/dt cat/nn)) #np-chunking, # np-chunking, the cat

上面的这个方法就是用regular expressions来表示tag pattern，从而找到np-chunking

再给个例子，tag patterns可以加上多条，可以变的更复杂

grammar = r"""np: {<dt|pp/>?<jj>*<nn>} # chunk determiner/possessive, adjectives and nouns {<nnp>+} # chunk sequences of proper nouns """ cp = nltk.regexpparser(grammar) sentence = [("rapunzel", "nnp"), ("let", "vbd"), ("down", "rp"), ("her", "pp>?<jj>*<nn>} # chunk determiner/possessive, adjectives and nouns {<nnp>+} # chunk sequences of proper nouns """ cp = nltk.regexpparser(grammar) sentence = [("rapunzel", "nnp"), ("let", "vbd"), ("down", "rp"), ("her", "pp"), ("long", "jj"), ("golden", "jj"), ("hair", "nn")]

>>> print cp.parse(sentence)

(np rapunzel/nnp) #np-chunking, rapunzel

let/vbd

down/rp

(np her/pp$ long/jj golden/jj hair/nn)) #np-chunking, her long golden hair

下面给个例子看看怎么从语料库中找到匹配的词性组合，

>>> cp = nltk.regexpparser(''chunk: {<v.*> <to> <v.*>}'') ＃找‘动词 to 动词’的组合

>>> brown = nltk.corpus.brown

>>> for sent in brown.tagged_sents():

... tree = cp.parse(sent)

... for subtree in tree.subtrees():

... if subtree.node == ''chunk'': print subtree

...

(chunk combined/vbn to/to achieve/vb)

(chunk continue/vb to/to place/vb)

(chunk serve/vb to/to protect/vb)

(chunk wanted/vbd to/to wait/vb)

(chunk allowed/vbn to/to place/vb)

(chunk expected/vbn to/to become/vb)

本文章摘自博客园，原文发布日期：2011-07-04

Extracting Information from Text With NLTK

继续阅读

set define off关闭替代变量功能

解码器用于语义分割：数据依赖的解码可以实现灵活的特征聚合

报错：'mysql' 不是内部或外部命令，也不是可运行的程序或批处理文件。

Linxu常用命令技巧汇总

ERROR 1 (HY000): Can't create/write to file '/tmp/#sql_4188_1.MYI' (Errcode: 28)

艰难安装LDAP,SSL认证

《Linux命令行与Shell脚本编程大全第2版.布卢姆》pdf

MySQL的4种隔离级别？出现问题

XX系统实施过程问题总结

无组件上传图片到数据库中，最完整解决方案

【MySQL数据库】数据库索引事务1.索引2.事务

neo4j之cypher使用文档

NOSQL安全攻击

mybatis_入门程序Mybatis入门

登录plsql 报错 the account is locked --用户被锁

SequoiaDB巨杉数据库C++驱动概述