天天看點

python 英語分詞_Python NLTK

python 英語分詞_Python NLTK

spaCy 是一個Python自然語言處理工具包,誕生于2014年年中,号稱“Industrial-Strength Natural Language Processing in Python”,是具有工業級強度的Python NLP工具包。spaCy裡大量使用了 Cython 來提高相關子產品的性能,這個差別于學術性質更濃的Python NLTK,是以具有了業界應用的實際價值。

安裝和編譯 spaCy 比較友善,在ubuntu環境下,直接用pip安裝即可:

sudo apt-get install build-essential python-dev git

sudo pip install -U spacy

不過安裝完畢之後,需要下載下傳相關的模型資料,以英文模型資料為例,可以用"all"參數下載下傳所有的資料:

sudo python -m spacy.en.download all

或者可以分别下載下傳相關的模型和用glove訓練好的詞向量資料:

# 這個過程下載下傳英文tokenizer,詞性标注,句法分析,命名實體識别相關的模型

python -m spacy.en.download parser

# 這個過程下載下傳glove訓練好的詞向量資料

python -m spacy.en.download glove

下載下傳好的資料放在spacy安裝目錄下的data裡,以我的ubuntu為例:

[email protected]:/usr/local/lib/python2.7/dist-packages/spacy/data$ du -sh *

776Men-1.1.0

774Men_glove_cc_300_1m_vectors-1.0.0

進入到英文資料模型下:

[email protected]:/usr/local/lib/python2.7/dist-packages/spacy/data/en-1.1.0$ du -sh *

424Mdeps

8.0Kmeta.json

35Mner

12Mpos

84Ktokenizer

300Mvocab

6.3Mwordnet

可以用如下指令檢查模型資料是否安裝成功:

[email protected]:~$ python -c "import spacy; spacy.load('en'); print('OK')"

OK

也可以用pytest進行測試:

# 首先找到spacy的安裝路徑:

python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))"

/usr/local/lib/python2.7/dist-packages/spacy

# 再安裝pytest:

sudo python -m pip install -U pytest

# 最後進行測試:

python -m pytest /usr/local/lib/python2.7/dist-packages/spacy --vectors --model --slow

============================= test session starts ==============================

platform linux2 -- Python 2.7.12, pytest-3.0.4, py-1.4.31, pluggy-0.4.0

rootdir: /usr/local/lib/python2.7/dist-packages/spacy, inifile:

collected 318 items

../../usr/local/lib/python2.7/dist-packages/spacy/tests/test_matcher.py ........

../../usr/local/lib/python2.7/dist-packages/spacy/tests/matcher/test_entity_id.py ....

../../usr/local/lib/python2.7/dist-packages/spacy/tests/matcher/test_matcher_bugfixes.py .....

......

../../usr/local/lib/python2.7/dist-packages/spacy/tests/vocab/test_vocab.py .......Xx

../../usr/local/lib/python2.7/dist-packages/spacy/tests/website/test_api.py x...............

../../usr/local/lib/python2.7/dist-packages/spacy/tests/website/test_home.py ............

============== 310 passed, 5 xfailed, 3 xpassed in 53.95 seconds ===============

現在可以快速測試一下spaCy的相關功能,我們以英文資料為例,spaCy目前主要支援英文和德文,對其他語言的支援正在陸續加入:

[email protected]:~$ ipython

Python 2.7.12 (default, Jul 1 2016, 15:12:24)

Type "copyright", "credits" or "license" for more information.

IPython 2.4.1 -- An enhanced Interactive Python.

? -> Introduction and overview of IPython's features.

%quickref -> Quick reference.

help -> Python's own help system.

object? -> Details about 'object', use 'object??' for extra details.

In [1]: import spacy

# 加載英文模型資料,稍許等待

In [2]: nlp = spacy.load('en')

Word tokenize功能,spaCy 1.2版本加了中文tokenize接口,基于Jieba中文分詞:

In [3]: test_doc = nlp(u"it's word tokenize test for spacy")

In [4]: print(test_doc)

it's word tokenize test for spacy

In [5]: for token in test_doc:

print(token)

...:

it

's

word

tokenize

test

for

spacy

英文斷句:

In [6]: test_doc = nlp(u'Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.')

In [7]: for sent in test_doc.sents:

print(sent)

...:

Natural language processing (NLP) deals with the application of computational models to text or speech data.

Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways.

NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form.

From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.

詞幹化(Lemmatize):

In [8]: test_doc = nlp(u"you are best. it is lemmatize test for spacy. I love these books")

In [9]: for token in test_doc:

print(token, token.lemma_, token.lemma)

...:

(you, u'you', 472)

(are, u'be', 488)

(best, u'good', 556)

(., u'.', 419)

(it, u'it', 473)

(is, u'be', 488)

(lemmatize, u'lemmatize', 1510296)

(test, u'test', 1351)

(for, u'for', 480)

(spacy, u'spacy', 173783)

(., u'.', 419)

(I, u'i', 570)

(love, u'love', 644)

(these, u'these', 642)

(books, u'book', 1011)

詞性标注(POS Tagging):

In [10]: for token in test_doc:

print(token, token.pos_, token.pos)

....:

(you, u'PRON', 92)

(are, u'VERB', 97)

(best, u'ADJ', 82)

(., u'PUNCT', 94)

(it, u'PRON', 92)

(is, u'VERB', 97)

(lemmatize, u'ADJ', 82)

(test, u'NOUN', 89)

(for, u'ADP', 83)

(spacy, u'NOUN', 89)

(., u'PUNCT', 94)

(I, u'PRON', 92)

(love, u'VERB', 97)

(these, u'DET', 87)

(books, u'NOUN', 89)

命名實體識别(NER):

In [11]: test_doc = nlp(u"Rami Eid is studying at Stony Brook University in New York")

In [12]: for ent in test_doc.ents:

print(ent, ent.label_, ent.label)

....:

(Rami Eid, u'PERSON', 346)

(Stony Brook University, u'ORG', 349)

(New York, u'GPE', 350)

名詞短語提取:

In [13]: test_doc = nlp(u'Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.')

In [14]: for np in test_doc.noun_chunks:

print(np)

....:

Natural language processing

Natural language processing (NLP) deals

the application

computational models

text

speech

data

Application areas

NLP

automatic (machine) translation

languages

dialogue systems

a human

a machine

natural language

information extraction

the goal

unstructured text

structured (database) representations

flexible ways

NLP technologies

a dramatic impact

the way

people

computers

the way

people

the use

language

the way

people

the vast amount

linguistic data

electronic form

a scientific viewpoint

NLP

fundamental questions

formal models

example

natural language phenomena

algorithms

these models

基于詞向量計算兩個單詞的相似度:

In [15]: test_doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")

In [16]: apples = test_doc[0]

In [17]: print(apples)

Apples

In [18]: oranges = test_doc[2]

In [19]: print(oranges)

oranges

In [20]: boots = test_doc[6]

In [21]: print(boots)

Boots

In [22]: hippos = test_doc[8]

In [23]: print(hippos)

hippos

In [24]: apples.similarity(oranges)

Out[24]: 0.77809414836023805

In [25]: boots.similarity(hippos)

Out[25]: 0.038474555379008429

當然,spaCy還包括句法分析的相關功能等。另外值得關注的是 spaCy 從1.0版本起,加入了對深度學習工具的支援,例如 Tensorflow 和 Keras 等,這方面具體可以參考官方文檔給出的一個對情感分析(Sentiment Analysis)模型進行分析的例子:Hooking a deep learning model into spaCy.