天天看點

Natural Language Processing With Python (3)

Chapter 4 something about python basic: (1)A list is typically a sequence of objects all having the same type, of arbitrary length. Mutable. (2)A tuple is typically a collection of objects of ifferent types , of fixed length. Immutable. (3)Generator expression is much faster than the use of list comprehensions. (4)Looking up in the dict by the key is much faster than looking up in the list. (5)Looking up in the dict by the value is so slow that we should invert the direction for searching.But notice that dict keys must be immutable types, such as strings and tuples. (6)Be careful about the use of sort() and sorted().Also notice that they have more than one argument. (7)LGB rule. (8)Function for checking type : isinstance(). Use assert statement to raise error when type is wrong. (9)DOC style : a one-line summary, a more detailed explanation, a doctest example and epytext markup. (10)Lambda expressions and use of higher-order functions. (11)Multiple-argument function.(the use of *args and **kwarges) (12)The __name__='main' for unit test, the __file__ to locate your file. (1)_x and __all__ for hide variables and functions from "from .. import *", but not from "import ..". (13)Some trap: "%s %s" % "aaa", "ggg" is wrong with not adding parentheses, not use a mutable object as the default vale of a parameter because of using the same object.

Chapter 5 Learning POS tagging. Some function useful for tagging : (1)nltk.pos_tag() (2)nltk.corpus.brown.tagged_words() / tagged_sents(), using unsimplified tags. (3)nltk.defaultdict(), or defaultdict for python (4)conditionFreqDist and nltk.index is a special kind of dict (5)nltk.RegexpTagger() (6)nltk.unigramTagger(), nltk.bigramTagger(), nltk.ngramTagger() (7)tagger.evaluate() (8)open, dump, close(), load()

some skill on nlp: (1)some idiom use of nltk.defaultdict() (2)lookup tagger, which use the dict as a model of a real tagger. (3)n-gram means n-1 preceding tokens. (4)use the backoff and cutoff to solve the sparse data problem. typical use as following : t0=nltk.DefaultTagger(..) t1=nltk.UnigramTagger(.., backoff=t0) t2=nltk.BigramTagger(.., backoff=t1) t2.evalutae() (5)Separate the trainning and testing data, such as 90% for trainning and 10% for testing.Use the brown.tagged_sents() for trainning and testing which include evaluating.And use the othercorpus.sents() for real application. (6)Tag at sentence level instead of word level, such as using the sents()function. (7)transformation-based tagging beats n-gram tag with two reason : smaller space size and better model. A useful tagging is the Brill tagger.