æ°æ®ééçæ¯20_newsgroupsï¼ææ7ï¼3åçè®ç»éåæµè¯éã
æ»çæµç¨å¦ä¸ï¼
è¿éææ°æ®éä¸çæ¯ä¸æ¡ææ¬é½è¡¨ç¤ºæTFTDFåéï¼ç¨è®ç»éçTFTDFåéæ¥è®ç»æ¨¡åï¼ç¨æµè¯éçTFTDFåéè¿è¡åç±»æµè¯ï¼æåç»è®¡æµè¯åç¡®çã
åå§å
# 设置è®ç»éï¼æµè¯éè·¯å¾ã
trainPath = "hdfs:///user/yy/20_newsgroups/train/*"
testPath = "hdfs:///user/yy/20_newsgroups/test/*"
# åç±»æ¶ï¼æ°é»ä¸»é¢éè¦è½¬æ¢ææ°åï¼labelsDictå°ä¸»é¢è½¬æ¢ææ°å
labelsDict = {'alt.atheism':, 'comp.graphics':, 'comp.os.ms-windows.misc':,\
'comp.sys.ibm.pc.hardware':, 'comp.sys.mac.hardware':, 'comp.windows.x':,\
'misc.forsale':, 'rec.autos':, 'rec.motorcycles':, 'rec.sport.baseball':,\
'rec.sport.hockey':, 'sci.crypt':, 'sci.electronics':, 'sci.med':,\
'sci.space':, 'soc.religion.christian':, 'talk.politics.guns':,\
'talk.politics.mideast':, 'talk.politics.misc':, 'talk.religion.misc':}
# keyTolabelsåå°æ°åå转æ¢å主é¢ï¼ä¸»è¦æ¯æ¹ä¾¿èªå·±çç
keyTolabels = {:'alt.atheism', :'comp.graphics', :'comp.os.ms-windows.misc',\
:'comp.sys.ibm.pc.hardware', :'comp.sys.mac.hardware', :'comp.windows.x',\
:'misc.forsale', :'rec.autos', :'rec.motorcycles', :'rec.sport.baseball',\
:'rec.sport.hockey', :'sci.crypt', :'sci.electronics', :'sci.med',\
:'sci.space', :'soc.religion.christian', :'talk.politics.guns',\
:'talk.politics.mideast', :'talk.politics.misc', :'talk.religion.misc'}
é¢å¤çå½æ°
å®æ对ææ¡£çåè¯ï¼å»åç¨è¯ï¼è¯å¹²æåï¼åä¹è¯æ¿æ¢çå·¥ä½ï¼éè¦å®è£ ä¸ä¸ªèªç¶è¯è¨å¤çç第ä¸æ¹åºnltkãå½ç¶ï¼æ¯ä¸ªèç¹é½éè¦å®è£ ãé¢å¤ççåºæ¬æ¥éª¤å¦ä¸ï¼
è¿éçåä¹è¯æ¿æ¢åçé常ç®åï¼åªæ¯ä»åè¯ç第ä¸ä¸ªåä¹è¯ééååºç¬¬ä¸ä¸ªåä¹è¯ãè¿ä¹åææ¶ä¼äº§çæ§ä¹ï¼å 为åè¯å¨ä¸åçè¯ä¹ä¸æä¸åçåä¹è¯éï¼åªå第ä¸ä¸ªåä¹è¯éå³éå®äºä» ä» ä½¿ç¨åè¯ç第ä¸ä¸ªè¯ä¹ã
def tokenlize(doc):
import nltk, re
from nltk.corpus import stopwords
from nltk.corpus import wordnet
r = re.compile(r'[\w]+') # 以éåæ¯æ°åå符æ¥è¿è¡åè¯
my_stopwords = nltk.corpus.stopwords.words('english')
porter = nltk.PorterStemmer()
newdoc = []
for word in nltk.regexp_tokenize(doc, r): # åè¯
newWord = porter.stem(word.lower()) # è¯å¹²æå
if newWord in my_stopwords: # å»åç¨è¯
continue
tokenSynsets = wordnet.synsets(newWord)
newdoc.append(newWord if tokenSynsets == [] else tokenSynsets[].lemma_names()[]) # åä¹è¯æ¿æ¢
return newdoc
å¯¼å ¥è®ç»é
trainTokens = sc.wholeTextFiles(trainPath)\
.map(lambda (fileName, doc): doc)\
.map(lambda doc: tokenlize(doc))
æ建åè¯æ å°åå¸è¡¨ï¼tfidf模å
è®ç»éåæµè¯éé½éè¦ä½¿ç¨è¿ä¸ªåå¸è¡¨ï¼å®ç大å°æ ¹æ®ä¸ååè¯çæ°éæ¥è®¾ç½®ï¼ä¸è¬å2çnæ¹ï¼å¨åææ°æ®æ¢ç´¢çæ¶åéè¦è®¡ç®ä¸ä¸ä¸ååè¯çæ°éã
from pyspark.mllib.feature import HashingTF
hasingTF = HashingTF( ** )
# å°è®ç»éæ¯ä¸ªææ¡£é½æ å°ä¸ºtfåé
trainTf = hasingTF.transform(trainTokens)
trainTf.cache()
# æ建IDF模åï¼è®ç»éåæµè¯éé½ç¨å®
from pyspark.mllib.feature import IDF
idf = IDF().fit(trainTf)
# å°è®ç»éæ¯ä¸ªtfåé转æ¢ä¸ºtfidfåé
trainTfidf = idf.transform(trainTf)
trainTfidf.cache()
æ 注è®ç»é
# 为è®ç»éæ 注ï¼æ为æç»å¯ç¨çè®ç»éï¼æ¯ä¸ªæ ·æ¬é½éè¦æ¾å¨LabeledPointé
from pyspark.mllib.regression import LabeledPoint
trainLabels = sc.wholeTextFiles(trainPath)\
.map(lambda (path, doc): path.split('/')[-])
train = trainLabels.zip(trainTfidf)\
.map(lambda (topic, vector): LabeledPoint(labelsDict[topic], vector))
train.cache()
å¯¼å ¥æµè¯é
# 导å
¥æµè¯é并å®æé¢å¤ç
testTokens = sc.wholeTextFiles(testPath)\
.map(lambda (fileName, doc): doc)\
.map(lambda doc: tokenlize(doc))
å°æµè¯é转æ¢ætfidfåé
# å°æµè¯éæ¯ä¸ªææ¡£é½æ å°ä¸ºtfåéï¼åè®ç»éç¨çæ¯åä¸ä¸ªåå¸æ å°hasingTF
from pyspark.mllib.feature import HashingTF
testTf = hasingTF.transform(testTokens)
# å°æµè¯éæ¯ä¸ªtfåé转æ¢ä¸ºtfidfåéï¼åè®ç»éç¨çæ¯åä¸ä¸ªIDF模åidf
from pyspark.mllib.feature import IDF
testTfidf = idf.transform(testTf)
æ 注æµè¯é
# 为æµè¯éæ 注ï¼æ为æç»å¯ç¨ä¸æµè¯çæµè¯é
from pyspark.mllib.regression import LabeledPoint
testLabels = sc.wholeTextFiles(testPath)\
.map(lambda (path, doc): path.split('/')[-])
test = testLabels.zip(testTfidf)\
.map(lambda (topic, vector): LabeledPoint(labelsDict[topic], vector))
testCount = test.count()
è®ç»æ´ç´ è´å¶æ¯æ¨¡å并计ç®åç¡®ç
from pyspark.mllib.classification import NaiveBayes
model = NaiveBayes.train(train, )
# 计ç®æµè¯çåç¡®ç
predictionAndLabel = test.map(lambda p: (model.predict(p.features), p.label))
accuracy = * predictionAndLabel.filter(lambda x: x[] == x[]).count() / testCount
print accuracy
0.803298634582
è®ç»å¤å é»è¾åå½æ¨¡å并计ç®åç¡®ç
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
lrModel = LogisticRegressionWithLBFGS.train(train, iterations=, numClasses=)
# 计ç®æµè¯çåç¡®ç
predictionAndLabel = test.map(lambda p: (lrModel.predict(p.features), p.label))
accuracy = * predictionAndLabel.filter(lambda x: x[] == x[]).count() / testCount
print accuracy
0.812897120454
å¦ææå ´è¶£ï¼å¯ä»¥é便æ¿ä¸ä»½æ°é»ç»ææ¬æ¥æµè¯ä¸ä¸ï¼ç»èªå·±ä¸ä¸ªæ´ä¸ºç´è§çæåã
aTestText = """
Path: cantaloupe.srv.cs.cmu.edu!rochester!udel!bogus.sura.net!howland.reston.ans.net!ira.uka.de!math.fu-berlin.de!cs.tu-berlin.de!ossip
From: [email protected] (Ossip Kaehr)
Newsgroups: comp.sys.mac.hardware
Subject: SE/30 8bit card does not work with 20mb..
Date: 21 Apr 1993 23:22:22 GMT
Organization: Technical University of Berlin, Germany
Lines: 27
Message-ID: <[email protected]>
NNTP-Posting-Host: trillian.cs.tu-berlin.de
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Summary: HELP!
Keywords: SE/30 MODE32 System7 PDS
Hello!
I have a SE/30 and a Generation Systems 8bit PDS card for a 17"
screen.
It worked great until I upgraded from 5 to 20 mb ram.
Now with Sys7.1 and MODE32 or 32enabler it does not boot..
a tech support person said the card does not support these 32bit
fixes.
BUT: when pressing the shift key while booting (when the ext. monitor
goes black after having been grey) the system SOMETIMES boots properly!!
and then works ok with the 20mb and full graphics.
WHAT's HAPPENING???
Thanks a lot for any advice!!!
please answer by mail.
Ossip Kaehr
[email protected]
voice: +49.30.6226317
--
__ -------------------------------------------------------------- __
/_/\ Ossip Kaehr Hermannstrasse 32 D-1000 Berlin 44 Germany /\_\
\_\/ Tel. +49.30.6223910 or 6218814 EMail [email protected] \/_/
--------------------------------------------------------------
"""
testTf = hasingTF.transform(tokenlize(aTestText)) # é¢å¤çå转æ¢ä¸ºtfåé
testTfidf = idf.transform(testTf) # å转æ¢ætfidfåé
print keyTolabels[lrModel.predict(testTfidf)] # é¢æµå¹¶è¾åºç»æ
'comp.sys.mac.hardware'
æ»ç»sparkä¸å¦ä½å°æ档转æ¢ætfidfåé
# æ建åå¸è¡¨ç¨äºæ å°ææåè¯
from pyspark.mllib.feature import HashingTF
hasingTF = HashingTF( ** ) # ç»´æ°éè¦å¤§äºä¸ååè¯çæ»æ°
# å°ææ¡£æ å°ä¸ºtfåéï¼è¿éçtrainTokens为rddç±»å
trainTf = hasingTF.transform(trainTokens)
testTf = hasingTF.transform(testTokens)
# æ建IDF模åï¼è®ç»éåæµè¯éé½ç¨å®
from pyspark.mllib.feature import IDF
idf = IDF().fit(trainTf)
# å°tfåé转æ¢ä¸ºtfidfåé
trainTfidf = idf.transform(trainTf)
testTfidf = idf.transform(testTf)
ç¸å ³é 读
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
https://en.wikipedia.org/wiki/Natural_Language_Toolkit