天天看點

使用樸素貝葉斯過濾垃圾郵件

split文本分割函數

mySent='This book is the best book on Python or M.L. I have ever laid eyes upon.'
ret=mySent.split()
print(ret)
           

輸出

['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M.L.', 'I', 'have', 'ever', 'laid', 'eyes', 'upon.']
>>> 
           

使用正規表達式可以解決單詞中的其他符号:

\d 比對任何十進制數;它相當于類 [0-9]。
\D 比對任何非數字字元;它相當于類 [^0-9]。
\s 比對任何空白字元;它相當于類 [ \t\n\r\f\v]。
\S 比對任何非空白字元;它相當于類 [^ \t\n\r\f\v]。
\w 比對任何字母數字字元;它相當于類 [a-zA-Z0-9_]。
\W 比對任何非字母數字字元;它相當于類 [^a-zA-Z0-9_]。
           
import re
regEx=re.compile('\\W')    #大寫的W
listOfTokens=regEx.split(mySent)
print(listOfTokens)
           

輸出

['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', '', 'I', 'have', 'ever', 'laid', 'eyes', 'upon', '']
           

為了消除其中的空元素,剔除長度為0的元素:

ret=[tok for tok in listOfTokens if len(tok)>0]
print(ret)
           

等效為一下語句:

ret=[]
for tok in listOfTokens:
    if len(tok)>0:
        ret.append(tok)
print(ret)
           

其中for關鍵字前方的字元即為需要append的内容,為了統一大小寫,全部傳回小寫:

ret=[tok.lower() for tok in listOfTokens if len(tok)>0]
print(ret)
           

輸出:

在附件中有很多email文本,以其中任意一個為例:

Hello,

Since you are an owner of at least one Google Groups group that uses the customized welcome message, pages or files, we are writing to inform you that we will no longer be supporting these features starting February 2011. We made this decision so that we can focus on improving the core functionalities of Google Groups -- mailing lists and forum discussions.  Instead of these features, we encourage you to use products that are designed specifically for file storage and page creation, such as Google Docs and Google Sites.

For example, you can easily create your pages on Google Sites and share the site (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=174623) with the members of your group. You can also store your files on the site by attaching files to pages (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=90563) on the site. If you’re just looking for a place to upload your files so that your group members can download them, we suggest you try Google Docs. You can upload files (http://docs.google.com/support/bin/answer.py?hl=en&answer=50092) and share access with either a group (http://docs.google.com/support/bin/answer.py?hl=en&answer=66343) or an individual (http://docs.google.com/support/bin/answer.py?hl=en&answer=86152), assigning either edit or download only access to the files.

you have received this mandatory email service announcement to update you about important changes to Google Groups.
           

同樣也可以将所有單詞都分割出來。

emailText = open('email/ham/6.txt').read()
listOfTokens=regEx.split(emailText)
print(listOfTokens)
           

定義一個split函數,傳回長度大于2的單詞list

def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 
           

最後附上檔案解析及完整的垃圾郵件測試函數:

from numpy import *

def createVocabList(dataSet):
    vocabSet = set([])  #create empty set
    #因為傳入是二維數組,是以将二維數組内的所有元素全部壓入Set中(順序可能會被打亂)
    #最後再以list的形式傳回
    for document in dataSet:
        vocabSet = vocabSet | set(document) #union of the two sets
    return list(vocabSet)

#統計單詞出現的次數,用于建立向量集
def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec


def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)                     #測試集數目  6
    numWords = len(trainMatrix[0])                      #總單詞(去重)數目  32
    pAbusive = sum(trainCategory)/float(numTrainDocs)   #該文檔屬于侮辱類的機率=被标記為侮辱類句子數量/總句子數量=3/6.0=0.5
    #變量初始化
    p0Num = zeros(numWords); p1Num = zeros(numWords)    #标記向量初始化為[0,0,0,0...]
    p0Denom = 0; p1Denom = 0                            #統計數為0

    #計算機率時,需要計算多個機率的乘積以獲得文檔屬于某個類别的機率
    #即計算p(w0|ci)*p(w1|ci)*...p(wN|ci),然後當其中任意一項的值為0,那麼最後的乘積也為0.
    #為降低這種影響,采用拉普拉斯平滑,在分子上添加a(一般為1),分母上添加ka(k表示類别總數),
    #即在這裡将所有詞的出現數初始化為1,并将分母初始化為2*1=2
    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones()   
    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0

    #對于每個句子
    #如果該句被人工标記為侮辱性的,則其中出現的每個詞彙p1Num都該被認為是侮辱性的,侮辱性詞彙總數p1Denom也做相應統計
    #如果該句不是侮辱性的,同樣做統計
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    #每個單詞的是侮辱詞的條件機率=在侮辱詞中出現的次數p1Num/侮辱詞出現總數p1Denom
    p1Vect = p1Num/p1Denom
    p0Vect = p0Num/p0Denom
    #計算機率時,由于大部分因子都非常小,最後相乘的結果四舍五入為0,造成下溢出或者得不到準确的結果,
    #是以,我們可以對成績取自然對數,即求解對數似然機率。這樣,可以避免下溢出或者浮點數舍入導緻的錯誤。
    #同時采用自然對數處理不會有任何損失。
    p1Vect = log(p1Num/p1Denom)          #change to log()
    p0Vect = log(p0Num/p0Denom)          #change to log()
    return p0Vect,p1Vect,pAbusive


def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    #p1=(單詞A出現的次數*單詞A出現在侮辱句時的機率+單詞B出現的次數*單詞B出現在侮辱句時的機率+...)*正常句出現的機率
    #p0=(單詞A出現的次數*單詞A出現在正常句時的機率+單詞B出現的次數*單詞B出現在正常句時的機率+...)*正常句出現的機率
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else: 
        return 0


def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 

def spamTest():
    docList=[]; classList = []; fullText =[]
    for i in range(1,26):
        #分别讀取25封垃圾郵件和正常郵件
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        #因ham/23.txt中包含商标R符号,讀取時需要忽略掉錯誤
        wordList = textParse(open('email/ham/%d.txt' % i,encoding='utf-8',errors='ignore').read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    #vocabulary 去重
    vocabList = createVocabList(docList)                #create vocabulary
    trainingSet = list(range(50)); testSet=[]           #create test set
    for i in range(10):
        #numpy包含ramdom,random.uniform用于生成一個0到len(trainingSet)的随機數
        randIndex = int(random.uniform(0,len(trainingSet)))
        #在range(50)中随機選取10封不重複的郵件
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  

    trainMat=[]; trainClasses = []
    #剩下的40封郵件用于統計訓練
    for docIndex in trainingSet:    #train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    #計算每種條件對應的機率
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))

    #選中的10封由于測試
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        #如果用貝葉斯分類器的結果和實際結果不一樣
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
            print ("classification error",docList[docIndex])
    #計算平均錯誤率
    print ('the error rate is: ',float(errorCount)/len(testSet))
    #return vocabList,fullText


spamTest()

           

注意其中的第23篇正常郵件中包含一個utf-8不支援的字元:

SciFinance now automatically generates GPU-enabled pricing & risk model source code that runs up to 50-300x faster than serial code using a new NVIDIA Fermi-class Tesla 20-Series GPU.

SciFinance® is a derivatives pricing and risk model development tool that automatically generates C/C++ and GPU-enabled source code from concise, high-level model specifications. No parallel computing or CUDA programming expertise is required.

SciFinance's automatic, GPU-enabled Monte Carlo pricing model source code generation capabilities have been significantly extended in the latest release. This includes:

           

需要跳過。

樣本資料下載下傳