機器學習與R之樸素貝葉斯分類器

樸素貝葉斯

1聯合機率分布

p(x,y)=p(y)P(x|y) 或者p(A交B)=p(A)xp(B) p(A交B)不容易求,假設條件獨立拆分成兩個事件的乘積

2基本假設條件獨立性

3利用貝葉斯定理 p(y|x)=P(x,y)/p(x)=p(y)P(x|y)/sum(y-i)[p(y)P(x|y)]

y=max p(y)P(x|y)

貝葉斯決策理論要求計算兩個機率p1(x，y),p2(x, y):

如果p1(x，y) > p2 (x, y) , 那麼屬于類别1

如果p2(x, y) > pl(x, y) , 那麼屬于類别2

拉普拉斯估計--防止機率值為0

每一個似然函數分子+1對分母加上分子中加上1的總數

在樸素貝葉斯使用數值特征采用數值特征離散化,找見資料分布分割點切分

1分割詞-去大小寫-去字母？-去停用詞-去多餘空格符合

2統計每個詞在每條短信中出現的頻率

3建立頻率矩陣行是短信中詞是否出現出現可以是yes no 列是每個詞 label可以是字元串

library(e1071)

sms_classifier <- naiveBayes(sms_train, sms_train_labels)

sms_test_pred <- predict(sms_classifier, sms_test)

例子 iris[,-5]不含第5項

data(iris)

m <- naiveBayes(iris[,-5], iris[,5])

table(predict(m, iris), iris[,5])

另一個樸素貝葉斯包 klaR NaiveBayes()

樸素貝葉斯分類器通常有兩種實作方式：一種基于貝努利模型實作，一種基于多項式模型實作

這裡采用前一種實作方式。該實作方式中并不考慮詞在文檔中出現的次數，隻考慮出不出現，

是以在這個意義上相當于假設詞是等權重的

Python版實作

http://blog.csdn.net/q383700092/article/details/51773364

R語言版調用函數

http://blog.csdn.net/q383700092/article/details/51774069

MapReduce簡化實作版

http://blog.csdn.net/q383700092/article/details/51778765

spark版

後續添加

垃圾短信識别

# read the sms data into the sms data frame
sms_raw <- read.csv("sms_spam.csv", stringsAsFactors = FALSE)

# examine the structure of the sms data 垃圾郵件标記為spam非垃圾ham 結構type+text
str(sms_raw)

# convert spam/ham to factor.字元串分類标簽轉換成因子比較好
sms_raw$type <- factor(sms_raw$type)

# examine the type variable more carefully
str(sms_raw$type)
table(sms_raw$type)

# build a corpus using the text mining (tm) package
#tm文本挖掘包
library(tm)
sms_corpus <- VCorpus(VectorSource(sms_raw$text))#建立語料庫 VCorpus存儲R文本文檔

# examine the sms corpus
print(sms_corpus)
inspect(sms_corpus[1:2]) #檢視1-2個語料庫内容

as.character(sms_corpus[[1]])
lapply(sms_corpus[1:2], as.character)

# clean up the corpus using tm_map()字母轉換成小寫
#sms_corpus_clean <- tm_map(sms_corpus, content_transformer(tolower))
# show the difference between sms_corpus and corpus_clean
#as.character(sms_corpus[[1]])
#as.character(sms_corpus_clean[[1]])

sms_corpus_clean <- tm_map(sms_corpus, removeNumbers) # remove numbers去掉數字
sms_corpus_clean <- tm_map(sms_corpus_clean, content_transformer(tolower)) #字母轉換成小寫
sms_corpus_clean <- tm_map(sms_corpus_clean, removeWords, stopwords()) # remove stop words去掉停用詞
sms_corpus_clean <- tm_map(sms_corpus_clean, removePunctuation) # remove punctuation去掉标點

# tip: create a custom function to replace (rather than remove) punctuation
removePunctuation("hello...world")
replacePunctuation <- function(x) { gsub("[[:punct:]]+", " ", x) }
replacePunctuation("hello...world")

# illustration of word stemming
library(SnowballC)
wordStem(c("learn", "learned", "learning", "learns"))

sms_corpus_clean <- tm_map(sms_corpus_clean, stemDocument)

sms_corpus_clean <- tm_map(sms_corpus_clean, stripWhitespace) # eliminate unneeded whitespace去掉多餘的空格

# examine the final clean corpus
lapply(sms_corpus[1:3], as.character)
lapply(sms_corpus_clean[1:3], as.character)

# create a document-term sparse matrix建立一個稀疏矩陣
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)

# alternative solution: create a document-term sparse matrix directly from the SMS corpus
#直接從語料庫建立一個稀疏矩陣
sms_dtm2 <- DocumentTermMatrix(sms_corpus, control = list(
  tolower = TRUE,
  removeNumbers = TRUE,
  stopwords = TRUE,
  removePunctuation = TRUE,
  stemming = TRUE
))

# alternative solution: using custom stop words function ensures identical result
# 使用自定義的停用詞
sms_dtm3 <- DocumentTermMatrix(sms_corpus, control = list(
  tolower = TRUE,
  removeNumbers = TRUE,
  stopwords = function(x) { removeWords(x, stopwords()) },
  removePunctuation = TRUE,
  stemming = TRUE
))

# compare the result
sms_dtm
sms_dtm2
sms_dtm3

# creating training and test datasets
sms_dtm_train <- sms_dtm[1:4169, ]
sms_dtm_test  <- sms_dtm[4170:5559, ]

# also save the labels
sms_train_labels <- sms_raw[1:4169, ]$type
sms_test_labels  <- sms_raw[4170:5559, ]$type

# check that the proportion of spam is similar檢視類别比例
prop.table(table(sms_train_labels))
prop.table(table(sms_test_labels))

# word cloud visualization詞雲可視化
library(wordcloud)
#從語料庫直接建立詞雲
wordcloud(sms_corpus_clean, min.freq = 50, random.order = FALSE)

# subset the training data into spam and ham groups
spam <- subset(sms_raw, type == "spam")
ham  <- subset(sms_raw, type == "ham")

wordcloud(spam$text, max.words = 40, scale = c(3, 0.5))
wordcloud(ham$text, max.words = 40, scale = c(3, 0.5))

sms_dtm_freq_train <- removeSparseTerms(sms_dtm_train, 0.999)
sms_dtm_freq_train

# indicator features for frequent words找出不少于5條短信的單詞--減少特征
findFreqTerms(sms_dtm_train, 5)

# save frequently-appearing terms to a character vector
sms_freq_words <- findFreqTerms(sms_dtm_train, 5)
str(sms_freq_words)

# create DTMs with only the frequent terms
sms_dtm_freq_train <- sms_dtm_train[ , sms_freq_words]
sms_dtm_freq_test <- sms_dtm_test[ , sms_freq_words]

# convert counts to a factor轉為成yes no
convert_counts <- function(x) {
  x <- ifelse(x > 0, "Yes", "No")
}

# apply() convert_counts() to columns of train/test data MARGIN = 2 2是列1是行
sms_train <- apply(sms_dtm_freq_train, MARGIN = 2, convert_counts)
sms_test  <- apply(sms_dtm_freq_test, MARGIN = 2, convert_counts)

## Step 3: Training a model on the data ----訓練
library(e1071)
sms_classifier <- naiveBayes(sms_train, sms_train_labels)

## Step 4: Evaluating model performance ----預測
sms_test_pred <- predict(sms_classifier, sms_test)
#----評估
library(gmodels)
CrossTable(sms_test_pred, sms_test_labels,
           prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
           dnn = c('predicted', 'actual'))

## Step 5: Improving model performance ----加入拉普拉斯估計laplace = 1
sms_classifier2 <- naiveBayes(sms_train, sms_train_labels, laplace = 1)
sms_test_pred2 <- predict(sms_classifier2, sms_test)
CrossTable(sms_test_pred2, sms_test_labels,
           prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
           dnn = c('predicted', 'actual'))

機器學習與R之樸素貝葉斯分類器

繼續閱讀

簡單文檔分類——樸素貝葉斯算法樸素貝葉斯算法簡單文檔分類執行個體步驟總結樸素貝葉斯分類調用(sklearn)

【分類算法】什麼是分類算法定義分類與聚類分類過程方法

分類算法的評價名額

K-近鄰算法以及圖像分類應用

weka之NB算法

使用weka的select attribute

weka中分類器算法

在weka中內建自己的算法

【多變量線性回歸】學習記錄序思路實作終

申請評分模型拒絕推斷（RI）方法申請評分模型拒絕推斷（RI）方法

【人工智能行業大師訪談1】吳恩達采訪 Geoffery Hinton

【趨高機器視覺】機器視覺技術原了解析及解決方案

吳恩達 coursera ML 第七課總結+作業答案前言目錄正文模型表示作業答案

XGBoost Plotting API以及GBDT組合特征實踐 XGBoost Plotting API以及GBDT組合特征實踐

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

2021-2025年中國運動療法（KT）帶行業市場供需與戰略研究報告