天天看點

【冷門?】樸素貝葉斯分類算法

【冷門?】樸素貝葉斯分類算法

介紹一種據說很冷門的算法——樸素貝葉斯分類算法。

樸素貝葉斯分類算法主要是基于樸素貝葉斯定理,即對于給定的資料集,利用特征條件獨立假設,求取後驗機率最大的分類。

具體來說,樸素貝葉斯分類算法是基于條件機率的思想,将每個執行個體看作n個屬性的向量,其中第i個屬性的值是 xi 。假設屬性之間互相獨立,則樸素貝葉斯分類器的計算公式如下:

P(y|x1,…,xn) = P(y) * P(x1|y) * … * P(xn|y)

其中,y 是分類結果,x1,…,xn 是給定的樣本特征。P(y) 是類别y在資料集中出現的機率,P(xi|y) 是在類别為y的情況下,第i個特征值為xi的機率。有了這個公式,我們就可以根據樣本的特征來計算屬于哪個分類。

樸素貝葉斯分類算法的優點在于:

  1. 樸素貝葉斯算法是基于機率的算法,解釋性強。
  2. 資料量較大時,它的精度比較高,能夠處理大量的特征,同時對噪聲資料不敏感。
  3. 算法學習和分類的速度較快。

缺點在于:

  1. 樸素貝葉斯分類算法基于獨立假設,是以在處理特征值有關聯的資料時,效果可能不如其他算法。
  2. 對于輸入資料的準備方式比較敏感,需要較為仔細的資料預處理。
  3. 樸素貝葉斯分類算法對于來自不同分布的資料效果不好。

樸素貝葉斯分類算法的應用場景主要是在文本分類、垃圾郵件過濾、情感分析等領域,因為在這些應用場景中,樣本集都比較大。

下面是Java代碼實作:

import java.util.*;

public class NaiveBayesClassifier {
    private Map<String, Double> classPriors;
    private Map<String, Map<String, Double>> featureProbabilities;
    private Map<String, Integer> featureCounts;
    private double smoothingFactor;

    public NaiveBayesClassifier(double smoothingFactor) {
        this.classPriors = new HashMap<>();
        this.featureProbabilities = new HashMap<>();
        this.featureCounts = new HashMap<>();
        this.smoothingFactor = smoothingFactor;
    }

    public void train(List<String> documents, List<String> classes) {
        int documentCount = documents.size();
        Set<String> vocabulary = new HashSet<>();
        for (int i = 0; i < documentCount; i++) {
            String[] words = documents.get(i).split(" ");
            for (String word : words) {
                vocabulary.add(word);
            }
            String currentClass = classes.get(i);
            if (classPriors.containsKey(currentClass)) {
                classPriors.put(currentClass, classPriors.get(currentClass) + 1);
            } else {
                classPriors.put(currentClass, 1.0);
            }
            for (String word : vocabulary) {
                Map<String, Double> featureProbabilitiesGivenClass = featureProbabilities.get(word);
                if (featureProbabilitiesGivenClass == null) {
                    featureProbabilitiesGivenClass = new HashMap<>();
                }
                if (featureProbabilitiesGivenClass.containsKey(currentClass)) {
                    featureProbabilitiesGivenClass.put(currentClass, featureProbabilitiesGivenClass.get(currentClass) + 1);
                } else {
                    featureProbabilitiesGivenClass.put(currentClass, 1.0);
                }
                featureProbabilities.put(word, featureProbabilitiesGivenClass);
            }
        }

        double totalCount = (double) documentCount;
        for (String currentClass : classPriors.keySet()) {
            double count = classPriors.get(currentClass);
            classPriors.put(currentClass, count / totalCount);
        }

        for (String word : featureProbabilities.keySet()) {
            Map<String, Double> featureProbabilitiesGivenClass = featureProbabilities.get(word);
            for (String currentClass : classPriors.keySet()) {
                double count = featureProbabilitiesGivenClass.getOrDefault(currentClass, 0.0);
                double probability = (count + smoothingFactor) / (classPriors.get(currentClass) * totalCount + vocabulary.size() * smoothingFactor);
                featureProbabilitiesGivenClass.put(currentClass, probability);
            }
            featureProbabilities.put(word, featureProbabilitiesGivenClass);
        }

        for (String word : vocabulary) {
            int count = 0;
            for (String currentClass : classPriors.keySet()) {
                double value = featureProbabilities.get(word).getOrDefault(currentClass, 0.0);
                int integer_count = (int)(value * totalCount);
                featureCounts.put(word + currentClass, integer_count);
                count += integer_count;
            }
            featureCounts.put(word, count);
        }
    }

    public String classify(String document) {
        String[] words = document.split(" ");
        Map<String, Double> classProbabilities = new HashMap<>();
        for (String currentClass : classPriors.keySet()) {
            double classProbability = Math.log(classPriors.get(currentClass));
            for (String word : words) {
                double featureProbability = featureProbabilities.getOrDefault(word, new HashMap<>()).getOrDefault(currentClass, 1.0 / featureCounts.get(word));
                classProbability += Math.log(featureProbability);
            }
            classProbabilities.put(currentClass, classProbability);
        }
        return Collections.max(classProbabilities.entrySet(), Map.Entry.comparingByValue()).getKey();
    }
}
           

雖然樸素貝葉斯分類算法在實際應用中已被證明是一種非常有效的算法,但仍需要根據具體問題來選擇和調整模型參數。希望這個例子可以幫助您更好地了解樸素貝葉斯分類算法并實作自己的分類器。

繼續閱讀