機器學習實戰學習筆記（一）

1、k-近鄰算法

算法原理：

存在一個樣本資料集（訓練樣本集），并且我們知道樣本集中的每個資料與其所屬分類的對應關系。輸入未知類别的資料後将新資料的每個特征與樣本集中資料對應的特征進行比較，然後算法提取樣本集中特征最相似（最近鄰）的k組資料。然後将k組資料中出現次數最多的分類，來作為新資料的分類。

算法步驟：

計算已知類别資料集中的每一個點與目前點之前的距離。（相似度度量）
按照距離遞增次序排序
選取與目前點距離最小的k個點
确定k個點所在類别的出現頻率
傳回頻率最高的類别作為目前點的分類

python代碼實作：

def classify0(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]
    diffMat = tile(inX, (dataSetSize,1)) - dataSet  #目前點的值矩陣（清單複制成矩陣）與訓練樣本集特征值矩陣差
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)  #将矩陣列的值相加
    distances = sqDistances**0.5
    sortedDistIndicies = distances.argsort()     
    classCount={}          
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

輸入：

inX：目前點的特征值清單  dataSet、label：訓練集的特征值矩陣及分類值 k：kNN算法k的取值

輸出：

sortedClassCount[0][0]：k組資料中出現頻率最高的分類
sortedClassCount 是一個字典，存儲的是k組資料中各個分類對應的出現次數

2、決策樹（ID3）

原理：

在目前資料集上選擇哪個特征來劃分資料分類。原始資料劃分為幾個資料子集，這些資料子集會分布在決策點的所有分支上。如果分支下的資料屬于同一類型，則無需進一步對資料劃分。否則，需要重複劃分資料子集的過程。（遞歸）

結束條件：程式周遊完所有的劃分資料集的數學，或者每一個分支下的所有執行個體都具有相同的分類（得到葉子節點或終止塊）。

選擇特征劃分資料原則：将無序的資料變得更加有序。

組織雜亂無章資料的一種方法是使用資訊論度量資訊。（熵）

資訊增益：劃分資料集前後的資訊變化。獲得資訊增益最高的特征是最好的選擇。

熵：H = -∑pi*log2pi

注：還有一種度量資訊無序程度的方法是基尼不純度（Gini impurity）。

python代碼計算資料集的熵：

def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet: #the the number of unique elements and their occurance
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob,2) #log base 2
    return shannonEnt

獲得最好的資料集劃分方式：

def splitDataSet(dataSet, axis, value):  #根據給定特征及其值來劃分資料
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet
    
def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):        #iterate over all the features
        featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
        uniqueVals = set(featList)       #get a set of unique values
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)     
        infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
        if (infoGain > bestInfoGain):       #compare this to the best gain so far
            bestInfoGain = infoGain         #if better than current best, set to best
            bestFeature = i
    return bestFeature                      #returns an integer

遞歸建構決策樹

def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList): 
        return classList[0]#stop splitting when all of the classes are equal
    if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
        return majorityCnt(classList)

    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
    return myTree

2014-10-12

本文内容遵從CC3.0版權協定，轉載請注明：轉自學而優，思則通

本文連結位址:機器學習實戰學習筆記（一）

轉載于:https://www.cnblogs.com/ddblog/p/4020922.html

機器學習實戰學習筆記（一）

繼續閱讀

2021-2025年中國運動療法（KT）帶行業市場供需與戰略研究報告

Small tricks

libsvm for python 安裝

2021年危險化學品經營機關安全管理人員考試題庫及危險化學品經營機關安全管理人員考試技巧

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

PAT 1089 Insert or Merge[難]

無人機--飛控科普

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入