天天看點

機器學習(2)--鄰近算法(KNN)

KNN(k-NearestNeighbor)是監督學習的分類技術中最簡單的方法之一,K指k個最近的鄰居的意思,

簡單的說:就是物以類聚,當有一條預測資料要看他屬于哪個類别時,在訓練資料集中,找出K個最近的點,看這k個最近的鄰居都是什麼類别,最終确定這條預測資料歸屬什麼類别

優點:思想簡單,算法簡單,易于了解,沒有像别的算法那些需要運用矩陣、機率,核心就是算距離

缺點:計算量大,可以了解為平時不努力,臨陣磨槍,

别的機器學習算法,如神經網絡,是對訓練資料進行計算後,得到一個模型,當要預測資料時隻要将預測資料套入模型計算一下就可以得到結果,雖然在這個計算模型過程的計算量還是非常大(平時努力學習),但預測時還是非常快的,

可參考 機器學習(1)--神經網絡初探

而KNN基本沒有訓練這一說,隻有在預測某條資料時,計算與所有訓練資料的點的距離,是以說平時不努力,臨陣磨槍,

本例使用資料集簡介:以鸢尾花的特征作為資料,共有資料集包含150個資料集,

分為3類setosa(山鸢尾), versicolor(變色鸢尾), virginica(維吉尼亞鸢尾)

每類50個資料,每條資料包含4個屬性資料 和 一個類别資料.

本例程式流程簡述:随機抽取一定比例(預設70%)做為訓練集,剩下的做為測試集,最終統計出測試的正确率

#-*- coding:utf-8 -*-  
data='''5.1,3.5,1.4,0.2,Iris-setosa
        4.9,3.0,1.4,0.2,Iris-setosa
        4.7,3.2,1.3,0.2,Iris-setosa
        4.6,3.1,1.5,0.2,Iris-setosa
        5.0,3.6,1.4,0.2,Iris-setosa
        5.4,3.9,1.7,0.4,Iris-setosa
        4.6,3.4,1.4,0.3,Iris-setosa
        5.0,3.4,1.5,0.2,Iris-setosa
        4.4,2.9,1.4,0.2,Iris-setosa
        4.9,3.1,1.5,0.1,Iris-setosa
        5.4,3.7,1.5,0.2,Iris-setosa
        4.8,3.4,1.6,0.2,Iris-setosa
        4.8,3.0,1.4,0.1,Iris-setosa
        4.3,3.0,1.1,0.1,Iris-setosa
        5.8,4.0,1.2,0.2,Iris-setosa
        5.7,4.4,1.5,0.4,Iris-setosa
        5.4,3.9,1.3,0.4,Iris-setosa
        5.1,3.5,1.4,0.3,Iris-setosa
        5.7,3.8,1.7,0.3,Iris-setosa
        5.1,3.8,1.5,0.3,Iris-setosa
        5.4,3.4,1.7,0.2,Iris-setosa
        5.1,3.7,1.5,0.4,Iris-setosa
        4.6,3.6,1.0,0.2,Iris-setosa
        5.1,3.3,1.7,0.5,Iris-setosa
        4.8,3.4,1.9,0.2,Iris-setosa
        5.0,3.0,1.6,0.2,Iris-setosa
        5.0,3.4,1.6,0.4,Iris-setosa
        5.2,3.5,1.5,0.2,Iris-setosa
        5.2,3.4,1.4,0.2,Iris-setosa
        4.7,3.2,1.6,0.2,Iris-setosa
        4.8,3.1,1.6,0.2,Iris-setosa
        5.4,3.4,1.5,0.4,Iris-setosa
        5.2,4.1,1.5,0.1,Iris-setosa
        5.5,4.2,1.4,0.2,Iris-setosa
        4.9,3.1,1.5,0.1,Iris-setosa
        5.0,3.2,1.2,0.2,Iris-setosa
        5.5,3.5,1.3,0.2,Iris-setosa
        4.9,3.1,1.5,0.1,Iris-setosa
        4.4,3.0,1.3,0.2,Iris-setosa
        5.1,3.4,1.5,0.2,Iris-setosa
        5.0,3.5,1.3,0.3,Iris-setosa
        4.5,2.3,1.3,0.3,Iris-setosa
        4.4,3.2,1.3,0.2,Iris-setosa
        5.0,3.5,1.6,0.6,Iris-setosa
        5.1,3.8,1.9,0.4,Iris-setosa
        4.8,3.0,1.4,0.3,Iris-setosa
        5.1,3.8,1.6,0.2,Iris-setosa
        4.6,3.2,1.4,0.2,Iris-setosa
        5.3,3.7,1.5,0.2,Iris-setosa
        5.0,3.3,1.4,0.2,Iris-setosa
        7.0,3.2,4.7,1.4,Iris-versicolor
        6.4,3.2,4.5,1.5,Iris-versicolor
        6.9,3.1,4.9,1.5,Iris-versicolor
        5.5,2.3,4.0,1.3,Iris-versicolor
        6.5,2.8,4.6,1.5,Iris-versicolor
        5.7,2.8,4.5,1.3,Iris-versicolor
        6.3,3.3,4.7,1.6,Iris-versicolor
        4.9,2.4,3.3,1.0,Iris-versicolor
        6.6,2.9,4.6,1.3,Iris-versicolor
        5.2,2.7,3.9,1.4,Iris-versicolor
        5.0,2.0,3.5,1.0,Iris-versicolor
        5.9,3.0,4.2,1.5,Iris-versicolor
        6.0,2.2,4.0,1.0,Iris-versicolor
        6.1,2.9,4.7,1.4,Iris-versicolor
        5.6,2.9,3.6,1.3,Iris-versicolor
        6.7,3.1,4.4,1.4,Iris-versicolor
        5.6,3.0,4.5,1.5,Iris-versicolor
        5.8,2.7,4.1,1.0,Iris-versicolor
        6.2,2.2,4.5,1.5,Iris-versicolor
        5.6,2.5,3.9,1.1,Iris-versicolor
        5.9,3.2,4.8,1.8,Iris-versicolor
        6.1,2.8,4.0,1.3,Iris-versicolor
        6.3,2.5,4.9,1.5,Iris-versicolor
        6.1,2.8,4.7,1.2,Iris-versicolor
        6.4,2.9,4.3,1.3,Iris-versicolor
        6.6,3.0,4.4,1.4,Iris-versicolor
        6.8,2.8,4.8,1.4,Iris-versicolor
        6.7,3.0,5.0,1.7,Iris-versicolor
        6.0,2.9,4.5,1.5,Iris-versicolor
        5.7,2.6,3.5,1.0,Iris-versicolor
        5.5,2.4,3.8,1.1,Iris-versicolor
        5.5,2.4,3.7,1.0,Iris-versicolor
        5.8,2.7,3.9,1.2,Iris-versicolor
        6.0,2.7,5.1,1.6,Iris-versicolor
        5.4,3.0,4.5,1.5,Iris-versicolor
        6.0,3.4,4.5,1.6,Iris-versicolor
        6.7,3.1,4.7,1.5,Iris-versicolor
        6.3,2.3,4.4,1.3,Iris-versicolor
        5.6,3.0,4.1,1.3,Iris-versicolor
        5.5,2.5,4.0,1.3,Iris-versicolor
        5.5,2.6,4.4,1.2,Iris-versicolor
        6.1,3.0,4.6,1.4,Iris-versicolor
        5.8,2.6,4.0,1.2,Iris-versicolor
        5.0,2.3,3.3,1.0,Iris-versicolor
        5.6,2.7,4.2,1.3,Iris-versicolor
        5.7,3.0,4.2,1.2,Iris-versicolor
        5.7,2.9,4.2,1.3,Iris-versicolor
        6.2,2.9,4.3,1.3,Iris-versicolor
        5.1,2.5,3.0,1.1,Iris-versicolor
        5.7,2.8,4.1,1.3,Iris-versicolor
        6.3,3.3,6.0,2.5,Iris-virginica
        5.8,2.7,5.1,1.9,Iris-virginica
        7.1,3.0,5.9,2.1,Iris-virginica
        6.3,2.9,5.6,1.8,Iris-virginica
        6.5,3.0,5.8,2.2,Iris-virginica
        7.6,3.0,6.6,2.1,Iris-virginica
        4.9,2.5,4.5,1.7,Iris-virginica
        7.3,2.9,6.3,1.8,Iris-virginica
        6.7,2.5,5.8,1.8,Iris-virginica
        7.2,3.6,6.1,2.5,Iris-virginica
        6.5,3.2,5.1,2.0,Iris-virginica
        6.4,2.7,5.3,1.9,Iris-virginica
        6.8,3.0,5.5,2.1,Iris-virginica
        5.7,2.5,5.0,2.0,Iris-virginica
        5.8,2.8,5.1,2.4,Iris-virginica
        6.4,3.2,5.3,2.3,Iris-virginica
        6.5,3.0,5.5,1.8,Iris-virginica
        7.7,3.8,6.7,2.2,Iris-virginica
        7.7,2.6,6.9,2.3,Iris-virginica
        6.0,2.2,5.0,1.5,Iris-virginica
        6.9,3.2,5.7,2.3,Iris-virginica
        5.6,2.8,4.9,2.0,Iris-virginica
        7.7,2.8,6.7,2.0,Iris-virginica
        6.3,2.7,4.9,1.8,Iris-virginica
        6.7,3.3,5.7,2.1,Iris-virginica
        7.2,3.2,6.0,1.8,Iris-virginica
        6.2,2.8,4.8,1.8,Iris-virginica
        6.1,3.0,4.9,1.8,Iris-virginica
        6.4,2.8,5.6,2.1,Iris-virginica
        7.2,3.0,5.8,1.6,Iris-virginica
        7.4,2.8,6.1,1.9,Iris-virginica
        7.9,3.8,6.4,2.0,Iris-virginica
        6.4,2.8,5.6,2.2,Iris-virginica
        6.3,2.8,5.1,1.5,Iris-virginica
        6.1,2.6,5.6,1.4,Iris-virginica
        7.7,3.0,6.1,2.3,Iris-virginica
        6.3,3.4,5.6,2.4,Iris-virginica
        6.4,3.1,5.5,1.8,Iris-virginica
        6.0,3.0,4.8,1.8,Iris-virginica
        6.9,3.1,5.4,2.1,Iris-virginica
        6.7,3.1,5.6,2.4,Iris-virginica
        6.9,3.1,5.1,2.3,Iris-virginica
        5.8,2.7,5.1,1.9,Iris-virginica
        6.8,3.2,5.9,2.3,Iris-virginica
        6.7,3.3,5.7,2.5,Iris-virginica
        6.7,3.0,5.2,2.3,Iris-virginica
        6.3,2.5,5.0,1.9,Iris-virginica
        6.5,3.0,5.2,2.0,Iris-virginica
        6.2,3.4,5.4,2.3,Iris-virginica
        5.9,3.0,5.1,1.8,Iris-virginica'''

import numpy as np
#資料處理,取得100條的資料,将類别轉化為1.0,2.0,3.0數字,因為後面使用NUMPY計算比較快,在數别的資料類型上和屬性一樣使用浮點型
data=data.replace(' ','').replace("Iris-setosa","1.0").replace("Iris-versicolor","2.0").replace("Iris-virginica","3.0").split('\n')
data=list(filter(lambda x: len(x)>0,data))
data=[x.split(',') for x in data]
data=np.array(data).astype(np.float16)


def splitData(trainPrecent = 0.7):
    train=[]
    test=[]
    for i in data:
        (train if np.random.random()<trainPrecent else test).append(i)
    return train,test

#附送一條一句話的快速排序法,正常情況不建議使用,有那麼一點點一點點的效率問題,
#sort=lambda arr:arr if len(arr)<=1 else (sort(list(filter(lambda x:x<=arr[0],arr[1:]))) + [arr[0]]+sort(list(filter(lambda x:x>arr[0],arr[1:]))))
#因為在本例中比較的是(訓練類别,距離)是以對排序中的比較做些簡單的修改
sort=lambda arr:arr if len(arr)<=1 else (sort(list(filter(lambda x:x[1]<=arr[0][1],arr[1:]))) + [arr[0]]+sort(list(filter(lambda x:x[1]>arr[0][1],arr[1:]))))

def knn(train,test,k=3):
    result=[];
    rightNumber=0;
    for testItem in test:
        item=testItem.tolist()# 每一個測試點轉為list
        distArr=[]
        for trainItem in train:
            distArr.append((trainItem[-1],((testItem[0:-1]-trainItem[0:-1])**2).sum())) #每一個測試點與每一個訓練點的(訓練類别,距離)放入上述數組,
            #注意這裡的距離計算未使用開根号,因為沒有必要,

        distArr=sort(distArr)[0:k]#對距離進行排序,并取得距離最近的K個元素
        #注:單條記錄的值因如下格式
        #[5.0, 3.599609375, 1.400390625, 0.199951171875, 1.0, [(1.0, 0.02023), (1.0, 0.03006), (1.0, 0.0595)]]
        #前五個元素為測試資料,最後一個 [(1.0, 0.02023), (1.0, 0.03006), (1.0, 0.0595)] ,表示最近的K個點的類别和距離

        distArr=[x[0] for x in distArr] #提取類别 資料格式 [1.0, 1.0, 1.0],分别表示提取K個點的類别

        countType={}#統計每個類别的數量
        for distItem in distArr:
            if distItem in countType:
                countType[distItem]+=1
            else:
                countType[distItem]=1
        
        resultType=None
        if len(countType)==1 or len(countType)==k: #當隻有一個類别,或類别均不相同時,取第一個類别
            resultType=distArr[0]
        else:
            #對類别,按數量進行排序,
            typeArr=[]
            for countItem in countType:
                typeArr.append((countItem,countType[countItem]))
            typeArr=sort(typeArr)
            resultType=typeArr[-1][0]#取得數量最多的類做為結果

        if resultType==item[-1]:rightNumber+=1 #與測試資料中的結果進行比對,看是測試的結果是否正确

    return rightNumber

train,test=splitData()
rightNumber=knn(train,test)
print("共取得訓練資料%d條,測試資料%d條,測試後正确資料為%d條,正确率為%.2f%%。"%(len(train),len(test),rightNumber,rightNumber/len(test)*100))