《機器學習實戰》筆記1 - KNN手寫數字

文章目錄

- 簡述KNN
- - KNN原理
  - 優點
  - 缺點
- 涉及到的python知識點
- - 1、np.tile()廣播
  - 2、os.listdir() 擷取目錄檔案
  - 3、open(filename, 'r') 打開檔案
  - 4、argsort()
  - 5、python字典添加元素的方法
  - 6、python字典按value排序
  - 7、混淆矩陣
  - 8、recall、precision、F-Measure
- 完整KNN示例代碼

簡述KNN

KNN原理

通過計算特征向量之間距離的方法，找出距離待測試的向量最小的k個訓練樣本，統計他們的标簽，标簽數量最多的一個就是我們預測的結果。距離的計算可以自己定義，如歐式距離，曼哈頓距離哪個效果好就用哪個。k的選取也會對結果産生較大的影響，調整k值（所謂調參）以獲得最好的效果。

優點

1、算法簡單，易于實作

2、可以作分類，也可以做回歸

3、對異常值不敏感，準确度高，對資料沒有假定

缺點

1、時間複雜度、空間複雜度高。取決于“訓練”的樣本數量

2、不像是所謂機器學習，因為它沒有訓練模型的過程。

涉及到的python知識點

1、np.tile()廣播

功能：在KNN中計算距離時，将目前特征向量廣播到整個資料大小，以便利用python特性直接計算。

參考 https://blog.csdn.net/qq_18433441/article/details/54897250

2、os.listdir() 擷取目錄檔案

功能：擷取某一目錄下所有檔案名，可用來周遊檔案夾，讀取資料

3、open(filename, ‘r’) 打開檔案

功能：可以打開txt檔案，擷取到字元串資訊，用split()或replace()等方法對文本進行處理。

4、argsort()

功能：擷取排序好的索引下标，可用來做遞增或遞減的部分周遊

5、python字典添加元素的方法

classCount[one_label] = classCount.get(one_label, 0) + 1

如果沒有one_label這個key，就置0，否則讀取原來的value，然後+1.

6、python字典按value排序

sortedClassCount = sorted(classCount.items(), key = lambda x:x[1], reverse=True)

三個參數預設是疊代的對象、排序按哪個值排序，是否反轉（預設為升序）

按鍵排序也可以這樣實作

《機器學習實戰》筆記1 - KNN手寫數字

7、混淆矩陣

混淆矩陣是什麼可以百度，它對分析資料的預測結果很有幫助。如圖是一個手寫數字識别的混淆矩陣，它把十個數字預測成了什麼和他們實際是什麼直覺地展示了出來。

《機器學習實戰》筆記1 - KNN手寫數字

代碼：利用sklearn封裝的函數得出混淆矩陣，并用seaborn作圖

# 繪制混淆矩陣
def sklearn_confusion_matrix(ytest, ymodel):
    from sklearn.metrics import confusion_matrix
    import seaborn as sns
    import matplotlib.pyplot as plt
    
    mat = confusion_matrix(ytest, ymodel)
    print(mat)
    sns.heatmap(mat, square=True, annot=True, cbar=False)
    plt.xlabel('predicted value')
    plt.ylabel('true value')

sklearn_confusion_matrix(test_labels, result)

8、recall、precision、F-Measure

這篇部落格講的非常好非常詳細

https://www.cnblogs.com/Zhi-Z/p/8728168.html

這些概念似乎是相對于二分類問題的，是以我們這個問題（手寫數字識别）是無法得出這些統計學概念的，但不妨礙我們學習。

《機器學習實戰》筆記1 - KNN手寫數字

圖檔來源：https://www.cnblogs.com/Zhi-Z/p/8728168.html

一個代碼示例

def get_precision_recall_f1(trues, predicts):
    TP = TN = FP = FN = 0
    precisioin = recall = f1 = 0
    for i in range(len(trues)):
        if trues[i] == 1 and predicts[i] == 1:
            TP += 1
        elif trues[i] == 0 and predicts[i] == 0:
            TN += 1
        elif trues[i] == 1 and predicts[i] == 0:
            FP += 1
        elif trues[i] == 0 and predicts[i] == 1:
            FN += 1
    precision = TP / (TP + FP)
    recall = TP / (TP + FN)
    f1 = 2*precision*recall / (precision + recall)
    return precision, recall, f1

trues = [1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1]
predicts = [1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1]
sklearn_confusion_matrix(trues, predicts)
    
precision, recall, f1 = get_precision_recall_f1(trues, predicts)
print("precision = {}, recall = {}, f1 = {}".format(precision, recall, f1))

輸出結果

《機器學習實戰》筆記1 - KNN手寫數字

完整KNN示例代碼

由于我是在jupyter notebook上寫的代碼，是以直接把它下載下傳成py檔案，看上去格式不怎麼好。代碼也無法直接運作，僅供參考。

#!/usr/bin/env python
# coding: utf-8
# In[52]:


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
get_ipython().run_line_magic('matplotlib', 'inline')


# In[53]:


def getFileArray(filename):
    file_array = []
    labels = []
    train_files = os.listdir(filename)
    for each in train_files:
        labels.append(each[0])
        tmp_file = open(filename + '/' + each, 'r')
        contents = tmp_file.read().replace('\n', '')
        tmp = []
        for i in range(1024):
            tmp.append(int(contents[i]))
        file_array.append(tmp)
    return np.array(file_array), np.array(labels)


# In[54]:


digits_matrix, digit_labels = getFileArray('./digits/trainingDigits')
test_x, test_labels = getFileArray('./digits/testDigits')


# In[55]:


def knn(xtest, data, label, k): # xtest為測試的特征向量，data、label為“訓練”資料集，k為設定的門檻值
#     print(xtest.shape)
#     print(label.shape)
    exp_xtest = np.tile(xtest, (len(label), 1)) - data
    sq_diff = exp_xtest**2
    sum_diff = sq_diff.sum(axis=1)
    distance = sum_diff**0.5
#     print(distance)
    sort_index = distance.argsort()
    classCount = {}
    for i in range(k):
        one_label = label[sort_index[i]]
        classCount[one_label] = classCount.get(one_label, 0) + 1
    sortedClassCount = sorted(classCount.items(), key = lambda x:x[1], reverse=True)
#     print(sortedClassCount)
    return sortedClassCount[0][0]


# In[56]:


result = []
for i in range(len(test_x)):
    result.append(knn(test_x[i], digits_matrix, digit_labels, 3))


# In[57]:


print(test_labels)
print(len(test_labels))
print(len(result))


# In[51]:


# print(result)
wrong = 0
right = 0
for i in range(len(result)):
    if result[i] != test_labels[i]:
        wrong += 1
    else:
        right += 1
print("wrong = {} right = {}".format(wrong, right))
print("accurate: {}".format(right / (wrong + right)))
print("wrong rate: {}".format(wrong / (wrong + right)))


# In[70]:


# 繪制混淆矩陣
def sklearn_confusion_matrix(ytest, ymodel):
    from sklearn.metrics import confusion_matrix
    import seaborn as sns
    import matplotlib.pyplot as plt
    
    mat = confusion_matrix(ytest, ymodel)
    print(mat)
    sns.heatmap(mat, square=True, annot=True, cbar=False)
    plt.xlabel('predicted value')
    plt.ylabel('true value')

sklearn_confusion_matrix(test_labels, result)


# In[68]:


dic1 = {'1': 2, '3': 5, '0': -1}
print(dic1.items())
x = [0, 3, 1]
y = lambda x:x[2]
print(y(x))


# In[76]:





# In[92]:


def get_precision_recall_f1(trues, predicts):
    TP = TN = FP = FN = 0
    precisioin = recall = f1 = 0
    for i in range(len(trues)):
        if trues[i] == 1 and predicts[i] == 1:
            TP += 1
        elif trues[i] == 0 and predicts[i] == 0:
            TN += 1
        elif trues[i] == 1 and predicts[i] == 0:
            FP += 1
        elif trues[i] == 0 and predicts[i] == 1:
            FN += 1
    precision = TP / (TP + FP)
    recall = TP / (TP + FN)
    f1 = 2*precision*recall / (precision + recall)
    return precision, recall, f1

trues = [1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1]
predicts = [1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1]
sklearn_confusion_matrix(trues, predicts)
    
precision, recall, f1 = get_precision_recall_f1(trues, predicts)
print("precision = {}, recall = {}, f1 = {}".format(precision, recall, f1))

《機器學習實戰》筆記1 - KNN手寫數字

文章目錄

簡述KNN

KNN原理

優點

缺點

涉及到的python知識點

1、np.tile()廣播

2、os.listdir() 擷取目錄檔案

3、open(filename, ‘r’) 打開檔案

4、argsort()

5、python字典添加元素的方法

6、python字典按value排序

7、混淆矩陣

8、recall、precision、F-Measure

完整KNN示例代碼

繼續閱讀

簡單文檔分類——樸素貝葉斯算法樸素貝葉斯算法簡單文檔分類執行個體步驟總結樸素貝葉斯分類調用(sklearn)

【分類算法】什麼是分類算法定義分類與聚類分類過程方法

分類算法的評價名額

K-近鄰算法以及圖像分類應用

weka之NB算法

使用weka的select attribute

weka中分類器算法

在weka中內建自己的算法

【多變量線性回歸】學習記錄序思路實作終

申請評分模型拒絕推斷（RI）方法申請評分模型拒絕推斷（RI）方法

【人工智能行業大師訪談1】吳恩達采訪 Geoffery Hinton

【趨高機器視覺】機器視覺技術原了解析及解決方案

吳恩達 coursera ML 第七課總結+作業答案前言目錄正文模型表示作業答案

XGBoost Plotting API以及GBDT組合特征實踐 XGBoost Plotting API以及GBDT組合特征實踐

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

2021-2025年中國運動療法（KT）帶行業市場供需與戰略研究報告