天天看點

【python】【kNN】【OCR】用python實作字元識别

一、問題

OCR(光學字元識别)是機器學習重要的應用之一,一般要經過二值化、去噪、傾斜校正、特征抽取、字元切割、字元識别、後處理等過程。其中難度最大的是字元切割,最關鍵的步驟是字元識别。一般進行字元識别的方法有kNN,SVM,CNN等方式,其中比較好用的是SVM。作者在這裡實作的是相對較為簡單的kNN(k近鄰)算法,用以完成經典的MNIST資料集的字元識别工作。該資料集的訓練集共計60000條資料,測試集共計10000條資料。

二、原理

kNN的核心思想:如果一個樣本在特征空間中的k個最相鄰的樣本中的大多數屬于某一個類别,則該樣本也屬于這個類别,并具有這個類别上樣本的特性。一般來講,對于類域的交叉或重疊較多的待分樣本集來說,kNN相比其它方法更為适合。

三、解決

①導入資料,觀察示例圖确定基本解決方案

②定義kNN函數

③調用kNN函數

④比較不同數量的訓練集、不同的距離度量函數對結果準确率、時間開銷的影響

首先是對資料集内的資料進行導入、劃分資料集、觀察示例圖等正常操作,讀者不必在這些代碼上花費大量時間,可以跳讀下一代碼段。

%matplotlib inline

import numpy as np
from PIL import Image
from matplotlib import pyplot as plt
import copy
import scipy.ndimage 
import sys, os

DATASET_PATH = r'D:\檔案路徑'
DATASET_FILE = os.path.join(DATASET_PATH, 'mnist.npz')

f = np.load(DATASET_FILE)
x_train, y_train = f['x_train'], f['y_train']
x_test, y_test = f['x_test'], f['y_test']

#unit8(無符号的整數,unit8是0~255
def img_show(img):
    plt.imshow(Image.fromarray(np.uint8(img)))
    plt.axis('on') # 關掉坐标軸為 off
    plt.title('image') # 圖像題目
    plt.show()
    
img = x_train[0]#訓練圖像賦給img
print(img.shape)  # (784,)
img = img.reshape(28, 28)  # 把圖像的形狀變為原來的尺寸
img_show(img)
           

實際操作中,我們可以隻定義一種距離度量方法,最常用的就是歐式距離。這裡定義了歐式距離、曼哈頓距離、切比雪夫距離和闵科夫斯基距離四種距離度量方法,目的是對四種方法進行比較。可以在後面的叙述中看到,歐式距離效果最好。

def euclidean_dist(x,y):
    return np.linalg.norm(x-y)
def manhattan_dist(x,y):
    return np.sum(np.abs(x-y))
def chebyshev_dist(x,y):
    return np.max(np.abs(x-y))
def minkowski_dist(x,y):
    return np.sqrt(np.sum(np.square(x-y)))
           

接下來就是對kNN函數的定義。函數包括四個參數:

x表示向量清單,是拟進行标注的圖檔的特征矩陣拉伸成向量後的向量;

M表示樣本矩陣,用來訓練分類器;

k表示目标點的鄰居個數;

dtype表示度量方法,共有0,1,2,3四個選項,分别對應上面的四種距離度量方法。

函數輸出的是M中與x最近的k個樣本的下标。

值得注意的是,這裡對x、M都做了拉伸變換。而這二者都必須是array類型才能利用numpy的方法進行距離度量。是以無論傳進來是什麼類型(實際上是list類型),都先轉換為array。

注意,下面M[:10000]代表使用傳入的訓練集中前10000條資料進行訓練,目的是加快訓練速度,但也犧牲了一部分精度。實際操作時,這部分可以進行更改。

def KNN(x, M, k, dtype):
    x = np.array(x)
    M = np.array(M)
    orin_dist = []
    dist = []
    dist0 = 0
    idx = []
    for a in M[:10000]:#這裡可以調整訓練集大小
        if dtype == 0:
            dist0 = euclidean_dist(x,a)
        elif dtype == 1:
            dist0 = manhattan_dist(x,a)
        elif dtype == 2:
            dist0 = chebyshev_dist(x,a)
        elif dtype == 3:
            dist0 = minkowski_dist(x,a)
        dist.append(dist0)
    for i in dist:
        orin_dist.append(i)
    dist.sort()
    for i in range(k):
        idx.append(orin_dist.index(dist[i]))
    return idx
           

定義查找結果函數。傳入的是目标點的k個鄰居共同組成的向量,傳出的是這k個鄰居确定的數字結果。

def find(x_result):
    y_result = y_train[x_result]
    from collections import Counter
    res0 = Counter(y_result).most_common(1)
    res = res0[0][0]
    print("這個數字是"+str(res))
    return res
           

定義驗證函數。若比對成功,說明結果正确;正确率等于正确驗證數量除以總驗證數量。

def vali(result, r_result):
    r_con = 0
    c_con = 0
    for i in range(len(result)):
        if(result[i] == r_result[i]):
            r_con += 1
        c_con += 1
    print("共"+str(r_con)+"個結果正确,正确率為"+str(r_con/c_con))
           

這一步完成了對x_train資料集的矩陣拉伸工作。

new_x_train = []
for i in x_train:
    new_x_train.append(np.ravel(i))
           

這一步完成了kNN函數調用,并輸出結果。

注意,這裡dtype的位置傳的參數是3,代表使用了闵科夫斯基距離。實際操作時,這一參數可以修改。

y_result = []
for i in x_test[:500]:
    x_result = KNN(np.ravel(i),new_x_train,11,3)
    y_result.append(find(x_result))
print(y_result)
           

在上面的介紹中提到了兩處參數的修改。分别調整這兩處參數,得到使用1000條、10000條、60000條訓練集中的資料進行訓練得到的y_train結果(使用的是歐式距離);以及在1000條訓練資料下分别使用四種距離度量方法進行度量的所得結果。

将上面得到的結果手動複制到下面,進行正确率比較。

y_r1000  = [7, 2, 1, 0, 0, 1, 4, 9, 8, 9, 0, 6, 9, 0, 1, 1, 9, 7, 3, 4, 9, 6, 6, 5, 1, 0, 7, 9, 0, 1, 3, 1, 3, 6, 7, 2, 7, 1, 1, 1, 1, 7, 9, 1, 1, 5, 1, 1, 9, 4, 6, 3, 5, 1, 0, 0, 4, 1, 9, 1, 7, 8, 9, 1, 7, 1, 7, 4, 3, 0, 7, 0, 2, 7, 1, 7, 1, 7, 1, 7, 9, 6, 2, 7, 8, 4, 7, 5, 6, 1, 3, 6, 1, 3, 1, 9, 1, 7, 6, 9, 1, 0, 5, 4, 9, 9, 2, 1, 9, 4, 8, 1, 1, 9, 1, 1, 9, 4, 7, 7, 5, 6, 7, 6, 7, 9, 0, 5, 8, 5, 6, 6, 5, 7, 8, 1, 0, 1, 6, 7, 6, 7, 3, 1, 9, 1, 8, 2, 0, 1, 9, 9, 9, 5, 1, 5, 6, 0, 3, 1, 4, 6, 5, 4, 6, 5, 4, 1, 1, 9, 9, 7, 3, 1, 1, 1, 1, 8, 1, 8, 1, 8, 1, 0, 1, 6, 2, 5, 0, 1, 1, 1, 0, 7, 0, 1, 1, 6, 9, 2, 3, 6, 1, 1, 1, 1, 9, 3, 2, 9, 7, 1, 9, 1, 9, 0, 3, 8, 7, 5, 9, 0, 2, 7, 1, 3, 8, 1, 1, 1, 5, 1, 8, 7, 7, 9, 2, 1, 4, 1, 5, 3, 8, 7, 1, 1, 0, 6, 4, 1, 9, 1, 9, 1, 7, 1, 1, 1, 2, 0, 8, 1, 7, 7, 1, 1, 0, 1, 3, 0, 3, 0, 1, 9, 9, 9, 1, 8, 2, 1, 2, 9, 1, 1, 9, 2, 6, 4, 1, 1, 8, 2, 9, 1, 0, 9, 0, 0, 2, 8, 1, 7, 1, 1, 9, 0, 1, 1, 4, 1, 3, 0, 0, 5, 1, 9, 1, 5, 0, 6, 1, 1, 9, 1, 6, 9, 6, 0, 7, 1, 1, 1, 1, 3, 3, 1, 9, 7, 0, 6, 5, 1, 1, 3, 8, 1, 0, 5, 1, 1, 1, 5, 0, 6, 1, 8, 5, 1, 1, 9, 4, 6, 7, 1, 5, 0, 6, 5, 6, 1, 7, 2, 0, 8, 8, 5, 9, 1, 1, 4, 0, 7, 3, 7, 6, 1, 6, 1, 1, 7, 2, 8, 6, 1, 7, 5, 1, 5, 4, 4, 2, 1, 1, 1, 1, 4, 5, 0, 5, 1, 7, 7, 8, 7, 9, 9, 1, 9, 2, 1, 1, 2, 9, 2, 0, 9, 9, 1, 4, 1, 1, 1, 6, 4, 9, 8, 9, 3, 7, 6, 0, 0, 3, 1, 8, 0, 6, 1, 9, 5, 3, 3, 1, 3, 9, 1, 1, 6, 9, 0, 9, 6, 6, 6, 7, 8, 8, 2, 8, 8, 8, 7, 6, 1, 8, 4, 1, 2, 1, 3, 1, 9, 7, 1, 9, 0, 8, 9, 9, 1, 6, 5, 2, 3, 7, 6, 9, 1, 0, 1]
y_r10000 = [7, 2, 1, 0, 4, 1, 9, 9, 4, 9, 0, 6, 9, 0, 1, 5, 9, 7, 3, 4, 9, 6, 6, 5, 4, 0, 7, 4, 0, 1, 3, 1, 3, 6, 7, 2, 7, 1, 1, 1, 1, 7, 4, 2, 1, 5, 1, 1, 9, 4, 6, 3, 5, 1, 6, 0, 4, 1, 9, 1, 7, 8, 1, 1, 7, 1, 6, 4, 3, 0, 7, 0, 2, 9, 1, 7, 1, 7, 9, 7, 9, 6, 2, 7, 8, 4, 7, 3, 6, 1, 3, 6, 1, 3, 1, 4, 1, 7, 6, 9, 6, 0, 5, 4, 9, 9, 2, 1, 9, 9, 8, 1, 1, 9, 7, 1, 1, 4, 9, 7, 8, 6, 1, 6, 7, 9, 0, 5, 8, 5, 6, 6, 8, 7, 8, 1, 0, 1, 6, 9, 6, 7, 3, 1, 7, 1, 8, 2, 0, 1, 9, 8, 5, 8, 1, 5, 6, 0, 3, 1, 4, 6, 5, 4, 6, 5, 4, 1, 1, 4, 9, 7, 3, 1, 2, 1, 1, 8, 1, 8, 1, 8, 1, 0, 1, 9, 2, 3, 0, 1, 1, 1, 0, 9, 0, 1, 1, 6, 9, 2, 3, 6, 1, 1, 1, 3, 9, 8, 2, 9, 7, 5, 9, 1, 9, 0, 3, 6, 5, 5, 7, 2, 2, 7, 1, 3, 8, 1, 1, 1, 3, 1, 8, 7, 1, 9, 2, 1, 4, 1, 5, 8, 8, 7, 1, 6, 0, 6, 4, 1, 9, 1, 9, 5, 7, 1, 1, 1, 2, 6, 8, 1, 7, 7, 1, 1, 8, 1, 3, 0, 3, 0, 1, 9, 9, 9, 1, 8, 2, 1, 2, 9, 1, 1, 9, 2, 6, 9, 1, 5, 9, 2, 9, 2, 0, 9, 0, 0, 2, 8, 1, 7, 1, 1, 9, 0, 2, 9, 4, 1, 3, 0, 0, 3, 1, 9, 1, 5, 3, 5, 1, 7, 9, 1, 6, 9, 6, 0, 7, 1, 1, 2, 1, 5, 3, 1, 9, 7, 8, 6, 6, 1, 1, 3, 8, 1, 0, 5, 1, 3, 1, 8, 0, 6, 1, 8, 5, 1, 9, 9, 4, 6, 7, 2, 8, 0, 6, 5, 6, 1, 7, 2, 0, 8, 8, 5, 4, 1, 1, 4, 0, 7, 3, 7, 6, 1, 6, 1, 1, 9, 2, 8, 6, 1, 7, 5, 2, 5, 4, 4, 2, 1, 3, 9, 2, 4, 5, 0, 3, 1, 7, 7, 8, 7, 9, 7, 1, 9, 2, 1, 9, 2, 9, 2, 0, 4, 9, 1, 8, 8, 1, 1, 6, 5, 9, 1, 9, 3, 7, 6, 0, 0, 3, 1, 8, 0, 6, 9, 8, 3, 3, 8, 1, 3, 9, 1, 1, 6, 8, 0, 9, 6, 6, 6, 7, 8, 8, 2, 8, 5, 8, 9, 6, 1, 8, 4, 1, 2, 6, 9, 1, 9, 7, 1, 9, 0, 8, 9, 9, 1, 0, 5, 2, 3, 7, 6, 9, 1, 8, 1]
y_r60000 = [7, 2, 1, 0, 4, 1, 9, 9, 0, 9, 0, 6, 9, 0, 1, 8, 9, 7, 3, 4, 9, 6, 6, 5, 4, 0, 7, 4, 0, 1, 3, 1, 3, 0, 7, 2, 7, 1, 2, 1, 1, 7, 4, 2, 1, 5, 1, 1, 9, 4, 6, 3, 5, 0, 6, 0, 4, 1, 9, 1, 7, 8, 4, 3, 7, 1, 6, 4, 3, 0, 7, 0, 2, 9, 1, 7, 1, 7, 9, 7, 9, 6, 2, 7, 8, 4, 7, 8, 6, 1, 3, 6, 1, 3, 1, 4, 1, 7, 6, 9, 6, 0, 5, 4, 9, 9, 2, 1, 9, 9, 8, 1, 1, 9, 1, 9, 9, 4, 9, 8, 8, 6, 7, 6, 7, 4, 0, 5, 8, 5, 6, 6, 3, 7, 8, 1, 0, 1, 6, 9, 6, 7, 3, 1, 7, 1, 8, 2, 0, 1, 9, 8, 5, 3, 1, 5, 6, 0, 3, 1, 8, 6, 5, 4, 6, 5, 4, 5, 1, 4, 9, 7, 2, 1, 2, 1, 1, 8, 1, 8, 1, 8, 1, 0, 1, 9, 2, 3, 0, 1, 1, 1, 0, 9, 0, 1, 1, 6, 4, 2, 3, 6, 1, 1, 1, 1, 9, 5, 2, 9, 4, 5, 9, 1, 9, 0, 3, 6, 5, 5, 7, 2, 2, 7, 1, 2, 8, 1, 1, 7, 3, 1, 8, 8, 7, 9, 2, 2, 4, 1, 5, 8, 8, 7, 1, 2, 0, 2, 4, 1, 9, 1, 9, 5, 7, 1, 2, 1, 2, 6, 8, 5, 7, 7, 1, 1, 8, 1, 8, 0, 3, 0, 1, 9, 9, 9, 1, 8, 2, 1, 2, 9, 1, 5, 9, 2, 6, 4, 1, 8, 9, 2, 9, 1, 0, 4, 0, 0, 2, 8, 1, 7, 1, 7, 9, 0, 2, 1, 8, 1, 3, 0, 0, 3, 1, 9, 1, 5, 2, 8, 1, 7, 9, 3, 0, 9, 2, 0, 7, 1, 1, 2, 1, 8, 3, 1, 9, 7, 8, 6, 6, 1, 1, 3, 8, 1, 0, 5, 1, 3, 1, 5, 0, 6, 1, 8, 5, 1, 8, 4, 4, 6, 8, 2, 5, 0, 6, 5, 6, 3, 7, 2, 0, 8, 8, 5, 4, 1, 1, 4, 0, 7, 3, 7, 6, 1, 6, 1, 1, 9, 2, 8, 6, 1, 9, 5, 2, 5, 4, 4, 2, 1, 3, 8, 7, 4, 5, 0, 3, 1, 7, 7, 8, 7, 9, 7, 1, 9, 2, 1, 1, 2, 9, 2, 0, 4, 9, 1, 4, 8, 1, 8, 1, 5, 9, 8, 8, 3, 7, 6, 0, 0, 3, 1, 8, 0, 6, 9, 8, 3, 3, 3, 2, 3, 9, 1, 1, 6, 8, 0, 9, 6, 6, 6, 7, 8, 8, 2, 7, 8, 8, 9, 6, 1, 8, 4, 1, 2, 1, 8, 1, 9, 7, 1, 4, 0, 8, 9, 9, 1, 0, 5, 2, 3, 7, 6, 9, 4, 0, 1]

y_r1000_eu = [7, 2, 1, 0, 0, 1, 4, 9, 8, 9, 0, 6, 9, 0, 1, 1, 9, 7, 3, 4, 9, 6, 6, 5, 1, 0, 7, 9, 0, 1, 3, 1, 3, 6, 7, 2, 7, 1, 1, 1, 1, 7, 9, 1, 1, 5, 1, 1, 9, 4, 6, 3, 5, 1, 0, 0, 4, 1, 9, 1, 7, 8, 9, 1, 7, 1, 7, 4, 3, 0, 7, 0, 2, 7, 1, 7, 1, 7, 1, 7, 9, 6, 2, 7, 8, 4, 7, 5, 6, 1, 3, 6, 1, 3, 1, 9, 1, 7, 6, 9, 1, 0, 5, 4, 9, 9, 2, 1, 9, 4, 8, 1, 1, 9, 1, 1, 9, 4, 7, 7, 5, 6, 7, 6, 7, 9, 0, 5, 8, 5, 6, 6, 5, 7, 8, 1, 0, 1, 6, 7, 6, 7, 3, 1, 9, 1, 8, 2, 0, 1, 9, 9, 9, 5, 1, 5, 6, 0, 3, 1, 4, 6, 5, 4, 6, 5, 4, 1, 1, 9, 9, 7, 3, 1, 1, 1, 1, 8, 1, 8, 1, 8, 1, 0, 1, 6, 2, 5, 0, 1, 1, 1, 0, 7, 0, 1, 1, 6, 9, 2, 3, 6, 1, 1, 1, 1, 9, 3, 2, 9, 7, 1, 9, 1, 9, 0, 3, 8, 7, 5, 9, 0, 2, 7, 1, 3, 8, 1, 1, 1, 5, 1, 8, 7, 7, 9, 2, 1, 4, 1, 5, 3, 8, 7, 1, 1, 0, 6, 4, 1, 9, 1, 9, 1, 7, 1, 1, 1, 2, 0, 8, 1, 7, 7, 1, 1, 0, 1, 3, 0, 3, 0, 1, 9, 9, 9, 1, 8, 2, 1, 2, 9, 1, 1, 9, 2, 6, 4, 1, 1, 8, 2, 9, 1, 0, 9, 0, 0, 2, 8, 1, 7, 1, 1, 9, 0, 1, 1, 4, 1, 3, 0, 0, 5, 1, 9, 1, 5, 0, 6, 1, 1, 9, 1, 6, 9, 6, 0, 7, 1, 1, 1, 1, 3, 3, 1, 9, 7, 0, 6, 5, 1, 1, 3, 8, 1, 0, 5, 1, 1, 1, 5, 0, 6, 1, 8, 5, 1, 1, 9, 4, 6, 7, 1, 5, 0, 6, 5, 6, 1, 7, 2, 0, 8, 8, 5, 9, 1, 1, 4, 0, 7, 3, 7, 6, 1, 6, 1, 1, 7, 2, 8, 6, 1, 7, 5, 1, 5, 4, 4, 2, 1, 1, 1, 1, 4, 5, 0, 5, 1, 7, 7, 8, 7, 9, 9, 1, 9, 2, 1, 1, 2, 9, 2, 0, 9, 9, 1, 4, 1, 1, 1, 6, 4, 9, 8, 9, 3, 7, 6, 0, 0, 3, 1, 8, 0, 6, 1, 9, 5, 3, 3, 1, 3, 9, 1, 1, 6, 9, 0, 9, 6, 6, 6, 7, 8, 8, 2, 8, 8, 8, 7, 6, 1, 8, 4, 1, 2, 1, 3, 1, 9, 7, 1, 9, 0, 8, 9, 9, 1, 6, 5, 2, 3, 7, 6, 9, 1, 0, 1]
y_r1000_ma = [7, 2, 1, 0, 9, 1, 4, 9, 9, 7, 0, 6, 9, 0, 1, 1, 9, 7, 3, 4, 9, 6, 6, 1, 1, 0, 7, 9, 0, 1, 3, 1, 3, 6, 7, 2, 7, 1, 1, 1, 1, 7, 9, 1, 1, 6, 1, 1, 9, 4, 6, 3, 5, 1, 0, 0, 4, 1, 9, 1, 7, 8, 1, 1, 7, 1, 1, 4, 3, 0, 7, 0, 3, 7, 1, 7, 1, 7, 1, 7, 9, 6, 2, 7, 1, 4, 7, 3, 6, 1, 3, 6, 1, 3, 1, 9, 1, 1, 6, 9, 1, 0, 5, 4, 9, 9, 2, 1, 9, 9, 1, 1, 1, 9, 1, 1, 1, 4, 7, 7, 5, 6, 1, 6, 7, 1, 0, 5, 8, 1, 6, 6, 5, 7, 8, 1, 0, 1, 6, 7, 1, 7, 3, 1, 7, 1, 9, 2, 0, 1, 9, 9, 1, 5, 1, 5, 6, 0, 3, 1, 4, 6, 5, 4, 6, 5, 4, 1, 1, 9, 9, 7, 3, 1, 1, 1, 1, 6, 1, 8, 1, 1, 1, 0, 1, 9, 2, 5, 0, 1, 1, 1, 0, 1, 0, 1, 1, 6, 9, 2, 0, 6, 1, 1, 1, 1, 9, 3, 1, 9, 7, 1, 9, 1, 9, 0, 3, 1, 7, 5, 9, 0, 2, 7, 1, 3, 8, 1, 1, 1, 5, 1, 6, 7, 1, 9, 2, 1, 4, 1, 5, 3, 8, 7, 1, 1, 0, 6, 4, 1, 9, 1, 9, 1, 7, 1, 1, 1, 2, 0, 8, 1, 7, 7, 1, 1, 0, 1, 3, 0, 3, 0, 1, 9, 9, 9, 1, 8, 2, 1, 2, 9, 1, 1, 9, 2, 6, 4, 1, 1, 9, 2, 9, 1, 0, 9, 0, 0, 2, 8, 1, 7, 1, 1, 7, 0, 1, 1, 4, 1, 1, 0, 0, 1, 1, 9, 1, 1, 0, 6, 1, 1, 9, 1, 6, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 9, 7, 5, 6, 1, 1, 1, 3, 8, 1, 0, 5, 1, 1, 1, 5, 0, 6, 1, 1, 5, 1, 1, 9, 4, 6, 7, 1, 5, 0, 1, 1, 6, 1, 7, 1, 1, 8, 1, 5, 9, 1, 1, 4, 0, 1, 3, 7, 6, 1, 6, 1, 1, 7, 2, 8, 6, 1, 7, 5, 1, 1, 4, 4, 2, 1, 1, 1, 1, 4, 5, 0, 5, 1, 7, 1, 8, 7, 9, 9, 1, 9, 1, 1, 1, 2, 9, 2, 0, 4, 9, 1, 1, 1, 1, 1, 1, 4, 9, 1, 1, 3, 7, 6, 0, 0, 3, 1, 1, 0, 6, 1, 9, 5, 3, 3, 1, 1, 9, 1, 1, 6, 9, 0, 9, 6, 6, 6, 7, 8, 8, 2, 7, 8, 1, 9, 6, 1, 8, 4, 1, 2, 1, 3, 1, 9, 7, 1, 4, 0, 7, 9, 9, 1, 6, 6, 2, 3, 7, 6, 9, 1, 0, 1]
y_r1000_ch = [4, 2, 1, 2, 9, 0, 7, 1, 4, 5, 0, 6, 6, 0, 5, 0, 0, 3, 4, 1, 5, 6, 5, 4, 4, 4, 2, 5, 3, 1, 0, 9, 2, 1, 1, 6, 5, 3, 1, 3, 1, 5, 3, 3, 9, 0, 9, 5, 5, 7, 6, 0, 0, 0, 4, 1, 2, 1, 4, 8, 5, 2, 3, 7, 4, 5, 7, 4, 5, 0, 2, 0, 0, 5, 0, 5, 3, 4, 0, 2, 9, 8, 9, 5, 4, 2, 0, 0, 5, 1, 9, 6, 1, 3, 3, 5, 9, 3, 9, 6, 1, 0, 9, 0, 1, 5, 4, 2, 5, 5, 4, 0, 2, 0, 9, 4, 4, 1, 4, 5, 0, 2, 0, 4, 3, 5, 9, 4, 6, 0, 5, 5, 5, 3, 0, 1, 6, 9, 5, 4, 6, 3, 3, 1, 3, 6, 6, 6, 0, 8, 5, 1, 4, 5, 1, 4, 5, 8, 0, 9, 5, 2, 0, 0, 5, 6, 5, 3, 1, 9, 4, 6, 3, 0, 0, 0, 2, 0, 2, 5, 0, 8, 6, 5, 0, 5, 0, 1, 6, 5, 4, 1, 2, 2, 0, 3, 1, 5, 5, 6, 5, 5, 9, 3, 1, 0, 0, 2, 2, 9, 4, 1, 5, 3, 0, 0, 9, 0, 3, 4, 7, 0, 0, 9, 8, 0, 4, 8, 1, 2, 0, 9, 5, 1, 0, 2, 5, 6, 1, 0, 5, 1, 8, 1, 2, 0, 6, 1, 9, 9, 0, 3, 6, 4, 1, 0, 0, 4, 5, 2, 2, 0, 7, 7, 1, 8, 5, 2, 4, 0, 0, 5, 1, 6, 4, 2, 0, 0, 5, 0, 0, 5, 2, 0, 5, 0, 6, 5, 5, 0, 6, 6, 4, 2, 1, 5, 0, 0, 0, 0, 1, 3, 8, 2, 4, 0, 2, 0, 1, 3, 5, 5, 6, 2, 5, 8, 1, 0, 3, 1, 0, 2, 5, 9, 5, 5, 4, 5, 6, 4, 0, 3, 1, 0, 6, 0, 1, 2, 2, 6, 6, 0, 2, 0, 1, 2, 0, 8, 3, 0, 0, 0, 1, 2, 8, 6, 7, 1, 5, 1, 1, 5, 4, 0, 0, 1, 5, 5, 6, 2, 5, 5, 0, 4, 5, 0, 0, 8, 3, 4, 4, 2, 5, 7, 5, 1, 6, 2, 2, 3, 3, 3, 6, 1, 9, 3, 1, 4, 0, 7, 6, 3, 0, 5, 9, 5, 8, 5, 1, 9, 1, 5, 5, 9, 5, 1, 3, 2, 6, 6, 8, 5, 3, 3, 2, 5, 0, 1, 9, 0, 1, 2, 3, 2, 1, 0, 0, 3, 7, 3, 3, 5, 3, 3, 9, 0, 5, 1, 3, 0, 5, 5, 6, 1, 2, 1, 5, 5, 1, 5, 0, 5, 5, 0, 1, 5, 2, 0, 5, 7, 3, 0, 9, 1, 5, 3, 1, 5, 1, 1, 3, 3, 4, 8, 0, 1, 0, 7, 0, 1, 0, 3, 5, 6, 3, 5, 2, 7, 1, 4]
y_r1000_mi = [1, 1, 1, 1, 6, 1, 5, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 5, 8, 6, 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 6, 7, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 1, 1, 1, 1, 8, 5, 8, 8, 1, 1, 5, 1, 1, 1, 1, 8, 1, 8, 1, 1, 6, 1, 1, 1, 1, 0, 1, 1, 1, 7, 1, 1, 1, 1, 1, 1, 8, 0, 8, 1, 1, 1, 1, 1, 8, 1, 1, 1, 1, 8, 1, 1, 1, 8, 5, 1, 1, 8, 1, 1, 0, 1, 1, 1, 8, 1, 1, 5, 1, 1, 8, 6, 1, 7, 1, 1, 1, 1, 1, 1, 6, 1, 1, 8, 6, 1, 1, 1, 8, 1, 1, 1, 6, 1, 1, 1, 6, 1, 1, 1, 8, 1, 0, 1, 1, 7, 1, 5, 1, 1, 1, 1, 1, 1, 1, 6, 1, 5, 1, 1, 0, 8, 1, 1, 1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 6, 5, 1, 1, 1, 2, 5, 6, 1, 1, 8, 5, 1, 0, 1, 1, 1, 1, 1, 1, 1, 6, 1, 6, 1, 1, 1, 9, 6, 1, 1, 0, 1, 1, 5, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 5, 5, 1, 1, 1, 1, 5, 1, 1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 8, 1, 1, 6, 1, 1, 8, 5, 8, 1, 8, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 8, 1, 5, 1, 1, 6, 1, 8, 1, 6, 1, 6, 1, 1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 1, 1, 5, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 6, 1, 1, 1, 8, 1, 1, 1, 1, 5, 1, 5, 8, 1, 1, 1, 5, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 6, 8, 1, 1, 1, 1, 1, 1, 5, 1, 1, 4, 0, 5, 5, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 8, 8, 1, 1, 4, 1, 1, 1, 1, 1, 1, 5, 1, 8, 5, 1, 6, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 1, 1, 5, 1, 1, 1, 8, 1, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 8, 6, 1, 8, 8, 1, 1, 5, 1, 6, 8, 1, 1, 5, 1, 5, 1, 1, 1, 1, 1, 1, 5, 0, 1, 1, 1, 1, 5, 5, 1, 1, 1, 1, 1, 1, 1, 1]
           

調用驗證函數,比較不同方法結果的準确率。

四、反思

1000訓練資料,歐氏距離

【python】【kNN】【OCR】用python實作字元識别

10000訓練資料,歐氏距離

【python】【kNN】【OCR】用python實作字元識别

60000訓練資料,歐氏距離

【python】【kNN】【OCR】用python實作字元識别

1000訓練資料,歐氏距離

【python】【kNN】【OCR】用python實作字元識别

1000訓練資料,曼哈頓距離

【python】【kNN】【OCR】用python實作字元識别

1000訓練資料,切比雪夫距離

【python】【kNN】【OCR】用python實作字元識别

1000訓練資料,闵科夫斯基距離

【python】【kNN】【OCR】用python實作字元識别

資料表明,歐式距離是最适合kNN進行文字識别的距離度量方法;同時筆者估算了一下,以筆者的電腦性能,如果用60000條訓練資料,大概要1.4個小時才能跑完10000條測試集資料,時間原因沒有進行驗證。不過這個時間也是可以接受的。

方案還有可改進的地方,歡迎留言交流。