介紹

在Python的實戰中爬蟲承擔相當重要的角色，而驗證碼識别則是爬蟲中一個重點。驗證碼是一個網站項目的守衛，如果不能通過驗證碼識别，那後期的爬蟲則無法進行。本文詳細介紹Python驗證碼識别的具體細節。鄭重聲明：僅讨論技術，不能用于違法手段，如若不然則受法律嚴懲且與作者無關。

準備工作——驗證碼解析環境搭建

安裝Tesseract

Tesserocr 是 Python 的一個 OCR 識别庫，但其實是對 Tesseract 做的一層 Python API 封裝，它的核心是 Tesseract，是以在安裝 Tesserocr 之前我們需要先安裝 Tesseract

官方網址：https://digi.bib.uni-mannheim.de/tesseract/

選擇版本：

此處選擇4.0.0版本，因為截至目前（2020-2-28）對應的python庫的支援最新隻到這個版本。

具體看https://github.com/simonflueckiger/tesserocr-windows_build/releases的顯示版本，括号裡是支援Tesserocr的版本。

安裝時可以勾選多語言支援（但會導緻整個過程很慢）：

安裝完成後，需要設定環境變量。在Path中設定C:\Program Files\Tesseract-OCR（路徑以自己為準）

确認是否設定正确：

安裝Tesserocr（Tesseract-OCR）

使用pip直接安裝：

 pip install tesserocr pillow

如果安裝失敗，嘗試使用以下方法：

1.下載下傳安裝tesserocr的whl格式檔案。

whl格式本質上是一個壓縮包,裡面包含了py檔案,以及經過編譯的pyd檔案

網址：https://github.com/simonflueckiger/tesserocr-windows_build/releases

2.檢視本機python對應的版本：

建立test2.py檔案并執行：

import pip import pip._internal 
print(pip._internal.pep425tags.get_supported())

輸出：

[('cp37', 'cp37m', 'win_amd64'), ('cp37', 'none', 'win_amd64'), ('py3', 'none', 'win_amd64'), ('cp37', 'none', 'any'), ('cp3', 'none', 'any'), ('py37', 'none', 'any'), ('py3', 'none', 'any'), ('py36', 'none', 'any'), ('py35', 'none', 'any'), ('py34', 'none', 'any'), ('py33', 'none', 'any'), ('py32', 'none', 'any'), ('py31', 'none', 'any'), ('py30', 'none', 'any')]

意思是對應版本是'cp37', 'cp37m', 'win_amd64'。

3.找到對應的版本：

4.下載下傳後使用pip安裝.whl檔案（路徑以自己實際路徑為準）：

 pip install C:\tesserocr-2.4.0-cp37-cp37m-win_amd64.whl

牛刀小試——簡單驗證碼識别

首先安裝依賴：

 pip install pillow

如果安裝失敗。使用：

 python -m pip install --upgrade pip

完成後執行install指令。

使用tesseract識别驗證碼

找一張較簡單的驗證碼（test.jpg）：

解析驗證碼（test3.py）：

import tesserocr
from PIL import Image
image=Image.open('test.jpg')
image.show()  #可以列印出圖檔，供預覽
print(tesserocr.image_to_text(image))

如果執行過程中報錯：

Failed to init API, possibly an invalid tessdata path: C:\Users\XXXXX\AppData\Local\Programs\Python\Python37\/tessdata/

則将Tesseract安裝目錄下的tessdata檔案夾複制到python的根目錄，即報錯顯示的目錄。

使用pytesseract識别驗證碼

以上範例使用的是tesserocr.image_to_text()，但是識别效率很低，推薦使用pytesseract。pytesseract是在Tesseract-OCR基礎上封裝的，識别效果更好的類庫。

官方介紹：Python-tesseract is a wrapper for Google’s Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others.

首先安裝pytesseract：

 pip install pytesseract

使用pytesseract的image_to_string()方法：

from PIL import Image
from pytesseract import *

result = image_to_string(Image.open("test.jpg"), lang='eng', config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

lang表示識别的語言。

psm是一個設定驗證碼識别的重要參數，可以用它來精确提升驗證通過率（下方是官網給出的值範圍）。

oem沒有找到專門的解釋，官網給的範例使用的值是3。

tessedit_char_whitelist表示白名單，将識别的結果控制在白名單範圍（經測試，效果有限）

psm值：

Page segmentation modes:

0 Orientation and script detection (OSD) only.

1 Automatic page segmentation with OSD.

2 Automatic page segmentation, but no OSD, or OCR.

3 Fully automatic page segmentation, but no OSD. (Default)

4 Assume a single column of text of variable sizes.

5 Assume a single uniform block of vertically aligned text.

6 Assume a single uniform block of text.

7 Treat the image as a single text line.

8 Treat the image as a single word.

9 Treat the image as a single word in a circle.

10 Treat the image as a single character.

11 Sparse text. Find as much text as possible in no particular order.

12 Sparse text with OSD.

13 Raw line. Treat the image as a single text line,bypassing hacks that are Tesseract-specific.

頗費功夫——複雜驗證碼識别

上文的驗證碼已經算是非常簡單的一種，幾乎使用原生的驗證碼識别庫就可以識别。但是大部分時候我們面對的是下面這種驗證碼：

或者這種：

亦或者這種：

這些驗證碼使用庫來識别通過率會非常低，幾乎無法識别。這時候就得用到我們的新手段——圖檔處理。

不同的驗證碼圖檔需要做的處理是不一樣的，需要對症下藥，比如第一種，它的特點是有一條很細的邊框以及極多的背景幹擾線。這樣我們需要作出兩點操作：

1.點性降噪

2.去除邊框

圖檔是由像素點構成的，我們放大圖像就可以一目了然。這些像素點中，有些是組成驗證碼的重要像素點，而大部分則是造成識别幹擾的像素。

圖檔當中的像素點不是獨立存在的，一個像素點周圍有8個像素點（邊框除外）。如下圖，若中心點與8個像素中絕大部分的像素點RBG值不一樣，就像臉上的粉刺一樣，這個孤零零的點破壞了整體的RBG統一性，成為了我們必須去除的點——噪點。

上圖中組成MABC四個字母的像素點是連貫的，但是噪點卻是随機分布的。利用這個特點我們就可以判斷是否是噪點。

當然，中心點與周圍RBG值完全不同是特殊情況。實際中我們看到的往往是這樣：

上圖裡中心點與周圍像素有RBG相同的也有不同的，面對這種情況，我們就需要設定一個值（N），N表示在判定噪點的時候，中心像素點與周圍像素點相同的個數的臨界值。

當中心點與周圍像素的RBG值相同的數量小于N時，該點為噪點。

上圖中，因為與中心點相同像素數是2個。當我們将N設為3，中心點将會被認為是噪點。若設為1，則中心點不是噪點。N值的設定需要我們根據情況判斷調整。

按照這個邏輯，對每一個像素點進行判斷，若是噪點則将其顔色置為白色即可。

但是實際中有可能因為圖檔的噪點太過密集而出現漏網之魚。這樣我們再引入一個新的想法——多次降噪。

意思是，在對每個像素點降噪判斷後，多次重新掃描保證盡可能多的噪點被去除。

但是多次降噪可能會導緻驗證碼像素受影響，需根據情況斟酌。

依照這個思路，我們寫出降噪代碼如下。（image是圖檔二值門檻值，N是噪點判斷的臨界值，K是多次降噪的次數）

def clearNoise(image, N, K):
    for i in range(0, K):
        t2val[(0, 0)] = 1
        t2val[(image.size[0] - 1, image.size[1] - 1)] = 1

        for x in range(1, image.size[0] - 1):
            for y in range(1, image.size[1] - 1):
                nearDots = 0
                L = t2val[(x, y)]
                if L == t2val[(x - 1, y - 1)]:
                    nearDots += 1
                if L == t2val[(x - 1, y)]:
                    nearDots += 1
                if L == t2val[(x - 1, y + 1)]:
                    nearDots += 1
                if L == t2val[(x, y - 1)]:
                    nearDots += 1
                if L == t2val[(x, y + 1)]:
                    nearDots += 1
                if L == t2val[(x + 1, y - 1)]:
                    nearDots += 1
                if L == t2val[(x + 1, y)]:
                    nearDots += 1
                if L == t2val[(x + 1, y + 1)]:
                    nearDots += 1

                if nearDots < N:
                    t2val[(x, y)] = 1

處理完成後得到圖檔：

可以看出，降噪完成後的圖檔背景已經變得非常“幹淨”。除了邊框外，這個驗證碼已經比較容易識别。

由于邊框像素本身也是一串連續的點，與驗證碼相似，且位置在邊界處，降噪不能對其處理。

第二步進行邊框去除。這個就比較簡單了。将邊框處的像素剪裁變色。

def clear_border(img_name):
    img = cv_imread(path_extends.get_absolute_path()+"\\images\\"+img_name)
    filename = path_extends.get_absolute_path()+"\\images\\" + \
        img_name.split('-')[0] + '-clearBorder.jpg'
    h, w = img.shape[:2]
    for y in range(0, w):
        for x in range(0, h):
            if y < 2 or y > w - 2:
                img[x, y] = 255
            if x < 2 or x > h - 2:
                img[x, y] = 255

    cv_imwrite(filename, img)
    return img

經過一系列的處理，得到結果：

完整的代碼（調用image_to_text函數即可識别，驗證碼原始圖檔需放置在images檔案夾内并命名為test.png）：

# coding:utf-8
import sys, os
from PIL import Image, ImageDraw
from pytesseract import *
import cv2
from tools import path_extends
import numpy as np


# 二值數組
t2val = {}
def twoValue(image, G):
    for y in range(0, image.size[1]):
        for x in range(0, image.size[0]):
            g = image.getpixel((x, y))
            if g > G:
                t2val[(x, y)] = 1
            else:
                t2val[(x, y)] = 0


def clear_border(img_name):
    img = cv_imread(path_extends.get_absolute_path()+"\\images\\"+img_name)
    filename = path_extends.get_absolute_path()+"\\images\\" + \
        img_name.split('-')[0] + '-clearBorder.jpg'
    h, w = img.shape[:2]
    for y in range(0, w):
        for x in range(0, h):
            if y < 2 or y > w - 2:
                img[x, y] = 255
            if x < 2 or x > h - 2:
                img[x, y] = 255

    cv_imwrite(filename, img)
    return img

def clearNoise(image, N, K):
    for i in range(0, K):
        t2val[(0, 0)] = 1
        t2val[(image.size[0] - 1, image.size[1] - 1)] = 1

        for x in range(1, image.size[0] - 1):
            for y in range(1, image.size[1] - 1):
                nearDots = 0
                L = t2val[(x, y)]
                if L == t2val[(x - 1, y - 1)]:
                    nearDots += 1
                if L == t2val[(x - 1, y)]:
                    nearDots += 1
                if L == t2val[(x - 1, y + 1)]:
                    nearDots += 1
                if L == t2val[(x, y - 1)]:
                    nearDots += 1
                if L == t2val[(x, y + 1)]:
                    nearDots += 1
                if L == t2val[(x + 1, y - 1)]:
                    nearDots += 1
                if L == t2val[(x + 1, y)]:
                    nearDots += 1
                if L == t2val[(x + 1, y + 1)]:
                    nearDots += 1

                if nearDots < N:
                    t2val[(x, y)] = 1

def cv_imread(filePath):
    cv_img = cv2.imdecode(np.fromfile(filePath, dtype=np.uint8), -1)
    return cv_img

def cv_imwrite(filePath, features):
    cv2.imencode('.jpg', features)[1].tofile(filePath)

def saveImage(filename, size):
    image = Image.new("1", size)
    draw = ImageDraw.Draw(image)

    for x in range(0, size[0]):
        for y in range(0, size[1]):
            draw.point((x, y), t2val[(x, y)])

    image.save(filename)
 

def image_to_text():
    image = Image.open(path_extends.get_absolute_path()+"\\images\\test.png").convert("L")
    twoValue(image, 100)
    clearNoise(image, 2, 1)
    path1 = path_extends.get_absolute_path()+"\\images\\test-clearNoise.jpg"
    saveImage(path1, image.size)
    clear_border("my-clearNoise.jpg")
    result = image_to_string(Image.open(
        path_extends.get_absolute_path()+"\\images\\test-clearBorder.jpg"), lang='eng', config='--psm 10 --oem 3 -c tessedit_char_whitelist=QWERTYUIOPLKHJHGFDSAZXCVBNM')

    return result

究極難度——開始樣本訓練吧

以上的驗證碼還不算是最難識别的，我們一定見過這種的（圖檔來自百度）：

文字扭曲、傾斜、擠靠。這些驗證碼即便是人來看都得多看一眼，更何況程式識别。這時候我們上文的辦法已經力不從心，需要一個新的思路。

計算機有比人快而準的優點，但是一個字母或者符号稍加變形程式便無法識别，這種過于較真的特點反倒成了缺點。假如我們能告訴程式m等于m，

深入Python 驗證碼解析

也等于m，問題就得以解決。

這就需要引入一個概念——樣本訓練。

我們在做訓練之前先需要收集樣本，這些樣本可以通過手動截圖，也可以通過程式分割。舉個簡單的例子，我們需要訓練0~9的數字，就需要先收集這10個數字的樣本圖檔，之後進行下一步。

下載下傳jTessBoxEditor：

官方下載下傳（較慢）：https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/

國内下載下傳：https://www.jb51.net/softs/676483.html#downintro2

下載下傳庫：

訓練庫下載下傳： https://sourceforge.net/projects/tess4j/files/tess4j/

制作樣本：

png轉化為tif

轉化網址：https://cloudconvert.com/png-to-tiff

導入訓練樣本

選擇訓練圖檔：

選擇後會繼續彈框讓你選擇目錄，用來儲存合并後的tiff。

檔案名命名為xl.normal.exp0.tif

執行指令行(開始訓練)：

tesseract xl.normal.exp0.tif xl.normal.exp0 -l eng batch.nochop makebox

樣本訓練完畢，接下來是關鍵的一步——分割驗證碼，以友善程式對照樣本進行識别。

分割的邏輯都大抵相似，這裡直接引用shaomine的博文：

#coding:utf8
import os
from PIL import Image,ImageDraw,ImageFile
import numpy
import pytesseract
import cv2
import imagehash
import collections
class pictureIdenti:

    #rownum：切割行數；colnum：切割列數；dstpath：圖檔檔案路徑；img_name：要切割的圖檔檔案
    def splitimage(self, rownum=1, colnum=4, dstpath="D:\work\python36_crawl\Veriycode",
                   img_name="D:\work\python36_crawl\Veriycode\mode_5246.png",):
        img = Image.open(img_name)
        w, h = img.size
        if rownum <= h and colnum <= w:
            print('Original image info: %sx%s, %s, %s' % (w, h, img.format, img.mode))
            print('開始處理圖檔切割, 請稍候...')

            s = os.path.split(img_name)
            if dstpath == '':
                dstpath = s[0]
            fn = s[1].split('.')
            basename = fn[0]
            ext = fn[-1]

            num = 1
            rowheight = h // rownum
            colwidth = w // colnum
            file_list = []
            for r in range(rownum):
                index = 0
                for c in range(colnum):
                    # (left, upper, right, lower)
                    # box = (c * colwidth, r * rowheight, (c + 1) * colwidth, (r + 1) * rowheight)
                    if index<1:
                        colwid = colwidth+6
                    elif index<2:
                        colwid = colwidth + 1
                    elif index < 3:
                        colwid = colwidth

                    box = (c * colwid, r * rowheight, (c + 1) * colwid, (r + 1) * rowheight)
                    newfile = os.path.join(dstpath, basename + '_' + str(num) + '.' + ext)
                    file_list.append(newfile)
                    img.crop(box).save(os.path.join(dstpath, basename + '_' + str(num) + '.' + ext), ext)
                    num = num + 1
                    index+=1
            for f in file_list:
                print(f)
            print('圖檔切割完畢，共生成 %s 張小圖檔。' % num)

宿命之敵——邏輯驗證碼

事實上，邏輯驗證碼已經不再是“碼”，而是一種邏輯判斷。舉個例子（圖檔來自百度）：

以及我們最熟悉的：

這已經不是上文的1=1，而是需要觀察者識别内容後進行邏輯判斷再輸入結果。依照上文的方式已經很難再識别。具體的解決方法也已經不是本文的讨論範圍。

結束語

驗證碼是網站和應用程式的守衛，它的作用也越來越重要。如果你不是一個Python爬蟲研究者，而是一個網站管理者，也需要深入了解驗證碼的識别，因為這對你的網站安全尤為重要。

我們研究驗證碼識别是為了更好的加強網絡安全性。對使用爬蟲技術的人來說，安全、非破壞式的使用該技術是底線也是自我要求。在爬取資料的時候應當先了解這些内容是否允許被爬，遵守robots.txt守則，且在爬取過程中應該盡可能的多等待，而不是無節制刷取資料而對伺服器造成影響。

部分引用：

https://www.cnblogs.com/shaosks/p/9700610.html

https://blog.csdn.net/dream_people/article/details/83393134

作者：Mr.Jimmy

出處：https://www.cnblogs.com/JHelius

聯系：[email protected]

如有疑問歡迎讨論，轉載請注明出處