圖檔驗證碼反爬解決思路

自打有爬蟲以來，爬蟲與驗證碼的戰鬥就一直在進行着。下面是我處理簡單驗證碼的一點心得：

一、登入驗證碼：

很多網站采取登入使用者名+密碼+圖檔驗證碼的方式進行登入。對于簡單的圖檔驗證碼可以采用ocr光學辨別符進行識别，而對于比較複雜的驗證碼則需要進行一些複雜的操作。

步驟一：擷取圖檔驗證碼，并且儲存為檔案

方法：使用webdriver截頻功能擷取圖檔驗證碼，代碼如下：

from selenium import webdriver
from PIL import Image


def login():
    driver = webdriver.Chrome()
    driver.maximize_window()  # 視窗最大化
    driver.get('登入網址l') 
    time.sleep(10)
    driver.save_screenshot('printscreen.png')
    imgelement = driver.find_element_by_css_selector("#wrap > div.sign-wrap > div.sign-form.sign-sms > form > div.form-row.row-code > img")  # 定位驗證碼
    location = imgelement.location  # 擷取驗證碼x,y軸坐标
    size = imgelement.size  # 擷取驗證碼的長寬
    rangle = (int(location['x']), int(location['y']), int(location['x'] + size['width']),
              int(location['y'] + size['height']))  # 寫成我們需要截取的位置坐标
    i = Image.open("printscreen.png")  # 打開截圖
    frame4 = i.crop(rangle)  # 使用Image的crop函數，從截圖中再次截取我們需要的區域
    frame4.save('save.png')  # 儲存我們接下來的驗證碼圖檔 進行打碼

步驟二：驗證碼的處理

方法：對于擷取到的驗證碼圖檔，簡單的可以自己處理，複雜的可以進行人工平台大碼或者深度學習的方法識别驗證碼。

驗證碼如下：

圖檔驗證碼反爬解決思路

import pytesseract
from PIL import Image

img = Image.open('9952.png')
res = pytesseract.image_to_string(img)
print(res)

運作結果如下：

C:\Python\Python36\python.exe E:/desktop_file/maimai_register/clawerImgs/tress.py
9952

二、爬蟲過程中驗證碼：

處理爬蟲過程中因為速度太快導緻的驗證碼問題。

通常網站會使用重定向進行驗證碼的處理：

浏覽a頁面---彈出驗證碼---擷取驗證碼---送出驗證碼---重定向到a頁面

處理思路：

1.擷取a頁面的url以及請求資訊

2.擷取驗證碼，發送get請求

3.儲存驗證碼圖檔，使用解析工具擷取圖檔内容（人工大碼）

4.送出驗證碼内容至伺服器，驗證通過

5.重定向到a頁面url對應的網址

介紹一個好用的解析網址：https://curl.trillworks.com/喜歡的可以收藏一下這個網址

舉個栗子：某網站的驗證碼處理過程---僅僅展示核心代碼

def get_capture(self):
        """擷取驗證碼圖檔"""
        self.randomkey = str(int(1000*time.time()))
        headers = {
            'accept-encoding': 'gzip, deflate, br',
            'accept-language': 'zh-CN,zh;q=0.9',
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3409.2 Safari/537.36',
            'accept': 'image/webp,image/apng,image/*,*/*;q=0.8',
            'referer': 'https://www.zhipin.com/captcha/popUpCaptcha?redirect={}'.format(self.url),
            'authority': 'www.zhipin.com',
            'cookie': self.cookies
        }

        params = (
            ('randomKey', self.randomkey ),
        )

        response = requests.get('https://www.zhipin.com/captcha', headers=headers, params=params)
        return response.content


    def pass_capture(self, capture_res):
        """發送驗證碼給伺服器"""
        headers = {
            'authority': 'www.zhipin.com',
            'cache-control': 'max-age=0',
            'origin': 'https://www.zhipin.com',
            'upgrade-insecure-requests': '1',
            'content-type': 'application/x-www-form-urlencoded',
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3409.2 Safari/537.36',
            'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
            # 'referer': 'https://www.zhipin.com/captcha/popUpCaptcha?redirect=https%3A%2F%2Fwww.zhipin.com%2Fboss%2Fsearch%2Fgeek%2Finfo%3Fsuid%3D1fc97ddf37d119f4KGQMV4z1idI%7E%26jid%3D0%26lid%3D12659U92LQM.lookupsearchgeek.1%26expectId%3D2726533%26segs%3Djava',
            'referer': self.url,
            'accept-encoding': 'gzip, deflate, br',
            'accept-language': 'zh-CN,zh;q=0.9',
            'cookie': self.cookies
        }

        params = (
            ('redirect', self.url),)
        # capture = input("pallse input capture")
        data = [
            ('randomKey', self.randomkey),
            ('captcha', capture_res),
        ]
        response = requests.post('https://www.zhipin.com/captcha/verifyCaptcha', headers=headers, params=params,
                                 data=data)
        print(response.url)
        print(response.text)
        return response

    def verify_code(self, filename):    
        """發送圖檔給大碼平台擷取驗證碼"""
        l = self.logger
        retry = 0
        while True:
            retry += 1
            if retry == 3:
                return ''
            try:
                captcha_id, vycode = self.dama2.decode_captcha(captcha_type=3040, file_path=filename)
                return vycode
            except Exception as e:
                l.info(e)
                l.info('verify error')
                pass

    def run(self, url):
        retry = 0
        while True:
            if retry == 10:
                return False
            capture = get_capture()
            print(capture)
            filename=os.getcwd()+"/captcha.png"
            with open(filename, "wb") as f:
                f.write(capture)
            capture_res = verify_code(filename)
            if not capture_res:
                print('擷取驗證碼失敗')
                continue
            print('驗證碼是：{}'.format(capture_res))
            res = pass_capture(capture_res)
            print("capture pass success!!!!!!!!!!!!!!!!!!!! {}".format(res.text))
            time.sleep(20)
            if '<div class="tips">為了您的賬号安全，我們需要在執行操作之前驗證您的身份，請輸入驗證碼。</div>' not in res.text:
                print("釋放驗證碼成功")
                return True
            retry += 1

三、複雜驗證碼的拼接重組

有些網站的驗證碼圖檔，網頁上顯示，如下：

圖檔驗證碼反爬解決思路

但是打開調試工具發現圖檔其實背景圖如下所示：

圖檔驗證碼反爬解決思路

很明顯這個背景圖是切片組上去的，是以在js中找到了定位的坐标：

圖檔驗證碼反爬解決思路

果不其然，頁面顯示的圖檔對背景圖進行了切片顯示，是以我們要還原原圖就要逆天做法。

下面是代碼思路：

from PIL import Image, ImageDraw

# 大圖的長寬為：21.98 * 58
offset_list = [['66', '40'], ['286', '40'], ['66', '98'], ['44', '40'], ['154', '40'], ['22', '40'], ['88', '98'],
               ['198', '40'], ['198', '98'], ['264', '98'], ['308', '40'], ['176', '40'], ['0', '98'], ['132', '98'],
               ['132', '40'], ['176', '98'], ['88', '40'], ['154', '98'], ['220', '40'], ['264', '40'], ['110', '40'],
               ['242', '98'], ['286', '98'], ['0', '40'], ['242', '40'], ['44', '98'], ['220', '98'], ['22', '98'],
               ['308', '98'], ['110', '98']]

# 小圖的長寬為：21.98 * 40
offset_list_small = [['264', '0'], ['154', '0'], ['44', '0'], ['242', '0'], ['110', '0'], ['176', '0'], ['88', '0']]

# 擷取每張小圖的偏移量
def convert_index_to_offset(index, size):
    if size == 1:
        if index < 15:  # 完整的驗證碼圖檔是由30個小圖檔組合而成，共2行15列
            return (index * 22, 0)
        else:
            i = index - 15
            return (i * 22, 58)  # 每張小圖的大小為22*58
    elif size == 2:
        return (index * 22, 116)


# 擷取每張小圖的坐标，供摳圖時使用
def convert_css_to_offset(off, size):
    # (left, upper)o ----- o
    #         |       |
    #         o ----- o(right, lower)
    if size == 1:
        return (int(off[0]), int(off[1]), int(off[0]) + 21.98, int(off[1]) + 58)
    elif size == 2:
        return (int(off[0]), int(off[1]), int(off[0]) + 21.98, int(off[1]) +40)

# 289,92;256,58;131,43;91,22
# 圖檔重組
def recombine_captcha(file_id):
    captcha = Image.new('RGB', (22 * 15, 58 * 2 + 40))  # 建立空白圖檔
    img = Image.open('./capture/capture1_{}.png'.format(file_id))  # 執行個體化原始圖檔Image對象
    for i, off in enumerate(offset_list):
        box = convert_css_to_offset(off, 1)  # 根據css backgound-position擷取每張小圖的坐标
        regoin = img.crop(box)  # 摳圖
        offset = convert_index_to_offset(i, 1)  # 擷取目前小圖在空白圖檔的坐标
        captcha.paste(regoin, offset)  # 根據目前坐标将小圖粘貼到空白圖檔
    for i, off in enumerate(offset_list_small):
        box = convert_css_to_offset(off, 2)  # 根據css backgound-position擷取每張小圖的坐标
        regoin = img.crop(box)  # 摳圖
        offset = convert_index_to_offset(i, 2)  # 擷取目前小圖在空白圖檔的坐标
        captcha.paste(regoin, offset)  # 根據目前坐标将小圖粘貼到空白圖檔
    capture_2 = Image.open('./capture/text.png')
    captcha.paste(capture_2, (154, 116))

    captcha.save('./capture/capture2_{}.png'.format(file_id))

if __name__ == '__main__':
    recombine_captcha('-5543')

圖檔驗證碼反爬解決思路

左圖為網頁原圖，右圖為拼接複原的圖形，至此圖檔處理的過程結束，接下來可以進行大碼擷取文字坐标發送給伺服器校驗驗證碼是否正确。

圖檔驗證碼反爬解決思路

一、登入驗證碼：

二、爬蟲過程中驗證碼：

舉個栗子：某網站的驗證碼處理過程---僅僅展示核心代碼

三、複雜驗證碼的拼接重組

繼續閱讀

Python image.show() 出錯FSPathMakeRef(/Applications/Preview.app) failed with error -43

2023爬蟲學習筆記 -- 多線程操作

M團店鋪評價采集不到問題問題展示：解決方案：

Python爬蟲學習（1）

Python爬蟲學習進階

Python爬蟲（入門+進階）學習筆記 1-2 初識Python爬蟲

Python進階爬蟲——Class1：認識爬蟲

python爬蟲學習筆記-1

python學習之urllib使用小結

NOIp模拟題之肮髒的牧師（桶排序）

一篇文章教你如何在一個月内學會爬取大規模資料

初談驗證碼與驗證碼設計

CAS增加驗證碼驗證功能

Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗

微信開通狀态檢測工具（免驗證碼版）運作原理

sort()函數到底是怎樣進行數字排序的