自打有爬蟲以來,爬蟲與驗證碼的戰鬥就一直在進行着。下面是我處理簡單驗證碼的一點心得:
一、登入驗證碼:
很多網站采取登入使用者名+密碼+圖檔驗證碼的方式進行登入。對于簡單的圖檔驗證碼可以采用ocr光學辨別符進行識别,而對于比較複雜的驗證碼則需要進行一些複雜的操作。
步驟一:擷取圖檔驗證碼,并且儲存為檔案
方法:使用webdriver截頻功能擷取圖檔驗證碼,代碼如下:
from selenium import webdriver
from PIL import Image
def login():
driver = webdriver.Chrome()
driver.maximize_window() # 視窗最大化
driver.get('登入網址l')
time.sleep(10)
driver.save_screenshot('printscreen.png')
imgelement = driver.find_element_by_css_selector("#wrap > div.sign-wrap > div.sign-form.sign-sms > form > div.form-row.row-code > img") # 定位驗證碼
location = imgelement.location # 擷取驗證碼x,y軸坐标
size = imgelement.size # 擷取驗證碼的長寬
rangle = (int(location['x']), int(location['y']), int(location['x'] + size['width']),
int(location['y'] + size['height'])) # 寫成我們需要截取的位置坐标
i = Image.open("printscreen.png") # 打開截圖
frame4 = i.crop(rangle) # 使用Image的crop函數,從截圖中再次截取我們需要的區域
frame4.save('save.png') # 儲存我們接下來的驗證碼圖檔 進行打碼
步驟二:驗證碼的處理
方法:對于擷取到的驗證碼圖檔,簡單的可以自己處理,複雜的可以進行人工平台大碼或者深度學習的方法識别驗證碼。
驗證碼如下:

import pytesseract
from PIL import Image
img = Image.open('9952.png')
res = pytesseract.image_to_string(img)
print(res)
運作結果如下:
C:\Python\Python36\python.exe E:/desktop_file/maimai_register/clawerImgs/tress.py
9952
二、爬蟲過程中驗證碼:
處理爬蟲過程中因為速度太快導緻的驗證碼問題。
通常網站會使用重定向進行驗證碼的處理:
浏覽a頁面---彈出驗證碼---擷取驗證碼---送出驗證碼---重定向到a頁面
處理思路:
1.擷取a頁面的url以及請求資訊
2.擷取驗證碼,發送get請求
3.儲存驗證碼圖檔,使用解析工具擷取圖檔内容(人工大碼)
4.送出驗證碼内容至伺服器,驗證通過
5.重定向到a頁面url對應的網址
介紹一個好用的解析網址:https://curl.trillworks.com/喜歡的可以收藏一下這個網址
舉個栗子:某網站的驗證碼處理過程---僅僅展示核心代碼
def get_capture(self):
"""擷取驗證碼圖檔"""
self.randomkey = str(int(1000*time.time()))
headers = {
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3409.2 Safari/537.36',
'accept': 'image/webp,image/apng,image/*,*/*;q=0.8',
'referer': 'https://www.zhipin.com/captcha/popUpCaptcha?redirect={}'.format(self.url),
'authority': 'www.zhipin.com',
'cookie': self.cookies
}
params = (
('randomKey', self.randomkey ),
)
response = requests.get('https://www.zhipin.com/captcha', headers=headers, params=params)
return response.content
def pass_capture(self, capture_res):
"""發送驗證碼給伺服器"""
headers = {
'authority': 'www.zhipin.com',
'cache-control': 'max-age=0',
'origin': 'https://www.zhipin.com',
'upgrade-insecure-requests': '1',
'content-type': 'application/x-www-form-urlencoded',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3409.2 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
# 'referer': 'https://www.zhipin.com/captcha/popUpCaptcha?redirect=https%3A%2F%2Fwww.zhipin.com%2Fboss%2Fsearch%2Fgeek%2Finfo%3Fsuid%3D1fc97ddf37d119f4KGQMV4z1idI%7E%26jid%3D0%26lid%3D12659U92LQM.lookupsearchgeek.1%26expectId%3D2726533%26segs%3Djava',
'referer': self.url,
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9',
'cookie': self.cookies
}
params = (
('redirect', self.url),)
# capture = input("pallse input capture")
data = [
('randomKey', self.randomkey),
('captcha', capture_res),
]
response = requests.post('https://www.zhipin.com/captcha/verifyCaptcha', headers=headers, params=params,
data=data)
print(response.url)
print(response.text)
return response
def verify_code(self, filename):
"""發送圖檔給大碼平台擷取驗證碼"""
l = self.logger
retry = 0
while True:
retry += 1
if retry == 3:
return ''
try:
captcha_id, vycode = self.dama2.decode_captcha(captcha_type=3040, file_path=filename)
return vycode
except Exception as e:
l.info(e)
l.info('verify error')
pass
def run(self, url):
retry = 0
while True:
if retry == 10:
return False
capture = get_capture()
print(capture)
filename=os.getcwd()+"/captcha.png"
with open(filename, "wb") as f:
f.write(capture)
capture_res = verify_code(filename)
if not capture_res:
print('擷取驗證碼失敗')
continue
print('驗證碼是:{}'.format(capture_res))
res = pass_capture(capture_res)
print("capture pass success!!!!!!!!!!!!!!!!!!!! {}".format(res.text))
time.sleep(20)
if '<div class="tips">為了您的賬号安全,我們需要在執行操作之前驗證您的身份,請輸入驗證碼。</div>' not in res.text:
print("釋放驗證碼成功")
return True
retry += 1
三、複雜驗證碼的拼接重組
有些網站的驗證碼圖檔,網頁上顯示,如下:
但是打開調試工具發現圖檔其實背景圖如下所示:
很明顯這個背景圖是切片組上去的,是以在js中找到了定位的坐标:
果不其然,頁面顯示的圖檔對背景圖進行了切片顯示,是以我們要還原原圖就要逆天做法。
下面是代碼思路:
from PIL import Image, ImageDraw
# 大圖的長寬為:21.98 * 58
offset_list = [['66', '40'], ['286', '40'], ['66', '98'], ['44', '40'], ['154', '40'], ['22', '40'], ['88', '98'],
['198', '40'], ['198', '98'], ['264', '98'], ['308', '40'], ['176', '40'], ['0', '98'], ['132', '98'],
['132', '40'], ['176', '98'], ['88', '40'], ['154', '98'], ['220', '40'], ['264', '40'], ['110', '40'],
['242', '98'], ['286', '98'], ['0', '40'], ['242', '40'], ['44', '98'], ['220', '98'], ['22', '98'],
['308', '98'], ['110', '98']]
# 小圖的長寬為:21.98 * 40
offset_list_small = [['264', '0'], ['154', '0'], ['44', '0'], ['242', '0'], ['110', '0'], ['176', '0'], ['88', '0']]
# 擷取每張小圖的偏移量
def convert_index_to_offset(index, size):
if size == 1:
if index < 15: # 完整的驗證碼圖檔是由30個小圖檔組合而成,共2行15列
return (index * 22, 0)
else:
i = index - 15
return (i * 22, 58) # 每張小圖的大小為22*58
elif size == 2:
return (index * 22, 116)
# 擷取每張小圖的坐标,供摳圖時使用
def convert_css_to_offset(off, size):
# (left, upper)o ----- o
# | |
# o ----- o(right, lower)
if size == 1:
return (int(off[0]), int(off[1]), int(off[0]) + 21.98, int(off[1]) + 58)
elif size == 2:
return (int(off[0]), int(off[1]), int(off[0]) + 21.98, int(off[1]) +40)
# 289,92;256,58;131,43;91,22
# 圖檔重組
def recombine_captcha(file_id):
captcha = Image.new('RGB', (22 * 15, 58 * 2 + 40)) # 建立空白圖檔
img = Image.open('./capture/capture1_{}.png'.format(file_id)) # 執行個體化原始圖檔Image對象
for i, off in enumerate(offset_list):
box = convert_css_to_offset(off, 1) # 根據css backgound-position擷取每張小圖的坐标
regoin = img.crop(box) # 摳圖
offset = convert_index_to_offset(i, 1) # 擷取目前小圖在空白圖檔的坐标
captcha.paste(regoin, offset) # 根據目前坐标将小圖粘貼到空白圖檔
for i, off in enumerate(offset_list_small):
box = convert_css_to_offset(off, 2) # 根據css backgound-position擷取每張小圖的坐标
regoin = img.crop(box) # 摳圖
offset = convert_index_to_offset(i, 2) # 擷取目前小圖在空白圖檔的坐标
captcha.paste(regoin, offset) # 根據目前坐标将小圖粘貼到空白圖檔
capture_2 = Image.open('./capture/text.png')
captcha.paste(capture_2, (154, 116))
captcha.save('./capture/capture2_{}.png'.format(file_id))
if __name__ == '__main__':
recombine_captcha('-5543')
左圖為網頁原圖,右圖為拼接複原的圖形,至此圖檔處理的過程結束,接下來可以進行大碼擷取文字坐标發送給伺服器校驗驗證碼是否正确。