分析
由于百度圖檔擷取的方式是采用ajax,是以你從正常的url連結隻能提取到一個頁面顯示的圖檔,也就是前30張(還是前60張)
具體的分析是:你下拉頁面滾動條,分析XHR的變化,就可以找到所需要的Json資料,比如說這個就是分析後找的Json資料
1.json連接配接:
https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord=hhkb&cl=&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=&z=&ic=0&word=hhkb&s=&se=&tab=&width=&height=&face=&istype=&qc=&nc=&fr=&pn=120&rn=30&gsm=78&1504602271332=
2.參數:
可以看到上面的連結,word就是搜尋關鍵詞,pn是第幾頁,rn是每頁多少圖檔(預設30)還可以看到其他的參數暫時可以不關心。
3.json資料
通過通路上面的json連結得到一份json資料,分析json資料,data數組中包含了圖檔資訊,可以看到data數組的元素中有一個objURL的連接配接,
“objURL”:”ippr_z2C$qAzdH3FAzdH3Ft4w2jf_z&e3B4wvx_z&e3BvgAzdH3Fu5674AzdH3Fda8n8aAzdH3Fd8AzdH3F8b8ccaacas17ir6i1lf1rn_z&e3B3r2”,
replaceUrl中也有ObjURL,不過第一個objurl看起來是加密過的,百度一下發現 解密方法很簡單,秘鑰是一個字元的對應關系,有2種映射:
1.多個字元映射為一個字元,
‘_z2C$q’ => ‘:’
‘_z&e3B’ => ‘.’
‘AzdH3F’ => ‘/’
2.單個字元映射為單字元,字元映射表為
将objurl解密就可以得到圖檔的真正連結
def decode_url(self, url):
in_table = u'0123456789abcdefghijklmnopqrstuvw'
out_table = u'7dgjmoru140852vsnkheb963wtqplifca'
translate_table = string.maketrans(in_table, out_table)
mapping = {'_z2C$q': ':', '_z&e3B': '.', 'AzdH3F': '/'}
for k, v in mapping.items():
url = url.replace(k, v)
url = url.encode()
return url.translate(translate_table)
開幹
好了有了上面的資訊現在基本上可以開幹了,我們隻要一個循環,每次請求一頁資料的json,然後解析json得到每一頁中每張圖檔的url然後下載下傳就可以了。這裡請求頁面我們用requests庫。
# coding:UTF-8
import time
import os
from bs4 import BeautifulSoup
import requests
import string
global path
def download(url, filename,fromHost):
try:
ir = requests.get(url)
ir.raise_for_status()
if ir.status_code == :
filePathName = os.path.join(path, filename)
open(filePathName, 'wb').write(ir.content)
print "download %s suceese"%url
return True
except BaseException,e:
print 'download error :%s'%filename
print e.message
return False
def request(params):
headers = { "Accept":"text/html,application/xhtml+xml,application/xml;",
"Accept-Encoding":"gzip",
"Accept-Language":"zh-CN,zh;q=0.8",
"Referer":"http://http://www.baidu.com/",
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
}
def decode_url(url):
in_table = u'0123456789abcdefghijklmnopqrstuvw'
out_table = u'7dgjmoru140852vsnkheb963wtqplifca'
translate_table = string.maketrans(in_table, out_table)
mapping = {'_z2C$q': ':', '_z&e3B': '.', 'AzdH3F': '/'}
for k, v in mapping.items():
url = url.replace(k, v)
url = url.encode()
return url.translate(translate_table)
try:
url = "http://image.baidu.com/search/acjson"
response = requests.get(url, params=params,headers=headers)
response.raise_for_status()
response.encoding = response.apparent_encoding
jsons = response.json()['data']
for json in jsons:
image_urls = []
if 'objURL' in json.keys():
image_urls.append(decode_url(json['objURL']))
if 'replaceUrl' in json.keys() and len(json['replaceUrl']) == :
image_urls.append(json['replaceUrl'][]['ObjURL'])
print len(image_urls)
for objUrl in image_urls:
filename = os.path.split(objUrl)[].split('?')[]
if(len(filename) != and filename.find('.') >= ):
fromHost = json['fromURLHost']
print 'Downloading from %s' % objUrl
if(download(objUrl, filename,fromHost)):
break
except BaseException,e:
print e.message
return "get url error"
def search(keyword, minpage, maxpage):
params = {
'tn': 'resultjson_com',
'word': keyword,
'queryWord':keyword,
'ie': 'utf-8',
'cg': '',
'ct':'201326592',
'fp':'result',
'cl':'2',
'lm':'-1',
'rn': '30',
'ipn':'rj'
};
for i in range(minpage, maxpage):
print 'Download page %d:'%i
params['pn'] = '%d' % (i * )
request(params)
print 'download end'
def start(keyword,startpage,endpage,inpath=''):
if len(inpath) == :
inpath = os.curdir + '/'+keyword
global path
path = inpath.decode('utf-8')
print 'download image to %s'%path
if os.path.exists(path) == False:
os.mkdir(path)
search(keyword, startpage, endpage)