百度圖檔爬蟲

分析

　　由于百度圖檔擷取的方式是采用ajax，是以你從正常的url連結隻能提取到一個頁面顯示的圖檔，也就是前30張（還是前60張）

　　具體的分析是：你下拉頁面滾動條，分析XHR的變化，就可以找到所需要的Json資料，比如說這個就是分析後找的Json資料

1.json連接配接:

https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord=hhkb&cl=&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=&z=&ic=0&word=hhkb&s=&se=&tab=&width=&height=&face=&istype=&qc=&nc=&fr=&pn=120&rn=30&gsm=78&1504602271332=

2.參數：

　　可以看到上面的連結,word就是搜尋關鍵詞，pn是第幾頁，rn是每頁多少圖檔（預設30）還可以看到其他的參數暫時可以不關心。

3.json資料

　　通過通路上面的json連結得到一份json資料,分析json資料，data數組中包含了圖檔資訊，可以看到data數組的元素中有一個objURL的連接配接，

“objURL”:”ippr_z2C$qAzdH3FAzdH3Ft4w2jf_z&e3B4wvx_z&e3BvgAzdH3Fu5674AzdH3Fda8n8aAzdH3Fd8AzdH3F8b8ccaacas17ir6i1lf1rn_z&e3B3r2”,

　　replaceUrl中也有ObjURL，不過第一個objurl看起來是加密過的，百度一下發現解密方法很簡單，秘鑰是一個字元的對應關系，有2種映射：

1.多個字元映射為一個字元，

‘_z2C$q’ => ‘:’

‘_z&e3B’ => ‘.’

‘AzdH3F’ => ‘/’

2.單個字元映射為單字元，字元映射表為

将objurl解密就可以得到圖檔的真正連結

def decode_url(self, url):
        in_table = u'0123456789abcdefghijklmnopqrstuvw'
        out_table = u'7dgjmoru140852vsnkheb963wtqplifca'
        translate_table = string.maketrans(in_table, out_table)
        mapping = {'_z2C$q': ':', '_z&e3B': '.', 'AzdH3F': '/'}
        for k, v in mapping.items():
            url = url.replace(k, v)
        url = url.encode()
        return url.translate(translate_table)

開幹

　　好了有了上面的資訊現在基本上可以開幹了，我們隻要一個循環，每次請求一頁資料的json,然後解析json得到每一頁中每張圖檔的url然後下載下傳就可以了。這裡請求頁面我們用requests庫。

# coding:UTF-8
import time
import os
from bs4 import BeautifulSoup
import requests
import string
global path

def download(url, filename,fromHost):
    try:
        ir = requests.get(url)
        ir.raise_for_status()
        if ir.status_code == :
            filePathName = os.path.join(path, filename)
            open(filePathName, 'wb').write(ir.content)
        print "download %s suceese"%url
        return True
    except BaseException,e:
        print 'download error :%s'%filename
        print e.message
        return False

def request(params):
    headers = { "Accept":"text/html,application/xhtml+xml,application/xml;",
                "Accept-Encoding":"gzip",
                "Accept-Language":"zh-CN,zh;q=0.8",
                "Referer":"http://http://www.baidu.com/",
                "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
                }

    def decode_url(url):
        in_table =  u'0123456789abcdefghijklmnopqrstuvw'
        out_table = u'7dgjmoru140852vsnkheb963wtqplifca'
        translate_table = string.maketrans(in_table, out_table)
        mapping = {'_z2C$q': ':', '_z&e3B': '.', 'AzdH3F': '/'}
        for k, v in mapping.items():
            url = url.replace(k, v)
        url = url.encode()
        return url.translate(translate_table)

    try:
        url = "http://image.baidu.com/search/acjson"
        response = requests.get(url, params=params,headers=headers)
        response.raise_for_status()
        response.encoding = response.apparent_encoding
        jsons = response.json()['data']
        for json in jsons:
            image_urls = []
            if 'objURL' in json.keys():
                image_urls.append(decode_url(json['objURL']))
            if 'replaceUrl' in json.keys() and len(json['replaceUrl']) == :
                image_urls.append(json['replaceUrl'][]['ObjURL'])


            print len(image_urls)
            for objUrl in image_urls:
                filename = os.path.split(objUrl)[].split('?')[]
                if(len(filename) !=  and filename.find('.') >= ):
                    fromHost = json['fromURLHost']
                    print 'Downloading from %s' % objUrl
                    if(download(objUrl, filename,fromHost)):
                       break 


    except BaseException,e:
        print e.message
        return "get url error"

def search(keyword, minpage, maxpage):
    params = {
        'tn': 'resultjson_com',
        'word': keyword,
        'queryWord':keyword,
        'ie': 'utf-8',
        'cg': '',
        'ct':'201326592',
        'fp':'result',
        'cl':'2',
        'lm':'-1',
        'rn': '30',
        'ipn':'rj'
    };
    for i in range(minpage, maxpage):
        print 'Download page %d:'%i 
        params['pn'] = '%d' % (i * )
        request(params)
    print 'download end'

def start(keyword,startpage,endpage,inpath=''):
    if len(inpath) == :
        inpath = os.curdir + '/'+keyword
    global path
    path = inpath.decode('utf-8')
    print 'download image to %s'%path
    if os.path.exists(path) == False:
        os.mkdir(path)
    search(keyword, startpage, endpage)

百度圖檔爬蟲

分析

開幹

繼續閱讀

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入