動态渲染頁面Ajax資料爬取分析

現在的大多數網頁資料都是通過動态渲染加載的，即常說的Ajax方式，在這類網站中爬取資料通常無法在網頁加載資料中獲得，一般都是找到動态加載頁面再擷取資料，本文就通過分析今日頭條網站，對關鍵字搜尋擷取搜尋頁面的圖檔和标題，最後通過簡單的多線程下載下傳儲存在本地。

1：打開頭條首頁，網址：今日頭條首頁

動态渲染頁面Ajax資料爬取分析

在頁面右上角有個搜尋框，搜尋關鍵字可以加載關鍵字資訊。

2：分析頁面

點選搜尋後，頁面重新加載，傳回一個關鍵字資訊界面，在這個頁面裡沒有發現内容具體資訊，由此可以判斷它是一個Ajax加載頁面，我們可以在網絡中通過XHR檢視加載資訊。

動态渲染頁面Ajax資料爬取分析

這裡可以看到加載資訊的url，在下面也可以看到請求頭資訊。

我們觀察響應裡面的傳回的網頁源代碼，可以發現資料全部在這裡。

動态渲染頁面Ajax資料爬取分析

在頁面資訊的data裡面就是具體的資訊，如标題、image清單等，我們通過擷取這裡的資訊就可以得到資料。

3：多頁資料分析

當在頁面往下滑動到底時，會自動加載新一頁資料，我們在這裡觀察一下後續連結的參數，可以發現變化的參數隻有offset，第一次請求時offset為0，而,第二次請求的 offset 值為 20，第三次為 40，第四次為 60，是以可以發現規律，這個 offset 值就是反頁偏移量，進而可以推斷出 count 參數就是一次性擷取的資料條數，是以我們可以用 offset 參數來控制資料分頁，這樣一來，我們就可以通過接口批量擷取資料了，然後将資料解析，可以将圖檔下載下傳了。

其實通過給URL傳入關鍵詞也能進一步實作關鍵詞下載下傳頭條圖檔。

4：程式實作

隻下載下傳第一頁

import requests
from urllib.parse import urlencode    #用 urlencode() 方法構造請求的 GET 參數
import os
from hashlib import md5
headers = {
    'Host': 'www.toutiao.com',
    'Referer': 'https://www.toutiao.com/search/?keyword=香港',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
    'Cookie': 'tt_webid=6704808321848215043; WEATHER_CITY=武漢; UM_distinctid=16b77e85ac24d7-006d237c790a88-4a5568-1fa400-16b77e85ac344b; CNZZDATA1259612802=1104480020-1561079689-%7C1561079689; __tasessionId=n39mykl9q1561084844772; tt_webid=6704808321848215043; csrftoken=dcd1d465bc7441c32e8f77a18e4dcb62; s_v_web_id=bfc941bbb6430551a71c9f370afd2c71TE: Trailers'
}

def get_page(offset):
    params={
        'aid':24,
        'app_name':'web_search',
        'format':'json',
        'keyword':'香港',
        'autoload':'true',
        'count':20,
        'en_qc':1,
        'from':'search_tab',
        'pd':'synthesis',
        'timestamp':'1561085020266',
        'offset':offset
    }
    url='https://www.toutiao.com/api/search/content/?'+urlencode(params)
    response=requests.get(url,headers=headers)
    if response.status_code==200:
        return response.json()
    return None

def get_image(page):
    if page.get('data'):
        for item in page.get('data'):
            title=item.get('title',None)
            image_list=item.get('image_list',[])
            if title:
                yield {                               #生成器，傳回标題和圖像url清單
                    'title':title,
                    'image_list':image_list
                }

def save_image(item):
    if not os.path.exists('E:\\toutiao\\'+item.get('title')):      #如果不存在就建立
        os.mkdir('E:\\toutiao\\'+item.get('title'))

    for img in item.get('image_list'):
        print(img.get('url'))
        response=requests.get(img.get('url'))
        if response.status_code==200:
            file_path='E:\\toutiao\\{0}\\{1}.{2}'.format(item.get('title'),
                                           md5(response.content).hexdigest(),
                                           'jpg')     #使用MD5編碼，防止命名重複
            if not os.path.exists(file_path):
                with open(file_path,'wb') as f:
                    f.write(response.content)             #下載下傳二進制資料
            else:
                print('已經下載下傳了')
        else:
            print(response.reason)

if __name__ == '__main__':
    x=get_page(0)
    for y in get_image(x):
        save_image(y)

多線程下載下傳多頁

在上面程式的基礎上新加一個主函數main

from multiprocessing.pool import Pool
def main(offset):
    json = get_page(offset)
    for item in get_images(json):
        print(item)
        save_image(item)

GROUP_START = 1     #其實頁面
GROUP_END = 20       #終止頁面

if __name__ == '__main__':
    pool = Pool()                 #建立多線程
    groups = ([x * 20 for x in range(GROUP_START, GROUP_END + 1)])  #構造多個參數
    pool.map(main, groups)      #傳入多個參數，多個函數同時運作
    pool.close()
    pool.join()

動态渲染頁面Ajax資料爬取分析

繼續閱讀

v2ex的簡單爬蟲

Python漫畫爬蟲開源 66漫畫 AJAX，包含資料庫連接配接，圖檔下載下傳處理

requests子產品進行人人網模拟登陸

Python image.show() 出錯FSPathMakeRef(/Applications/Preview.app) failed with error -43

2023爬蟲學習筆記 -- 多線程操作

M團店鋪評價采集不到問題問題展示：解決方案：

Python爬蟲學習（1）

Python爬蟲學習進階

Python爬蟲（入門+進階）學習筆記 1-2 初識Python爬蟲

Python進階爬蟲——Class1：認識爬蟲

python爬蟲學習筆記-1

python學習之urllib使用小結

NOIp模拟題之肮髒的牧師（桶排序）

一篇文章教你如何在一個月内學會爬取大規模資料

Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗

sort()函數到底是怎樣進行數字排序的