Ajax資料爬取--用python玩微網誌

2023-07-15 11:16:47

目标網頁

新浪微網誌-我的首頁-爬取我的微網誌

右鍵打開檢查，點選network，打開Ajax的XHR過濾器，重新整理

Ajax資料爬取--用python玩微網誌

然後下圖就是我要爬取的東西JSON格式的内容（attitudes_count是贊數目，comments_cout是評論數目，reposts_count是轉發數目，created_at是釋出時間，text是釋出正文)。

下拉微網誌頁面以加載到新的内容，可以看出會有不斷的Ajax請求發出

Ajax資料爬取--用python玩微網誌

在Headers可以看到請求連結為https://weibo.com/ajax/statuses/mymblog?uid=123&page=1&feature=0，請求參數有3個，uid，page，feature

也可以下拉Headers頁面，在Query String Paramenters中看到請求參數

Ajax資料爬取--用python玩微網誌

代碼

通過滑動頁面也可以得知，每一頁的微網誌内容是20條，很容易看出page是一個可變參數

# 導入庫
from urllib.parse import urlencode
import requests
from pyquery import PyQuery as pq
from pymongo import MongoClient

首先uid和cookie涉及到個人資訊，我将其改為了123

#基礎url（請求url的前半部分）及請求頭資訊
base_url = 'https://weibo.com/ajax/statuses/mymblog?'
headers = {
    'cookie':'123',
    'referer':'https://weibo.com/u/123',
    'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
    'x-requested-with':'XMLHttpRequest'
}

#請求url
def get_page(page):
    params = {
        'uid':'123',
        'page':page,
        'feature':'0',
    }# 構造參數字典
    url = base_url + urlencode(params) # 組合url
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.json()  # 傳回并json化
    except requests.ConnectionError as e:
        print('Error',e.args)

def parse_page(json):
    if json:
        # 便利list，得到一個新的字典items
        items = json.get('data').get('list') 
        for item in items:  # 便利items字典擷取需要爬取的資訊
            weibo = {}
            weibo['id'] = item.get('id')
            weibo['text'] = pq(item.get('text')).text()
            weibo['attitudes'] = item.get('attitudes_count')
            weibo['comments'] = item.get('comments_count')
            weibo['reposts'] = item.get('reposts_count')
            yield weibo

# 這裡可以将輸出的結果儲存下來，我将輸出的儲存到MongoDB資料中
if __name__ == '__main__':
	# 連接配接資料庫并命名微網誌
    client = MongoClient()
	db = client['weibo']
	collections = db['weibo']
	# 周遊page，列印并輸出
    for page in range(1,11):
        json = get_page(page)
        results = parse_page(json)
        for result in results:
            print(result)
            # 儲存内容
            if collections.insert_one(result):
            print('Save to Mongo')

Ajax資料爬取--用python玩微網誌

目錄

目标網頁

代碼

繼續閱讀

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入