微信小程式，Python爬蟲抓包采集實戰，采集某成考題庫小程式

⛳️ 實戰場景

從本篇部落格開始，我們會針對微信小程式編寫一系列的爬蟲，這些爬蟲依舊通過案例進行串聯，保證對大家的學習有所幫助。

正式開始前先準備工具，一個可以解析 https 協定請求的軟體

fiddler

，電腦版微信。

由于在 2022 年 5 月份，微信調整了其小程式架構，是以在正式開始前需要對環境進行一下基礎配置，便于抓取到網絡包。

如果你的 fiddler 啟動之後，可以成功抓取資料包，無需該步操作。

找到下述路徑的檔案夾，然後清空該檔案夾，接下來重新啟動微信，此時就可以擷取小程式中的 https 請求了，效果圖如下所示。

C:\Users\Administrator\AppData\Roaming\Tencent\WeChat\XPlugin\Plugins\WMPFRuntime

由于 Python 有侵權問題，是以關鍵站點資訊進行打碼處理。

抓取請求之後，首先對請求位址進行分析，本次采集的目标站點呈現如下規則（實踐初期可以多擷取一下不同位址）：

weixin.不能展示的作品位址.com/app/index.php?i=1063&t=0&v=1.0.0&from=wxapp&c=entry&a=wxapp&do=gettopic&m=huitong_shuati&sign=指紋位址&topic_class_id=30702&ordinal=1
weixin.不能展示的作品位址.com/app/index.php?i=1063&t=0&v=1.0.0&from=wxapp&c=entry&a=wxapp&do=gettopic&m=huitong_shuati&sign=指紋位址&topic_class_id=30702&ordinal=2

在對參數進行猜測的時候，我發現了

sign

加密參數，此時就比較麻煩了，因為我們沒有辦法進行調試，如果該參數需要解析，那還面臨小程式解包這一問題，不過後續的實際分析，讓我松了一口氣，該參數并未參加運算。

其核心參數如下所示：

c ：控制器；
a ：動作，也可以叫做函數
m ：猜測是 model
topic_class_id ：試卷 ID；
ordinal ：題目序号。

如果僅有這些參數，此時可以在一空白谷歌浏覽器中進行疊代，即判斷是否可以直接通過題号切換，擷取所有試題。

随着不斷增大題号，結果當試題不存在時，得到下述響應資訊。

{"errno":0,"message":"\u9898\u7684\u6570\u636e\u83b7\u53d6\u5931\u8d25","data":{"status":0}}

将 unicode 轉碼之後，得到題目不存在的提示，此時表示該試卷所有試題已經解析完畢。

題目擷取成功，得到題幹，選項，解析，包括試卷題目總數等資訊，有這些資料之後，就可以進入編碼實戰環節。

以下代碼主要通過 Python requests 子產品實作。

⛳️ 實戰編碼

在實戰中可以先将資料存儲到 MySQL 中（直接存儲 JSON 檔案也可以），然後在進行後續的讀取操作。

下面在梳理一下需求實作步驟：

1. 擷取分類頁面試題類型

該頁面擷取位址如下所示：

weixin.不能展示的作品位址.com/app/index.php?i=1063&t=0&v=1.0.0&from=wxapp&c=entry&a=wxapp&do=fenlei_data&state=指紋位址&m=huitong_shuati&sign=指紋位址&up_class_id=3464&openid=om76m4sMOiBzooHFFKbqcZJFysq0

2. 擷取試卷科目資料

擷取位址如下所示：

weixin.不能展示的作品位址.com/app/index.php?i=1063&t=0&v=1.0.0&from=wxapp&c=entry&a=wxapp&do=fenlei_data&state=指紋位址&m=huitong_shuati&sign=指紋位址&up_class_id=24321&openid=om76m4sMOiBzooHFFKbqcZJFysq0

3. 擷取試卷分組資料

https://weixin.不能展示的作品位址.com/app/index.php?i=1063&t=0&v=1.0.0&from=wxapp&c=entry&a=wxapp&do=fenlei_data&state=指紋位址&m=huitong_shuati&sign=指紋位址&up_class_id=24377&openid=om76m4sMOiBzooHFFKbqcZJFysq0

4.擷取分組下的試卷

https://weixin.不能展示的作品位址.com/app/index.php?i=1063&t=0&v=1.0.0&from=wxapp&c=entry&a=wxapp&do=fenlei_data&state=指紋位址&m=huitong_shuati&sign=指紋位址&up_class_id=30658&openid=om76m4sMOiBzooHFFKbqcZJFysq0

5.擷取試題

https://weixin.不能展示的作品位址.com/app/index.php?i=1063&t=0&v=1.0.0&from=wxapp&c=entry&a=wxapp&do=gettopic&state=指紋位址&m=huitong_shuati&sign=指紋位址&topic_class_id=30659&ordinal=1

擷取到這五個位址之後，還需要進一步的進行整理分析，看中間是否存在步驟跳過的可能。

經過篩選，得到下圖内容，其中

up_class_id

作為核心參數脫穎而出。

是以後續的内容就圍繞

up_class_id

進行編寫即可，二級分類資料擷取代碼如下：

import requests
import pymysql

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"

}
def get_conn():
    conn = pymysql.connect(
        host='127.0.0.1',
        user='你的賬号',
        db='question_scrapy',
        port=3306,
        cursorclass=pymysql.cursors.DictCursor
    )
    return conn


def run():
    res = requests.get(
        'https://weixin.不能展示的作品位址.com/app/index.php?i=1063&from=wxapp&c=entry&a=wxapp&do=fenlei_data&m=huitong_shuati&up_class_id=3464',
        headers=headers, timeout=5, verify=False)

    json_data = res.json()
    if json_data["errno"] == 0:
        fenlei_data = json_data["data"]['0']["fenlei_data"]
        insert_fenlei(fenlei_data)


def insert_fenlei(fenlei_data):
    conn = get_conn()
    cursor = conn.cursor()
    for item in fenlei_data:
        up_class_id = item["id"]
        topic_class_name = item["topic_class_name"]

        cursor.execute(
            f'insert into fenlei_data values({up_class_id},"{topic_class_name}")')
        try:
            conn.commit()
        except Exception as e:
            conn.rollback()

    cursor.close()
    conn.close()

if __name__ == '__main__':
    run()

def get_question(topic_class_id, ordinal=1):

    url = f'https://weixin.不能展示的作品位址.com/app/index.php?i=1063&from=wxapp&c=entry&a=wxapp&do=gettopic&m=huitong_shuati&topic_class_id={topic_class_id}&ordinal={ordinal}'
    res = requests.get(url, headers=headers)
    data = res.json()["data"]
    if data["status"] == 1:
        questions_count = data["count"]
        for num in range(1,int(questions_count)+1):
            url = f'https://weixin.不能展示的作品位址.com/app/index.php?i=1063&from=wxapp&c=entry&a=wxapp&do=gettopic&m=huitong_shuati&topic_class_id={topic_class_id}&ordinal={num}'
            res = requests.get(url, headers=headers)
            print(res.json()["data"]["topic_data"][0])

微信小程式，Python爬蟲抓包采集實戰，采集某成考題庫小程式

⛳️ 實戰場景

⛳️ 實戰編碼

繼續閱讀

Testlink安裝部署之XAMPP

TestLink 圖表中文亂碼問題

[HTML5]自定義屬性 data-* 和 jQuery.data 詳解

七牛雲-C#SDK-上傳-前期準備

ecshop屬性排序

Ubuntu16.04安裝Apache+MySQL+PHP1. 安裝Apache2. 安裝MySQL3. 安裝PHP4. 安裝phpMyAdmin

版本号隐藏

Apache配置SSLApache配置SSL

配置apache支援PHP（win7）

Cloud Studio初體驗

NOSQL安全攻擊

vue-cli簡介（中文翻譯）

php 去掉字元串的最後一個字元及截取原字元串1,2,3,4,5,6,

Ajax發送和擷取json資料到Spring mvc 1.spring mvc後端2.web前段

php——水印

JSONObject包導入異常 java.lang.NoClassDefFoundErrorweb項目的導入包的問題