微網誌博文内容爬取

不知你們發現了沒有，微網誌偷偷推出了一個新版本的網頁版

微網誌博文内容爬取

界面相對原版微網誌來說簡直是舒服了不知多少倍，全新炫目的微網誌界面、清晰有條理的分組閱讀、個性化的應用管理……（微網誌打錢！）

那咱們今天就用這個來爬一爬你女神曆史博文資料吧！

一、網頁分析

今天我選擇的女神是迪麗熱巴！阿巴阿巴

找到熱巴的首頁，依舊先打開開發者模式，然後重新整理網頁。

微網誌博文内容爬取

我們很容易的就能找到這個請求，我們可以看到，裡面包含博文内容資訊、點贊數、轉發數、評論數、發文時間等等一些資訊。

微網誌博文内容爬取

還能獲得熱巴的照片哦，這裡就不再教你們了^_^。

二、接口分析

url分析

第一頁：

https://weibo.com/ajax/statuses/mymblog?uid=1669879400&page=1&feature=0

繼續往下翻：

第二頁：

https://weibo.com/ajax/statuses/mymblog?uid=1669879400&page=2&feature=0

可以發現改變的隻有page這個參數，代表的是第幾頁

微網誌博文内容爬取

其中的uid就是熱巴微網誌使用者id了，如果把這個uid換成你女神的uid那麼爬取的就是你女神的博文資訊了，懂？？？

OK，萬事大吉

傳回資料分析

微網誌博文内容爬取

用的是GET請求，傳回的資料類型是json格式的，編碼為utf-8。

直接把得到的資料按照json資料格式化就行了。

三、編寫代碼

知道了url規則，以及傳回資料的格式，那現在咱們的任務就是構造url然後請求資料

現在來構造url：

寫道這不知道你們意識到了沒，怎麼知道他有多少頁的博文呢?

那咱們就用while循環來解決，一旦請求不到博文了咱們就可退出循環了。

uid = 1669879400
url = 'https://weibo.com/ajax/statuses/mymblog?uid={}&page={}&feature=0'
page = 1
while 1:
    url = url.format(uid, page)
    page += 1

對于每個url我們都要去用requests庫中的get方法去請求資料：

是以我們為了友善就把請求網頁的代碼寫成了函數get_html(url)，傳入的參數是url傳回的是請求到的内容。

def get_html(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36",
        "Referer": "https://weibo.com"
    }
    cookies = {
        "cookie": "你的cookie"
    }
    response = requests.get(url, headers=headers, cookies)
    time.sleep(3)    # 加上3s 的延時防止被反爬
    return response.text

把自己的cookie裡面的資訊替換掉代碼裡的就好了。

cookies擷取方式

擷取資料

html = get_html(url)
responses = json.loads(html)
blogs = responses['data']['list']
data = {}   # 建立個字典用來存資料
for blog in blogs:
    data['attitudes_count'] = blog['attitudes_count']   # 點贊數量
    data['comments_count'] = blog['comments_count']     # 評論數量(超過100萬的隻會顯示100萬)
    data['created_at'] = blog['created_at']     # 釋出時間
    data['reposts_count'] = blog['reposts_count']     # 轉發數量(超過100萬的隻會顯示100萬)
    data['text_raw'] = blog['text_raw']     # 博文正文文字資料

儲存資料

定義一個函數

def save_data(data):
    title = ['text_raw', 'created_at', 'attitudes_count', 'comments_count', 'reposts_count']
    with open("data.csv", "a", encoding="utf-8", newline="")as fi:
        fi = csv.writer(fi)
        fi.writerow([data[i] for i in title])

完整代碼

# -*- coding:utf-8 -*-
# @time: 2021/5/20 5:20
# @Author: 南韓麥當勞
# @Environment: Python 3.7
# @file: 有情人終成眷屬.py
import requests
import csv
import time
import json


def get_html(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36",
        "Referer": "https://weibo.com"
    }
    cookies = {
        "cookie": "你的cookie"
    }
    response = requests.get(url, headers=headers, cookies=cookies)
    time.sleep(3)   # 加上3s 的延時防止被反爬
    return response.text


def save_data(data):
    title = ['text_raw', 'created_at', 'attitudes_count', 'comments_count', 'reposts_count']
    with open("data.csv", "a", encoding="utf-8", newline="")as fi:
        fi = csv.writer(fi)
        fi.writerow([data[i] for i in title])


if __name__ == '__main__':

    uid = 1669879400
    url = 'https://weibo.com/ajax/statuses/mymblog?uid={}&page={}&feature=0'
    page = 1
    while 1:
        print(page)
        url = url.format(uid, page)
        html = get_html(url)
        responses = json.loads(html)
        blogs = responses['data']['list']
        if len(blogs) == 0:
            break
        data = {}   # 建立個字典用來存資料
        for blog in blogs:
            data['attitudes_count'] = blog['attitudes_count']   # 點贊數量
            data['comments_count'] = blog['comments_count']     # 評論數量(超過100萬的隻會顯示100萬)
            data['created_at'] = blog['created_at']     # 釋出時間
            data['reposts_count'] = blog['reposts_count']     # 轉發數量(超過100萬的隻會顯示100萬)
            data['text_raw'] = blog['text_raw']     # 博文正文文字資料
            save_data(data)
        page += 1

獲得的部分資料截圖

微網誌博文内容爬取

微網誌博文内容爬取

繼續閱讀

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入