【python實作網絡爬蟲（12）】JSON解析之爬取騰訊新聞

目标網址：騰訊新聞，頁面如下

1. 尋找 `json` 接口

在目标頁面點選滑鼠右鍵進行’檢查’，然後選擇

'Network'

，再點選’網頁重新整理’按鈕，接着在右下區域内彈出的内容上選擇具有

pull_url

辨別的檔案，最後點選

'Preview'

選項即可。圖解如下：

比如就以目前這個頁面來看，擷取這個

json

的接口就是點選

'Preview'

旁邊的

'Headers'

，選擇’Request URL:'後面的網址，就為請求資料的接口。如下

2. 嘗試擷取資料

import requests
import json

def get_json():
    url = 'https://i.match.qq.com/ninja/fragcontent?pull_urls=news_top_2018&callback=__jp0'
    html = requests.get(url)
    print(html.text)

get_json()

–> 輸出結果為：（可以看到最後輸出的結果并不符合json轉python數組的格式要求，可以嘗試把資料格式轉為一緻）

__jp0(
[{"title":"\u4e60\u8fd1\u5e73\u56de\u4fe1\u52c9\u52b1\u5317\u5927\u63f4\u9102\u533b\u7597\u961f90\u540e\u515a\u5458","url":"https:\/\/new.qq.com\/omn\/20200316\/20200316A0CI9E00.html","article_id":"20200316a0ci9e00","comment_id":"0","group":"0"},
{"title":"\u4e60\u8fd1\u5e73\u6c42\u662f\u6587\u7ae0\u91cc\u7684\u5173\u952e","url":"https:\/\/new.qq.com\/omn\/TWF20200\/TWF2020031604303300.html","article_id":"twf2020031604303300","comment_id":"0","group":1}
,{"title":"\u82f1\u96c4\uff0c\u5fc5\u987b\u8ba9\u4f60\u4eec\u767b\u201c\u53f0\u201d\u4eae\u76f8\uff01","url":"https:\/\/new.qq.com\/omn\/20200315\/20200315A0GHBK00.html","article_id":"20200315a0ghbk00","comment_id":"0","group":1},
{"title":"\u5b9a\u5411\u964d\u51c6\u7a33\u4fe1\u5fc3 \u5229\u7387\u4e0b\u884c\u6709\u7a7a\u95f4","url":"https:\/\/new.qq.com\/omn\/20200316\/20200316A049RE00.html","article_id":"20200316a049re00","comment_id":"0","group":2},
{"title":"\u75ab\u60c5\u5f71\u54cd\u4e0b\u7684\u4e2d\u56fd\u7ecf\u6d4e\u89c2\u5bdf","url":"https:\/\/new.qq.com\/omn\/20200316\/20200316A04HX200.html","article_id":"20200316a04hx200","comment_id":"0","group":2},
{"title":"\u5927\u533b\u5f20\u4f2f\u793c\u2014\u2014\u4ed6\u662f\u957f\u8005\uff0c\u662f\u7236\u4eb2\uff0c\u66f4\u662f\u5171\u4ea7\u515a\u5458\uff01","url":"https:\/\/new.qq.com\/omn\/20200316\/20200316A0DS7U00.html","article_id":"20200316a0ds7u00","comment_id":"0","group":"0"},
{"title":"\u4ed6\u4eec\u662f\u8bb0\u8005\uff0c\u4ed6\u4eec\u662f\u6218\u58eb","url":"https:\/\/new.qq.com\/omn\/20200316\/20200316A07XO800.html","article_id":"20200316a07xo800","comment_id":"0","group":3},
{"title":"\u6218\u201c\u75ab\u201d\u4e2d\u7684\u9752\u6625\u4e4b\u6b4c","url":"https:\/\/new.qq.com\/omn\/20200316\/20200316A07ZYG00.html","article_id":"20200316a07zyg00","comment_id":"0","group":3},
{"title":"\u3010\u6218\u201c\u75ab\u201d\u8bf4\u7406\u3011\u4ee5\u4eba\u6c11\u4e3a\u4e2d\u5fc3\uff1a\u75ab\u60c5\u9632\u63a7\u7684\u4ef7\u503c\u903b\u8f91","url":"https:\/\/new.qq.com\/rain\/a\/20200316A07IIW00","article_id":"20200316A07IIW00","comment_id":"0","group":"0"},
{"title":"\u575a\u6301\u5411\u79d1\u5b66\u8981\u7b54\u6848\u8981\u65b9\u6cd5","url":"https:\/\/new.qq.com\/omn\/20200315\/20200315A0FJZO00.html","article_id":"20200315a0fjzo00","comment_id":"0","group":4},
])

嘗試轉變資料（去掉清單兩側的多餘字元）

print(json.loads((html.text)[6:-2]))

–> 輸出結果為：（這裡就是進行字元串的切片，将資料轉為合适的資料，上下資料一緻，均為13條）

3. 擷取标題和url

for i in json.loads((html.text)[6:-2]):
        # print(i)
        print('新聞标題為：\n{}\n'.format(i['title']))
        print('新聞連結為：\n{}\n----------\n'.format(i['url']))

4. 全部代碼

import requests
import json

def get_json():
    url = 'https://i.match.qq.com/ninja/fragcontent?pull_urls=news_top_2018&callback=__jp0'
    html = requests.get(url)
    # print(html.text)
    # print(json.loads((html.text)[6:-2]))
    for i in json.loads((html.text)[6:-2]):
        # print(i)
        print('新聞标題為：\n{}\n'.format(i['title']))
        print('新聞連結為：\n{}\n----------\n'.format(i['url']))

get_json()

【python實作網絡爬蟲（12）】JSON解析之爬取騰訊新聞

1. 尋找 `json` 接口

2. 嘗試擷取資料

3. 擷取标題和url

4. 全部代碼

繼續閱讀

Small tricks

403 Forbidden，You don't have permission to access / on this server.Forbidden

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

vue-cli簡介（中文翻譯）

Ajax發送和擷取json資料到Spring mvc 1.spring mvc後端2.web前段

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

JSONObject包導入異常 java.lang.NoClassDefFoundErrorweb項目的導入包的問題

在python中建立excel并寫入

【python實作網絡爬蟲（12）】JSON解析之爬取騰訊新聞

1. 尋找​ ​json​ ​接口

2. 嘗試擷取資料

3. 擷取标題和url

4. 全部代碼

繼續閱讀

1. 尋找 `json` 接口