Python爬取百度備案資訊

首先使用pip install requests和pip install bs4安裝兩個必備的庫(注意:你的lxml可能沒有安裝，如果運作錯誤的話嘗試使用pip install lxml安裝lxml，這個庫是解析HTML的)

這裡我使用的編譯器是Spyder,當然你也可以直接在Python自帶的IDE中運作

Python爬取百度備案資訊Python爬取百度備案資訊

爬蟲的核心是

1.僞造請求頭

2.擷取目标網站的位址

3.找到需要爬取内容的DOM位置

4.進行構造周遊爬取(當然這個爬取備案資訊的很簡單，不需要各種提取操作)

Python爬取百度備案資訊Python爬取百度備案資訊

// 完整代碼及解釋
import requests
from bs4 import BeautifulSoup
//僞造請求頭，防止伺服器端觸發反爬機制
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)\
     AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
//爬取目标網站的位址
res = requests.get('https://icp.aizhan.com/www.baidu.com/', headers = headers)
try:
	//BeautifulSoup可以讀取HTML檔案進行解析
    soup = BeautifulSoup(res.text, 'lxml')
    //找到需要爬取内容的DOM位置
    div = soup.find('div', attrs = {'id':'icp-table'})
    td_list = div.find_all('td')
    //使用:nth-child(n) 選擇器比對父元素中的第 n 個子元素
	//https://icp.aizhan.com/www.baidu.com/
    //icp-table > table > tbody > tr:nth-child(3) > td:nth-child(2) > span
    icp = soup.select('#icp-table > table > tbody > tr:nth-of-type(3) > td:nth-of-type(2) > span')
    if len(icp):
        print(icp[0].get_text())
	//周遊 構造列印出來的内容
    for i in range(0, len(td_list), 2):
        info = td_list[i].text + ":" + td_list[i + 1].text
        print(info)
        print("-" * 20)
    
except ConnectionError:
    print("網站連接配接失敗")

Python爬取百度備案資訊Python爬取百度備案資訊

Python爬取百度備案資訊

繼續閱讀

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入