python定向爬蟲執行個體（三）

2023-08-03 05:59:10

功能：爬取百度熱搜的實時排行榜資訊

程式設計：

爬取百度熱搜網頁的HTML文本
解析HTML文本擷取排名、熱度資訊
将擷取到的資訊以字典的形式存儲到檔案中

代碼：

#爬取百度熱搜的實時排行榜
#技術路線：requests---bs4
import requests
from bs4 import BeautifulSoup

def getHTML(url):
    try:
        r=requests.get(url,headers={'User-Agent':'Mozilla/5.0'})
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return ""

def parseHTML(demo,file_path):
    f=open(file_path,"w")
    soup=BeautifulSoup(demo,"html.parser")
    num_list=soup.find_all('td','first')
    title_list=soup.find_all('a','list-title')
    for i in range(len(num_list)):
        info_dict={}
        try:
            info_dict.update({
                '排名':num_list[i].find('span').string,
                '标題':title_list[i].string,
            })
            f.write(str(info_dict)+'\n')
        except:
            continue
    f.close()
    print("爬取完畢!")
def main():
    url='http://top.baidu.com/buzz?b=1&fr=20811'
    file_path="D://百度實時熱搜排行.txt"
    demo=getHTML(url)
    parseHTML(demo,file_path)

main()

結果：

python定向爬蟲執行個體（三）

轉載于:https://www.cnblogs.com/BUPT-MrWu/p/11349130.html

python定向爬蟲執行個體（三）

繼續閱讀

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入