爬蟲實戰--抓取糗事百科前10頁資料

2023-04-17 03:14:18

1.使用三個庫：urllib2, re , lxml ,自行百度安裝，

# -*- coding:utf-8 -*-
import urllib2
import re
import lxml.html as html

def get_url(url): #封裝一次url的請求，獲得3個參數
    User_Agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.90 Safari/537.36'
    header = {
        'User-Agent':User_Agent
    } #不使用headers傳參數將無法擷取到頁面資料
    try:
        req = urllib2.Request(url,headers=header)
        response = urllib2.urlopen(req)
        content = response.read().decode('utf-8')
        # print content
        tree = html.fromstring(content) #将源碼進行轉化，這樣就可以通過tree來使用xpath，
        aa = tree.xpath('//*[@id="content-left"]//div[@class="article block untagged mb15"]')
        wenben = tree.xpath('//text()')
        wenben = "".join(wenben).strip() #擷取目前頁面中文本，
        meiyetiaoshu = len(aa)		#擷取目前頁面中有幾條資料

        return tree, meiyetiaoshu, wenben,
    except urllib2.URLError, e:
        if hasattr(e, 'code'):
            print e.code
        if hasattr(e, 'reason'):
            print e.reason



#抓取每一頁中的所有條數資料
def page_one(tree,wenben,meiyetiaoshu):
    for i in range(1,meiyetiaoshu+1):
        zuozhe = tree.xpath('//*[@id="content-left"]//div[{}]/div[1]/a[2]/@title'.format(i))
        if not zuozhe: #對匿名使用者的處理，
            zuozhe = tree.xpath('//*[@id="content-left"]//div[{}]/div[1]/span[2]/h2/text()'.format(i))
        print i,zuozhe[0] #擷取得到使用者名稱，
        duanzi = tree.xpath('//*[@id="content-left"]//div[{}]/a/div/span/text()'.format(i))
        print duanzi[0].strip() #擷取得到使用者說的笑話資訊，
        sub_zuozhe= re.sub(u'(\*|\(|\)|\~|\^)','',zuozhe[0]) #裡面有特殊字元，将其中的特殊字元替換掉
        wenben = re.sub(u'(\*|\(|\)|\~|\^)','',wenben) #将文本中的特殊字元頁替換掉，
        haoxiao_group = re.search(u'%s[\d\D]+?(\d+ 好笑)'%sub_zuozhe,wenben)
        print haoxiao_group.group(1).strip()  擷取得到裡面有多少人覺得好笑，評論數也可以用相同的方法抓取


#抓取前10頁資料，
for page in range(1,11):
    url = 'http://www.qiushibaike.com/hot/page/%s/'%str(page) #前10頁的url
    tree, meiyetiaoshu, wenben = get_url(url)
    page_one(tree=tree,wenben=wenben,meiyetiaoshu=meiyetiaoshu)

爬蟲實戰--抓取糗事百科前10頁資料

繼續閱讀

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入