python3 [入門基礎實戰] 爬蟲入門之爬取糗事百科

2023-03-21 10:14:12

#encoding=utf8
import requests
from lxml import etree


class QiuShi(object):
    headers = {
        "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"
    }

    url = 'http://www.qiushibaike.com/text/'

    def __init__(self):
        filed = ['作者','性别','年齡','段子内容','好笑','評論']
        # self.write = CSV('qiushi.csv',filed)
        print(filed)

    # 總頁碼
    def totalUrl(self):
        urls = [self.url+'page/{}?s=4985075'.format(i) for i in range(,)]
        for url in urls:
            print(u'正在擷取：'+url.split('/')[-]+u'頁')
            self.getInfo(url)

    # 抓取詳細資訊
    def getInfo(self,url):
        item= {}
        html = requests.get(url,headers = self.headers).text
        data = etree.HTML(html)

        infos = data.xpath('//*[@class="article block untagged mb15"]')
        print(infos)

        for info in infos:
            try:
                item[] = info.xpath('div[1]/a[2]/h2/text()')[]
                try:
                    age = info.xpath('div[1]/div[@class="articleGender womenIcon"]/text()')[]
                    item[] = u'女'
                    item[] = age
                except:
                    age = info.xpath('div[1]/div[@class="articleGender manIcon"]/text()')[]
                    item[] = u'男'
                    item[] = age
            except:
                item[] = u'匿名使用者'
                item[] = u'不詳'
                item[] = u'不詳'
            item[] = info.xpath('a/div/span/text()')[].strip()
            item[] = info.xpath('div[2]/span[1]/i/text()')[]
            item[] = data.xpath('//*[@class="qiushi_comments"]/i/text()')[]
            row = [item[i] for i in range(, )]
            # self.write.writeRow(row)
            print(row)
            # with open('C:\\QiuShiBaiKe.cvs', 'w+') as f:
            #     # f.write('{},{},{},{},{}'.format(row, work_year, money, palace, '\n'))
            #     f.write(row+"")

if __name__ == '__main__':
    qiushi = QiuShi()
    qiushi.totalUrl()

python3 [入門基礎實戰] 爬蟲入門之爬取糗事百科

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入