Scrapy ：全站爬取文學文章

2023-08-07 02:47:55

爬取網站：www.rensheng5.com

爬取内容：整站文章

爬取字段：名稱時間作者内容

儲存：以每個文章的名稱命名儲存為txt

本次采用通用爬蟲爬網站：

環境：Ubuntu python3.7

在終端建立項目模闆 CrawlSpider

重要的就是Rule正規表達式的構造

項目建立可見我的其他scrapy爬蟲，在此不再贅述

直接上主要代碼：

rules = (
        Rule(LinkExtractor(allow=r'\w+/id-\d+.html'), callback='parse_item', follow=True),

    )

解析代碼：

item['name'] = response.xpath('//div[@class="artview"]/h1/text()').extract_first()
        date = response.xpath('//div[@class="artinfo"]//text()').extract()
        item['date'] = ' '.join(date).split('點選')[0].replace('\u3000', ' ').strip()
        content = response.xpath('//div[@class="artbody"]//p/text()').extract()
        item['content'] = ' '.join(content).replace('\u3000', '').replace('\r\n', ' ').strip()

settings設定：

将 ITEM_PIPELINES的注釋去掉

item設定：

設定三個字段；name date content

piplines設定：

這個主要是用于儲存資料的代碼如下：

def process_item(self, item, spider):
        filename = item['name']
        f = open(filename+'.txt', 'w', encoding='utf8')
        f.write(item['name']+'\n')
        f.write(item['date']+'\n')
        f.write(item['content'])
        f.close()
        return item

結果如下：

Scrapy ：全站爬取文學文章

Scrapy ：全站爬取文學文章

繼續閱讀

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入