Scrapy CrawlSpider抓取資料

本文主要是對CrawlSpider爬蟲的應用示例

資料爬取對象：中華網科技類新聞

url：https://tech.china.com/articles/

1. 建立項目

scrapy startproject zhonghuawang

2. 建立CrawlSpider爬蟲

cd zhonghuawang
scrapy genspider -t crawl china tech.china.com

3. 頁面分析

清單頁：

通過F12開發者工具，你會發現所有的資訊都包含在div[@class="m2left topborder"]這個标簽裡面，下面的每個字标簽div[@class="con_item"]裡面就包含每條新聞的資訊，包括連結，通過觀察，你會發現每條新聞的詳情頁連結的規律了

Scrapy CrawlSpider抓取資料

這樣我們就可以寫詳情頁的連結提取規則了，加之之後我們要在詳情頁裡面提取需要的資料，則肯定需要有回調函數（callback）去處理

Rule(LinkExtractor(allow=r'/article/\d+/\d+.html'), callback='parse_item', follow=False)

接下來，我們再分析一下怎麼翻頁，通過點選下一頁就很容易發現規律了

Scrapy CrawlSpider抓取資料

然後我們再通過F12開發者工具去檢視下面翻頁頁面的資訊

Scrapy CrawlSpider抓取資料

那麼我們下一條Rule規則頁就出來了，但它隻需要提取連結，并不需要做其它的操作，是以不需要回調函數

Rule(LinkExtractor(allow=r'/articles/index[0-9_]+.html'))

完整的Rule規則如下：

rules = (
        Rule(LinkExtractor(allow=r'/articles/index[0-9_]+.html')),
        Rule(LinkExtractor(allow=r'/article/\d+/\d+.html'), callback='parse_item', follow=False)
    )

詳情頁：

我們需要提取的資訊是：标題，連結，文章内容，發表時間，來源以及站點名稱，确定好需要提取的哪些資訊字段後，我們就可以寫items.py檔案了

import scrapy

class ZhonghuawangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 标題
    title = scrapy.Field()
    # 連結
    url = scrapy.Field()
    # 正文
    content = scrapy.Field()
    # 釋出時間
    datetime = scrapy.Field()
    # 來源
    source = scrapy.Field()
    # 站點名稱
    website = scrapy.Field()

接下來就是通過F12開發者工具去檢視需要提取的資訊分别在哪個标簽下，然後使用xpath規則進行提取

def parse_item(self, response):
        item = ZhonghuawangItem()
        # 标題
        item['title'] = response.xpath('//h1[@id="chan_newsTitle"]/text()').extract()[0]
        # 連結
        item['url'] = response.url
        # 内容
        # item['content'] = response.xpath('//div[@id="chan_newsDetail"]/p/text()').extract()[0]
        item['content'] = ''.join(response.xpath('//div[@id="chan_newsDetail"]//text()').extract()).strip()
        # 釋出時間
        item['datetime'] = ' '.join(response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[:2])
        # 來源
        # item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[-1][3:]
        item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').re_first('來源：(.*)').strip()
        # 站點名稱
        item['website'] = '中華網'
        yield item

4. 儲存資訊

對需要的資訊提取完成後，最後就寫入到json檔案進行儲存（你可以儲存到資料庫中），那麼我們就要寫pipelines.py檔案了

import json

class ZhonghuawangPipeline(object):
    def __init__(self):
        self.filename = open('zhonghuawnag.json', 'w')

    def process_item(self, item, spider):
        text = json.dumps(dict(item), ensure_ascii=False) + '\n'
        self.filename.write(text)
        return item

    def close_spider(self, spider):
        self.filename.close()

5. 修改settings.py檔案

#不遵守robots.txt協定
ROBOTSTXT_OBEY = False

#不使用Cookies
COOKIES_ENABLED = False

#設定請求報頭
DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  # 'Accept-Language': 'en',
}

#開啟piprlines
ITEM_PIPELINES = {
   'zhonghuawang.pipelines.ZhonghuawangPipeline': 300,
}

6. 運作代碼即可

scrapy crawl china

完整代碼如下：

items.py

import scrapy


class ZhonghuawangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 标題
    title = scrapy.Field()
    # 連結
    url = scrapy.Field()
    # 正文
    content = scrapy.Field()
    # 釋出時間
    datetime = scrapy.Field()
    # 來源
    source = scrapy.Field()
    # 站點名稱
    website = scrapy.Field()

china.py

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from zhonghuawang.items import ZhonghuawangItem


class ChinaSpider(CrawlSpider):
    name = 'china'
    allowed_domains = ['tech.china.com']
    start_urls = ['http://tech.china.com/articles/']

    rules = (
        Rule(LinkExtractor(allow=r'/articles/index[0-9_]+.html')),
        Rule(LinkExtractor(allow=r'/article/\d+/\d+.html'), callback='parse_item', follow=False)
    )

    def parse_item(self, response):
        item = ZhonghuawangItem()
        # 标題
        item['title'] = response.xpath('//h1[@id="chan_newsTitle"]/text()').extract()[0]
        # 連結
        item['url'] = response.url
        # 内容
        # item['content'] = response.xpath('//div[@id="chan_newsDetail"]/p/text()').extract()[0]
        item['content'] = ''.join(response.xpath('//div[@id="chan_newsDetail"]//text()').extract()).strip()
        # 釋出時間
        item['datetime'] = ' '.join(response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[:2])
        # 來源
        # item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[-1][3:]
        item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').re_first('來源：(.*)').strip()
        # 站點名稱
        item['website'] = '中華網'
        yield item

pipelines.py

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from zhonghuawang.items import ZhonghuawangItem


class ChinaSpider(CrawlSpider):
    name = 'china'
    allowed_domains = ['tech.china.com']
    start_urls = ['http://tech.china.com/articles/']

    rules = (
        Rule(LinkExtractor(allow=r'/articles/index[0-9_]+.html')),
        Rule(LinkExtractor(allow=r'/article/\d+/\d+.html'), callback='parse_item', follow=False)
    )

    def parse_item(self, response):
        item = ZhonghuawangItem()
        # 标題
        item['title'] = response.xpath('//h1[@id="chan_newsTitle"]/text()').extract()[0]
        # 連結
        item['url'] = response.url
        # 内容
        # item['content'] = response.xpath('//div[@id="chan_newsDetail"]/p/text()').extract()[0]
        item['content'] = ''.join(response.xpath('//div[@id="chan_newsDetail"]//text()').extract()).strip()
        # 釋出時間
        item['datetime'] = ' '.join(response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[:2])
        # 來源
        # item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[-1][3:]
        item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').re_first('來源：(.*)').strip()
        # 站點名稱
        item['website'] = '中華網'
        yield item

settings.py

#不遵守robots.txt協定
ROBOTSTXT_OBEY = False

#不使用Cookies
COOKIES_ENABLED = False

#設定請求報頭
DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  # 'Accept-Language': 'en',
}

#開啟piprlines
ITEM_PIPELINES = {
   'zhonghuawang.pipelines.ZhonghuawangPipeline': 300,
}

Scrapy CrawlSpider抓取資料

資料爬取對象：中華網科技類新聞

url：https://tech.china.com/articles/

完整代碼如下：

繼續閱讀

16Python爬蟲---Scrapy常用指令

scrapy常用指令筆記

Python爬蟲基本庫的使用第二章基本庫的使用

Python爬蟲（四）lxml、xpath安裝子產品導入查找節點屬性查找 @ 符号使用謂語選取未知節點擷取文本和屬性

爬蟲學習之04-request子產品擷取糗事百科一張熱圖

python3下用selenium庫和chrome的headless模式實作網頁抓取（注釋中有用phantomJS的小段代碼）

【Python爬蟲案例學習19】多程序爬取某圖檔網站

python爬蟲實戰之爬取成語大全

【爬取百度首頁】-将整個html源碼儲存-headers使用一、網頁分析二、代碼實作與步驟三、結果分析

爬取百度貼吧

爬取貓眼電影--靜态網頁反爬與多線程/多程序爬取網頁解析爬取代碼多線程與多程序

requests子產品進行人人網模拟登陸

【崔慶才教材】《Python3網絡爬蟲開發實戰》3.4爬取貓眼電影排行代碼更正（繞過美團驗證碼）

2023爬蟲學習筆記 -- 多線程操作

Python爬蟲學習（1）

Boss直聘Python爬蟲實戰