天天看點

Scrapy CrawlSpider抓取資料

本文主要是對CrawlSpider爬蟲的應用示例

資料爬取對象:中華網科技類新聞

url:https://tech.china.com/articles/

1. 建立項目

scrapy startproject zhonghuawang
           

2. 建立CrawlSpider爬蟲

cd zhonghuawang
scrapy genspider -t crawl china tech.china.com
           

3. 頁面分析

清單頁:

通過F12開發者工具,你會發現所有的資訊都包含在div[@class="m2left topborder"]這個标簽裡面,下面的每個字标簽div[@class="con_item"]裡面就包含每條新聞的資訊,包括連結,通過觀察,你會發現每條新聞的詳情頁連結的規律了

Scrapy CrawlSpider抓取資料

這樣我們就可以寫詳情頁的連結提取規則了,加之之後我們要在詳情頁裡面提取需要的資料,則肯定需要有回調函數(callback)去處理

Rule(LinkExtractor(allow=r'/article/\d+/\d+.html'), callback='parse_item', follow=False)
           

接下來,我們再分析一下怎麼翻頁,通過點選下一頁就很容易發現規律了

Scrapy CrawlSpider抓取資料
Scrapy CrawlSpider抓取資料

然後我們再通過F12開發者工具去檢視下面翻頁頁面的資訊

Scrapy CrawlSpider抓取資料

那麼我們下一條Rule規則頁就出來了,但它隻需要提取連結,并不需要做其它的操作,是以不需要回調函數

Rule(LinkExtractor(allow=r'/articles/index[0-9_]+.html'))
           

完整的Rule規則如下:

rules = (
        Rule(LinkExtractor(allow=r'/articles/index[0-9_]+.html')),
        Rule(LinkExtractor(allow=r'/article/\d+/\d+.html'), callback='parse_item', follow=False)
    )
           

詳情頁:

我們需要提取的資訊是:标題,連結,文章内容,發表時間,來源以及站點名稱,确定好需要提取的哪些資訊字段後,我們就可以寫items.py檔案了

import scrapy

class ZhonghuawangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 标題
    title = scrapy.Field()
    # 連結
    url = scrapy.Field()
    # 正文
    content = scrapy.Field()
    # 釋出時間
    datetime = scrapy.Field()
    # 來源
    source = scrapy.Field()
    # 站點名稱
    website = scrapy.Field()
           

接下來就是通過F12開發者工具去檢視需要提取的資訊分别在哪個标簽下,然後使用xpath規則進行提取

def parse_item(self, response):
        item = ZhonghuawangItem()
        # 标題
        item['title'] = response.xpath('//h1[@id="chan_newsTitle"]/text()').extract()[0]
        # 連結
        item['url'] = response.url
        # 内容
        # item['content'] = response.xpath('//div[@id="chan_newsDetail"]/p/text()').extract()[0]
        item['content'] = ''.join(response.xpath('//div[@id="chan_newsDetail"]//text()').extract()).strip()
        # 釋出時間
        item['datetime'] = ' '.join(response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[:2])
        # 來源
        # item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[-1][3:]
        item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').re_first('來源:(.*)').strip()
        # 站點名稱
        item['website'] = '中華網'
        yield item
           

4. 儲存資訊

對需要的資訊提取完成後,最後就寫入到json檔案進行儲存(你可以儲存到資料庫中),那麼我們就要寫pipelines.py檔案了

import json

class ZhonghuawangPipeline(object):
    def __init__(self):
        self.filename = open('zhonghuawnag.json', 'w')

    def process_item(self, item, spider):
        text = json.dumps(dict(item), ensure_ascii=False) + '\n'
        self.filename.write(text)
        return item

    def close_spider(self, spider):
        self.filename.close()
           

5. 修改settings.py檔案

#不遵守robots.txt協定
ROBOTSTXT_OBEY = False

#不使用Cookies
COOKIES_ENABLED = False

#設定請求報頭
DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  # 'Accept-Language': 'en',
}

#開啟piprlines
ITEM_PIPELINES = {
   'zhonghuawang.pipelines.ZhonghuawangPipeline': 300,
}
           

6. 運作代碼即可

scrapy crawl china
           

完整代碼如下:

items.py

import scrapy


class ZhonghuawangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 标題
    title = scrapy.Field()
    # 連結
    url = scrapy.Field()
    # 正文
    content = scrapy.Field()
    # 釋出時間
    datetime = scrapy.Field()
    # 來源
    source = scrapy.Field()
    # 站點名稱
    website = scrapy.Field()
           

china.py

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from zhonghuawang.items import ZhonghuawangItem


class ChinaSpider(CrawlSpider):
    name = 'china'
    allowed_domains = ['tech.china.com']
    start_urls = ['http://tech.china.com/articles/']

    rules = (
        Rule(LinkExtractor(allow=r'/articles/index[0-9_]+.html')),
        Rule(LinkExtractor(allow=r'/article/\d+/\d+.html'), callback='parse_item', follow=False)
    )

    def parse_item(self, response):
        item = ZhonghuawangItem()
        # 标題
        item['title'] = response.xpath('//h1[@id="chan_newsTitle"]/text()').extract()[0]
        # 連結
        item['url'] = response.url
        # 内容
        # item['content'] = response.xpath('//div[@id="chan_newsDetail"]/p/text()').extract()[0]
        item['content'] = ''.join(response.xpath('//div[@id="chan_newsDetail"]//text()').extract()).strip()
        # 釋出時間
        item['datetime'] = ' '.join(response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[:2])
        # 來源
        # item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[-1][3:]
        item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').re_first('來源:(.*)').strip()
        # 站點名稱
        item['website'] = '中華網'
        yield item
           

pipelines.py

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from zhonghuawang.items import ZhonghuawangItem


class ChinaSpider(CrawlSpider):
    name = 'china'
    allowed_domains = ['tech.china.com']
    start_urls = ['http://tech.china.com/articles/']

    rules = (
        Rule(LinkExtractor(allow=r'/articles/index[0-9_]+.html')),
        Rule(LinkExtractor(allow=r'/article/\d+/\d+.html'), callback='parse_item', follow=False)
    )

    def parse_item(self, response):
        item = ZhonghuawangItem()
        # 标題
        item['title'] = response.xpath('//h1[@id="chan_newsTitle"]/text()').extract()[0]
        # 連結
        item['url'] = response.url
        # 内容
        # item['content'] = response.xpath('//div[@id="chan_newsDetail"]/p/text()').extract()[0]
        item['content'] = ''.join(response.xpath('//div[@id="chan_newsDetail"]//text()').extract()).strip()
        # 釋出時間
        item['datetime'] = ' '.join(response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[:2])
        # 來源
        # item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').extract()[1].strip().split(' ')[-1][3:]
        item['source'] = response.xpath('//div[@id="chan_newsInfo"]/text()').re_first('來源:(.*)').strip()
        # 站點名稱
        item['website'] = '中華網'
        yield item
           

settings.py

#不遵守robots.txt協定
ROBOTSTXT_OBEY = False

#不使用Cookies
COOKIES_ENABLED = False

#設定請求報頭
DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  # 'Accept-Language': 'en',
}

#開啟piprlines
ITEM_PIPELINES = {
   'zhonghuawang.pipelines.ZhonghuawangPipeline': 300,
}
           

繼續閱讀