scrapy爬蟲架構（二）：建立一個scrapy爬蟲

在建立新的scrapy爬蟲之前，我們需要先了解一下建立一個scrapy爬蟲的基本步驟

一、确定要爬取的資料

以爬取豆瓣電影資料為例：

每部電影所要爬取的資訊有：

片名:《頭号玩家》
導演: 史蒂文·斯皮爾伯格
編劇: 紮克·佩恩 / 恩斯特·克萊恩
主演: 泰伊·謝裡丹 / 奧利維亞·庫克 / 本·門德爾森 / 馬克·裡朗斯 / 麗娜·維特 / 更多...
類型: 動作 / 科幻 / 冒險

是以items檔案的代碼如下：

#items.py

import scrapy

class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    movie_name = scrapy.Field()
    movie_dir = scrapy.Field()
    movie_editors = scrapy.Field()
    movie_actors = scrapy.Field()
    movie_type = scrapy.Field()

複制

二、爬取所需的資訊

确定了要爬取的資訊後，就可以開始寫爬蟲的代碼了。

首先，我們建立一個爬蟲檔案。

在指令行中輸入如下指令（必須在爬蟲項目的檔案夾裡）：

scrapy genspider spidername "domain"
#spidername是要建立的爬蟲的名字，必須是唯一的，而且不能和爬蟲項目名相同
#domain是要爬取的網站的 host，即你所要爬取的網站的域名,如：www.baidu.com

複制

建立好爬蟲檔案後，打開爬蟲項目下的spiders檔案夾，用編輯器打開我們剛剛建立的爬蟲檔案。

檔案裡已經定義好了start_urls，這是我們運作爬蟲時要通路的連結。

注意這是一個清單，可以放入多個url。

當爬蟲運作時就會一個一個地通路 start_urls裡的連結，然後将傳回的響應做為參數傳遞給 parse函數。

在 parse函數裡，我們可以來對網頁中的資訊進行提取。

示例隻爬取一個頁面（頭号玩家的詳情頁），代碼如下：

# -*- coding: utf-8 -*-
#movieInfoSpider.py
import scrapy
#導入DouBanItem類
from douban.items import DoubanItem

class MovieinfoSpider(scrapy.Spider):
    name = 'movieInfo'
    allowed_domains = ['movie.douban.com']
    start_urls = ['https://movie.douban.com/subject/4920389/?from=showing']

    def parse(self, response):
        #建立DoubanItem類
        item = DoubanItem()

        item['movie_name'] = response.xpath('//title/text()').extract()[0]
        item['movie_dir'] = '導演:' + '/'.join(response.xpath('//div[@id="info"]/span[1]/span/a/text()').extract())
        item['movie_editors'] = '編劇:' + '/'.join(response.xpath('//div[@id="info"]/span[2]/span/a/text()').extract())
        item['movie_actors'] = '主演:' + '/'.join(response.xpath('//div[@id="info"]/span[3]/span/a/text()').extract())
        item['movie_type'] = '類型:' + '/'.join(response.xpath('//div[@id="info"]/span[@property=

        yield item

複制

提取到所需的資訊後，用

yield

關鍵字将 item傳遞給 pipelines.py進行進一步的處理

三、對提取到的資訊進行儲存

pipelines.py檔案獲得item後将會調用管道函數來對item進行處理，這裡我們把電影的資訊儲存到 txt檔案中去，代碼如下：

# -*- coding: utf-8 -*-
#pipelines.py

class DoubanPipeline(object):
    def __init__(self):
        self.fo = open('info.txt', 'wb')

    def process_item(self, item, spider):
        self.fo.write((item['movie_name'] + '\n').encode('utf-8'))
        self.fo.write((item['movie_dir'] + '\n').encode('utf-8'))
        self.fo.write((item['movie_editor'] + '\n').encode('utf-8'))
        self.fo.write((item['movie_actors'] + '\n').encode('utf-8'))
        self.fo.write((item['movie_type'] + '\n').encode('utf-8'))

        #這裡必須傳回item，否則程式會一直等待，直到傳回item為止
        return item

    def close_spider(self, spider):
        self.fo.close()
    #__init__, 和close_spider 函數相當于c++裡的構造函數和析構函數

複制

四、在 setting.py裡開啟 DoubanPipeline管道

這裡隻截取部分相關的代碼：

# Obey robots.txt rules
#是否遵循網站對爬蟲的規則，一般設為False，但預設為True
ROBOTSTXT_OBEY = False

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'douban.pipelines.DoubanPipeline': 300,
}

#設定請求頭，模拟浏覽器
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Cookie': 'bid=uzUipzgnxdY; ll="118267"; __utmc=30149280; __utmz=30149280.1523088054.4.4.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __utmc=223695111; __utmz=223695111.1523088054.1.1.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __yadk_uid=u46EFxFlzD46PvWysMULc80N9s8k2pp4; _vwo_uuid_v2=DC94F00058615E2C6A432CB494EEB894B|64bbcc3ac402b9490e5de18ce3216c5f; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1523092410%2C%22https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3DFIqLEYPF6UnylF-ja19vuuKZ51u3u5gGYJHpVJ5MRTO-oLkJ_C84HBgYi5OulPwl%26wd%3D%26eqid%3Dd260482b00005bbb000000055ac87ab2%22%5D; _pk_id.100001.4cf6=cbf515d686eadc0b.1523088053.2.1523092410.1523088087.; _pk_ses.100001.4cf6=*; __utma=30149280.1054682088.1514545233.1523088054.1523092410.5; __utmb=30149280.0.10.1523092410; __utma=223695111.979367240.1523088054.1523088054.1523092410.2; __utmb=223695111.0.10.1523092410',
'Host': 'movie.douban.com',
'Upgrade-Insecure-Requests': '1',
}

複制

五、運作爬蟲

進入到爬蟲項目的檔案夾裡執行如下指令：

scrapy crawl movieInfoSpider

複制

總結：scrapy爬蟲建構順序 items.py-->spiders-->pipelines.py-->settings.py

原文： https://blog.csdn.net/qq_40695895/article/details/79842502