分布式爬蟲(scrapy-redis)

- 為什麼原生的scrapy不能實作分布式

　　 - 排程器不能被共享

　　 - 管道無法被共享

- scrapy-redis元件的作用是什麼

　　 - 提供了可以被共享的排程器和管道

- 分布式爬蟲實作流程

1.環境安裝:pip install scrapy-redis
2.建立工程
3.建立爬蟲檔案:RedisCrawlSpider  RedisSpider
    - scrapy genspider -t crawl xxx www.xxx.com
4.對爬蟲檔案中的相關屬性進行修改:
    - 導報:from scrapy_redis.spiders import RedisCrawlSpider
    - 将目前爬蟲檔案的父類設定成RedisCrawlSpider
    - 将起始url清單替換成redis_key = 'xxx'(排程器隊列的名稱)
5.在配置檔案中進行配置:
    - 使用元件中封裝好的可以被共享的管道類:
        ITEM_PIPELINES = {
            'scrapy_redis.pipelines.RedisPipeline': 400
            }
    - 配置排程器(使用元件中封裝好的可以被共享的排程器)
        # 增加了一個去重容器類的配置, 作用使用Redis的set集合來存儲請求的指紋資料, 進而實作請求去重的持久化
        DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
        # 使用scrapy-redis元件自己的排程器
        SCHEDULER = "scrapy_redis.scheduler.Scheduler"
        # 配置排程器是否要持久化, 也就是當爬蟲結束了, 要不要清空Redis中請求隊列和去重指紋的set。如果是True, 就表示要持久化存儲, 就不清空資料, 否則清空資料
        SCHEDULER_PERSIST = True

     - 指定存儲資料的redis:
        REDIS_HOST = 'redis服務的ip位址'
        REDIS_PORT = 6379

     - 配置redis資料庫的配置檔案
        - 取消保護模式:protected-mode no
        - bind綁定: #bind 127.0.0.1

     - 啟動redis

6.執行分布式程式
    scrapy runspider xxx.py

7.向排程器隊列中仍入一個起始url:
    在redis-cli中執行:

- 爬取抽屜網标題和作者

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_redis.spiders import RedisCrawlSpider
from redischoutipro.items import RedischoutiproItem
class ChoutiSpider(RedisCrawlSpider):
    name = 'chouti'

    redis_key = "chouti" # 排程器隊列名字

    rules = (
        Rule(LinkExtractor(allow=r'/all/hot/recent/\d+'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        div_list = response.xpath('//div[@class="item"]')
        for div in div_list:
            title = div.xpath('./div[4]/div[1]/a/text()').extract_first()
            author = div.xpath('./div[4]/div[2]/a[4]/b/text()').extract_first()
            item = RedischoutiproItem()
            item["title"] = title
            item["author"] = author

            yield item

spiders.chouti.py

import scrapy


class RedischoutiproItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    author = scrapy.Field()

items

ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 400
}

DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True  #資料指紋

REDIS_HOST = '127.0.0.1'

REDIS_PORT = 6379

settings

　 - 執行指令:進入項目中的spiders

scrapy runspider chouti.py

　　 - 打開redis向隊列扔一個url

lpush chouti https://dig.chouti.com/all/hot/recent/1

　　 - keys * 之後你會看到自動生成了:

分布式爬蟲(scrapy-redis)

　　　　 - requests :資料指紋(密文)

　　　　 - items: 資料都在這

　　　　 - dupefilter: 用來存儲抓取過的 url 的 fingerprint（使用哈希函數将url運算後的結果），防止重複抓取，隻要 redis 不清空，就可以進行斷點續爬

# 檢視資料
lrange chouti:items 0 -1

轉載于:https://www.cnblogs.com/lzmdbk/p/10477982.html

分布式爬蟲(scrapy-redis)

繼續閱讀

MySQL的4種隔離級别？出現問題

XX系統實施過程問題總結

sort()函數到底是怎樣進行數字排序的

無元件上傳圖檔到資料庫中，最完整解決方案

【MySQL資料庫】資料庫索引事務1.索引2.事務

neo4j之cypher使用文檔

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

NOSQL安全攻擊

mybatis_入門程式Mybatis入門

登入plsql 報錯 the account is locked --使用者被鎖

SequoiaDB巨杉資料庫C++驅動概述

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入