首先，官方GitHub位址

https://github.com/rmax/scrapy-redis

特征

1。分布式抓取/抓取

您可以啟動共享單個redis隊列的多個spider實例。Y适合廣泛的多域抓取。

2。分布式後處理

Scraped項目被推送到redis排隊，這意味着您可以在共享項目隊列的所需後處理過程中啟動。

3。Scrapy即插即用元件

排程程式+複制過濾器，項目管道，基礎蜘蛛。

使用環境

redis scrapy scrapy-redis 就夠了。

scrapy-redis 分布式爬蟲。多台電腦共同運作，資料儲存在Redis中。確定伺服器端redis開啟了遠端連接配接。

用于提醒自己的（不确定對不對），scrapy-redis伺服器端隻是提供了一個redis伺服器。隻是提供了一個redis伺服器，隻是提供了一個redis伺服器。

是以隻要是要進行爬蟲，那它充當的就是一個slave，但是，一個電腦是可以既是master又是slave的。。。

就爬取讀書網的新聞測試一下吧。官方建議的 http://www.dmoz.org/ 好像關掉了。這個也沒關系，本來這篇部落格就是個人複習使用，有看到的大神覺得哪裡錯了指出來呗。

https://www.dushu.com/

1。開始建立項目。按照基本的scrapy項目那樣建立。

-> scrapy startproject myscrapy_redis
-> scrapy genpider myspider

建立完成

2。settings.py檔案修改。将官網的檔案複制到自己的settings.py檔案就好。

# Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# Default requests serializer is pickle, but it can be changed to any module
# with loads and dumps functions. Note that pickle is not compatible between
# python versions.
# Caveat: In python 3.x, the serializer must return strings keys and support
# bytes as values. Because of this reason the json or msgpack module will not
# work by default. In python 2.x there is no such issue and you can use
# 'json' or 'msgpack' as serializers.
#SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"

# Don't cleanup redis queues, allows to pause/resume crawls.
#SCHEDULER_PERSIST = True

# Schedule requests using a priority queue. (default)
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'

# Alternative queues.
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'

# Max idle time to prevent the spider from being closed when distributed crawling.
# This only works if queue class is SpiderQueue or SpiderStack,
# and may also block the same time when your spider start at the first time (because the queue is empty).
#SCHEDULER_IDLE_BEFORE_CLOSE = 10

# Store scraped item in redis for post-processing.
ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 300
}

# The item pipeline serializes and stores the items in this redis key.
#REDIS_ITEMS_KEY = '%(spider)s:items'

# The items serializer is by default ScrapyJSONEncoder. You can use any
# importable path to a callable object.
#REDIS_ITEMS_SERIALIZER = 'json.dumps'

# Specify the host and port to use when connecting to Redis (optional).
#REDIS_HOST = 'localhost'
#REDIS_PORT = 6379

# Specify the full Redis URL for connecting (optional).
# If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.
#REDIS_URL = 'redis://user:[email protected]:9001'

# Custom redis client parameters (i.e.: socket timeout, etc.)
#REDIS_PARAMS  = {}
# Use custom redis client class.
#REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient'

# If True, it uses redis' ``SPOP`` operation. You have to use the ``SADD``
# command to add URLs to the redis queue. This could be useful if you
# want to avoid duplicates in your start urls list and the order of
# processing does not matter.
#REDIS_START_URLS_AS_SET = False

# Default start urls key for RedisSpider and RedisCrawlSpider.
#REDIS_START_URLS_KEY = '%(name)s:start_urls'

# Use other encoding than utf-8 for redis.
#REDIS_ENCODING = 'latin1'

沒有被注釋的就三個地方，替換掉了原來的排程器和去重器。采用scrapy-redis的排程器和去重器。然後指定了item管道 RedisPipeline。

3。修改爬蟲檔案。與一般scrapy項目的爬蟲檔案相比。繼承的類發生了改變

import scrapy
from scrapy_redis.spiders import RedisSpider


class MovieinfoSpider(RedisSpider):
    name = 'movieInfo'

    def parse(self, response):

        # 提取下一頁連結   callback為None 繼續在該parse中解析
        next_page = response.xpath("//a[contains(text(), '下一頁')]/@href")
        print(next_page)
        if next_page:   # 如果存在下一頁
            yield scrapy.Request(response.urljoin(next_page.get()))

        # 提取詳情頁 callback 指向解析詳情頁
        news_detail_page_list = response.xpath("//div[contains(@class, 'news-item')]/h3/a")
        for news_detail_page in news_detail_page_list:
            yield response.follow(news_detail_page, callback=self.parse_detail)

    def parse_detail(self, response):
        title = response.xpath("//h1/text()").get()
        print(title)
        yield {'title': title}

觀察該檔案，發現少了 start_urls和start_request()函數，沒有了起始任務隊列和請求開始函數。因為是scrapy-redis，任務隊列存放在redis伺服器中，由排程器擷取。

不過此時由于設定中沒有設定連接配接，是連不上redis的，是以啟動不能成功。

4。修改settings.py

關于slave連接配接伺服器的redis，有兩種方式。

（1）REDIS_HOST表示ip位址 REDIS_PORT表示端口号，去掉注釋将位址端口号修改成自己的即可。

REDIS_HOST = '192.168.0.104'
REDIS_PORT = 6379

（2）#REDIS_URL = ‘redis://user:[email protected]:9001’ 使用者名密碼 ip端口，可以将這一行注釋去掉修改成自己的，比如

REDIS_URL = 'redis://192.168.0.104:6379'

方式2 如果設定了會替換掉方式1的設定。。。

排程器從redis伺服器中取請求，管道向redis存資料時。使用的預設key值。

任務隊列的key 爬蟲名:starturls, 比如爬蟲名字是 myspider那麼key就是

myspider:start_urls

# Default start urls key for RedisSpider and RedisCrawlSpider.
#REDIS_START_URLS_KEY = '%(name)s:start_urls'

item資料的key 爬蟲名:items 如 myspider:items

# The item pipeline serializes and stores the items in this redis key.
#REDIS_ITEMS_KEY = '%(spider)s:items'

注：如果不做修改，那麼爬蟲檔案運作時，排程器會從redis伺服器中擷取key 為myspider:start_urls的set中值。

5。如果不喜歡預設的key值，可以自己進行替換。

替換方式一， settings中設定。比如直接settings中設定請求隊列的key

REDIS_START_URLS_KEY = 'myspider:start_urls'

如此就将預設的key設定成了myspider:start_urls，我沒有改變預設值，如果想改變，可以自行嘗試。item存儲的key也是一樣的。

替換方式二，爬蟲檔案中設定 redis_key屬性。比如，這樣修改了請求隊列的key,至于item存儲的key，好像隻能從settings中修改。

redis_key = 'myspider:start_urls'

6。可以将項目部署到不同機器上，隻要能連接配接上伺服器的redis。然後啟動爬蟲。

啟動方式與一般scrapy相同。不同的是，該種方式啟動後會一直運作，排程器會一直監聽redis伺服器的任務隊列keys中是否有元素。

-> scrapy crawl myspider

Scrapy-Redis簡單使用首先，官方GitHub位址

啟動了一個虛拟機，和windows下程式一同啟動爬蟲。可以看到，程式一直在等待。因為初始時，我們的根本就沒有該key值，可以用redis-cli連接配接redis伺服器，放入一個初始值。myspider:start_urls 就是一個key

redis-cli->lpush myspider:start_urls https://www.dushu.com/news/

Scrapy-Redis簡單使用首先，官方GitHub位址

7。運作過程中可以中斷運作，而在此啟動時，會繼續擷取值，且不會有重複。

Scrapy-Redis簡單使用首先，官方GitHub位址

第一個就是查重的，每個請求都會生成一個唯一序列字元串辨別，當再次受到該請求，發現redis庫中已存在該辨別，就不再發送該請求。

第二個就是已經存儲的item

第三個是帶請求的request隊列

（看官方文檔還是其他部落格）似乎說有四個key，這個我就不清楚了，但是三個key應該也可以解釋去重排程 item管道了。

8。繼續運作。。。直到停止（并不是指程式運作結束，而是長時間沒有擷取新請求）。

其實如果沒出錯最後redis中存儲的資料數量挺有意思的。。。

120個item，正好是 15*8頁，而去重的清單有127條資料。。。也就是說，除了初始的請求沒有生成序列，其他的每個網頁請求都生成了一個唯一辨別。。。

這算數應該會算吧。。。

Scrapy-Redis簡單使用首先，官方GitHub位址

首先，官方GitHub位址

使用環境

繼續閱讀

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

MapReduce的幾個企業級經典面試案例MapReduce的幾個企業級經典面試案例

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入