安裝 scrapy_redis
pip install scrapy-redis
Scrapy-Redis分布式政策
- Master端(核心伺服器) :我使用的虛拟機系統為linux,搭建一個Redis資料庫,不負責爬取,隻負責url指紋判重、Request的配置設定,以及資料的存儲
- Slaver端(爬蟲程式執行端) :我使用的win10,負責執行爬蟲程式,運作過程中送出新的Request給Master
分布式爬蟲的運作過程
![](https://img.laitimes.com/img/__Qf2AjLwojIjJCLyojI0JCLiAzNfRHLGZkRGZkRfJ3bs92YsYTMfVmepNHL1ZVbiZnRykVMs1WY2R2MMBjVtJWd0ckW65UbM5WOHJWa5kHT20ESjBjUIF2X0hXZ0xCMx81dvRWYoNHLrdEZwZ1Rh5WNXp1bwNjW1ZUba9VZwlHdssmch1mclRXY39CXldWYtlWPzNXZj9mcw1ycz9WL49zZuBnL1ADO1ADN1cTM1AjNwkTMwIzLc52YucWbp5GZzNmLn9Gbi1yZtl2Lc9CX6MHc0RHaiojIsJye.png)
- 首先Slaver端從Master端拿任務(Request、url)進行資料抓取,Slaver抓取資料的同時,産生新任務的Request便送出給 Master 處理;
- Master端隻有一個Redis資料庫,負責将未處理的Request去重和任務配置設定,将處理後的Request加入待爬隊列,并且存儲爬取的資料。
執行個體-使用分布式爬蟲爬取豆瓣250
此處省略了虛拟機redis資料庫的配置。
先從github上拿到scrapy-redis的示例,然後将裡面的example-project目錄移到指定的位址:
https://github.com/rmax/scrapy-redis
scrapy-redis 源碼中有自帶一個example-project項目,這個項目包含3個spider,分别是dmoz, myspider_redis,mycrawler_redis,
此處使用的是mycrawler_redis
在mycrawler_redis 中不再有start_urls,取而代之的是redis_key,scrapy-redis将key從Redis裡pop出來,成為請求的url位址。
下面是代碼
1. 修改items
from scrapy.item import Item, Field
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, TakeFirst, Join
class ExampleItem(Item):
mingzi = Field()
daoyan = Field()
riqi = Field()
jianjie = Field()
url = Field()
class ExampleLoader(ItemLoader):
default_item_class = ExampleItem
default_input_processor = MapCompose(lambda s: s.strip())
default_output_processor = TakeFirst()
description_out = Join()
2. 修改pipelines
class ExamplePipeline(object):
#把資料儲存在本地txt中
def __init__(self):
self.file = open("douban.txt", "w", encoding="utf-8")
def process_item(self, item, spider):
self.file.write(str(item) + "\r\n")
self.file.flush()
print(item)
return item
def __del__(self):
self.file.close()
3. 修改setting
REDIS_HOST = 'x.x.x.x'
REDIS_PORT = 6379
REDIS_URL = "redis://:[email protected]:6379"
#REDIS_URL = "redis://:密碼@x.x.x.x:6379"
4.寫spider
Mydouban.py
from bs4 import BeautifulSoup
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy_redis.spiders import RedisMixin
from scrapy.spiders import CrawlSpider
from scrapy_redis.spiders import RedisCrawlSpider
from ..items import ExampleItem
import re
import scrapy
class MyCrawler(RedisCrawlSpider):
"""Spider that reads urls from redis queue (myspider:start_urls)."""
name = 'mydouban_redis'
redis_key = 'douban:start_urls'
rules = (
# follow all links
Rule(LinkExtractor(allow=(r'\?start=\d+&filter=')), follow=True),
Rule(LinkExtractor(allow=(r'movie\.douban\.com/subject/\d+/')), callback='parse_page', follow=False),
)
def set_crawler(self, crawer):
CrawlSpider.set_crawler(self, crawer) # 設定預設爬去
RedisMixin.setup_redis(self) # url由redis
#def __init__(self, *args, **kwargs):
# # Dynamically define the allowed domains list.
# domain = kwargs.pop('domain', '')
# self.allowed_domains = filter(None, domain.split(','))
# super(MyCrawler, self).__init__(*args, **kwargs)
def getinf(self,response):
mingzi =response.xpath('//*[@id="content"]/h1/span[1]/text()').extract()
daoyan =response.xpath('//*[@id="info"]/span[1]/span[2]/a/text()').extract()
riqi =response.xpath('//*[@id="info"]/span[10]/text()').extract()
jianjie =response.xpath('//*[@id="link-report"]/span[1]/span/text()').extract()
jianjie =response.xpath('//*[@id="link-report"]/span[1]/text()').extract()
url =response.url
return mingzi,daoyan,riqi,jianjie,url
def parse_page(self, response):
result=self.getinf(response)
item=ExampleItem()
item['url']=result[4]
item['mingzi']=result[0]
item['daoyan']=result[1]
item['riqi']=result[2]
item['jianjie']=result[3]
yield item
運作項目
-
通過runspider方法執行爬蟲的py檔案(也可以分次執行多條),爬蟲(們)将處于等待準備狀态:
scrapy runspider mydouban_redis.py
-
在Master端的redis-cli輸入push指令,參考格式:
*$redis > lpush mycrawler:start_urls https://movie.douban.com/top250
-
爬蟲擷取url,開始執行。
此處使用 scrapy runspider mydouban_redis.py ,執行失敗,提示找不到mydouban_redis.py,不明原因,有大佬知道原因的望告知。
此處我使用的是 scrapy crawl 我的爬蟲name 才執行成功。