使用Scrapy_redis進行分布式爬蟲

2023-06-23 21:46:30

1.建立項目：

scrapy startproject mySpider

2.建立爬蟲：

scrapy genspider –t crawl tencent3 hr.tencent.com

3.安裝需要的軟體包

4.tencent3.py代碼

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_redis.spiders import RedisCrawlSpider


class TencentSpider(RedisCrawlSpider):
    name = 'tencent3'
    allowed_domains = ['hr.tencent.com']
    #start_urls = ['https://hr.tencent.com/position.php']
    redis_key='tencent3:start_urls'

    rules = (
        Rule(LinkExtractor(restrict_xpaths=('//tr[@class="f"]',)), follow=True),
        Rule(LinkExtractor(restrict_xpaths=('//tr[@class="odd"]','//tr[@class="even"]')), callback='parse_item'),
    )

    def parse_item(self, response):
        item = {}
        item['name'] = response.xpath('//td[@id="sharetitle"]/text()').extract_first()
        item['address'] = response.xpath('//tr[@class="c bottomline"]/td[1]/text()').extract_first()

        print(item)
        # yield item

'''
啟動指令：
    sudo redis-server /etc/redis/redis.conf
    redis-cli
    select 15
    LPUSH tencent3:start_urls  https://hr.tencent.com/position.php

'''

5.配置settings.py裡面的檔案

DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True

REDIS_URL = 'redis://192.168.12.189:6379/15
#192.168.12.189為本地虛拟機ip位址

當然還需要在settings.py其他基礎配置，這裡不做詳細介紹

5.運作爬蟲項目

scrapy crawl tencent3

6.開啟redis并配置建和值

sudo redis-server /etc/redis/redis.conf
    redis-cli
    select 15
    LPUSH tencent3:start_urls  https://hr.tencent.com/position.php

7.成功爬取

使用Scrapy_redis進行分布式爬蟲

使用Scrapy_redis進行分布式爬蟲

繼續閱讀

v2ex的簡單爬蟲

Python漫畫爬蟲開源 66漫畫 AJAX，包含資料庫連接配接，圖檔下載下傳處理

requests子產品進行人人網模拟登陸

Python image.show() 出錯FSPathMakeRef(/Applications/Preview.app) failed with error -43

2023爬蟲學習筆記 -- 多線程操作

M團店鋪評價采集不到問題問題展示：解決方案：

Python爬蟲學習（1）

Python爬蟲學習進階

Python爬蟲（入門+進階）學習筆記 1-2 初識Python爬蟲

Python進階爬蟲——Class1：認識爬蟲

python爬蟲學習筆記-1

python學習之urllib使用小結

NOIp模拟題之肮髒的牧師（桶排序）

一篇文章教你如何在一個月内學會爬取大規模資料

Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗

sort()函數到底是怎樣進行數字排序的