scrapy 爬虫中间件 httperror中间件

2019-11-10 10:17:00

class HttpErrorMiddleware(object):

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings)

    def __init__(self, settings):
        self.handle_httpstatus_all = settings.getbool('HTTPERROR_ALLOW_ALL')
        self.handle_httpstatus_list = settings.getlist('HTTPERROR_ALLOWED_CODES')

    def process_spider_input(self, response, spider):
        if 200 <= response.status < 300:  # common case
            return
        meta = response.meta
        if 'handle_httpstatus_all' in meta:
            return
        if 'handle_httpstatus_list' in meta:
            allowed_statuses = meta['handle_httpstatus_list']
        elif self.handle_httpstatus_all:
            return
        else:
            allowed_statuses = getattr(spider, 'handle_httpstatus_list', self.handle_httpstatus_list)
        if response.status in allowed_statuses:
            return
        raise HttpError(response, 'Ignoring non-200 response')

    def process_spider_exception(self, response, exception, spider):
        if isinstance(exception, HttpError):
            spider.crawler.stats.inc_value('httperror/response_ignored_count')
            spider.crawler.stats.inc_value(
                'httperror/response_ignored_status_count/%s' % response.status
            )
            logger.info(
                "Ignoring response %(response)r: HTTP status code is not handled or not allowed",
                {'response': response}, extra={'spider': spider},
            )
            return []

HTTPERROR_ALLOW_ALL = true

HTTPERROR_ALLOWED_CODES=[301,404]

第一个配置是否允许所有，就是收到响应后，不管什么状态码都返回给爬虫
第二个是允许的列表
以上是全局配置 不推荐

如果想在每个爬虫里面进行配置
可以在单独的爬虫里面设置

handle_httpstatus_all = true

handle_httpstatus_list = [404,302]

除非你非常熟悉你的网站和scrapy 不建议使用这些配置 ，因为把错误的响应也返回给爬虫，没什么用

scrapy 爬虫中间件 httperror中间件

继续阅读

CSU 1561 (More) Multiplication

CSU 1563 Lexicography

HDU 4721 Food and Productivity

ZOJ 1041 Transmitters

CSU 1562 Fun House

CodeChef PALPROB Palindromeness

UVA 10344- 23 out of 5

ZOJ 1104 Leaps Tall Buildings

HDU 2821 Pusher

UVA 1401 Remember the Word

ZOJ 2748 Free Kick

CSU 1567 Reverse Rot

JAVA 系列——>开发工具IntelliJ IDEA的安装以及配置、快捷键IDEA 简介

UVA 519 Puzzle (II)

磁盘结构及在Linux中的命名

sort()函数到底是怎样进行数字排序的