天天看點

downloader middleware 研讀(1)

對requests和response會産生影響,像代理IP什麼的就跟這個有關了
	The downloader middleware is a framework of hooks into Scrapy’s request/response processing. It’s a light, low-level system for globally altering Scrapy’s requests and responses.


激活downloader middleware
	To activate a downloader middleware component, add it to the DOWNLOADER_MIDDLEWARES setting, which is a dict whose keys are the middleware class paths and their values are the middleware orders.

Here’s an example:
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.CustomDownloaderMiddleware': 543,
}
myproject是 project name
middlewares是project目錄下的一個.py檔案
CustomDownloaderMiddleware是.py檔案内的一個自定義類
543是middleware的順序

DOWNLOADER_MIDDLEWARES(自己定義的下載下傳中間件)DOWNLOADER_MIDDLEWARES_BASE(架構自帶的)這兩者是互相合并的
	The DOWNLOADER_MIDDLEWARES setting is merged with the DOWNLOADER_MIDDLEWARES_BASE setting defined in Scrapy (and not meant to be overridden) and then sorted by order to get the final sorted list of enabled middlewares: the first middleware is the one closer to the engine and the last is the one closer to the downloader. In other words, the process_request() method of each middleware will be invoked in increasing middleware order (100, 200, 300, ...) and the process_response() method of each middleware will be invoked in decreasing order.
process_request()你的代理IP可以在這裡添加

中間件的順序是有要求的,他們之間有些可能會有依賴關系
	To decide which order to assign to your middleware see the DOWNLOADER_MIDDLEWARES_BASE setting and pick a value according to where you want to insert the middleware. The order does matter because each middleware performs a different action and your middleware could depend on some previous (or subsequent) middleware being applied.

class scrapy.downloadermiddlewares.DownloaderMiddleware

你的代理就是在這個函數裡添加的
process_request(request, spider)
	這個函數傳回值的類型
	process_request() should either: return None, return a Response object, return a Request object, or raise IgnoreRequest

	如果傳回值為None,那麼scrapy就會調用相關的downloader handler來處理這個請求,我在這裡遇到了一個問題,如果代理IP不穩定,那麼這種情況下,會緻使爬蟲出問題的
	If it returns None, Scrapy will continue processing this request, executing all other middlewares until, finally, the appropriate downloader handler is called the request performed (and its response downloaded).

	是以相對可靠的還是這個傳回值Response object,看過Architecture overview這節的會知道,最終response會傳回spiders進行處理,比如可以調用requests庫裡的requests.get(url,timeout,proxies)來傳回一個response
	If it returns a Response object, Scrapy won’t bother calling any other process_request() or process_exception() methods, or the appropriate download function; it’ll return that response. The process_response() methods of installed middleware is always called on every response.

	這個傳回值是Request object,沒看出來與傳回值None的差别
	If it returns a Request object, Scrapy will stop calling process_request methods and reschedule the returned request. Once the newly returned request is performed, the appropriate middleware chain will be called on the downloaded response.

	If it raises an IgnoreRequest exception, the process_exception() methods of installed downloader middleware will be called. If none of them handle the exception, the errback function of the request (Request.errback) is called. If no code handles the raised exception, it is ignored and not logged (unlike other exceptions).

Parameters:	

    request (Request object) – the request being processed
    spider (Spider object) – the spider for which this request is intended

還沒仔細研究
process_response(request, response, spider)

    process_response() should either: return a Response object, return a Request object or raise a IgnoreRequest exception.

    If it returns a Response (it could be the same given response, or a brand-new one), that response will continue to be processed with the process_response() of the next middleware in the chain.

    If it returns a Request object, the middleware chain is halted and the returned request is rescheduled to be downloaded in the future. This is the same behavior as if a request is returned from process_request().

    If it raises an IgnoreRequest exception, the errback function of the request (Request.errback) is called. If no code handles the raised exception, it is ignored and not logged (unlike other exceptions).
    Parameters:	

        request (is a Request object) – the request that originated the response
        response (Response object) – the response being processed
        spider (Spider object) – the spider for which this response is intended    傳回到對應的spiders裡的爬蟲去

沒仔細研究
process_exception(request, exception, spider)

    Scrapy calls process_exception() when a download handler or a process_request() (from a downloader middleware) raises an exception (including an IgnoreRequest exception)

    process_exception() should return: either None, a Response object, or a Request object.

    If it returns None, Scrapy will continue processing this exception, executing any other process_exception() methods of installed middleware, until no middleware is left and the default exception handling kicks in.

    If it returns a Response object, the process_response() method chain of installed middleware is started, and Scrapy won’t bother calling any other process_exception() methods of middleware.

    If it returns a Request object, the returned request is rescheduled to be downloaded in the future. This stops the execution of process_exception() methods of the middleware the same as returning a response would.
    Parameters:	

        request (is a Request object) – the request that generated the exception
        exception (an Exception object) – the raised exception
        spider (Spider object) – the spider for which this request is intended