scrapy ImagesPipeline根據關鍵字下載下傳百度圖檔到本地

scrapy架構
一、scrapy的圖檔下載下傳-ImagesPipeline
二、根據關鍵字下載下傳百度圖檔到本地
- 1.構造百度圖檔請求，解析圖檔URL
- 2.ImagesPipeline下載下傳圖檔到本地
本篇小結

scrapy架構

scrapy架構是一個多線程爬蟲架構，是可以集請求、解析、存儲于一體的爬蟲架構，關于架構簡介和重要的元件可以參考：

零基礎scrapy項目結構簡介-python批量擷取百度圖檔到本地

下面主要以百度圖檔下載下傳并儲存到本地為例，介紹scrapy爬取圖檔到本地的方法

一、scrapy的圖檔下載下傳-ImagesPipeline

ImagesPipeline是scrapy提供的圖檔下載下傳類。我們可以定義pipeline來繼承ImagesPipeline來實作自定義的圖檔下載下傳

在scrapy的源碼中（源碼位址：https://github.com/scrapy/scrapy/tree/master/scrapy），pipelines檔案夾下有三個python檔案：files.py ，images.py 和 media.py。當我們選擇使用ImagesPipeline來處理圖檔時，主要用到了這三個python所定義的方法

scrapy ImagesPipeline根據關鍵字下載下傳百度圖檔到本地scrapy架構一、scrapy的圖檔下載下傳-ImagesPipeline二、根據關鍵字下載下傳百度圖檔到本地本篇小結

有關圖檔請求和儲存到本地的方法可以參見這三個python，其中images.py中的一些方法是我們可以重寫的（此外還有一個小知識點，ImagesPipeline其實繼承了FilesPipeline，具體可參考images.py，細節将在下一篇文章中進行介紹）

二、根據關鍵字下載下傳百度圖檔到本地

1.構造百度圖檔請求，解析圖檔URL

初始化定義spider的name和allowed_domains以及關鍵字

image_spider.py

import scrapy
import json
from baidu_crawler.items import BaiduCrawlerItem


class ImageSpiderSpider(scrapy.Spider):
    name = 'image_spider'
    allowed_domains = ['image.baidu.com']
    key = r'貓'

構造初始請求擷取到百度提供的圖檔總數，為之後翻頁查詢做準備。這裡Request有三個參數，url是要擷取的圖檔連結，callback是請求後的回調函數，即在parse方法裡可以對請求結果進行解析，dont_filter=False表示對url進行去重。關于Request函數的參數和詳細定義可以參照

https://github.com/scrapy/scrapy/blob/be655b855da3f5643b004e9f2d5b9161266c17f4/scrapy/http/request/init.py

image_spider.py

def start_requests(self):
    url = 'http://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&queryWord={word}&word={word}&pn=0'.format(word=self.key)
    yield scrapy.Request(url=url, callback=self.parse, dont_filter=False)

然後解析請求傳回的結果，擷取到圖檔總數，根據圖檔數來翻頁請求所有可以獲得的圖檔

image_spider.py

def parse(self, response):
    # 擷取百度可以提供給使用者的圖檔數目
    baidu_page_num = json.loads(response.body)['listNum']
    start_urls = [
        'http://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&queryWord={word}&word={word}&pn={pn}'.format(
            word=self.key, pn=str(i)) for i in range(0, baidu_page_num, 30)]

    for url in start_urls:
        yield scrapy.Request(url=url, headers=self.default_headers, callback=self.parse_two, dont_filter=False)

對于上面翻頁請求的百度圖檔資訊，解析傳回結果，擷取到所有圖檔的url，并用在item.py中定義的資料項存儲

image_spider.py

def parse_two(self, response):
    item = BaiduCrawlerItem()
    image_urls = []
    imageData = json.loads(response.body)['data']
    for image in imageData:
        image_urls.append(image['thumbURL'])

    item['image_urls'] = image_urls
    yield item

其中BaiduCrawlerItem定義如下：

items.py

import scrapy
class BaiduCrawlerItem(scrapy.Item):
    image_urls = scrapy.Field()

2.ImagesPipeline下載下傳圖檔到本地

項目的pipeline繼承ImagesPipeline，重寫圖檔下載下傳有關方法

pipelines.py

import os
from baidu_crawler.settings import IMAGES_STORE
from urllib.request import urlopen, Request, urlretrieve
from scrapy.exceptions import DropItem
from scrapy import Request
from scrapy.pipelines.images import ImagesPipeline

class BaiduCrawlerDownloadPipeline(ImagesPipeline):
    # 根據從spider中傳遞過來的圖檔url對網絡中的圖檔（百度圖檔）送出請求，ImagesPipeline進行圖檔下載下傳儲存
    def get_media_requests(self, item):
        image_urls = item['image_urls']

        for image_url in image_urls:
            yield Request(image_url)

    def item_completed(self, results, item, info):
        try:
            image_paths = [x['path'] for ok, x in results if ok]
            if not image_paths:
                raise DropItem("Item contains no images")
            item['image_paths'] = IMAGES_STORE.join(item['image_urls'])
        except Exception as e:
            # logging.error("Exception: {}".format(e))
            print(("Exception: {}".format(e)))
        return item
    # 重命名圖檔    
    def file_path(self, request, response=None, info=None):
        # 這裡用圖檔url的一部分進行命名
        # 例如https://ss1.bdstatic.com/70cFvXSh_Q1YnxGkpoWK1HF6hhy/it/u=1340038759,2253650778&fm=26&gp=0.jpg，選擇u=1340038759,2253650778&fm=26&gp=0.jpg作為圖檔名
        # 也可以自定義圖檔名，例如用圖檔url的hash作為圖檔名，scrapy中是将整張圖檔的md5值作為圖檔名
        name_array = response.url.spilt('/')
        image_name = name_array[len(name_array)-1]
        image_path = IMAGES_STORE + image_name
        return image_path

最後再在配置檔案對pipeline進行配置

settings.py中的重要配置：

settings.py

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

DOWNLOADER_MIDDLEWARES = {
   'crawler.middlewares.BaiduCrawlerDownloaderMiddleware': 543,
}
ITEM_PIPELINES = {
   'crawler.pipelines.BaiduCrawlerDownloadPipeline': 100,
   'scrapy.pipelines.images.ImagesPipeline': 102,
   'scrapy.pipelines.files.FilesPipeline': 103
}
# 圖檔存儲的目錄
IMAGES_STORE = './data/'

如果需要用到代理，則在middlewares.py中配置request代理

middlewares.py

class BaiduCrawlerDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.
        current_ip = '你的代理ip: port'
        current_ip = 'http://' + current_ip 
        request.meta['proxy'] = current_ip
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

本篇小結

1.在spider中發送請求并解析響應結果

2.從響應結果中提取出圖檔url

3.通過定義的資料項item，将圖檔url發送給pipeline

4.pipeline繼承ImagesPipeline，按照需求重寫圖檔請求和命名等方法

以上就是本篇要講的内容，事實上ImagesPipeline中的圖檔下載下傳的最終方法也可以通過重寫的方式自定義，和我們正常使用requests庫一樣，scrapy中也是通過在本地open一個新檔案然後以’wb’的方式将請求得到的圖檔write到該檔案的方式來下載下傳圖檔的，詳細方法将在下一篇進行介紹。

scrapy ImagesPipeline根據關鍵字下載下傳百度圖檔到本地scrapy架構一、scrapy的圖檔下載下傳-ImagesPipeline二、根據關鍵字下載下傳百度圖檔到本地本篇小結

scrapy ImagesPipeline根據關鍵字下載下傳百度圖檔到本地

scrapy架構

一、scrapy的圖檔下載下傳-ImagesPipeline

二、根據關鍵字下載下傳百度圖檔到本地

1.構造百度圖檔請求，解析圖檔URL

2.ImagesPipeline下載下傳圖檔到本地

本篇小結

繼續閱讀

Python漫畫爬蟲開源 66漫畫 AJAX，包含資料庫連接配接，圖檔下載下傳處理

requests子產品進行人人網模拟登陸

【崔慶才教材】《Python3網絡爬蟲開發實戰》3.4爬取貓眼電影排行代碼更正（繞過美團驗證碼）

Python image.show() 出錯FSPathMakeRef(/Applications/Preview.app) failed with error -43

2023爬蟲學習筆記 -- 多線程操作

M團店鋪評價采集不到問題問題展示：解決方案：

Python爬蟲學習（1）

Python爬蟲學習進階

Python爬蟲（入門+進階）學習筆記 1-2 初識Python爬蟲

Python進階爬蟲——Class1：認識爬蟲

python爬蟲學習筆記-1

python學習之urllib使用小結

NOIp模拟題之肮髒的牧師（桶排序）

一篇文章教你如何在一個月内學會爬取大規模資料

Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗

sort()函數到底是怎樣進行數字排序的