使用Scrapy自帶ImagePipeline下載下傳圖檔

示例代碼

# 啟用scrapy自帶的圖檔下載下傳pipeline
ITEM_PIPELINES = {
  'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_URLS_FIELD = "front_image_url"    # 設定item中作為圖檔下載下傳連結的item字段
project_dir = os.path.abspath(os.path.dirname(__file__)) 
IMAGES_STORE = os.path.join(project_dir, 'images')   # 設定圖檔存儲路徑

注意

預設的ImagePipeline要求接收的圖檔下載下傳url字段為清單類型，如果不是清單類型将會報錯。
使用預設ImagePipeline時可能會報這樣的錯誤：ModuleNotFoundError: No module named ‘PIL’，原因是缺少一個pillow的庫，通過pip install pillow指令下載下傳即可

遇到的異常

圖檔下載下傳字段為空異常且圖檔下載下傳連結不規範

使用Scrapy自帶ImagePipeline下載下傳圖檔使用Scrapy自帶ImagePipeline下載下傳圖檔自定義ImagePipeline

使用Scrapy自帶ImagePipeline下載下傳圖檔使用Scrapy自帶ImagePipeline下載下傳圖檔自定義ImagePipeline

解決方案——重載ImagePipeline

有些文章沒有封面圖，也就不會有圖檔下載下傳連結這個值，對應的item中的該字段也就為None，但我設定的沒有比對時預設為‘ ’。但是scrapy自帶的ImagePipeline是預設認位每一個item中對應的圖檔連結字段都有正确可通路的圖檔連結。導緻對于一些沒有該連結的item下載下傳圖檔是報錯異常。解決方法是自定義ImagePipeline，重寫其中的get_media_request()函數，這個函數用來周遊item中的圖檔下載下傳連結字段，生成request，交給scrapy引擎下載下傳圖檔。我在這個方法中添加了判斷是否‘ ’ 的邏輯。

class CustomImagePipeline(ImagesPipeline):  
   
	def get_media_requests(self, item, info):        
		# 重寫ImagesPipeline類的get_media_requests方法        
		# 實作：下載下傳圖檔之前判斷item對應字段的url是否為‘ ’，如果為‘’就跳過下載下傳。       
	    for url in item['cover_image_url']:            
	        if url != '':                
		   try:                    
		       yield Request(url=url)                
		except:                    
		       url = 'https:' + url                    
		       try:                        
			   yield Request(url=url)                    
		       except:                        
			    url = 'http:' + url                       
			    yield Request(url=url)

注意

在get_media_request（）函數中生成的Request和在spider中使用的Requset是不一樣的，這裡的Request在from scrapy.http import Request這裡，而spider中的Request在from scrapy.spider import Request

自定義ImagePipeline

示例代碼

from scrapy.pipelines.images import ImagesPipeline
from scrapy.http import Request
from scrapy.exceptions import DropItem

 
class CustomImagePipeline(ImagesPipeline):

    def file_path(self, request, response=None, info=None):
        """
        重寫ImagesPipeline類的file_path方法
        實作：下載下傳下來的圖檔命名是以校驗碼來命名的，該方法實作保持原有圖檔命名
        :return: 圖檔路徑
        """
        image_guid = request.url.split('/')[-1]  # 取原url的圖檔命名
        return 'full/%s'% image_guid

    def get_media_requests(self, item, info):
        """
        重寫ImagesPipeline類的get_media_requests方法
        實作：下載下傳圖檔之前判斷item對應字段的url是否為‘’，如果為‘’就跳過下載下傳。
        """
        for url in item['cover_image_url']:
            if url!= '':
                yield Request(url=url)

    def item_completed(self, results, item, info):
        """
        将圖檔的本地路徑指派給item['image_paths']
        :param results:下載下傳結果，二進制組定義如下：(success, image_info_or_failure)。
        第一個元素表示圖檔是否下載下傳成功；第二個元素是一個字典。
        如果success=true，image_info_or_error詞典包含以下鍵值對。失敗則包含一些出錯資訊。
        字典内包含*url：原始URL * path：本地存儲路徑 * checksum：校驗碼
        :param item:
        :param info:
        :return:
        """
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")  # 如果沒有路徑則抛出異常
        item['image_paths'] = image_paths
        return item

使用Scrapy自帶ImagePipeline下載下傳圖檔使用Scrapy自帶ImagePipeline下載下傳圖檔自定義ImagePipeline

使用Scrapy自帶ImagePipeline下載下傳圖檔

示例代碼

注意

遇到的異常

解決方案——重載ImagePipeline

注意

自定義ImagePipeline

示例代碼

繼續閱讀

python3：request+BeautifuleSoup抓取房天下開始之前

Scrpay之Pipeline同步/異步方式儲存資料庫Scrpay之Pipeline同步方式儲存資料庫Scrapy之Pipeline異步方式儲存資料庫

eclipse maven建立maven報錯project read error

Scrapy——ItemLoader空值報錯問題ItemLoader空值報錯問題

python中正規表達式基本應用python中正規表達式基本應用