python:利用scrapy爬取圖檔，爬取的圖檔為福利圖檔，程式都帶有詳細注釋，就不再過多碼字

1.建立工程

scrapy startproject beautifulgirl

2.在spiders檔案裡建立屬于屬于自己的spider檔案

python:利用scrapy爬取圖檔1.建立工程

3.設定item

import scrapy
#圖檔下載下傳管道

class BeautifulgirlItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    image_name=scrapy.Field()
    image_urls=scrapy.Field()
    images=scrapy.Field()
    referer=scrapy.Field()
    pass

先去網頁踩點

找到圖檔所在的類

python:利用scrapy爬取圖檔1.建立工程

4.編寫Spiders

import scrapy
from beautifulgirl.items import BeautifulgirlItem

class ImgspiderSpider(scrapy.Spider):
    name='girl'                                           #名字唯一,用于爬取
    allowed_domains=['www.mm131.net']                     #設定過濾爬取的域名，插件OffsiteMiddleware啟用的情況下（預設是啟用的），不在此允許範圍内的域名就會被過濾，而不會進行爬取
    start_urls=['https://www.mm131.net/xinggan/',         #爬取的連結，這裡将首頁一行的标題連結都放進去了
                'https://www.mm131.net/qingchun/',
                'https://www.mm131.net/xiaohua/',
                'https://www.mm131.net/chemo/',
                'https://www.mm131.net/qipao/',
                'https://www.mm131.net/mingxing/'
        ]
    def parse(self,response):                             #編寫回調函數
        list=response.css('.main dd:not(.page)')          #擷取圖檔清單,css表達式 :not(p) 表示選擇非p元素的每一個元素 ,即擷取本頁面所有圖檔
        for image in list:
            image_name=image.css('a::text').extract_first()        #擷取圖檔名字 text提取文本，extract_first選取第一個元素  
            image_url=image.css('a::attr(href)').extract_first()
            image_url2=str(image_url)                              #網址字元串化
            print(image_url2)
            next_page=response.css('.page-en:nth-last-child(2)::attr(href)').extract_first()   #下一頁   :nth-last-child(2)選擇其父元素的最後一進制素開始計數
            if next_page is not None:
                yield response.follow(next_page,callback=self.parse)                           #傳回Request執行個體
            yield scrapy.Request(image_url2,callback=self.downloadImage)                       #傳回請求位址imgur12    callback指定請求傳回的response 由那個函數來處理
    def downloadImage(self,response):                                                          #擷取網頁點進去後的網頁圖像  上一行傳回的請求位址，其響應就是這個函數
        item=BeautifulgirlItem()
        item['image_name']=response.css('.content h5::text').extract_first()                   #擷取名字
        item['image_urls']=response.css('.content-pic img::attr(src)').extract()               #擷取位址    extract 序列化該節點位unicode字元串并傳回list
        print('---------------image_urls---------',item['image_urls'])

        item['referer']=response.url                                                           #将請求的連結，放在URL攜帶的referer中  用于驗證請求的合法性                                             
        yield item
        next_url =response.css('.page-ch:last-child::attr(href)').extract_first()              #擷取下一頁
        if next_url is not None:
            yield response.follow(next_url,callback=self.downloadImage,dont_filter=True)       #若沒到最後一頁，則傳回下一頁請求

5.編寫pipeline處理下載下傳的圖檔分組

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
import re


class My_Pipeline(ImagesPipeline):                                  #繼承ImagesPipeline 類,重寫pipeline 
    def get_media_requests(self,item,info):                         #擷取圖檔請求 item,這個就是通過spiders傳遞過來的item ,info，是一個儲存了圖檔的名字和下載下傳連結的清單
        for image_url in item['image_urls']:
            print('-------------image_url---------%s',image_url)
            yield Request(image_url,meta={'name':item['image_name']},headers={'referer':item['referer']}) #傳回擷取的連結，圖檔名字和referer
    def file_path(self,request,response=None,info=None):            #重命名圖檔名字,定義下載下傳分類
        img_name=request.url.split('/')[-1]                         #圖檔分片，将擷取的圖檔遇/進行分割
        name=request.meta['name']                                   #接受meta傳遞過來的名字
        name=re.sub(r'[?\\*|"<>:/()0123456789]','',name)            #過濾windows字元串   不然會亂碼
        filename=u'pic/{0}/{1}'.format(name,img_name)               #建立檔案夾,便于觀看
        return filename
    def item_completed(self,results,item,info):                     #所有圖檔處理完畢後（不管下載下傳成功或失敗），會調用item_completed進行處理   
        image_paths=[x['path']for ok,x in results if ok]            #簡寫    for ok,x in results:
        if not image_paths:                                         #            if ok:
            raise DropItem('Item contains no images')               #               prink(x['path'])
        item['image_urls']=image_paths                              #将儲存路徑儲存于 item 中
        return item                                                 #item_completed參數中包含 item ，有我們抓取的所有資訊，參數 results 為下載下傳圖檔的結果數組，包含下載下傳後的路徑以及是否成功下載下傳

6.最後設定setting檔案

BOT_NAME = 'beautifulgirl'

SPIDER_MODULES = ['beautifulgirl.spiders']
NEWSPIDER_MODULE = 'beautifulgirl.spiders'

IMAGES_STORE = 'D:\office'  #儲存路徑
IMAGES_EXPIRES = 90         #90天内抓取的都不會被重抓
ITEM_PIPELINES = {
    'beautifulgirl.pipelines.My_Pipeline': 300,
}

爬取後可以看到檔案已被下載下傳

圖檔名字過于敏感，打了馬賽克，圖檔還行😜

python:利用scrapy爬取圖檔1.建立工程

python:利用scrapy爬取圖檔1.建立工程

1.建立工程

2.在spiders檔案裡建立屬于屬于自己的spider檔案

3.設定item

4.編寫Spiders

5.編寫pipeline處理下載下傳的圖檔分組

6.最後設定setting檔案

爬取後可以看到檔案已被下載下傳

繼續閱讀

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入