Scrapy爬蟲架構使用簡述

本文案例所有Scrapy為2.2.0，Python為3.7，開發工具為Pycharm，學習資料來源于B站。

本文項目代碼百度雲網盤：連結：https://pan.baidu.com/s/1jP6ONSD7paXkesNRppO2kw

提取碼：7hao

一、Scrapy簡介：

1、scrapy架構的架構圖如下

Scrapy爬蟲架構使用簡述

2、各個元件的功能

（1）、引擎(Scrapy Engine)

負責Spider、ItemPipeline、Downloader、Scheduler中間的通訊，信号、資料傳遞等。

（2）、排程器(Scheduler)

用來接受引擎發過來的請求, 壓入隊列中, 并在引擎再次請求的時候傳回. 可以想像成一個URL的優先級隊列, 由它來決定下一個要抓取的網址是什麼, 同時去除重複的網址。

（3）、下載下傳器(Downloader)

用于下載下傳網頁内容, 并将網頁内容傳回給Engine，下載下傳器是建立在twisted這個高效的異步模型上的

(4)、爬蟲(Spiders)

Spiders是開發人員自定義的類，用來解析responses，并且提取items，或者發送新的請求

（5）、項目管道(Item Piplines)

在items被提取後負責處理它們，主要包括清理、驗證、持久化（比如存到資料庫）等操作

（6）、爬蟲中間件(Spider Middlewares)

下載下傳器中間件(Downloader Middlewares)位于Scrapy引擎和下載下傳器之間，主要用來處理從Engine傳到Downloader的請求request，已經從Downloader傳到Engine的響應response，

你可用該中間件做以下幾件事：

①、 process a request just before it is sent to the Downloader (i.e. right before Scrapy sends the request to the website);

②、 change received response before passing it to a spider;

③、send a new Request instead of passing received response to a spider;

④ 、pass response to a spider without fetching a web page;

⑤、 silently drop some requests.

二、Scrapy使用：

1、安裝：

pip install scrapy(或者Scrapy，首字元大小寫都行)

2、通過指令生成scrapy架構目錄。

scrapy startproject HeimaTeacher

其中 scrapy startproject 是固定，HeimaTeacher是項目的目錄名，自己定義。

執行指令：

Scrapy爬蟲架構使用簡述

執行完指令結果：

Scrapy爬蟲架構使用簡述

3、建立爬蟲檔案：

在HeimaTeacher目錄下執行： scrapy genspider heimateacher "itcast.cn"

其中 scrapy genspider 是固定文法。heimateacher 是爬蟲項目名，在運作時要用，需要自己定義，必輸項，也可以在生成爬蟲檔案後在檔案中修改，"itcast.cn" 爬取域名，指令中必輸項，但生成檔案後可以修改也可以不用。

Scrapy爬蟲架構使用簡述

4、項目中重要檔案的講解：

（1）、heimateacher.py：

import scrapy

from ..items import HeimateacherItem

class HeimateacherSpider(scrapy.Spider):
    name = 'heimateacher'
    allowed_domains = ['itcast.cn']
    start_urls = ['http://itcast.cn/']

    def parse(self, response):
        teacher_list = response.xpath("//div[@class='main_bot']")
        # items = []
        for teacher in teacher_list:
            item = HeimateacherItem()
            teacher_name = teacher.xpath("h2/text()").extract()[0]
            info = teacher.xpath("h2/span/text()").extract()[0]
            le = len(teacher.xpath("h3/span/text()"))
            if le>0:
                des1 = teacher.xpath("h3/span/text()").extract()[0]
            else:
                des1 = ""
            if le > 1:
                des2 = teacher.xpath("h3/span/text()")[1].extract()
            else:
                des2 = ""
            if len(teacher.xpath("p/text()")) > 0:
                result_des = teacher.xpath("p/text()").extract()[0].strip()
            else:
                result_des = ""
            if len(teacher.xpath("p/span/text()")) > 0:
                result = teacher.xpath("p/span/text()").extract()[0].strip()
            else:
                result = ""
            item["name"] = teacher_name
            item["info"] = info
            item["des1"] = des1
            item["des2"] = des2
            item["result_des"] = result_des
            item["result"] = result
            yield item

①、該檔案為自己通過scrapy genspider heimateacher "itcast.cn" 指令建立。

②、其中name即爬蟲名字，啟動爬蟲項目時要用,allowed_domains為允許爬取的域名，可以不要，start_urls為要爬取的項目的url，清單的形式存儲，可以爬取多個url。

③、parse中寫爬取資料後的解析。解析response中資料時，有三種解析方式，一種是xpath，第二種是css,第三種是正則，這個根據個人喜好選擇。

④、HeimateacherItem為item.py中的類，設定爬取資料。

⑤、response.xpath()擷取到的為xpath對象清單，轉化為資料需要加.extract()，擷取其中第幾個需要根據下标擷取。

⑥、注意傳回值要用yield,不能用return，因為yield具有return功能的同時，還能繼續執行for循環。如果要用return，需要在for循環外層添加清單變量，在for循環内部将對象添加到清單中，但此種方式存在的問題是當資料量比較大時會比較占記憶體。

（2）item.py檔案：

# Define here the models for your scraped items

# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class HeimateacherItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # 定義爬取時需要的字段
    name = scrapy.Field()
    level = scrapy.Field()
    info = scrapy.Field()
    des1 = scrapy.Field()
    des2 = scrapy.Field()
    result_des = scrapy.Field()
    result = scrapy.Field()
    pass

①該檔案定義的變量，要和heimateacher.py中保持一緻，供儲存資料時使用。

（3）pipelines.py：

import json

class HeimateacherPipeline:

    def __init__(self):
        self.file = open("pip_json_data.json", "wb")

    def process_item(self, item, spider):
        content = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(content.encode("utf-8"))
        return item

    def close_spider(self, spider):
        self.file.close()

①、該檔案用于處理爬取的資料。

②、函數__init__(self):用于定義初始化變量，隻會加載一次，例如開啟要寫入的檔案，函數 close_spider用于爬取結束時處理資源，例如關閉檔案流。

③、函數process_item中的item即為parse傳回的資料，以管道流的形式不斷接受yiel傳回的資料。

（4）settings.py：

該檔案為scrapy的配置檔案，

①、一定要将ROBOTSTXT_OBEY = True改為ROBOTSTXT_OBEY = False，該變量是要遵守robot協定，如果遵守了robot協定，很多網站上的東西就爬取不下來了。

②、ITEM_PIPELINES中為pipelines中的類，可以配置多個，後面的數字代表執行的先後順序，數字越小，執行級别優先級越高。

（5）middlewares.py：

該檔案可以用于配置浏覽器請求頭，代理的ip。

5、運作項目：

方式一：通過指令運作：進入到HeimaTeacher目錄下，執行指令： scrapy crawl heimateacher

方式二：建立執行檔案run.py,放到HeimaTeacher目錄下，在檔案中配置爬蟲項目名，run.py中代碼如下

from scrapy.cmdline import execute

import sys
import os

sys.path.append(os.path.dirname(os.path.abspath(__file__)))
print(os.path.dirname(os.path.abspath(__file__)))
execute(["scrapy", "crawl", "heimateacher"])  # 這個heimateacher是你自己爬蟲的名字，就是上面所講的scrapyname

運作該檔案run.py即可得到爬取的資料,資料檔案pip_json_data.json在運作項目的目錄。

Scrapy爬蟲架構使用簡述

Scrapy爬蟲架構使用簡述

繼續閱讀

python資料分析筆記-2Pandas基礎二、Pandas基礎

Python之旅基礎文法變量表達式程式結構函數str字元串list清單tuple（元組）集合遞歸函數OOP

tkinter Menubar

Decorator in Pyhton

跑通Pwcnet 光流網絡

39行python代碼,實作模拟驗證碼登入

Python版"狼人殺"：帶你解密卧底代碼

Python中的執行個體方法、classmethod和staticmethod的差別

python datetime子產品處理時間

除了Pygame是遊戲子產品？你所不知道的還有四個

Pyhton函數傳回函數

統計推斷——正态性檢驗（圖形方法、偏度和峰度、統計（拟合優度）檢驗）一、圖形驗證二、數值驗證三、統計檢驗方法AD, RJ, 或KS: 哪一個正态檢驗最好?

Python爬取代理IPPython爬取代理IP（一）配置環境（二）代碼展示

python urllib2介紹