天天看點

Scrapy_Study01Scrapy

Scrapy

scrapy 爬蟲架構的爬取流程

Scrapy_Study01Scrapy

scrapy架構各個元件的簡介

對于以上四步而言,也就是各個元件,它們之間沒有直接的聯系,全部都由scrapy引擎來連接配接傳遞資料。引擎由scrapy架構已經實作,而需要手動實作一般是spider爬蟲和pipeline管道,對于複雜的爬蟲項目可以手寫downloader和spider 的中間件來滿足更複雜的業務需求。

Scrapy_Study01Scrapy

scrapy架構的簡單使用

在安裝好scrapy第三方庫後,通過terminal控制台來直接輸入指令

  1. 建立一個scrapy項目

scrapy startproject myspider

  1. 生成一個爬蟲

scrapy genspider itcast itcast.cn

  1. 提取資料

完善spider,使用xpath等

  1. 儲存資料

在pipeline中進行操作

  1. 啟動爬蟲

scrapy crawl itcast

scrapy架構使用的簡單流程

  1. 建立scrapy項目,會自動生成一系列的py檔案和配置檔案
Scrapy_Study01Scrapy
  1. 建立一個自定義名稱,确定爬取域名(可選)的爬蟲
Scrapy_Study01Scrapy
  1. 書寫代碼完善自定義的爬蟲,以實作所需效果
Scrapy_Study01Scrapy
  1. 使用yield 将解析出的資料傳遞到pipeline
Scrapy_Study01Scrapy
  1. 使用pipeline将資料存儲(在pipeline中操作資料需要在settings.py中将配置開啟,預設是關閉)
Scrapy_Study01Scrapy
  1. 使用pipeline的幾點注意事項
Scrapy_Study01Scrapy

使用logging子產品

在scrapy 中

settings中設定LOG_LEVEL = “WARNING”

settings中設定LOG_FILE = “./a.log” # 設定日志檔案儲存位置及檔案名, 同時終端中不會顯示日志内容

import logging, 執行個體化logger的方式在任何檔案中使用logger輸出内容

在普通項目中

import logging

logging.basicConfig(…) # 設定日志輸出的樣式, 格式

執行個體化一個’logger = logging.getLogger(name)’

在任何py檔案中調用logger即可

scrapy中實作翻頁請求

案例 爬取騰訊招聘

因為現在網站主流趨勢是前後分離,直接去get網站隻能得到一堆不含資料的html标簽,而網頁展示出的資料都是由js請求後端接口擷取資料然後将資料拼接在html中,是以不能直接通路網站位址,而是通過chrome開發者工具獲知網站請求的後端接口位址,然後去請求該位址

通過比對網站請求後端接口的querystring,确定下要請求的url

在騰訊招聘網中,翻頁檢視招聘資訊也是通過請求後端接口實作的,是以翻頁爬取實際上就是對後端接口的請求但需要傳遞不同的querystring

spider 代碼

import scrapy
import random
import json


class TencenthrSpider(scrapy.Spider):
    name = 'tencenthr'
    allowed_domains = ['tencent.com']
    start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1614839354704&parentCategoryId=40001&pageIndex=1&pageSize=10&language=zh-cn&area=cn']
    # start_urls = "https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1614839354704&parentCategoryId=40001&pageIndex=1&pageSize=10&language=zh-cn&area=cn"

    def parse(self, response):
        # 由于是請求後端接口,是以傳回的是json資料,是以擷取response對象的text内容,
        # 然後轉換成dict資料類型便于操作
        gr_list = response.text
        gr_dict = json.loads(gr_list)
        # 因為實作翻頁功能就是querystring中的pageIndex的變化,是以擷取每次的index,然後下一次的index加一即可
        start_url = str(response.request.url)
        start_index = int(start_url.find("Index") + 6)
        mid_index = int(start_url.find("&", start_index))
        num_ = start_url[start_index:mid_index]
		# 一般傳回的json資料會有共有多少條資料,這裡取出
        temp = gr_dict["Data"]["Count"]
        # 定義一個字典
        item = {}
        for i in range(10):
            # 填充所需資料,通過通路dict 的方式取出資料
            item["Id"] = gr_dict["Data"]["Posts"][i]["PostId"]
            item["Name"] = gr_dict["Data"]["Posts"][i]["RecruitPostName"]
            item["Content"] = gr_dict["Data"]["Posts"][i]["Responsibility"]
            item["Url"] = "https://careers.tencent.com/jobdesc.html?postid=" + gr_dict["Data"]["Posts"][i]["PostId"]
            # 将item資料交給引擎
            yield item
        # 下一個url
        # 這裡确定下一次請求的url,同時url中的timestamp就是一個13位的随機數字
        rand_num1 = random.randint(100000, 999999)
        rand_num2 = random.randint(1000000, 9999999)
        rand_num = str(rand_num1) + str(rand_num2)
        # 這裡确定pageindex 的數值
        nums = int(start_url[start_index:mid_index]) + 1
        if nums > int(temp)/10:
            pass
        else:
            nums = str(nums)
            next_url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=' + rand_num + '&parentCategoryId=40001&pageIndex=' + nums +'&pageSize=10&language=zh-cn&area=cn'
            # 将 下一次請求的url封裝成request對象傳遞給引擎
            yield scrapy.Request(next_url, callback=self.parse)
           

pipeline 代碼

import csv


class TencentPipeline:
    def process_item(self, item, spider):
        # 将擷取到的各個資料 儲存到csv檔案
        with open('./tencent_hr.csv', 'a+', encoding='utf-8') as file:
            fieldnames = ['Id', 'Name', 'Content', 'Url']
            writer = csv.DictWriter(file, fieldnames=fieldnames)
            writer.writeheader()
            print(item)
            writer.writerow(item)
        return item
           

補充scrapy.Request

Scrapy_Study01Scrapy

scrapy的item使用

Scrapy_Study01Scrapy

案例 爬取陽光網的問政資訊

爬取陽光政務網的資訊,通過chrome開發者工具知道網頁的資料都是正常填充在html中,是以爬取陽關網就隻是正常的解析html标簽資料。

但注意的是,因為還需要爬取問政資訊詳情頁的圖檔等資訊,是以在書寫spider代碼時需要注意parse方法的書寫

spider 代碼

import scrapy
from yangguang.items import YangguangItem


class YangguanggovSpider(scrapy.Spider):
    name = 'yangguanggov'
    allowed_domains = ['sun0769.com']
    start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?page=1']

    def parse(self, response):
        start_url = response.url
        # 按頁分組進行爬取并解析資料
        li_list = response.xpath("/html/body/div[2]/div[3]/ul[2]")
        for li in li_list:
            # 在item中定義的工具類。來承載所需的資料
            item = YangguangItem()
            item["Id"] = str(li.xpath("./li/span[1]/text()").extract_first())
            item["State"] = str(li.xpath("./li/span[2]/text()").extract_first()).replace(" ", "").replace("\n", "")
            item["Content"] = str(li.xpath("./li/span[3]/a/text()").extract_first())
            item["Time"] = li.xpath("./li/span[5]/text()").extract_first()
            item["Link"] = "http://wz.sun0769.com" + str(li.xpath("./li/span[3]/a[1]/@href").extract_first())
            # 通路每一條問政資訊的詳情頁,并使用parse_detail方法進行處理
            # 借助scrapy的meta 參數将item傳遞到parse_detail方法中
            yield scrapy.Request(
                item["Link"],
                callback=self.parse_detail,
                meta={"item": item}
            )
        # 請求下一頁
        start_url_page = int(str(start_url)[str(start_url).find("=")+1:]) + 1
        next_url = "http://wz.sun0769.com/political/index/politicsNewest?page=" + str(start_url_page)
        yield scrapy.Request(
            next_url,
            callback=self.parse
        )
	# 解析詳情頁的資料
    def parse_detail(self, response):
        item = response.meta["item"]
        item["Content_img"] = response.xpath("/html/body/div[3]/div[2]/div[2]/div[3]/img/@src")
        yield item
           

items 代碼

import scrapy

# 在item類中定義所需的字段
class YangguangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    Id = scrapy.Field()
    Link = scrapy.Field()
    State = scrapy.Field()
    Content = scrapy.Field()
    Time = scrapy.Field()
    Content_img = scrapy.Field()
           

pipeline 代碼

class YangguangPipeline:
    # 簡單的列印出所需資料
    def process_item(self, item, spider):
        print(item)
        return item
           

scrapy的debug資訊認識

Scrapy_Study01Scrapy

通過檢視scrapy架構列印的debug資訊,可以檢視scrapy啟動順序,在出現錯誤時,可以輔助解決成為。

scrapy深入之scrapy shell

Scrapy_Study01Scrapy

通過scrapy shell可以在未啟動spider的情況下嘗試以及調試代碼,在一些不能确定操作的情況下可以先通過shell來驗證嘗試。

scrapy深入之settings和管道

settings

Scrapy_Study01Scrapy

對scrapy項目的settings檔案的介紹:

# Scrapy settings for yangguang project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# 項目名
BOT_NAME = 'yangguang'
# 爬蟲子產品所在位置
SPIDER_MODULES = ['yangguang.spiders']
# 建立爬蟲所在位置
NEWSPIDER_MODULE = 'yangguang.spiders'
# 輸出日志等級
LOG_LEVEL = 'WARNING'
# 設定每次發送請求時攜帶的headers的user-argent
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.72 Safari/537.36 Edg/89.0.774.45'
# 設定是否遵守 robot協定
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# 設定最大同時請求發出量
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 設定每次請求間歇時間
#DOWNLOAD_DELAY = 3
# 一般用處較少
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# cookie是否開啟,預設可以開啟
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# 控制台元件是否開啟
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# 設定預設請求頭,user-argent不能同時放置在此處
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}
# 設定爬蟲中間件是否開啟
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'yangguang.middlewares.YangguangSpiderMiddleware': 543,
#}
# 設定下載下傳中間件是否開啟
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'yangguang.middlewares.YangguangDownloaderMiddleware': 543,
#}
#
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}
# 設定管道是否開啟
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'yangguang.pipelines.YangguangPipeline': 300,
}
# 自動限速相關設定
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# HTTP緩存相關設定
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
           

管道 pipeline

在管道中不僅隻有項目建立時的process_item方法,管道中還有open_spider,close_spider方法等,這兩個方法就是分别在爬蟲開啟時和爬蟲結束時執行一次。

舉例代碼:

class YangguangPipeline:
    def process_item(self, item, spider):
        print(item)
        # 如果不return的話,另一個權重較低的pipeline就不會擷取到該item
        return item
    
	def open_spider(self, spider):
        # 這在爬蟲開啟時執行一次
        spider.test = "hello"
        # 為spider添加了一個屬性值,之後在pipeline中的process_item或spider中都可以使用該屬性值
    def close_spider(self, spider):
        # 這在爬蟲關閉時執行一次
        spider.test = ""
           

mongodb的補充

借助pymongo第三方包來操作

Scrapy_Study01Scrapy

scrapy中的crawlspider爬蟲

生成crawlspider的指令:

scrapy genspider -t crawl 爬蟲名 要爬取的域名

Scrapy_Study01Scrapy
Scrapy_Study01Scrapy

crawlspider的使用

  • 建立爬蟲scrapy genspider -t crawl 爬蟲名 allow_domain
  • 指定start_url, 對應的響應會經過rules提取url位址
  • 完善rules, 添加Rule

Rule(LinkExtractor(allow=r’ /web/site0/tab5240/info\d+.htm’), callback=‘parse_ item’),

  • 注意點:

url位址不完整, crawlspider會自動補充完整之後在請求

parse函數還不能定義, 他有特殊的功能需要實作

callback: 連接配接提取器提取出來的url位址對應的響應交給他處理

follow: 連接配接提取器提取出來的url位址對應的響應是否繼續被rules來過濾

LinkExtractors連結提取器:

使用

LinkExtractors

可以不用程式員自己提取想要的url,然後發送請求。這些工作都可以交給

LinkExtractors

,他會在所有爬的頁面中找到滿足規則的

url

,實作自動的爬取。以下對

LinkExtractors

類做一個簡單的介紹:

class scrapy.linkextractors.LinkExtractor(
    allow = (),
    deny = (),
    allow_domains = (),
    deny_domains = (),
    deny_extensions = None,
    restrict_xpaths = (),
    tags = ('a','area'),
    attrs = ('href'),
    canonicalize = True,
    unique = True,
    process_value = None
)
           

主要參數講解:

  • allow:允許的url。所有滿足這個正規表達式的url都會被提取。
  • deny:禁止的url。所有滿足這個正規表達式的url都不會被提取。
  • allow_domains:允許的域名。隻有在這個裡面指定的域名的url才會被提取。
  • deny_domains:禁止的域名。所有在這個裡面指定的域名的url都不會被提取。
  • restrict_xpaths:嚴格的xpath。和allow共同過濾連結。

Rule規則類:

定義爬蟲的規則類。以下對這個類做一個簡單的介紹:

class scrapy.spiders.Rule(
    link_extractor, 
    callback = None, 
    cb_kwargs = None, 
    follow = None, 
    process_links = None, 
    process_request = None
)
           

主要參數講解:

  • link_extractor:一個

    LinkExtractor

    對象,用于定義爬取規則。
  • callback:滿足這個規則的url,應該要執行哪個回調函數。因為

    CrawlSpider

    使用了

    parse

    作為回調函數,是以不要覆寫

    parse

    作為回調函數自己的回調函數。
  • follow:指定根據該規則從response中提取的連結是否需要跟進。
  • process_links:從link_extractor中擷取到連結後會傳遞給這個函數,用來過濾不需要爬取的連結。

案例 爬取笑話大全網站

分析xiaohua.zolcom.cn 可以得知, 網頁的資料是直接嵌在HTML中, 請求網站域名, 伺服器直接傳回的html标簽包含了網頁内可見的全部資訊. 是以直接對伺服器響應的html标簽進行解析.

同時翻頁爬取資料時,也發現下頁的url 已被嵌在html中, 是以借助crawlspider可以非常友善的提取出下一頁url.

spider 代碼:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import re
class XhzolSpider(CrawlSpider):
name = 'xhzol'
allowed_domains = ['xiaohua.zol.com.cn']
start_urls = ['http://xiaohua.zol.com.cn/lengxiaohua/1.html']
rules = (
    # 這裡定義從相應中提取符合該正則的url位址,并且可以自動補全, callpack指明哪一個處理函數來處理響應, follow表示從響應中提取出的符合正則的url 是否要繼續進行請求
    Rule(LinkExtractor(allow=r'/lengxiaohua/\d+\.html'), callback='parse_item', follow=True),
)

def parse_item(self, response):
    item = {}
    # item["title"] = response.xpath("/html/body/div[6]/div[1]/ul/li[1]/span/a/text()").extract_first()
    # print(re.findall("<span class='article-title'><a target='_blank' href='.*?\d+\.html'>(.*?)</a></span>", response.body.decode("gb18030"), re.S))
    # 這裡按正則搜尋笑話的标題
    for i in re.findall(r'<span class="article-title"><a target="_blank" href="/detail\d+/\d+\.html" target="_blank" rel="external nofollow" >(.*?)</a></span>', response.body.decode("gb18030"), re.S):
        item["titles"] = i
        yield item

    return item
           

pipeline 代碼:

class XiaohuaPipeline:
    def process_item(self, item, spider):
        print(item)
        return item
           

簡單的列印來檢視運作結果

案例 爬取中國銀監會網站的處罰資訊

分析網頁資訊得知,網頁的具體資料資訊都是網頁通過發送Ajax請求,請求後端接口擷取到json資料,然後通過js動态的将資料嵌在html中,渲染出來。是以不能直接去請求網站域名,而是去請求後端的api接口。并且通過比對翻頁時請求的後端api接口的變化,确定翻頁時下頁的url。

spider 代碼:

import scrapy
import re
import json


class CbircSpider(scrapy.Spider):
    name = 'cbirc'
    allowed_domains = ['cbirc.gov.cn']
    start_urls = ['https://www.cbirc.gov.cn/']

    def parse(self, response):
        start_url = "http://www.cbirc.gov.cn/cbircweb/DocInfo/SelectDocByItemIdAndChild?itemId=4113&pageSize=18&pageIndex=1"
        yield scrapy.Request(
            start_url,
            callback=self.parse1
        )

    def parse1(self, response):

        # 資料處理
        json_data = response.body.decode()
        json_data = json.loads(json_data)
        for i in json_data["data"]["rows"]:
            item = {}
            item["doc_name"] = i["docSubtitle"]
            item["doc_id"] = i["docId"]
            item["doc_time"] = i["builddate"]
            item["doc_detail"] = "http://www.cbirc.gov.cn/cn/view/pages/ItemDetail.html?docId=" + str(i["docId"]) + "&itemId=4113&generaltype=" + str(i["generaltype"])
            yield item
        # 翻頁, 确定下一頁的url
        str_url = response.request.url
        page = re.findall(r'.*?pageIndex=(\d+)', str_url, re.S)[0]
        mid_url = str(str_url).strip(str(page))
        page = int(page) + 1
        # 請求的url變化就是 page 的增加
        if page <= 24:
            next_url = mid_url + str(page)
            yield scrapy.Request(
                next_url,
                callback=self.parse1
            )
           

pipeline 代碼:

import csv


class CircplusPipeline:
    def process_item(self, item, spider):
        with open('./circ_gb.csv', 'a+', encoding='gb2312') as file:
            fieldnames = ['doc_id', 'doc_name', 'doc_time', 'doc_detail']
            writer = csv.DictWriter(file, fieldnames=fieldnames)
            writer.writerow(item)
        return item

    def open_spider(self, spider):
        with open('./circ_gb.csv', 'a+', encoding='gb2312') as file:
            fieldnames = ['doc_id', 'doc_name', 'doc_time', 'doc_detail']
            writer = csv.DictWriter(file, fieldnames=fieldnames)
            writer.writeheader()
           

将資料儲存在csv檔案中

下載下傳中間件

學習download middleware的使用,下載下傳中間件用于初步處理将排程器發送給下載下傳器的request url 或 初步處理下載下傳器請求後擷取的response

Scrapy_Study01Scrapy

同時還有process_exception 方法用于處理當中間件程式抛出異常時進行的異常處理。

下載下傳中間件的簡單使用

Scrapy_Study01Scrapy

自定義中間件的類,在類中定義process的三個方法,方法中書寫實作代碼。注意要在settings中開啟,将類進行注冊。

代碼嘗試:

import random

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


class RandomUserArgentMiddleware:
	# 處理請求
    def process_request(self, request, spider):
        ua = random.choice(spider.settings.get("USER_ARENT_LIST"))
        request.headers["User-Agent"] = ua[0]


class SelectRequestUserAgent:

    # 處理響應
    def process_response(self, request, response, spider):
        print(request.headers["User=Agent"])
        # 需要傳回一個response(通過引擎将response交給spider)或request(通過引擎将request交給排程器)或none
        return response


class HandleMiddlewareEcxeption:

    # 處理異常
    def process_exception(self, request, exception, spider):
        print(exception)
           

settings 代碼:

DOWNLOADER_MIDDLEWARES = {
    'suningbook.middlewares.RandomUserArgentMiddleware': 543,
    'suningbook.middlewares.SelectRequestUserAgent': 544,
    'suningbook.middlewares.HandleMiddlewareEcxeption': 544,
}
           

scrapy 模拟登入

scrapy 攜帶cookie登入

在scrapy中, start_url不會經過allowed_domains的過濾, 是一定會被請求, 檢視scrapy 的源碼, 請求start_url就是由start_requests方法操作的, 是以通過自己重寫start_requests方法可以為請求start_url 攜帶上cookie資訊等, 實作模拟登入等功能.

Scrapy_Study01Scrapy

通過重寫start_requests 方法,為我們的請求攜帶上cookie資訊,來實作模拟登入功能。

Scrapy_Study01Scrapy

補充知識點:

scrapy中 cookie資訊是預設開啟的,是以預設請求下是直接使用cookie的。可以通過開啟COOKIE_DEBUG = True 可以檢視到詳細的cookie在函數中的傳遞。

Scrapy_Study01Scrapy

案例 攜帶cookie模拟登入人人網

通過重寫start_requests方法,為請求攜帶上cookie資訊,去通路需要登入後才能通路的頁面,擷取資訊。模拟實作模拟登入的功能。

import scrapy
import re


class LoginSpider(scrapy.Spider):
    name = 'login'
    allowed_domains = ['renren.com']
    start_urls = ['http://renren.com/975252058/profile']
	# 重寫方法
    def start_requests(self):
        # 添加上cookie資訊,這之後的請求中都會攜帶上該cookie資訊
        cookies = "anonymid=klx1odv08szk4j; depovince=GW; _r01_=1; taihe_bi_sdk_uid=17f803e81753a44fe40be7ad8032071b; taihe_bi_sdk_session=089db9062fdfdbd57b2da32e92cad1c2; ick_login=666a6c12-9cd1-433b-9ad7-97f4a595768d; _de=49A204BB9E35C5367A7153C3102580586DEBB8C2103DE356; t=c433fa35a370d4d8e662f1fb4ea7c8838; societyguester=c433fa35a370d4d8e662f1fb4ea7c8838; id=975252058; xnsid=fadc519c; jebecookies=db5f9239-9800-4e50-9fc5-eaac2c445206|||||; JSESSIONID=abcb9nQkVmO0MekR6ifGx; ver=7.0; loginfrom=null; wp_fold=0"
        cookie = {i.split("=")[0]:i.split("=")[1] for i in cookies.split("; ")}
        yield scrapy.Request(
            self.start_urls[0],
            callback=self.parse,
            cookies=cookie
        )
	# 列印使用者名,驗證是否模拟登入成功
    def parse(self, response):
        print(re.findall("該使用者尚未開", response.body.decode(), re.S))
           

scrapy模拟登入之發送post請求

借助scrapy提供的FromRequest對象發送Post請求,并且可以設定fromdata,headers,cookies等參數。

Scrapy_Study01Scrapy

案例 scrapy模拟登入github

模拟登入GitHub,通路github.com/login, 擷取from參數, 再去請求/session 驗證賬号密碼,最後登入成功

spider 代碼:

import scrapy
import re
import random


class GithubSpider(scrapy.Spider):
    name = 'github'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/login']

    def parse(self, response):
        # 先從login 頁面的響應中擷取出authenticity_token和commit,在請求登入是必需
        authenticity_token = response.xpath("//*[@id='login']/div[4]/form/input[1]/@value").extract_first()
        rand_num1 = random.randint(100000, 999999)
        rand_num2 = random.randint(1000000, 9999999)
        rand_num = str(rand_num1) + str(rand_num2)
        commit = response.xpath("//*[@id='login']/div[4]/form/div/input[12]/@value").extract_first()
        form_data = dict(
            commit=commit,
            authenticity_token=authenticity_token,
            login="[email protected]",
            password="tcc062556",
            timestamp=rand_num,
            # rusted_device="",
        )
        # form_data["webauthn-support"] = ""
        # form_data["webauthn-iuvpaa-support"] = ""
        # form_data["return_to"] = ""
        # form_data["allow_signup"] = ""
        # form_data["client_id"] = ""
        # form_data["integration"] = ""
        # form_data["required_field_b292"] = ""
        headers = {
            "referer": "https://github.com/login",
            'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
            'accept-language': 'zh-CN,zh;q=0.9',
            'accept-encoding': 'gzip, deflate, br',
            'origin': 'https://github.com'
        }
        # 借助fromrequest 發送post請求,進行登入
        yield scrapy.FormRequest.from_response(
            response,
            formdata=form_data,
            headers=headers,
            callback=self.login_data
        )

    def login_data(self, response):
        # 列印使用者名驗證是否登入成功
        print(re.findall("xiangshiersheng", response.body.decode()))
        # 儲存成本地html 檔案
        with open('./github.html', 'a+', encoding='utf-8') as f:
            f.write(response.body.decode())
           

總結:

模拟登入三種方式:

1. 攜帶cookie登入

使用scrapy.Request(url, callback=, cookies={})

将cookies填入,在請求url時會攜帶cookie去請求。

2. 使用FormRequest

scrapy.FromRequest(url, formdata={}, callback=)

formdata 就是請求體, 在formdata中填入要送出的表單資料

3. 借助from_response

scrapy.FromRequest.from_response(response, formdata={}, callback=)

from_response 會自動從響應中搜尋到表單送出的位址(如果存在表單及送出位址)

知識的簡單總結

crawlspider 如何使用

  • 建立爬蟲 scrapy genspider -t crawl spidername allow_domain
  • 完善爬蟲
  1. start_url
  2. 完善rules
    • 元組
    • Rule(LinkExtractor, callback, follow)
      • LinkExtractor 連接配接提取器, 提取url
      • callback url的響應會交給該callback處理
      • follow = True url的響應會繼續被Rule提取位址
  3. 完善 callback, 處理資料

下載下傳中間件如何使用

  • 定義類
  • process_request 處理請求, 不需要return
  • process_response 處理響應, 需要return request response
  • settings中開啟

scrapy如何模拟登入

  • 攜帶cookie登入
    • 準備cookie字典
    • scrapy.Request(url, callba, cookies=cookies_dict)
  • scrapy.FromRequest(post_url, formdata={}, callback)
  • scrapy.FromRequest.from_response(response, formdata={}, callback)

scrapy_redis 的學習

Scrapy 是一個通用的爬蟲架構,但是不支援分布式,Scrapy-redis是為了更友善地實作Scrapy分布式爬取,而提供了一些以redis為基礎的元件(僅有元件)。

scrapy_redis 的爬取流程

相比scrapy的工作流程,scrapy-redis就隻是多了redis的一部分,并且排程器的request是從redis中讀取出的,而且spider爬取過程中擷取到的url也會經過排程器判重和排程再由spider爬取。最會spider傳回的item會被存儲到redis中。

Scrapy_Study01Scrapy

Scrapy-redis提供了下面四種元件(基于redis)

Scheduler:

Scrapy改造了python本來的collection.deque(雙向隊列)形成了自己的Scrapy queue(https://github.com/scrapy/queuelib/blob/master/queuelib/queue.py)),但是Scrapy多個spider不能共享待爬取隊列Scrapy queue, 即Scrapy本身不支援爬蟲分布式,scrapy-redis 的解決是把這個Scrapy queue換成redis資料庫(也是指redis隊列),從同一個redis-server存放要爬取的request,便能讓多個spider去同一個資料庫裡讀取。

Scrapy中跟“待爬隊列”直接相關的就是排程器Scheduler,它負責對新的request進行入列操作(加入Scrapy queue),取出下一個要爬取的request(從Scrapy queue中取出)等操作。它把待爬隊列按照優先級建立了一個字典結構,比如:

{
    優先級0 : 隊列0

    優先級1 : 隊列1

    優先級2 : 隊列2

}
           

然後根據request中的優先級,來決定該入哪個隊列,出列時則按優先級較小的優先出列。為了管理這個比較進階的隊列字典,Scheduler需要提供一系列的方法。但是原來的Scheduler已經無法使用,是以使用Scrapy-redis的scheduler元件。

Duplication Filter:

Scrapy中用集合實作這個request去重功能,Scrapy中把已經發送的request指紋放入到一個集合中,把下一個request的指紋拿到集合中比對,如果該指紋存在于集合中,說明這個request發送過了,如果沒有則繼續操作。這個核心的判重功能是這樣實作的:

def request_seen(self, request):
        
        # self.figerprints就是一個指紋集合 
    	fp = self.request_fingerprint(request)

    	# 這就是判重的核心操作 

    	if fp in self.fingerprints:

        	return True

    	self.fingerprints.add(fp)

    	if self.file:

        	self.file.write(fp + os.linesep)
           

在scrapy-redis中去重是由Duplication Filter元件來實作的,它通過redis的set 不重複的特性,巧妙的實作了Duplication Filter去重。scrapy-redis排程器從引擎接受request,将request的指紋存⼊redis的set檢查是否重複,并将不重複的request push寫⼊redis的 request queue。

引擎請求request(Spider發出的)時,排程器從redis的request queue隊列⾥裡根據優先級pop 出⼀個request 傳回給引擎,引擎将此request發給spider處理。

Item Pipeline:

引擎将(Spider傳回的)爬取到的Item給Item Pipeline,scrapy-redis 的Item Pipeline将爬取到的 Item 存⼊redis的 items queue。

修改過Item Pipeline可以很友善的根據 key 從 items queue 提取item,從⽽實作 items processes叢集。

Base Spider:

不再使用scrapy原有的Spider類,重寫的RedisSpider繼承了Spider和RedisMixin這兩個類,RedisMixin是用來從redis讀取url的類。

當我們生成一個Spider繼承RedisSpider時,調用setup_redis函數,這個函數會去連接配接redis資料庫,然後會設定signals(信号):

  • 一個是當spider空閑時候的signal,會調用spider_idle函數,這個函數調用schedule_next_request函數,保證spider是一直活着的狀态,并且抛出DontCloseSpider異常。
  • 一個是當抓到一個item時的signal,會調用item_scraped函數,這個函數會調用schedule_next_request函數,擷取下一個request。

當下載下傳scrapy-redis後會自帶一個demo程式,如下

settings.py 配置檔案:

Scrapy_Study01Scrapy

domz spider 代碼:

同普通的crawlspider項目相比,主要差距在parse處理響應上.

Scrapy_Study01Scrapy

程式運作時:

Scrapy_Study01Scrapy

嘗試在settings中關閉redispipeline,觀察redis中三個鍵的變化情況

Scrapy_Study01Scrapy

scrapy-redis的源碼解析

scrapy-redis重寫的 scrapy本身的request去重功能,DUperFilter。

Scrapy_Study01Scrapy

相比scrapy的pipeline, scrapy-redis隻是将item 存儲在redis中

Scrapy_Study01Scrapy

scrapy-redis 提供的排程器

Scrapy_Study01Scrapy

重點補充:

request對象什麼時候入隊

  • dont_filter = True, 構造請求的時候, 把dont_filter置為True, 該url會被反複抓取(url位址對應的内容會更新的情況)
  • 一個全新的url位址被抓到的時候, 構造request請求
  • url位址在start_urls中的時候, 會入隊, 不管之前是否請求過構造start_urls 位址的時請求時候,dont_filter = True

scrapy-redis 入隊源碼

def enqueue_request(self, request):
        if not request.dont_filter and self.df.request_seen(request):
            # dont_filter = False self.df.request_seen = True 此時不會入隊,因為request指紋已經存在
            # dont_filter = False self.df.request_seen = False 此時會入隊,因為此時是全新的url
            
            self.df.log(request, self.spider)
            return False
        if self.stats:
            self.stats.inc_value('scheduler/enqueued/redis', spider=self.spider)
        self.queue.push(request) # 入隊
        return True

           

scrapy-redis去重方法

**

  • 使用sha1加密request得到指紋
  • 把指紋存在redis的集合中
  • 下一次新來一個request, 同樣的方式生成指紋, 判斷指紋是否存在redis的集合中

生成指紋

fp = hashlib.sha1()
fp.update(to_bytes(request.method))
fp.update(to_byte(canonicalize_url(request.url)))
fp.update(request.body or b'')
return fp.hexdigest()
           

判斷資料是否存在redis的集合中, 不存在插入

added = self.server.sadd(self.key, fp)
return added != 0
           

練習 爬取百度貼吧

spider 代碼:

處理正确響應後擷取到的資訊,多使用正則,因為貼吧就算是擷取到正确響應 頁面内的html元素都是被注釋起來,在渲染網頁時由js處理,是以xpath等手段無法使用。

import scrapy
from copy import deepcopy
from tieba.pipelines import HandleChromeDriver
import re


class TiebaspiderSpider(scrapy.Spider):
    name = 'tiebaspider'
    allowed_domains = ['tieba.baidu.com']
    start_urls = ['https://tieba.baidu.com/index.html']

    def start_requests(self):
        item = {}
        # cookie1 = "BAIDUID_BFESS = 9250B568D2AF5E8D7C24501FD8947F10:FG=1; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; ZD_ENTRY=baidu; BA_HECTOR=0l2ha48g00ah842l6k1g46ooh0r; H_PS_PSSID=33518_33358_33272_31660_33595_33393_26350; delPer=0; PSINO=5; NO_UNAME=1; BIDUPSID=233AE38C1766688048F6AA80C4F0D56C; PSTM=1614821745; BAIDUID=233AE38C176668807122431B232D9927:FG=1; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598"
        # cookie2 = {i.split("=")[0]: i.split("=")[1] for i in cookie1.split("; ")}
        cookies = self.parse_cookie()
        print(cookies)
        # print(cookies)
        headers = {
            'Cache-Control': 'no-cache',
            'Host': 'tieba.baidu.com',
            'Pragma': 'no-cache',
            'sec-ch-ua': '"Google Chrome";v = "89", "Chromium";v = "89", ";Not A Brand"; v = "99"',
            'sec-ch-ua-mobile': '?0',
            'Sec-Fetch-Dest': 'document',
            'Sec-Fetch-Mode': 'navigate',
            'Sec-Fetch-Site': 'none',
            'Sec-Fetch-User': '?1',
            'Upgrade-Insecure-Requests': 1
        }
        yield scrapy.Request(
            'https://tieba.baidu.com/index.html',
            cookies=cookies,
            callback=self.parse,
            headers=headers,
            meta={"item": item}
        )
        print("ok")

    def parse(self, response):
        # 處理首頁頁面
        if str(response.url).find("captcha") != -1:
            HandleChromeDriver.handle_tuxing_captcha(url=str(response.url))
        print(response.url)
        print(response.status)
        item = response.meta["item"]
        grouping_list = response.xpath("//*[@id='f-d-w']/div/div/div")
        for i in grouping_list:
            group_link = "https://tieba.baidu.com" + i.xpath("./a/@href").extract_first()
            group_name = i.xpath("./a/@title").extract_first()
            item["group_link"] = group_link
            item["group_name"] = group_name

            if group_name is not None:
                yield scrapy.Request(
                    group_link,
                    callback=self.parse_detail,
                    meta={"item": deepcopy(item)}
                )
        print("parse")

    def parse_detail(self, response):
        # 處理分組頁面
        detail_data = response.body.decode()
        if str(response.url).find("captcha") != -1:
            detail_data = HandleChromeDriver.handle_tuxing_captcha(url=str(response.url))
        print(response.url)
        print(response.status)
        detail_list_link = re.findall(
            '<div class="ba_info.*?">.*?<a rel="noopener" target="_blank" href="(.*?)" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  class="ba_href clearfix">',
            detail_data, re.S)
        print(detail_list_link)
        detail_list_name = re.findall(
            '<div class="ba_content">.*?<p class="ba_name">(.*?)</p>', detail_data, re.S)
        item = response.meta["item"]
        for i in range(len(detail_list_link)):
            detail_link = "https://tieba.baidu.com" + detail_list_link[i]
            detail_name = detail_list_name[i]
            item["detail_link"] = detail_link
            item["detail_name"] = detail_name

            yield scrapy.Request(
                detail_link,
                callback=self.parse_data,
                meta={"item": deepcopy(item)}
            )
        start_parse_url = response.url[:str(response.url).find("pn=") + 3]
        start_parse_body = response.body.decode()
        last_parse_page = re.findall('下一頁&gt;</a>.*?<a href=".*?pn=(\d+)" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  class="last">.*?</a>', start_parse_body, re.S)[0]
        page_parse_num = re.findall('<a href=".*?" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >(\d+)</a>', start_parse_body, re.S)[0]
        page_parse_num = int(page_parse_num) + 1
        end_parse_url = start_parse_url + str(page_parse_num)
        if page_parse_num <= int(last_parse_page):
            yield scrapy.Request(
                end_parse_url,
                callback=self.parse_detail,
                meta={"item": deepcopy(item)}
            )

        print("parse_detail")

    def parse_data(self, response):
        body_data = response.body.decode()
        if str(response.url).find("captcha") != -1:
            body_data = HandleChromeDriver.handle_tuxing_captcha(url=str(response.url))
        print(response.url)
        print(response.status)
        # print(response.body.decode())
        data_name = re.findall('<a rel="noreferrer" href=".*?" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  title=".*?" target="_blank" class="j_th_tit ">(.*?)</a>',
                               body_data, re.S)
        # print(data_name)
        data_link = re.findall(
            '<div class="t_con cleafix">.*?<a rel="noreferrer" href="(.*?)" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  title=".*?" target="_blank" class="j_th_tit ">.*?</a>',
            body_data, re.S)
        # print(data_link)
        # data_list = response.xpath('//*[@id="thread_list"]/li//div[@class="threadlist_title pull_left j_th_tit "]/a')
        # print(data_list.extract_first())
        item = response.meta["item"]
        for i in range(len(data_link)):
            item["data_link"] = "https://tieba.baidu.com" + data_link[i]
            item["data_name"] = data_name[i]
            yield item
        temp_url_find = str(response.url).find("pn=")
        if temp_url_find == -1:
            start_detail_url = response.url + "&ie=utf-8&pn="
        else:
            start_detail_url = str(response.url)[:temp_url_find + 3]
        start_detail_body = response.body.decode()
        last_detail_page = re.findall('下一頁&gt;</a>.*?<a href=".*?pn=(\d+)" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  class="last pagination-item " >尾頁</a>', start_detail_body, re.S)[0]
        page_detail_num = re.findall('<span class="pagination-current pagination-item ">(.*?)</span>', start_detail_body, re.S)[0]
        page_detail_num = int(page_detail_num) * 50
        end_detail_url = start_detail_url + str(page_detail_num)
        print(end_detail_url)
        if page_detail_num <= int(last_detail_page):
            yield scrapy.Request(
                end_detail_url,
                callback=self.parse_data,
                meta={"item": deepcopy(item)}
            )
        print("parse_data")

    def parse_data1(self, response):
        pass

    def parse_cookie(self):
        lis = []
        lst_end = {}
        lis_link = ["BAIDUID", "PSTM", "BIDUPSID", "__yjs_duid", "BDORZ", "BDUSS", "BAIDUID_BFESS", "H_PS_PSSID",
                    "bdshare_firstime", "BDUSS_BFESS", "NO_UNAME", "tb_as_data", "STOKEN", "st_data",
                    "Hm_lvt_287705c8d9e2073d13275b18dbd746dc", "Hm_lvt_98b9d8c2fd6608d564bf2ac2ae642948", "st_key_id",
                    "ab_sr", "st_sign"]
        with open("./cookie.txt", "r+", encoding="utf-8") as f:
            s = f.read()
            t = s.strip("[").strip("]").replace("'", "")
        while True:
            num = t.find("}, ")
            if num != -1:
                lis.append({i.split(": ")[0]: i.split(": ")[1] for i in t[:num].strip("{").split(", ")})
                t = t.replace(t[:num + 3], "")
            else:
                break
        cookie1 = "BAIDUID_BFESS = 9250B568D2AF5E8D7C24501FD8947F10:FG=1; BDRCVFR[feWj1Vr5u3D] = I67x6TjHwwYf0; ZD_ENTRY = baidu; BA_HECTOR = 0l2ha48g00ah842l6k1g46ooh0r; H_PS_PSSID = 33518_33358_33272_31660_33595_33393_26350; delPe r= 0; PSINO = 5; NO_UNAME = 1; BIDUPSID = 233AE38C1766688048F6AA80C4F0D56C; PSTM = 1614821745; BAIDUID = 233AE38C176668807122431B232D9927:FG=1; BDORZ = B490B5EBF6F3CD402E515D22BCDA1598"
        cookie2 = {i.split(" = ")[0]: i.split(" = ")[-1] for i in cookie1.split("; ")}
        for i in lis_link:
            for j in lis:
                if j["name"] == i:
                    lst_end[i] = j["value"]
            for z in cookie2:
                if i == z:
                    lst_end[i] = cookie2[i]
        return lst_end

           

pipeline 代碼:

這裡主要是資料的存儲,存在csv檔案内。以及一個工具類, 帶有兩個靜态方法,一個用于處理自動登入貼吧以擷取到完整且正确的cookie資訊,以便之後的請求攜帶,能得到正确的響應資訊,一個用于處理爬蟲在爬取時遇到貼吧的檢測圖形驗證碼(該驗證碼,人都不是很容易通過。。。)

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from selenium import webdriver
import time
import csv


class TiebaPipeline:

    def process_item(self, item, spider):
        with open('./tieba.csv', 'a+', encoding='utf-8') as file:
            fieldnames = ['group_link', 'group_name', 'detail_link', 'detail_name', 'data_link', 'data_name']
            writer = csv.DictWriter(file, fieldnames=fieldnames)
            writer.writerow(item)
        return item

    def open_spider(self, spider):
        with open('./tieba.csv', 'w+', encoding='utf-8') as file:
            fieldnames = ['group_link', 'group_name', 'detail_link', 'detail_name', 'data_link', 'data_name']
            writer = csv.DictWriter(file, fieldnames=fieldnames)
            writer.writeheader()
        HandleChromeDriver.handle_cookie(url="http://tieba.baidu.com/f/user/passport?jumpUrl=http://tieba.baidu.com")


class HandleChromeDriver:

    @staticmethod
    def handle_cookie(url):
        driver = webdriver.Chrome("E:\python_study\spider\data\chromedriver_win32\chromedriver.exe")
        driver.implicitly_wait(2)
        driver.get(url)
        driver.implicitly_wait(2)
        login_pwd = driver.find_element_by_xpath('//*[@id="TANGRAM__PSP_4__footerULoginBtn"]')
        login_pwd.click()
        username = driver.find_element_by_xpath('//*[@id="TANGRAM__PSP_4__userName"]')
        pwd = driver.find_element_by_xpath('//*[@id="TANGRAM__PSP_4__password"]')
        login_btn = driver.find_element_by_xpath('//*[@id="TANGRAM__PSP_4__submit"]')
        time.sleep(1)
        username.send_keys("18657589370")
        time.sleep(1)
        pwd.send_keys("tcc062556")
        time.sleep(1)
        login_btn.click()
        time.sleep(15)
        tb_cookie = str(driver.get_cookies())
        with open("./cookie.txt", "w+", encoding="utf-8") as f:
            f.write(tb_cookie)
        # print(tb_cookie)
        driver.close()

    @staticmethod
    def handle_tuxing_captcha(url):
        drivers = webdriver.Chrome("E:\python_study\spider\data\chromedriver_win32\chromedriver.exe")
        drivers.implicitly_wait(2)
        drivers.get(url)
        drivers.implicitly_wait(2)
        time.sleep(10)
        drivers.close()
        # print(tb_cookie)
        return drivers.page_source


           

settings 代碼:

這裡主要設定一些請求頭得到資訊以及在每次請求時間歇兩秒

BOT_NAME = 'tieba'

SPIDER_MODULES = ['tieba.spiders']
NEWSPIDER_MODULE = 'tieba.spiders'

LOG_LEVEL = 'WARNING'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.72 Safari/537.36 Edg/89.0.774.45'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 2
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
  'Connection': 'keep-alive',
}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'tieba.pipelines.TiebaPipeline': 300,
}