Scrapy
scrapy 爬蟲架構的爬取流程
scrapy架構各個元件的簡介
對于以上四步而言,也就是各個元件,它們之間沒有直接的聯系,全部都由scrapy引擎來連接配接傳遞資料。引擎由scrapy架構已經實作,而需要手動實作一般是spider爬蟲和pipeline管道,對于複雜的爬蟲項目可以手寫downloader和spider 的中間件來滿足更複雜的業務需求。
scrapy架構的簡單使用
在安裝好scrapy第三方庫後,通過terminal控制台來直接輸入指令
- 建立一個scrapy項目
scrapy startproject myspider
- 生成一個爬蟲
scrapy genspider itcast itcast.cn
- 提取資料
完善spider,使用xpath等
- 儲存資料
在pipeline中進行操作
- 啟動爬蟲
scrapy crawl itcast
scrapy架構使用的簡單流程
- 建立scrapy項目,會自動生成一系列的py檔案和配置檔案
- 建立一個自定義名稱,确定爬取域名(可選)的爬蟲
- 書寫代碼完善自定義的爬蟲,以實作所需效果
- 使用yield 将解析出的資料傳遞到pipeline
- 使用pipeline将資料存儲(在pipeline中操作資料需要在settings.py中将配置開啟,預設是關閉)
- 使用pipeline的幾點注意事項
使用logging子產品
在scrapy 中
settings中設定LOG_LEVEL = “WARNING”
settings中設定LOG_FILE = “./a.log” # 設定日志檔案儲存位置及檔案名, 同時終端中不會顯示日志内容
import logging, 執行個體化logger的方式在任何檔案中使用logger輸出内容
在普通項目中
import logging
logging.basicConfig(…) # 設定日志輸出的樣式, 格式
執行個體化一個’logger = logging.getLogger(name)’
在任何py檔案中調用logger即可
scrapy中實作翻頁請求
案例 爬取騰訊招聘
因為現在網站主流趨勢是前後分離,直接去get網站隻能得到一堆不含資料的html标簽,而網頁展示出的資料都是由js請求後端接口擷取資料然後将資料拼接在html中,是以不能直接通路網站位址,而是通過chrome開發者工具獲知網站請求的後端接口位址,然後去請求該位址
通過比對網站請求後端接口的querystring,确定下要請求的url
在騰訊招聘網中,翻頁檢視招聘資訊也是通過請求後端接口實作的,是以翻頁爬取實際上就是對後端接口的請求但需要傳遞不同的querystring
spider 代碼
import scrapy
import random
import json
class TencenthrSpider(scrapy.Spider):
name = 'tencenthr'
allowed_domains = ['tencent.com']
start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1614839354704&parentCategoryId=40001&pageIndex=1&pageSize=10&language=zh-cn&area=cn']
# start_urls = "https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1614839354704&parentCategoryId=40001&pageIndex=1&pageSize=10&language=zh-cn&area=cn"
def parse(self, response):
# 由于是請求後端接口,是以傳回的是json資料,是以擷取response對象的text内容,
# 然後轉換成dict資料類型便于操作
gr_list = response.text
gr_dict = json.loads(gr_list)
# 因為實作翻頁功能就是querystring中的pageIndex的變化,是以擷取每次的index,然後下一次的index加一即可
start_url = str(response.request.url)
start_index = int(start_url.find("Index") + 6)
mid_index = int(start_url.find("&", start_index))
num_ = start_url[start_index:mid_index]
# 一般傳回的json資料會有共有多少條資料,這裡取出
temp = gr_dict["Data"]["Count"]
# 定義一個字典
item = {}
for i in range(10):
# 填充所需資料,通過通路dict 的方式取出資料
item["Id"] = gr_dict["Data"]["Posts"][i]["PostId"]
item["Name"] = gr_dict["Data"]["Posts"][i]["RecruitPostName"]
item["Content"] = gr_dict["Data"]["Posts"][i]["Responsibility"]
item["Url"] = "https://careers.tencent.com/jobdesc.html?postid=" + gr_dict["Data"]["Posts"][i]["PostId"]
# 将item資料交給引擎
yield item
# 下一個url
# 這裡确定下一次請求的url,同時url中的timestamp就是一個13位的随機數字
rand_num1 = random.randint(100000, 999999)
rand_num2 = random.randint(1000000, 9999999)
rand_num = str(rand_num1) + str(rand_num2)
# 這裡确定pageindex 的數值
nums = int(start_url[start_index:mid_index]) + 1
if nums > int(temp)/10:
pass
else:
nums = str(nums)
next_url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=' + rand_num + '&parentCategoryId=40001&pageIndex=' + nums +'&pageSize=10&language=zh-cn&area=cn'
# 将 下一次請求的url封裝成request對象傳遞給引擎
yield scrapy.Request(next_url, callback=self.parse)
pipeline 代碼
import csv
class TencentPipeline:
def process_item(self, item, spider):
# 将擷取到的各個資料 儲存到csv檔案
with open('./tencent_hr.csv', 'a+', encoding='utf-8') as file:
fieldnames = ['Id', 'Name', 'Content', 'Url']
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
print(item)
writer.writerow(item)
return item
補充scrapy.Request
scrapy的item使用
案例 爬取陽光網的問政資訊
爬取陽光政務網的資訊,通過chrome開發者工具知道網頁的資料都是正常填充在html中,是以爬取陽關網就隻是正常的解析html标簽資料。
但注意的是,因為還需要爬取問政資訊詳情頁的圖檔等資訊,是以在書寫spider代碼時需要注意parse方法的書寫
spider 代碼
import scrapy
from yangguang.items import YangguangItem
class YangguanggovSpider(scrapy.Spider):
name = 'yangguanggov'
allowed_domains = ['sun0769.com']
start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?page=1']
def parse(self, response):
start_url = response.url
# 按頁分組進行爬取并解析資料
li_list = response.xpath("/html/body/div[2]/div[3]/ul[2]")
for li in li_list:
# 在item中定義的工具類。來承載所需的資料
item = YangguangItem()
item["Id"] = str(li.xpath("./li/span[1]/text()").extract_first())
item["State"] = str(li.xpath("./li/span[2]/text()").extract_first()).replace(" ", "").replace("\n", "")
item["Content"] = str(li.xpath("./li/span[3]/a/text()").extract_first())
item["Time"] = li.xpath("./li/span[5]/text()").extract_first()
item["Link"] = "http://wz.sun0769.com" + str(li.xpath("./li/span[3]/a[1]/@href").extract_first())
# 通路每一條問政資訊的詳情頁,并使用parse_detail方法進行處理
# 借助scrapy的meta 參數将item傳遞到parse_detail方法中
yield scrapy.Request(
item["Link"],
callback=self.parse_detail,
meta={"item": item}
)
# 請求下一頁
start_url_page = int(str(start_url)[str(start_url).find("=")+1:]) + 1
next_url = "http://wz.sun0769.com/political/index/politicsNewest?page=" + str(start_url_page)
yield scrapy.Request(
next_url,
callback=self.parse
)
# 解析詳情頁的資料
def parse_detail(self, response):
item = response.meta["item"]
item["Content_img"] = response.xpath("/html/body/div[3]/div[2]/div[2]/div[3]/img/@src")
yield item
items 代碼
import scrapy
# 在item類中定義所需的字段
class YangguangItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
Id = scrapy.Field()
Link = scrapy.Field()
State = scrapy.Field()
Content = scrapy.Field()
Time = scrapy.Field()
Content_img = scrapy.Field()
pipeline 代碼
class YangguangPipeline:
# 簡單的列印出所需資料
def process_item(self, item, spider):
print(item)
return item
scrapy的debug資訊認識
通過檢視scrapy架構列印的debug資訊,可以檢視scrapy啟動順序,在出現錯誤時,可以輔助解決成為。
scrapy深入之scrapy shell
通過scrapy shell可以在未啟動spider的情況下嘗試以及調試代碼,在一些不能确定操作的情況下可以先通過shell來驗證嘗試。
scrapy深入之settings和管道
settings
對scrapy項目的settings檔案的介紹:
# Scrapy settings for yangguang project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# 項目名
BOT_NAME = 'yangguang'
# 爬蟲子產品所在位置
SPIDER_MODULES = ['yangguang.spiders']
# 建立爬蟲所在位置
NEWSPIDER_MODULE = 'yangguang.spiders'
# 輸出日志等級
LOG_LEVEL = 'WARNING'
# 設定每次發送請求時攜帶的headers的user-argent
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.72 Safari/537.36 Edg/89.0.774.45'
# 設定是否遵守 robot協定
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# 設定最大同時請求發出量
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 設定每次請求間歇時間
#DOWNLOAD_DELAY = 3
# 一般用處較少
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# cookie是否開啟,預設可以開啟
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# 控制台元件是否開啟
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# 設定預設請求頭,user-argent不能同時放置在此處
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# 設定爬蟲中間件是否開啟
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'yangguang.middlewares.YangguangSpiderMiddleware': 543,
#}
# 設定下載下傳中間件是否開啟
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'yangguang.middlewares.YangguangDownloaderMiddleware': 543,
#}
#
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# 設定管道是否開啟
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'yangguang.pipelines.YangguangPipeline': 300,
}
# 自動限速相關設定
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# HTTP緩存相關設定
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
管道 pipeline
在管道中不僅隻有項目建立時的process_item方法,管道中還有open_spider,close_spider方法等,這兩個方法就是分别在爬蟲開啟時和爬蟲結束時執行一次。
舉例代碼:
class YangguangPipeline:
def process_item(self, item, spider):
print(item)
# 如果不return的話,另一個權重較低的pipeline就不會擷取到該item
return item
def open_spider(self, spider):
# 這在爬蟲開啟時執行一次
spider.test = "hello"
# 為spider添加了一個屬性值,之後在pipeline中的process_item或spider中都可以使用該屬性值
def close_spider(self, spider):
# 這在爬蟲關閉時執行一次
spider.test = ""
mongodb的補充
借助pymongo第三方包來操作
scrapy中的crawlspider爬蟲
生成crawlspider的指令:
scrapy genspider -t crawl 爬蟲名 要爬取的域名
crawlspider的使用
- 建立爬蟲scrapy genspider -t crawl 爬蟲名 allow_domain
- 指定start_url, 對應的響應會經過rules提取url位址
- 完善rules, 添加Rule
Rule(LinkExtractor(allow=r’ /web/site0/tab5240/info\d+.htm’), callback=‘parse_ item’),
- 注意點:
url位址不完整, crawlspider會自動補充完整之後在請求
parse函數還不能定義, 他有特殊的功能需要實作
callback: 連接配接提取器提取出來的url位址對應的響應交給他處理
follow: 連接配接提取器提取出來的url位址對應的響應是否繼續被rules來過濾
LinkExtractors連結提取器:
使用
LinkExtractors
可以不用程式員自己提取想要的url,然後發送請求。這些工作都可以交給
LinkExtractors
,他會在所有爬的頁面中找到滿足規則的
url
,實作自動的爬取。以下對
LinkExtractors
類做一個簡單的介紹:
class scrapy.linkextractors.LinkExtractor(
allow = (),
deny = (),
allow_domains = (),
deny_domains = (),
deny_extensions = None,
restrict_xpaths = (),
tags = ('a','area'),
attrs = ('href'),
canonicalize = True,
unique = True,
process_value = None
)
主要參數講解:
- allow:允許的url。所有滿足這個正規表達式的url都會被提取。
- deny:禁止的url。所有滿足這個正規表達式的url都不會被提取。
- allow_domains:允許的域名。隻有在這個裡面指定的域名的url才會被提取。
- deny_domains:禁止的域名。所有在這個裡面指定的域名的url都不會被提取。
- restrict_xpaths:嚴格的xpath。和allow共同過濾連結。
Rule規則類:
定義爬蟲的規則類。以下對這個類做一個簡單的介紹:
class scrapy.spiders.Rule(
link_extractor,
callback = None,
cb_kwargs = None,
follow = None,
process_links = None,
process_request = None
)
主要參數講解:
- link_extractor:一個
對象,用于定義爬取規則。LinkExtractor
- callback:滿足這個規則的url,應該要執行哪個回調函數。因為
使用了CrawlSpider
作為回調函數,是以不要覆寫parse
作為回調函數自己的回調函數。parse
- follow:指定根據該規則從response中提取的連結是否需要跟進。
- process_links:從link_extractor中擷取到連結後會傳遞給這個函數,用來過濾不需要爬取的連結。
案例 爬取笑話大全網站
分析xiaohua.zolcom.cn 可以得知, 網頁的資料是直接嵌在HTML中, 請求網站域名, 伺服器直接傳回的html标簽包含了網頁内可見的全部資訊. 是以直接對伺服器響應的html标簽進行解析.
同時翻頁爬取資料時,也發現下頁的url 已被嵌在html中, 是以借助crawlspider可以非常友善的提取出下一頁url.
spider 代碼:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import re
class XhzolSpider(CrawlSpider):
name = 'xhzol'
allowed_domains = ['xiaohua.zol.com.cn']
start_urls = ['http://xiaohua.zol.com.cn/lengxiaohua/1.html']
rules = (
# 這裡定義從相應中提取符合該正則的url位址,并且可以自動補全, callpack指明哪一個處理函數來處理響應, follow表示從響應中提取出的符合正則的url 是否要繼續進行請求
Rule(LinkExtractor(allow=r'/lengxiaohua/\d+\.html'), callback='parse_item', follow=True),
)
def parse_item(self, response):
item = {}
# item["title"] = response.xpath("/html/body/div[6]/div[1]/ul/li[1]/span/a/text()").extract_first()
# print(re.findall("<span class='article-title'><a target='_blank' href='.*?\d+\.html'>(.*?)</a></span>", response.body.decode("gb18030"), re.S))
# 這裡按正則搜尋笑話的标題
for i in re.findall(r'<span class="article-title"><a target="_blank" href="/detail\d+/\d+\.html" target="_blank" rel="external nofollow" >(.*?)</a></span>', response.body.decode("gb18030"), re.S):
item["titles"] = i
yield item
return item
pipeline 代碼:
class XiaohuaPipeline:
def process_item(self, item, spider):
print(item)
return item
簡單的列印來檢視運作結果
案例 爬取中國銀監會網站的處罰資訊
分析網頁資訊得知,網頁的具體資料資訊都是網頁通過發送Ajax請求,請求後端接口擷取到json資料,然後通過js動态的将資料嵌在html中,渲染出來。是以不能直接去請求網站域名,而是去請求後端的api接口。并且通過比對翻頁時請求的後端api接口的變化,确定翻頁時下頁的url。
spider 代碼:
import scrapy
import re
import json
class CbircSpider(scrapy.Spider):
name = 'cbirc'
allowed_domains = ['cbirc.gov.cn']
start_urls = ['https://www.cbirc.gov.cn/']
def parse(self, response):
start_url = "http://www.cbirc.gov.cn/cbircweb/DocInfo/SelectDocByItemIdAndChild?itemId=4113&pageSize=18&pageIndex=1"
yield scrapy.Request(
start_url,
callback=self.parse1
)
def parse1(self, response):
# 資料處理
json_data = response.body.decode()
json_data = json.loads(json_data)
for i in json_data["data"]["rows"]:
item = {}
item["doc_name"] = i["docSubtitle"]
item["doc_id"] = i["docId"]
item["doc_time"] = i["builddate"]
item["doc_detail"] = "http://www.cbirc.gov.cn/cn/view/pages/ItemDetail.html?docId=" + str(i["docId"]) + "&itemId=4113&generaltype=" + str(i["generaltype"])
yield item
# 翻頁, 确定下一頁的url
str_url = response.request.url
page = re.findall(r'.*?pageIndex=(\d+)', str_url, re.S)[0]
mid_url = str(str_url).strip(str(page))
page = int(page) + 1
# 請求的url變化就是 page 的增加
if page <= 24:
next_url = mid_url + str(page)
yield scrapy.Request(
next_url,
callback=self.parse1
)
pipeline 代碼:
import csv
class CircplusPipeline:
def process_item(self, item, spider):
with open('./circ_gb.csv', 'a+', encoding='gb2312') as file:
fieldnames = ['doc_id', 'doc_name', 'doc_time', 'doc_detail']
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writerow(item)
return item
def open_spider(self, spider):
with open('./circ_gb.csv', 'a+', encoding='gb2312') as file:
fieldnames = ['doc_id', 'doc_name', 'doc_time', 'doc_detail']
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
将資料儲存在csv檔案中
下載下傳中間件
學習download middleware的使用,下載下傳中間件用于初步處理将排程器發送給下載下傳器的request url 或 初步處理下載下傳器請求後擷取的response
同時還有process_exception 方法用于處理當中間件程式抛出異常時進行的異常處理。
下載下傳中間件的簡單使用
自定義中間件的類,在類中定義process的三個方法,方法中書寫實作代碼。注意要在settings中開啟,将類進行注冊。
代碼嘗試:
import random
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
class RandomUserArgentMiddleware:
# 處理請求
def process_request(self, request, spider):
ua = random.choice(spider.settings.get("USER_ARENT_LIST"))
request.headers["User-Agent"] = ua[0]
class SelectRequestUserAgent:
# 處理響應
def process_response(self, request, response, spider):
print(request.headers["User=Agent"])
# 需要傳回一個response(通過引擎将response交給spider)或request(通過引擎将request交給排程器)或none
return response
class HandleMiddlewareEcxeption:
# 處理異常
def process_exception(self, request, exception, spider):
print(exception)
settings 代碼:
DOWNLOADER_MIDDLEWARES = {
'suningbook.middlewares.RandomUserArgentMiddleware': 543,
'suningbook.middlewares.SelectRequestUserAgent': 544,
'suningbook.middlewares.HandleMiddlewareEcxeption': 544,
}
scrapy 模拟登入
scrapy 攜帶cookie登入
在scrapy中, start_url不會經過allowed_domains的過濾, 是一定會被請求, 檢視scrapy 的源碼, 請求start_url就是由start_requests方法操作的, 是以通過自己重寫start_requests方法可以為請求start_url 攜帶上cookie資訊等, 實作模拟登入等功能.
通過重寫start_requests 方法,為我們的請求攜帶上cookie資訊,來實作模拟登入功能。
補充知識點:
scrapy中 cookie資訊是預設開啟的,是以預設請求下是直接使用cookie的。可以通過開啟COOKIE_DEBUG = True 可以檢視到詳細的cookie在函數中的傳遞。
案例 攜帶cookie模拟登入人人網
通過重寫start_requests方法,為請求攜帶上cookie資訊,去通路需要登入後才能通路的頁面,擷取資訊。模拟實作模拟登入的功能。
import scrapy
import re
class LoginSpider(scrapy.Spider):
name = 'login'
allowed_domains = ['renren.com']
start_urls = ['http://renren.com/975252058/profile']
# 重寫方法
def start_requests(self):
# 添加上cookie資訊,這之後的請求中都會攜帶上該cookie資訊
cookies = "anonymid=klx1odv08szk4j; depovince=GW; _r01_=1; taihe_bi_sdk_uid=17f803e81753a44fe40be7ad8032071b; taihe_bi_sdk_session=089db9062fdfdbd57b2da32e92cad1c2; ick_login=666a6c12-9cd1-433b-9ad7-97f4a595768d; _de=49A204BB9E35C5367A7153C3102580586DEBB8C2103DE356; t=c433fa35a370d4d8e662f1fb4ea7c8838; societyguester=c433fa35a370d4d8e662f1fb4ea7c8838; id=975252058; xnsid=fadc519c; jebecookies=db5f9239-9800-4e50-9fc5-eaac2c445206|||||; JSESSIONID=abcb9nQkVmO0MekR6ifGx; ver=7.0; loginfrom=null; wp_fold=0"
cookie = {i.split("=")[0]:i.split("=")[1] for i in cookies.split("; ")}
yield scrapy.Request(
self.start_urls[0],
callback=self.parse,
cookies=cookie
)
# 列印使用者名,驗證是否模拟登入成功
def parse(self, response):
print(re.findall("該使用者尚未開", response.body.decode(), re.S))
scrapy模拟登入之發送post請求
借助scrapy提供的FromRequest對象發送Post請求,并且可以設定fromdata,headers,cookies等參數。
案例 scrapy模拟登入github
模拟登入GitHub,通路github.com/login, 擷取from參數, 再去請求/session 驗證賬号密碼,最後登入成功
spider 代碼:
import scrapy
import re
import random
class GithubSpider(scrapy.Spider):
name = 'github'
allowed_domains = ['github.com']
start_urls = ['https://github.com/login']
def parse(self, response):
# 先從login 頁面的響應中擷取出authenticity_token和commit,在請求登入是必需
authenticity_token = response.xpath("//*[@id='login']/div[4]/form/input[1]/@value").extract_first()
rand_num1 = random.randint(100000, 999999)
rand_num2 = random.randint(1000000, 9999999)
rand_num = str(rand_num1) + str(rand_num2)
commit = response.xpath("//*[@id='login']/div[4]/form/div/input[12]/@value").extract_first()
form_data = dict(
commit=commit,
authenticity_token=authenticity_token,
login="[email protected]",
password="tcc062556",
timestamp=rand_num,
# rusted_device="",
)
# form_data["webauthn-support"] = ""
# form_data["webauthn-iuvpaa-support"] = ""
# form_data["return_to"] = ""
# form_data["allow_signup"] = ""
# form_data["client_id"] = ""
# form_data["integration"] = ""
# form_data["required_field_b292"] = ""
headers = {
"referer": "https://github.com/login",
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'zh-CN,zh;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'origin': 'https://github.com'
}
# 借助fromrequest 發送post請求,進行登入
yield scrapy.FormRequest.from_response(
response,
formdata=form_data,
headers=headers,
callback=self.login_data
)
def login_data(self, response):
# 列印使用者名驗證是否登入成功
print(re.findall("xiangshiersheng", response.body.decode()))
# 儲存成本地html 檔案
with open('./github.html', 'a+', encoding='utf-8') as f:
f.write(response.body.decode())
總結:
模拟登入三種方式:
1. 攜帶cookie登入
使用scrapy.Request(url, callback=, cookies={})
将cookies填入,在請求url時會攜帶cookie去請求。
2. 使用FormRequest
scrapy.FromRequest(url, formdata={}, callback=)
formdata 就是請求體, 在formdata中填入要送出的表單資料
3. 借助from_response
scrapy.FromRequest.from_response(response, formdata={}, callback=)
from_response 會自動從響應中搜尋到表單送出的位址(如果存在表單及送出位址)
知識的簡單總結
crawlspider 如何使用
- 建立爬蟲 scrapy genspider -t crawl spidername allow_domain
- 完善爬蟲
- start_url
- 完善rules
- 元組
- Rule(LinkExtractor, callback, follow)
- LinkExtractor 連接配接提取器, 提取url
- callback url的響應會交給該callback處理
- follow = True url的響應會繼續被Rule提取位址
- 完善 callback, 處理資料
下載下傳中間件如何使用
- 定義類
- process_request 處理請求, 不需要return
- process_response 處理響應, 需要return request response
- settings中開啟
scrapy如何模拟登入
- 攜帶cookie登入
- 準備cookie字典
- scrapy.Request(url, callba, cookies=cookies_dict)
- scrapy.FromRequest(post_url, formdata={}, callback)
- scrapy.FromRequest.from_response(response, formdata={}, callback)
scrapy_redis 的學習
Scrapy 是一個通用的爬蟲架構,但是不支援分布式,Scrapy-redis是為了更友善地實作Scrapy分布式爬取,而提供了一些以redis為基礎的元件(僅有元件)。
scrapy_redis 的爬取流程
相比scrapy的工作流程,scrapy-redis就隻是多了redis的一部分,并且排程器的request是從redis中讀取出的,而且spider爬取過程中擷取到的url也會經過排程器判重和排程再由spider爬取。最會spider傳回的item會被存儲到redis中。
Scrapy-redis提供了下面四種元件(基于redis)
Scheduler:
Scrapy改造了python本來的collection.deque(雙向隊列)形成了自己的Scrapy queue(https://github.com/scrapy/queuelib/blob/master/queuelib/queue.py)),但是Scrapy多個spider不能共享待爬取隊列Scrapy queue, 即Scrapy本身不支援爬蟲分布式,scrapy-redis 的解決是把這個Scrapy queue換成redis資料庫(也是指redis隊列),從同一個redis-server存放要爬取的request,便能讓多個spider去同一個資料庫裡讀取。
Scrapy中跟“待爬隊列”直接相關的就是排程器Scheduler,它負責對新的request進行入列操作(加入Scrapy queue),取出下一個要爬取的request(從Scrapy queue中取出)等操作。它把待爬隊列按照優先級建立了一個字典結構,比如:
{
優先級0 : 隊列0
優先級1 : 隊列1
優先級2 : 隊列2
}
然後根據request中的優先級,來決定該入哪個隊列,出列時則按優先級較小的優先出列。為了管理這個比較進階的隊列字典,Scheduler需要提供一系列的方法。但是原來的Scheduler已經無法使用,是以使用Scrapy-redis的scheduler元件。
Duplication Filter:
Scrapy中用集合實作這個request去重功能,Scrapy中把已經發送的request指紋放入到一個集合中,把下一個request的指紋拿到集合中比對,如果該指紋存在于集合中,說明這個request發送過了,如果沒有則繼續操作。這個核心的判重功能是這樣實作的:
def request_seen(self, request):
# self.figerprints就是一個指紋集合
fp = self.request_fingerprint(request)
# 這就是判重的核心操作
if fp in self.fingerprints:
return True
self.fingerprints.add(fp)
if self.file:
self.file.write(fp + os.linesep)
在scrapy-redis中去重是由Duplication Filter元件來實作的,它通過redis的set 不重複的特性,巧妙的實作了Duplication Filter去重。scrapy-redis排程器從引擎接受request,将request的指紋存⼊redis的set檢查是否重複,并将不重複的request push寫⼊redis的 request queue。
引擎請求request(Spider發出的)時,排程器從redis的request queue隊列⾥裡根據優先級pop 出⼀個request 傳回給引擎,引擎将此request發給spider處理。
Item Pipeline:
引擎将(Spider傳回的)爬取到的Item給Item Pipeline,scrapy-redis 的Item Pipeline将爬取到的 Item 存⼊redis的 items queue。
修改過Item Pipeline可以很友善的根據 key 從 items queue 提取item,從⽽實作 items processes叢集。
Base Spider:
不再使用scrapy原有的Spider類,重寫的RedisSpider繼承了Spider和RedisMixin這兩個類,RedisMixin是用來從redis讀取url的類。
當我們生成一個Spider繼承RedisSpider時,調用setup_redis函數,這個函數會去連接配接redis資料庫,然後會設定signals(信号):
- 一個是當spider空閑時候的signal,會調用spider_idle函數,這個函數調用schedule_next_request函數,保證spider是一直活着的狀态,并且抛出DontCloseSpider異常。
- 一個是當抓到一個item時的signal,會調用item_scraped函數,這個函數會調用schedule_next_request函數,擷取下一個request。
當下載下傳scrapy-redis後會自帶一個demo程式,如下
settings.py 配置檔案:
domz spider 代碼:
同普通的crawlspider項目相比,主要差距在parse處理響應上.
程式運作時:
嘗試在settings中關閉redispipeline,觀察redis中三個鍵的變化情況
scrapy-redis的源碼解析
scrapy-redis重寫的 scrapy本身的request去重功能,DUperFilter。
相比scrapy的pipeline, scrapy-redis隻是将item 存儲在redis中
scrapy-redis 提供的排程器
重點補充:
request對象什麼時候入隊
- dont_filter = True, 構造請求的時候, 把dont_filter置為True, 該url會被反複抓取(url位址對應的内容會更新的情況)
- 一個全新的url位址被抓到的時候, 構造request請求
- url位址在start_urls中的時候, 會入隊, 不管之前是否請求過構造start_urls 位址的時請求時候,dont_filter = True
scrapy-redis 入隊源碼
def enqueue_request(self, request):
if not request.dont_filter and self.df.request_seen(request):
# dont_filter = False self.df.request_seen = True 此時不會入隊,因為request指紋已經存在
# dont_filter = False self.df.request_seen = False 此時會入隊,因為此時是全新的url
self.df.log(request, self.spider)
return False
if self.stats:
self.stats.inc_value('scheduler/enqueued/redis', spider=self.spider)
self.queue.push(request) # 入隊
return True
scrapy-redis去重方法
**
- 使用sha1加密request得到指紋
- 把指紋存在redis的集合中
- 下一次新來一個request, 同樣的方式生成指紋, 判斷指紋是否存在redis的集合中
生成指紋
fp = hashlib.sha1()
fp.update(to_bytes(request.method))
fp.update(to_byte(canonicalize_url(request.url)))
fp.update(request.body or b'')
return fp.hexdigest()
判斷資料是否存在redis的集合中, 不存在插入
added = self.server.sadd(self.key, fp)
return added != 0
練習 爬取百度貼吧
spider 代碼:
處理正确響應後擷取到的資訊,多使用正則,因為貼吧就算是擷取到正确響應 頁面内的html元素都是被注釋起來,在渲染網頁時由js處理,是以xpath等手段無法使用。
import scrapy
from copy import deepcopy
from tieba.pipelines import HandleChromeDriver
import re
class TiebaspiderSpider(scrapy.Spider):
name = 'tiebaspider'
allowed_domains = ['tieba.baidu.com']
start_urls = ['https://tieba.baidu.com/index.html']
def start_requests(self):
item = {}
# cookie1 = "BAIDUID_BFESS = 9250B568D2AF5E8D7C24501FD8947F10:FG=1; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; ZD_ENTRY=baidu; BA_HECTOR=0l2ha48g00ah842l6k1g46ooh0r; H_PS_PSSID=33518_33358_33272_31660_33595_33393_26350; delPer=0; PSINO=5; NO_UNAME=1; BIDUPSID=233AE38C1766688048F6AA80C4F0D56C; PSTM=1614821745; BAIDUID=233AE38C176668807122431B232D9927:FG=1; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598"
# cookie2 = {i.split("=")[0]: i.split("=")[1] for i in cookie1.split("; ")}
cookies = self.parse_cookie()
print(cookies)
# print(cookies)
headers = {
'Cache-Control': 'no-cache',
'Host': 'tieba.baidu.com',
'Pragma': 'no-cache',
'sec-ch-ua': '"Google Chrome";v = "89", "Chromium";v = "89", ";Not A Brand"; v = "99"',
'sec-ch-ua-mobile': '?0',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': 1
}
yield scrapy.Request(
'https://tieba.baidu.com/index.html',
cookies=cookies,
callback=self.parse,
headers=headers,
meta={"item": item}
)
print("ok")
def parse(self, response):
# 處理首頁頁面
if str(response.url).find("captcha") != -1:
HandleChromeDriver.handle_tuxing_captcha(url=str(response.url))
print(response.url)
print(response.status)
item = response.meta["item"]
grouping_list = response.xpath("//*[@id='f-d-w']/div/div/div")
for i in grouping_list:
group_link = "https://tieba.baidu.com" + i.xpath("./a/@href").extract_first()
group_name = i.xpath("./a/@title").extract_first()
item["group_link"] = group_link
item["group_name"] = group_name
if group_name is not None:
yield scrapy.Request(
group_link,
callback=self.parse_detail,
meta={"item": deepcopy(item)}
)
print("parse")
def parse_detail(self, response):
# 處理分組頁面
detail_data = response.body.decode()
if str(response.url).find("captcha") != -1:
detail_data = HandleChromeDriver.handle_tuxing_captcha(url=str(response.url))
print(response.url)
print(response.status)
detail_list_link = re.findall(
'<div class="ba_info.*?">.*?<a rel="noopener" target="_blank" href="(.*?)" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" class="ba_href clearfix">',
detail_data, re.S)
print(detail_list_link)
detail_list_name = re.findall(
'<div class="ba_content">.*?<p class="ba_name">(.*?)</p>', detail_data, re.S)
item = response.meta["item"]
for i in range(len(detail_list_link)):
detail_link = "https://tieba.baidu.com" + detail_list_link[i]
detail_name = detail_list_name[i]
item["detail_link"] = detail_link
item["detail_name"] = detail_name
yield scrapy.Request(
detail_link,
callback=self.parse_data,
meta={"item": deepcopy(item)}
)
start_parse_url = response.url[:str(response.url).find("pn=") + 3]
start_parse_body = response.body.decode()
last_parse_page = re.findall('下一頁></a>.*?<a href=".*?pn=(\d+)" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" class="last">.*?</a>', start_parse_body, re.S)[0]
page_parse_num = re.findall('<a href=".*?" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >(\d+)</a>', start_parse_body, re.S)[0]
page_parse_num = int(page_parse_num) + 1
end_parse_url = start_parse_url + str(page_parse_num)
if page_parse_num <= int(last_parse_page):
yield scrapy.Request(
end_parse_url,
callback=self.parse_detail,
meta={"item": deepcopy(item)}
)
print("parse_detail")
def parse_data(self, response):
body_data = response.body.decode()
if str(response.url).find("captcha") != -1:
body_data = HandleChromeDriver.handle_tuxing_captcha(url=str(response.url))
print(response.url)
print(response.status)
# print(response.body.decode())
data_name = re.findall('<a rel="noreferrer" href=".*?" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" title=".*?" target="_blank" class="j_th_tit ">(.*?)</a>',
body_data, re.S)
# print(data_name)
data_link = re.findall(
'<div class="t_con cleafix">.*?<a rel="noreferrer" href="(.*?)" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" title=".*?" target="_blank" class="j_th_tit ">.*?</a>',
body_data, re.S)
# print(data_link)
# data_list = response.xpath('//*[@id="thread_list"]/li//div[@class="threadlist_title pull_left j_th_tit "]/a')
# print(data_list.extract_first())
item = response.meta["item"]
for i in range(len(data_link)):
item["data_link"] = "https://tieba.baidu.com" + data_link[i]
item["data_name"] = data_name[i]
yield item
temp_url_find = str(response.url).find("pn=")
if temp_url_find == -1:
start_detail_url = response.url + "&ie=utf-8&pn="
else:
start_detail_url = str(response.url)[:temp_url_find + 3]
start_detail_body = response.body.decode()
last_detail_page = re.findall('下一頁></a>.*?<a href=".*?pn=(\d+)" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" class="last pagination-item " >尾頁</a>', start_detail_body, re.S)[0]
page_detail_num = re.findall('<span class="pagination-current pagination-item ">(.*?)</span>', start_detail_body, re.S)[0]
page_detail_num = int(page_detail_num) * 50
end_detail_url = start_detail_url + str(page_detail_num)
print(end_detail_url)
if page_detail_num <= int(last_detail_page):
yield scrapy.Request(
end_detail_url,
callback=self.parse_data,
meta={"item": deepcopy(item)}
)
print("parse_data")
def parse_data1(self, response):
pass
def parse_cookie(self):
lis = []
lst_end = {}
lis_link = ["BAIDUID", "PSTM", "BIDUPSID", "__yjs_duid", "BDORZ", "BDUSS", "BAIDUID_BFESS", "H_PS_PSSID",
"bdshare_firstime", "BDUSS_BFESS", "NO_UNAME", "tb_as_data", "STOKEN", "st_data",
"Hm_lvt_287705c8d9e2073d13275b18dbd746dc", "Hm_lvt_98b9d8c2fd6608d564bf2ac2ae642948", "st_key_id",
"ab_sr", "st_sign"]
with open("./cookie.txt", "r+", encoding="utf-8") as f:
s = f.read()
t = s.strip("[").strip("]").replace("'", "")
while True:
num = t.find("}, ")
if num != -1:
lis.append({i.split(": ")[0]: i.split(": ")[1] for i in t[:num].strip("{").split(", ")})
t = t.replace(t[:num + 3], "")
else:
break
cookie1 = "BAIDUID_BFESS = 9250B568D2AF5E8D7C24501FD8947F10:FG=1; BDRCVFR[feWj1Vr5u3D] = I67x6TjHwwYf0; ZD_ENTRY = baidu; BA_HECTOR = 0l2ha48g00ah842l6k1g46ooh0r; H_PS_PSSID = 33518_33358_33272_31660_33595_33393_26350; delPe r= 0; PSINO = 5; NO_UNAME = 1; BIDUPSID = 233AE38C1766688048F6AA80C4F0D56C; PSTM = 1614821745; BAIDUID = 233AE38C176668807122431B232D9927:FG=1; BDORZ = B490B5EBF6F3CD402E515D22BCDA1598"
cookie2 = {i.split(" = ")[0]: i.split(" = ")[-1] for i in cookie1.split("; ")}
for i in lis_link:
for j in lis:
if j["name"] == i:
lst_end[i] = j["value"]
for z in cookie2:
if i == z:
lst_end[i] = cookie2[i]
return lst_end
pipeline 代碼:
這裡主要是資料的存儲,存在csv檔案内。以及一個工具類, 帶有兩個靜态方法,一個用于處理自動登入貼吧以擷取到完整且正确的cookie資訊,以便之後的請求攜帶,能得到正确的響應資訊,一個用于處理爬蟲在爬取時遇到貼吧的檢測圖形驗證碼(該驗證碼,人都不是很容易通過。。。)
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from selenium import webdriver
import time
import csv
class TiebaPipeline:
def process_item(self, item, spider):
with open('./tieba.csv', 'a+', encoding='utf-8') as file:
fieldnames = ['group_link', 'group_name', 'detail_link', 'detail_name', 'data_link', 'data_name']
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writerow(item)
return item
def open_spider(self, spider):
with open('./tieba.csv', 'w+', encoding='utf-8') as file:
fieldnames = ['group_link', 'group_name', 'detail_link', 'detail_name', 'data_link', 'data_name']
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
HandleChromeDriver.handle_cookie(url="http://tieba.baidu.com/f/user/passport?jumpUrl=http://tieba.baidu.com")
class HandleChromeDriver:
@staticmethod
def handle_cookie(url):
driver = webdriver.Chrome("E:\python_study\spider\data\chromedriver_win32\chromedriver.exe")
driver.implicitly_wait(2)
driver.get(url)
driver.implicitly_wait(2)
login_pwd = driver.find_element_by_xpath('//*[@id="TANGRAM__PSP_4__footerULoginBtn"]')
login_pwd.click()
username = driver.find_element_by_xpath('//*[@id="TANGRAM__PSP_4__userName"]')
pwd = driver.find_element_by_xpath('//*[@id="TANGRAM__PSP_4__password"]')
login_btn = driver.find_element_by_xpath('//*[@id="TANGRAM__PSP_4__submit"]')
time.sleep(1)
username.send_keys("18657589370")
time.sleep(1)
pwd.send_keys("tcc062556")
time.sleep(1)
login_btn.click()
time.sleep(15)
tb_cookie = str(driver.get_cookies())
with open("./cookie.txt", "w+", encoding="utf-8") as f:
f.write(tb_cookie)
# print(tb_cookie)
driver.close()
@staticmethod
def handle_tuxing_captcha(url):
drivers = webdriver.Chrome("E:\python_study\spider\data\chromedriver_win32\chromedriver.exe")
drivers.implicitly_wait(2)
drivers.get(url)
drivers.implicitly_wait(2)
time.sleep(10)
drivers.close()
# print(tb_cookie)
return drivers.page_source
settings 代碼:
這裡主要設定一些請求頭得到資訊以及在每次請求時間歇兩秒
BOT_NAME = 'tieba'
SPIDER_MODULES = ['tieba.spiders']
NEWSPIDER_MODULE = 'tieba.spiders'
LOG_LEVEL = 'WARNING'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.72 Safari/537.36 Edg/89.0.774.45'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 2
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Connection': 'keep-alive',
}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'tieba.pipelines.TiebaPipeline': 300,
}