文章目錄
- Python爬蟲之scrapy架構使用詳解
-
- 1. scrapy架構指令
- 2. scrapy項目檔案結構
-
- 2.1 sample_spider
- 2.2 itmes
- 2.3 middlewares
- 2.4 pipelines
- 2.5 settings
- 2.6 main
Python爬蟲之scrapy架構使用詳解
1. scrapy架構指令
<<全局指令>>(全局指令可以在任何地方使用)
1. scrapy startproject <project_name> # 建立爬蟲項目
2. scrapy genspider [-t template] <name> <domain> # 建立爬蟲檔案
3. scrapy runspider <spider_file.py> # 直接通過運作.py檔案來啟動爬蟲
4. scrapy shell [url] # 打開scrapy-shell互動器(可以使用Selector進行調試)
5. scrapy fetch <url> # 該指令會通過scrapy-downloader将網頁的源代碼下載下傳并顯示出來
6. settings # 檢視項目設定
7. version # 檢視版本
8. view # 該指令會将網頁document内容下載下傳下來,并且在浏覽器顯示出來(可以判斷是否是Ajax請求)
<<項目指令>>(項目指令隻能在項目目錄下使用)
1. scrapy crawl <spider> # 啟動爬蟲程式
2. scrapy list # 顯示項目中所有的爬蟲
3. check # 用于檢查代碼是否有錯誤
4. edit # 編輯
5. parse # 解析調試
6. bench # 速度測試
<<建立示例>>(如果指令顯示無效,在指令前面加上“python -m”)
1. scrapy startproject example # 建立名字為example的項目
2. cd example # 切換到該項目
3. scrapy genspider sample_spider www.sample.com # 建立名字為sample_spider的爬蟲檔案,并且初始域名為www.sample.com
4. scrapy crawl sample # 執行sample爬蟲程式
5. scrapy crawl sample -o sample.json # 儲存輸出結果到json檔案(還有csv,xml,pickle,marshal,ftp等格式可以存取)
scrapy crawl sample --nolog # 不列印日志
scrapy crawl sample --headers # 列印響應頭資訊
scrapy crawl sample --no-redirect # 不做跳轉(禁止重定向)
<<shell調試>>
python3 -m scrapy shell
response.selector.xpath('')
response.selector.css('')
2. scrapy項目檔案結構
建立一個名叫example的scrapy項目,并且添加一個sample_spider.py
檔案結構如下:
接下來對scrapy項目内的每個檔案進行講解
使用方法都寫在程式裡面的注釋中,請盡情享用,如果您覺得不錯可以點個贊哦🙂
2.1 sample_spider
"""Spider類基礎屬性和方法
屬性: 含義:
name 爬蟲名稱,它必須是唯一的,用來啟動爬蟲
allowed_domains 允許爬取的域名,是可選配置
start_urls 起始URL清單,當沒有重寫start_requests()方法時,預設使用這個清單
custom_settings 它是一個字典,專屬與本Spider的配置,此設定會覆寫項目全局的設定,必須在初始化前被更新,必須定義成類變量
spider 它是由from_crawler()方法設定的,代表本Spider類對應的Crawler對象,可以擷取項目的全局配置資訊
settings 它是一個Settings對象,我們可以直接擷取項目的全局設定變量
方法: 含義:
start_requests() 生成初始請求,必須傳回一個可疊代對象,預設使用start_urls裡的URL和GET請求,如需使用POST需要重寫此方法
parse() 當Response沒有指定回調函數時,該方法會預設被調用,該函數必須要傳回一個包含Request或Item的可疊代對象
closed() 當Spider關閉時,該方法會被調用,可以在這裡定義釋放資源的一些操作或其他收尾操作
Request屬性:
meta 可以利用Request請求傳入參數,在Response中可以取值,是一個字典類型
cookies 可以傳入cookies資訊,是一個字典類型
dont_filter 如果使用POST,需要多次送出表單,且URL一樣,那麼就必須設定為True,防止被當成重複網頁過濾掉
"""
# -*- coding: utf-8 -*-
import scrapy
from ..items import ExampleItem
from scrapy.http import Request, FormRequest
from scrapy import Selector
__author__ = 'Evan'
class SampleSpider(scrapy.Spider):
name = 'sample_spider' # 項目名稱,具有唯一性不能同名
allowed_domains = ['quotes.toscrape.com'] # 允許的domain range
start_urls = ['http://quotes.toscrape.com/'] # 起始URL
"""更改初始請求,必須傳回一個可疊代對象
def start_requests(self):
return [Request(url=self.start_urls[0], callback=self.parse)]
or
yield Request(url=self.start_urls[0], callback=self.parse)
"""
def parse(self, response):
"""
當Response沒有指定回調函數時,該方法會預設被調用
:param response: From the start_requests() function
:return: 該函數必須要傳回一個包含 Request 或 Item 的可疊代對象
"""
# TODO Request attribute
# print(response.request.url) # 傳回Request的URL
# print(response.request.headers) # 傳回Request的headers
# print(response.request.headers.getlist('Cookie')) # 傳回Request的cookies
# TODO Response attribute
# print(response.text) # 傳回Response的HTML
# print(response.body) # 傳回Response的二進制格式HTML
# print(response.url) # 傳回Response的URL
# print(response.headers) # 傳回Response的headers
# print(response.headers.getlist('Set-Cookie')) # 傳回Response的cookies
# json.loads(response.text) # 擷取AJAX資料,傳回一個字典
# TODO 使用Selector選擇器
# selector = Selector(response=response) # 選擇Response初始化
# selector = Selector(text=div) # 選擇HTML文本初始化
# selector.xpath('//a/text()').extract() # 使用xpath選擇器解析,傳回一個清單
# selector.xpath('//a/text()').re('Name:\s(.*)') # 使用xpath選擇器 + 正規表達式解析,傳回正則比對的分組清單
# selector.xpath('//a/text()').re_first('Name:\s(.*)') # 使用xpath選擇器 + 正規表達式解析,傳回正則比對的第一個結果
# TODO 從settings.py中擷取全局配置資訊
# print(self.settings.get('USER_AGENT'))
quotes = response.css('.quote') # 使用css選擇器,傳回一個SelectorList類型的清單
item = ExampleItem()
for quote in quotes:
# ::text 擷取文本
# ::attr(src) 擷取src屬性的值
item['text'] = quote.css('.text::text').extract_first() # 傳回比對到的第一個結果
item['author'] = quote.css('.author::text').extract_first()
item['tags'] = quote.css('.tags .tag::text').extract() # 傳回一個包含所有結果的清單
yield item
next_url = response.css('.pager .next a::attr("href")').extract_first() # 傳回下一頁的URL
url = response.urljoin(next_url) # 拼接成一個絕對的URL
yield Request(url=url, callback=self.parse) # 設定回調函數,循環檢索每一頁
2.2 itmes
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class ExampleItem(scrapy.Item):
"""
定義資料結構
"""
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
2.3 middlewares
# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
import random
from scrapy import signals
class RandomUserAgentMiddleware(object):
"""
自定義類
"""
def __init__(self):
self.user_agents = [
# Chrome UA
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
' Chrome/73.0.3683.75 Safari/537.36',
# IE UA
'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
# Microsoft Edge UA
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
' Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763'
]
def process_request(self, request, spider):
"""
生成一個随機請求頭
:param request:
:param spider:
:return:
"""
request.headers['User-Agent'] = random.choice(self.user_agents)
class ExampleSpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spider.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
"""
當 Response 被 Spider MiddleWare 處理時,會調用此方法
:param response:
:param spider:
:return:
"""
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
"""
當 Spider 處理 Response 傳回結果時,會調用此方法
:param response:
:param result:
:param spider:
:return:
"""
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, dict or Item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Request, dict
# or Item objects.
pass
def process_start_requests(self, start_requests, spider):
"""
以 Spider 啟動的 Request 為參數被調用,執行的過程類似 process_spider_output(),必須傳回 Request
:param start_requests:
:param spider:
:return:
"""
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class ExampleDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spider.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
"""
在發送請求到 Download 之前調用此方法,可以修改User-Agent、處理重定向、設定代理、失敗重試、設定Cookies等功能
:param request:
:param spider:
:return: 如果傳回的是一個 Request,會把它放到排程隊列,等待被排程
"""
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
"""
在發送 Response 響應結果到 Spider 解析之前調用此方法,可以修改響應結果
:param request:
:param response:
:param spider:
:return: 如果傳回的是一個 Request,會把它放到排程隊列,等待被排程
"""
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
"""
當 Downloader 或 process_request() 方法抛出異常時,會調用此方法
:param request:
:param exception:
:param spider:
:return: 如果傳回的是一個 Request,會把它放到排程隊列,等待被排程
"""
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
2.4 pipelines
# -*- coding: utf-8 -*-
import pymongo
from scrapy.exceptions import DropItem
class TextPipeline(object):
"""
自定義類
"""
def __init__(self):
self.limit = 50
def process_item(self, item, spider):
"""
必須要實作的方法,Pipeline會預設調用此方法
:param item:
:param spider:
:return: 必須傳回 Item 類型的值或者抛出一個 DropItem 異常
"""
if item['text']:
if len(item['text']) > self.limit: # 對超過50個位元組長度的字元串進行切割
item['text'] = item['text'][:self.limit].rstrip() + '...'
return item
else:
raise DropItem('Missing Text') # 如果抛出此異常,會丢棄此Item,不再進行處理
class MongoPipeline(object):
"""
自定義類
"""
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
self.client = None
self.db = None
@classmethod
def from_crawler(cls, crawler):
"""
Pipelines的準備工作,通過crawler可以拿到全局配置的每個配置資訊
:param crawler:
:return: 類執行個體
"""
# 使用類方法,傳回帶有MONGO_URI和MONGO_DB值的類執行個體
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'), # MONGO_URI的值從settings.py擷取
mongo_db=crawler.settings.get('MONGO_DB') # MONGO_DB的值從settings.py擷取
)
def open_spider(self, spider):
"""
當 Spider 開啟時,這個方法會被調用
:param spider:
:return:
"""
self.client = pymongo.MongoClient(self.mongo_uri) # 打開Mongodb連接配接
self.db = self.client[self.mongo_db]
def process_item(self, item, spider):
"""
必須要實作的方法,Pipeline會預設調用此方法
:param item:
:param spider:
:return: 必須傳回 Item 類型的值或者抛出一個 DropItem 異常
"""
name = item.__class__.__name__ # 建立一個集合,name='ExampleItem'
self.db[name].update_one(item, {"$set": item}, upsert=True) # 資料去重
return item
def close_spider(self, spider):
"""
當 Spider 關閉時,這個方法會被調用
:param spider:
:return:
"""
self.client.close() # 關閉Mongodb連接配接
2.5 settings
# -*- coding: utf-8 -*-
# Scrapy settings for example project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'example'
SPIDER_MODULES = ['example.spiders']
NEWSPIDER_MODULE = 'example.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# TODO 設定預設的使用者代理請求頭
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' \
'Chrome/78.0.3904.108 Safari/537.36'
# Obey robots.txt rules
# TODO 不請求Robots協定
ROBOTSTXT_OBEY = False
# TODO 設定編碼格式
FEED_EXPORT_ENCODING = 'utf-8' # 在json格式下轉換中文編碼
# FEED_EXPORT_ENCODING = 'gb18030' # 在csv格式下轉換中文編碼
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
# TODO 如果設定為True則可以手動添加Cookies參數到Request請求中
COOKIES_ENABLED = True
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'example.middlewares.ExampleSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# TODO 使用下載下傳中間件,設定随機請求頭
DOWNLOADER_MIDDLEWARES = {
'example.middlewares.RandomUserAgentMiddleware': 543,
}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# TODO 使用項目管道,過濾和儲存資料(字典的value越小,優先級越高,如下所示:300優先級 > 400優先級)
ITEM_PIPELINES = {
'example.pipelines.TextPipeline': 300,
'example.pipelines.MongoPipeline': 400,
}
# TODO Mongodb配置
MONGO_URI = 'localhost'
MONGO_DB = 'example'
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
2.6 main
在編寫完所有scrapy代碼後,建立一個main.py來啟動scrapy (直接運作main.py即可)
# -*- coding=utf8 -*-
from scrapy import cmdline
# TODO 執行爬蟲指令
cmdline.execute("scrapy crawl sample_spider".split()) # 執行sample_spider爬蟲
cmdline.execute("scrapy crawl sample_spider -o sample.json".split()) # 執行爬蟲并在目前目錄下生成一個sample.json檔案(該檔案是爬蟲結果)