本文章僅作為個人筆記
Scrpy官網
Scrpy官方文檔
Scrpy中文文檔
個人ScrapyDemo項目位址
python環境安裝
- win下安裝:
- python:下載下傳python安裝包直接安裝即可
- pip: easy_install pip
- mac下安裝:
- python:mac下自帶python2.7
- pip: easy_install pip
- centos7下安裝:
- python:centos7下自帶python2.7
- pip: easy_install pip
scrapy 安裝
pip install Scrapy
建立項目
scrapy startproject <project_name>
建立爬蟲
scrapy genspider <spider_name> <host_name>
在檔案夾根目錄建立 requirements.txt檔案并加入需要的元件,例如:
Scrapy==1.5.0
beautifulsoup4==4.6.0
requests==2.18.4
項目環境搭建
pip install -r requirements.txt
運作單個爬蟲
scrapy crawl <spider_name>
運作多個爬蟲(Scrapy本身并不支援指令行直接運作多個Spiders,建立一個新的python檔案加入如下内容運作此python檔案便可)(需按需更改)
# -*- coding: utf-8 -*-
import sys
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from ScrapyDemo.spiders.news_estadao import EstadaoSpider
from ScrapyDemo.spiders.news_gazetaesportiva import DemoSpider
from ScrapyDemo.spiders.news_megacurioso import MegacuriosoSpider
if sys.getdefaultencoding != 'utf-8':
reload(sys)
sys.setdefaultencoding('utf-8')
process = CrawlerProcess(get_project_settings())
process.crawl(EstadaoSpider)
process.crawl(DemoSpider)
process.crawl(MegacuriosoSpider)
process.start()
啟用pipelines用于處理結果
- 打開settings.py檔案在ITEM_PIPELINES選項下加入peline并指派,0-1000,數字越小越優先
輸出單個spider運作結果到檔案
scrapy crawl demo -o /path/to/demo.json
多個spider的結果混合處理:
- 上面的運作多個爬蟲腳本并不能将多個spider的結果混合處理
- 因為業務需要,隻可退而求其次
- 思路:借助commands庫運作linux指令順序運作并輸出結果到檔案,最後讀取檔案内容解析為對象進行混合處理
- 代碼(需按需更改):
!/usr/bin/env python
encoding: utf-8
import commands
def test():
result = []
try:
commands.getoutput(“echo ” > /path/to/megacurioso.json”) #清空上次運作結果
commands.getoutput(“scrapy crawl demo -o /path/to/demo.json”) # 運作結果并輸出
result = json.loads(commands.getoutput(“cat /path/to/megacurioso.json”)) # 擷取運作結果
except:
print “Get megacurioso error.”
return result
解決結果爬蟲資訊亂碼問題:
-
在有亂碼問題python檔案頂部加入如下代碼:
if sys.getdefaultencoding != ‘utf-8’:
reload(sys)
sys.setdefaultencoding(‘utf-8’)
爬蟲示例,也可以使用文頂給出的github連結:
-
item示例(items.py):
# -- coding: utf-8 --
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class ScrapydemoItem(scrapy.Item):
title = scrapy.Field()
imageUrl = scrapy.Field()
des = scrapy.Field()
source = scrapy.Field()
actionUrl = scrapy.Field()
contentType = scrapy.Field()
itemType = scrapy.Field()
createTime = scrapy.Field()
country = scrapy.Field()
headUrl = scrapy.Field()
pass
-
pipelines示例(pipelines.py):
# -- coding: utf-8 --
# Define your item pipelines here
#
# Don’t forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from ScrapyDemo.items import ScrapydemoItem
import json
class ScrapydemoPipeline(object):
DATA_LIST_NEWS = []
def open_spider(self, spider): DATA_LIST_NEWS = [] print 'Spider start.' def process_item(self, item, spider): if isinstance(item, ScrapydemoItem): self.DATA_LIST_NEWS.append(dict(item)) return item def close_spider(self, spider): print json.dumps(self.DATA_LIST_NEWS) print 'Spider end.'
- spider示例(demo.py):
# -*- coding: utf-8 -*- import scrapy from ScrapyDemo.items import ScrapydemoItem class DemoSpider(scrapy.Spider): name = 'news_gazetaesportiva' allowed_domains = ['www.gazetaesportiva.com'] start_urls = ['https://www.gazetaesportiva.com/noticias/'] headers = { 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'accept-language': 'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7', 'cache-control': 'max-age=0', 'upgrade-insecure-requests': '1', 'User-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36' } def parse(self, response): print('Start parse.') for element in response.xpath('//article'): title = element.xpath(".//h3[@class='entry-title no-margin']/a/text()").extract_first() imageUrl = [element.xpath(".//img[@class='medias-object wp-post-image']/@src").extract_first()] des = element.xpath(".//div[@class='entry-content space']/text()").extract_first() source = 'gazeta' actionUrl = element.xpath(".//a[@class='blog-image']/@href").extract_first() contentType = '' itemType = '' createTime = element.xpath(".//small[@class='updated']/text()").extract_first() country = 'PZ' headUrl = '' if title is not None and title != "" and actionUrl is not None and actionUrl != "" and imageUrl is not None and imageUrl != "": item = ScrapydemoItem() item['title'] = title item['imageUrl'] = imageUrl item['des'] = des item['source'] = source item['actionUrl'] = actionUrl item['contentType'] = contentType item['itemType'] = itemType item['createTime'] = createTime item['country'] = country item['headUrl'] = headUrl yield item print('End parse.')
- 代碼個人了解:
-
settings可配置公共配置及配置pipelines對spiders結果進行彙總,例如(後面的數值越大優先級越低,取值0-1000):
ITEM_PIPELINES = {
‘DemoScrapy.pipelines.ScrapydemoPipeline’: 300,
}
- 配置pipelines後指令行運作spiders是會先運作open_spider方法,然後每個結果解析時運作process_item方法,最後spider結束時運作close_spider方法
- items檔案是用來配置描述結果對象的
- spiders檔案夾裡根據指令行建立的spiders檔案配置需要抓取的資料的網頁及需要僞裝的請求頭參數等,抓取資料後資料結果進入 parse方法進行解析,可使用xpath進行解析。xpath的具體使用可參考前文給出的連結,個人進行資料抓取前使用chrom定位标簽,複制源碼後根據規則找到标簽位置最後進行規則比對,因為每次資料規則比對不可能一次性完成,建議使用debug功能來進行比對,最後一次性完成規則書寫。
-
- pycharm下debug spiders:
- 打開pycharm後如果遇到部分插件無法安裝的情況可使用虛拟環境:
Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8 Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8 Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8 - debug運作scrapy:
Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8 Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8 Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8 Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8 Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8 Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8 - 運作到斷點後右擊選擇 Evaluate Expresion
Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8
- 打開pycharm後如果遇到部分插件無法安裝的情況可使用虛拟環境: