Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8

本文章僅作為個人筆記

Scrpy官網

Scrpy官方文檔

Scrpy中文文檔

個人ScrapyDemo項目位址

python環境安裝

win下安裝：
- python：下載下傳python安裝包直接安裝即可
- pip： easy_install pip
mac下安裝：
- python：mac下自帶python2.7
- pip： easy_install pip
centos7下安裝：
- python：centos7下自帶python2.7
- pip： easy_install pip

scrapy 安裝

pip install Scrapy

建立項目

scrapy startproject <project_name>

建立爬蟲

scrapy genspider <spider_name> <host_name>

在檔案夾根目錄建立 requirements.txt檔案并加入需要的元件，例如：

Scrapy==1.5.0
beautifulsoup4==4.6.0
requests==2.18.4

項目環境搭建

pip install -r requirements.txt

運作單個爬蟲

scrapy crawl <spider_name>

運作多個爬蟲（Scrapy本身并不支援指令行直接運作多個Spiders，建立一個新的python檔案加入如下内容運作此python檔案便可）（需按需更改）

# -*- coding: utf-8 -*-
import sys
    from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from ScrapyDemo.spiders.news_estadao import EstadaoSpider
from ScrapyDemo.spiders.news_gazetaesportiva import DemoSpider
from ScrapyDemo.spiders.news_megacurioso import MegacuriosoSpider

if sys.getdefaultencoding != 'utf-8':
    reload(sys)
    sys.setdefaultencoding('utf-8')

process = CrawlerProcess(get_project_settings())
process.crawl(EstadaoSpider)
process.crawl(DemoSpider)
process.crawl(MegacuriosoSpider)
process.start()

啟用pipelines用于處理結果

打開settings.py檔案在ITEM_PIPELINES選項下加入peline并指派，0-1000，數字越小越優先

輸出單個spider運作結果到檔案

scrapy crawl demo -o /path/to/demo.json

多個spider的結果混合處理：

上面的運作多個爬蟲腳本并不能将多個spider的結果混合處理
因為業務需要，隻可退而求其次
- 思路：借助commands庫運作linux指令順序運作并輸出結果到檔案，最後讀取檔案内容解析為對象進行混合處理
- 代碼（需按需更改）：
  !/usr/bin/env python
  
  encoding: utf-8
  
  import commands
  
  def test():
  
  result = []
  
  try:
  
  commands.getoutput(“echo ” > /path/to/megacurioso.json”) #清空上次運作結果
  
  commands.getoutput(“scrapy crawl demo -o /path/to/demo.json”) # 運作結果并輸出
  
  result = json.loads(commands.getoutput(“cat /path/to/megacurioso.json”)) # 擷取運作結果
  
  except:
  
  print “Get megacurioso error.”
  
  return result

解決結果爬蟲資訊亂碼問題：

在有亂碼問題python檔案頂部加入如下代碼：

if sys.getdefaultencoding != ‘utf-8’:

reload(sys)

sys.setdefaultencoding(‘utf-8’)

爬蟲示例，也可以使用文頂給出的github連結：

item示例(items.py)：

# -- coding: utf-8 --

# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class ScrapydemoItem(scrapy.Item):

title = scrapy.Field()

imageUrl = scrapy.Field()

des = scrapy.Field()

source = scrapy.Field()

actionUrl = scrapy.Field()

contentType = scrapy.Field()

itemType = scrapy.Field()

createTime = scrapy.Field()

country = scrapy.Field()

headUrl = scrapy.Field()

pass

pipelines示例（pipelines.py）：

# -- coding: utf-8 --

# Define your item pipelines here

# Don’t forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

from ScrapyDemo.items import ScrapydemoItem

import json

class ScrapydemoPipeline(object):

DATA_LIST_NEWS = []

def open_spider(self, spider):
      DATA_LIST_NEWS = []
      print 'Spider start.'

  def process_item(self, item, spider):
      if isinstance(item, ScrapydemoItem):
          self.DATA_LIST_NEWS.append(dict(item))
      return item

  def close_spider(self, spider):
      print json.dumps(self.DATA_LIST_NEWS)
      print 'Spider end.'

spider示例(demo.py)：

# -*- coding: utf-8 -*-
  import scrapy
  from ScrapyDemo.items import ScrapydemoItem


  class DemoSpider(scrapy.Spider):
      name = 'news_gazetaesportiva'
      allowed_domains = ['www.gazetaesportiva.com']
      start_urls = ['https://www.gazetaesportiva.com/noticias/']
      headers = {
          'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
          'accept-language': 'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7',
          'cache-control': 'max-age=0',
          'upgrade-insecure-requests': '1',
          'User-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
      }

      def parse(self, response):
          print('Start parse.')
          for element in response.xpath('//article'):
              title = element.xpath(".//h3[@class='entry-title no-margin']/a/text()").extract_first()
              imageUrl = [element.xpath(".//img[@class='medias-object wp-post-image']/@src").extract_first()]
              des = element.xpath(".//div[@class='entry-content space']/text()").extract_first()
              source = 'gazeta'
              actionUrl = element.xpath(".//a[@class='blog-image']/@href").extract_first()
              contentType = ''
              itemType = ''
              createTime = element.xpath(".//small[@class='updated']/text()").extract_first()
              country = 'PZ'
              headUrl = ''
              if title is not None and title != "" and actionUrl is not None and actionUrl != "" and imageUrl is not None and imageUrl != "":
                  item = ScrapydemoItem()
                  item['title'] = title
                  item['imageUrl'] = imageUrl
                  item['des'] = des
                  item['source'] = source
                  item['actionUrl'] = actionUrl
                  item['contentType'] = contentType
                  item['itemType'] = itemType
                  item['createTime'] = createTime
                  item['country'] = country
                  item['headUrl'] = headUrl
                  yield item
          print('End parse.')

代碼個人了解：
- settings可配置公共配置及配置pipelines對spiders結果進行彙總，例如(後面的數值越大優先級越低，取值0-1000)：
  
  ITEM_PIPELINES = {
  
  ‘DemoScrapy.pipelines.ScrapydemoPipeline’: 300,
  
  }
- 配置pipelines後指令行運作spiders是會先運作open_spider方法，然後每個結果解析時運作process_item方法，最後spider結束時運作close_spider方法
- items檔案是用來配置描述結果對象的
- spiders檔案夾裡根據指令行建立的spiders檔案配置需要抓取的資料的網頁及需要僞裝的請求頭參數等，抓取資料後資料結果進入 parse方法進行解析，可使用xpath進行解析。xpath的具體使用可參考前文給出的連結，個人進行資料抓取前使用chrom定位标簽，複制源碼後根據規則找到标簽位置最後進行規則比對，因為每次資料規則比對不可能一次性完成，建議使用debug功能來進行比對，最後一次性完成規則書寫。
pycharm下debug spiders：
- 打開pycharm後如果遇到部分插件無法安裝的情況可使用虛拟環境：
  
  Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8
  
  Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8
  
  Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8
- debug運作scrapy：
  
  Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8
  
  Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8
  
  Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8
  
  Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8
  
  Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8
  
  Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8
- 運作到斷點後右擊選擇 Evaluate Expresion
  
  Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8

Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8

本文章僅作為個人筆記

Scrpy官網

Scrpy官方文檔

Scrpy中文文檔

個人ScrapyDemo項目位址

!/usr/bin/env python

encoding: utf-8

繼續閱讀

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入