天天看點

Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8

本文章僅作為個人筆記

Scrpy官網

Scrpy官方文檔

Scrpy中文文檔

個人ScrapyDemo項目位址

python環境安裝

  • win下安裝:
    • python:下載下傳python安裝包直接安裝即可
    • pip: easy_install pip
  • mac下安裝:
    • python:mac下自帶python2.7
    • pip: easy_install pip
  • centos7下安裝:
    • python:centos7下自帶python2.7
    • pip: easy_install pip

scrapy 安裝

pip install Scrapy
           

建立項目

scrapy startproject <project_name>
           

建立爬蟲

scrapy genspider <spider_name> <host_name>
           

在檔案夾根目錄建立 requirements.txt檔案并加入需要的元件,例如:

Scrapy==1.5.0
beautifulsoup4==4.6.0
requests==2.18.4
           

項目環境搭建

pip install -r requirements.txt
           

運作單個爬蟲

scrapy crawl <spider_name>
           

運作多個爬蟲(Scrapy本身并不支援指令行直接運作多個Spiders,建立一個新的python檔案加入如下内容運作此python檔案便可)(需按需更改)

# -*- coding: utf-8 -*-
import sys
    from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from ScrapyDemo.spiders.news_estadao import EstadaoSpider
from ScrapyDemo.spiders.news_gazetaesportiva import DemoSpider
from ScrapyDemo.spiders.news_megacurioso import MegacuriosoSpider

if sys.getdefaultencoding != 'utf-8':
    reload(sys)
    sys.setdefaultencoding('utf-8')

process = CrawlerProcess(get_project_settings())
process.crawl(EstadaoSpider)
process.crawl(DemoSpider)
process.crawl(MegacuriosoSpider)
process.start()
           

啟用pipelines用于處理結果

  • 打開settings.py檔案在ITEM_PIPELINES選項下加入peline并指派,0-1000,數字越小越優先

輸出單個spider運作結果到檔案

scrapy crawl demo -o /path/to/demo.json
           

多個spider的結果混合處理:

  • 上面的運作多個爬蟲腳本并不能将多個spider的結果混合處理
  • 因為業務需要,隻可退而求其次
    • 思路:借助commands庫運作linux指令順序運作并輸出結果到檔案,最後讀取檔案内容解析為對象進行混合處理
    • 代碼(需按需更改):

      !/usr/bin/env python

      encoding: utf-8

      import commands

      def test():

      result = []

      try:

      commands.getoutput(“echo ” > /path/to/megacurioso.json”) #清空上次運作結果

      commands.getoutput(“scrapy crawl demo -o /path/to/demo.json”) # 運作結果并輸出

      result = json.loads(commands.getoutput(“cat /path/to/megacurioso.json”)) # 擷取運作結果

      except:

      print “Get megacurioso error.”

      return result

解決結果爬蟲資訊亂碼問題:

  • 在有亂碼問題python檔案頂部加入如下代碼:

    if sys.getdefaultencoding != ‘utf-8’:

    reload(sys)

    sys.setdefaultencoding(‘utf-8’)

爬蟲示例,也可以使用文頂給出的github連結:

  • item示例(items.py):

    # -- coding: utf-8 --

    # Define here the models for your scraped items

    #

    # See documentation in:

    # https://doc.scrapy.org/en/latest/topics/items.html

    import scrapy

    class ScrapydemoItem(scrapy.Item):

    title = scrapy.Field()

    imageUrl = scrapy.Field()

    des = scrapy.Field()

    source = scrapy.Field()

    actionUrl = scrapy.Field()

    contentType = scrapy.Field()

    itemType = scrapy.Field()

    createTime = scrapy.Field()

    country = scrapy.Field()

    headUrl = scrapy.Field()

    pass

  • pipelines示例(pipelines.py):

    # -- coding: utf-8 --

    # Define your item pipelines here

    #

    # Don’t forget to add your pipeline to the ITEM_PIPELINES setting

    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

    from ScrapyDemo.items import ScrapydemoItem

    import json

    class ScrapydemoPipeline(object):

    DATA_LIST_NEWS = []

    def open_spider(self, spider):
          DATA_LIST_NEWS = []
          print 'Spider start.'
    
      def process_item(self, item, spider):
          if isinstance(item, ScrapydemoItem):
              self.DATA_LIST_NEWS.append(dict(item))
          return item
    
      def close_spider(self, spider):
          print json.dumps(self.DATA_LIST_NEWS)
          print 'Spider end.'
               
  • spider示例(demo.py):
    # -*- coding: utf-8 -*-
      import scrapy
      from ScrapyDemo.items import ScrapydemoItem
    
    
      class DemoSpider(scrapy.Spider):
          name = 'news_gazetaesportiva'
          allowed_domains = ['www.gazetaesportiva.com']
          start_urls = ['https://www.gazetaesportiva.com/noticias/']
          headers = {
              'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
              'accept-language': 'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7',
              'cache-control': 'max-age=0',
              'upgrade-insecure-requests': '1',
              'User-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
          }
    
          def parse(self, response):
              print('Start parse.')
              for element in response.xpath('//article'):
                  title = element.xpath(".//h3[@class='entry-title no-margin']/a/text()").extract_first()
                  imageUrl = [element.xpath(".//img[@class='medias-object wp-post-image']/@src").extract_first()]
                  des = element.xpath(".//div[@class='entry-content space']/text()").extract_first()
                  source = 'gazeta'
                  actionUrl = element.xpath(".//a[@class='blog-image']/@href").extract_first()
                  contentType = ''
                  itemType = ''
                  createTime = element.xpath(".//small[@class='updated']/text()").extract_first()
                  country = 'PZ'
                  headUrl = ''
                  if title is not None and title != "" and actionUrl is not None and actionUrl != "" and imageUrl is not None and imageUrl != "":
                      item = ScrapydemoItem()
                      item['title'] = title
                      item['imageUrl'] = imageUrl
                      item['des'] = des
                      item['source'] = source
                      item['actionUrl'] = actionUrl
                      item['contentType'] = contentType
                      item['itemType'] = itemType
                      item['createTime'] = createTime
                      item['country'] = country
                      item['headUrl'] = headUrl
                      yield item
              print('End parse.')
               
  • 代碼個人了解:
    • settings可配置公共配置及配置pipelines對spiders結果進行彙總,例如(後面的數值越大優先級越低,取值0-1000):

      ITEM_PIPELINES = {

      ‘DemoScrapy.pipelines.ScrapydemoPipeline’: 300,

      }

    • 配置pipelines後指令行運作spiders是會先運作open_spider方法,然後每個結果解析時運作process_item方法,最後spider結束時運作close_spider方法
    • items檔案是用來配置描述結果對象的
    • spiders檔案夾裡根據指令行建立的spiders檔案配置需要抓取的資料的網頁及需要僞裝的請求頭參數等,抓取資料後資料結果進入 parse方法進行解析,可使用xpath進行解析。xpath的具體使用可參考前文給出的連結,個人進行資料抓取前使用chrom定位标簽,複制源碼後根據規則找到标簽位置最後進行規則比對,因為每次資料規則比對不可能一次性完成,建議使用debug功能來進行比對,最後一次性完成規則書寫。
  • pycharm下debug spiders:
    • 打開pycharm後如果遇到部分插件無法安裝的情況可使用虛拟環境:
      Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8
      Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8
      Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8
    • debug運作scrapy:
      Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8
      Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8
      Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8
      Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8
      Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8
      Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8
    • 運作到斷點後右擊選擇 Evaluate Expresion
      Scrapy(python爬蟲架構)入門筆記!/usr/bin/env pythonencoding: utf-8