通過核心ＡＰＩ啟動單個或多個scrapy爬蟲

可以使用API從腳本運作Scrapy，而不是運作Scrapy的典型方法scrapy crawl；Scrapy是基于Twisted異步網絡庫建構的，是以需要在Twisted容器内運作它，可以通過兩個API來運作單個或多個爬蟲scrapy.crawler.CrawlerProcess、scrapy.crawler.CrawlerRunner。

啟動爬蟲的的第一個實用程式是scrapy.crawler.CrawlerProcess 。該類将為您啟動Twisted reactor，配置日志記錄并設定關閉處理程式，此類是所有Scrapy指令使用的類。

示例運作單個爬蟲。

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):

    # Your spider definition

    ...


process = CrawlerProcess({

    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'

})


process.crawl(MySpider)

process.start() # the script will block here until the crawling is finished

通過CrawlerProcess傳入參數，并使用get_project_settings擷取Settings 項目設定的執行個體。

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


process = CrawlerProcess(get_project_settings())

# 'followall' is the name of one of the spiders of the project.

process.crawl('followall', domain='scrapinghub.com')

process.start() # the script will block here until the crawling is finished

還有另一個Scrapy執行個體方式可以更好地控制爬蟲運作過程：scrapy.crawler.CrawlerRunner。此類封裝了一些簡單的幫助程式來運作多個爬蟲程式，但它不會以任何方式啟動或幹擾現有的爬蟲。

使用此類，顯式運作reactor。如果已有爬蟲在運作想在同一個程序中開啟另一個Scrapy，建議您使用CrawlerRunner 而不是CrawlerProcess。

注意，爬蟲結束後需要手動關閉Twisted reactor，通過向CrawlerRunner.crawl方法傳回的延遲添加回調來實作。

下面是它的用法示例，在MySpider完成運作後手動停止容器的回調。

from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider(scrapy.Spider):

    # Your spider definition

    ...


configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})

runner = CrawlerRunner()


d = runner.crawl(MySpider)

d.addBoth(lambda _: reactor.stop())

reactor.run() # the script will block here until the crawling is finished

在同一個程序中運作多個蜘蛛

預設情況下，Scrapy在您運作時為每個程序運作一個蜘蛛。但是，Scrapy支援使用内部API為每個程序運作多個蜘蛛。

這是一個同時運作多個蜘蛛的示例：

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):

    # Your first spider definition

    ...

class MySpider2(scrapy.Spider):

    # Your second spider definition

    ...


process = CrawlerProcess()

process.crawl(MySpider1)

process.crawl(MySpider2)

process.start() # the script will block here until all crawling jobs are finished

使用CrawlerRunner示例：

import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):

    # Your first spider definition

    ...

class MySpider2(scrapy.Spider):

    # Your second spider definition

    ...


configure_logging()

runner = CrawlerRunner()

runner.crawl(MySpider1)

runner.crawl(MySpider2)

d = runner.join()

d.addBoth(lambda _: reactor.stop())


reactor.run() # the script will block here until all crawling jobs are finished

相同的示例，但通過異步運作爬蟲蛛：

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):

    # Your first spider definition

    ...

class MySpider2(scrapy.Spider):

    # Your second spider definition

    ...


configure_logging()

runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():

    yield runner.crawl(MySpider1)

    yield runner.crawl(MySpider2)

    reactor.stop()


crawl()

reactor.run() # the script will block here until the last crawl call is finished

通過核心ＡＰＩ啟動單個或多個scrapy爬蟲

在同一個程序中運作多個蜘蛛

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入