可以使用API從腳本運作Scrapy,而不是運作Scrapy的典型方法scrapy crawl;Scrapy是基于Twisted異步網絡庫建構的,是以需要在Twisted容器内運作它,可以通過兩個API來運作單個或多個爬蟲scrapy.crawler.CrawlerProcess、scrapy.crawler.CrawlerRunner。
啟動爬蟲的的第一個實用程式是scrapy.crawler.CrawlerProcess 。該類将為您啟動Twisted reactor,配置日志記錄并設定關閉處理程式,此類是所有Scrapy指令使用的類。
示例運作單個爬蟲。
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
通過CrawlerProcess傳入參數,并使用get_project_settings擷取Settings 項目設定的執行個體。
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
# 'followall' is the name of one of the spiders of the project.
process.crawl('followall', domain='scrapinghub.com')
process.start() # the script will block here until the crawling is finished
還有另一個Scrapy執行個體方式可以更好地控制爬蟲運作過程:scrapy.crawler.CrawlerRunner。此類封裝了一些簡單的幫助程式來運作多個爬蟲程式,但它不會以任何方式啟動或幹擾現有的爬蟲。
使用此類,顯式運作reactor。如果已有爬蟲在運作想在同一個程序中開啟另一個Scrapy,建議您使用CrawlerRunner 而不是CrawlerProcess。
注意,爬蟲結束後需要手動關閉Twisted reactor,通過向CrawlerRunner.crawl方法傳回的延遲添加回調來實作。
下面是它的用法示例,在MySpider完成運作後手動停止容器的回調。
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
# Your spider definition
...
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
在同一個程序中運作多個蜘蛛
預設情況下,Scrapy在您運作時為每個程序運作一個蜘蛛。但是,Scrapy支援使用内部API為每個程序運作多個蜘蛛。
這是一個同時運作多個蜘蛛的示例:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
使用CrawlerRunner示例:
import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
相同的示例,但通過異步運作爬蟲蛛:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
yield runner.crawl(MySpider1)
yield runner.crawl(MySpider2)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished