scrapy爬蟲寫好後,需要用指令行運作,如果能在網頁上操作就比較友善。scrapyd部署就是為了解決這個問題,能夠在網頁端檢視正在執行的任務,也能建立爬蟲任務,和終止爬蟲任務,功能比較強大。
一、安裝
1,安裝scrapyd
pip install scrapyd
2, 安裝 scrapyd-deploy
pip install scrapyd-client
windows系統,在c:\python27\Scripts下生成的是scrapyd-deploy,無法直接在指令行裡運作scrapd-deploy。
解決辦法:
在c:\python27\Scripts下建立一個scrapyd-deploy.bat,檔案内容如下:
@echo off
C:\Python27\python C:\Python27\Scripts\scrapyd-deploy %*
添加環境變量:C:\Python27\Scripts;
二、使用
1,運作scrapyd
首先切換指令行路徑到Scrapy項目的根目錄下,
要執行以下的指令,需要先在指令行裡執行scrapyd,将scrapyd運作起來
MacBook-Pro:~ usera$ scrapyd
/usr/local/bin/scrapyd:5: UserWarning: Module _markerlib was already imported from /Library/Python/2.7/site-packages/distribute-0.6.49-py2.7.egg/_markerlib/__init__.pyc, but /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python is being added to sys.path
from pkg_resources import load_entry_point
2016-09-24 16:00:21+0800 [-] Log opened.
2016-09-24 16:00:21+0800 [-] twistd 15.5.0 (/usr/bin/python 2.7.10) starting up.
2016-09-24 16:00:21+0800 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
2016-09-24 16:00:21+0800 [-] Site starting on 6800
2016-09-24 16:00:21+0800 [-] Starting factory <twisted.web.server.Site instance at 0x102a21518>
2016-09-24 16:00:21+0800 [Launcher] Scrapyd 1.1.0 started: max_proc=16, runner='scrapyd.runner'
2,釋出工程到scrapyd
a,配置scrapy.cfg
在scrapy.cfg中,取消#url =
http://localhost:6800/前面的“#”,具體如下:,
然後在指令行中切換指令至scrapy工程根目錄,運作指令:
scrapyd-deploy <target> -p <project>
示例:
scrapd-deploy -p MySpider
- 驗證是否釋出成功
scrapyd-deploy -l
output:
TS http://localhost:6800/
一,開始使用
1,先啟動 scrapyd,在指令行中執行:
MyMacBook-Pro:MySpiderProject user$ scrapyd
2,建立爬蟲任務
curl http://localhost:6800/schedule.json -d project=myproject -d spider=spider2
-
bug:
scrapyd deploy shows 0 spiders by scrapyd-client
scrapy中有的spider不出現,顯示隻有0個spiders。
-
解決
需要注釋掉settings中的
# LOG_LEVEL = "ERROR"
# LOG_STDOUT = True
# LOG_FILE = "/tmp/spider.log"
# LOG_FORMAT = "%(asctime)s [%(name)s] %(levelname)s: %(message)s"
When setting LOG_STDOUT=True, scrapyd-deploy will return 'spiders: 0'. Because the output will be redirected to the file when execute 'scrapy list', like this: INFO:stdout:spider-name. Soget_spider_list can not parse it correctly.
3,檢視爬蟲任務
在網頁中輸入:
下圖為
http://localhost:6800/jobs的内容:

4,運作配置
配置檔案:C:\Python27\Lib\site-packages\scrapyd-1.1.0-py2.7.egg\scrapyd\default_scrapyd.conf
[scrapyd]
eggs_dir = eggs
logs_dir = logs
items_dir = items
jobs_to_keep = 50
dbs_dir = dbs
max_proc = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5
http_port = 6800
debug = off
runner = scrapyd.runner
application = scrapyd.app.application
launcher = scrapyd.launcher.Launcher
[services]
schedule.json = scrapyd.webservice.Schedule
cancel.json = scrapyd.webservice.Cancel
addversion.json = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json = scrapyd.webservice.ListSpiders
delproject.json = scrapyd.webservice.DeleteProject
delversion.json = scrapyd.webservice.DeleteVersion
listjobs.json = scrapyd.webservice.ListJobs
參考
http://www.cnblogs.com/jinhaolin/p/5033733.html https://scrapyd.readthedocs.io/en/latest/api.html#cancel-json