爬蟲學習(三) Scrapy架構入門與豆瓣電影爬蟲

知識點筆記

Scrapy架構知識點簡記
官方說明文檔：http://scrapychs.readthedocs.io/zh_CN/latest/intro/tutorial.html（寫在前面，必須要看的）
Scrapy是Python開發的一個快速web抓取的架構。需要提前安裝一些庫才可以安裝Scrapy庫。(1.lxml 2.zope.interface 3.Twisted 4.pyOpenSSL 5.pywin32 6.Scrapy)
Scrapy生成項目的方法：在cmd中用指令：scrapy startproject XXX

如下圖所示：

爬蟲學習(三) Scrapy架構入門與豆瓣電影爬蟲
在項目中引用庫的方法如下所示： from scrapy.spiders import CrawlSpider from scrapy .http import Request from scrapy.selector import Selector from movie.items import MovieItem 其生成的項目結構如圖所示。在spiders目錄下編寫我們的爬蟲檔案

爬蟲學習(三) Scrapy架構入門與豆瓣電影爬蟲
這裡需要說明的是，在pycharm中運作scrapy架構需要一些設定，是以建立了main檔案，其内部代碼如圖所示。（這裡參考了其他相關部落格，見連結http://blog.csdn.net/ck4438707/article/details/52076220）

爬蟲學習(三) Scrapy架構入門與豆瓣電影爬蟲
scrapy各個檔案功能，見官方說明文檔不再贅述。

執行個體（豆瓣電影top250爬蟲）

目标：

url: http://movie.douban.com/top2502.
爬取内容：電影名稱，資訊，評分以及語錄
儲存為CSV檔案

主程式（spider下面的内容）：

#-*-coding:utf-8 -*-
from scrapy.spiders import CrawlSpider
from scrapy .http import Request
from scrapy.selector import Selector
from movie.items import MovieItem

class test(CrawlSpider):
    name = "douban"
    redis_key='douban:start_urls'
    start_urls=['http://movie.douban.com/top250']
    url='https://movie.douban.com/top250'

    def parse(self,response):
        #print response.body
        item = MovieItem()
        selector = Selector(response)
        Movies = selector.xpath('//div[@class="info"]')
        for eachMovie in Movies:
            title = eachMovie.xpath('div[@class="hd"]/a/span/text()').extract()
            fullTitle = ''
            for each in title:
                fullTitle += each
            movieInfo = eachMovie.xpath('div[@class="bd"]/p/text()').extract()
            star = eachMovie.xpath('div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()').extract()[]
            quote = eachMovie.xpath('div[@class="bd"]/p[@class="quote"]/span/text()').extract()
            if quote:
                quote = quote[]
            else:
                quote = ''
            item['title'] = fullTitle
            item['movieInfo'] = ';'.join(movieInfo)
            item['star'] = star
            item['quote'] = quote
            yield item
        nextlink = selector.xpath('//span[@class="next"]/link/@href').extract()
        if nextlink:
            nextlink = nextlink[]
            print nextlink
            yield Request(self.url + nextlink, callback=self.parse)

items.py

from scrapy import Item, Field
class MovieItem(Item):
       # define the fields for your item here like:
        # name = scrapy.Field()
        title = Field()
        movieInfo = Field()
        star = Field()
        quote = Field()

seting.py

# -*- coding: utf-8 -*-

# Scrapy settings for movie project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'movie'

SPIDER_MODULES = ['movie.spiders']
NEWSPIDER_MODULE = 'movie.spiders'
USER_AGENT='Mozilla/5.0 (Windows NT 10.0; Win64; x64)'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'movie (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'movie.middlewares.MovieSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'movie.middlewares.MyCustomDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'movie.pipelines.MoviePipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
FEED_URI=u'file:///G:/movie.csv'
FEED_FORMAT='CSV'

爬蟲學習(三) Scrapy架構入門與豆瓣電影爬蟲

知識點筆記

執行個體（豆瓣電影top250爬蟲）

繼續閱讀

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入