scrapy抓取一個電影網站資訊

使用scrapy抓取ashvsash電影網站的電影資訊。這裡隻簡單的print資訊，沒有存儲到資料庫，稍加修改使能pipe，用PyMySQL或者mongdb庫，過濾一下資料即可。備注：提取資訊的時候有些網頁會失敗，需要細緻調整。直接代碼

# -*- coding: utf-8 -*-

import scrapy

#列印函數，友善檢視

def my_print(a_map):

for item in a_map:

print(("%-15s %-50s")%(item, a_map[item]))

debug = 1

class MovicesSpider(scrapy.Spider):

name = "movices"

allowed_domains = ["ashvsash.com"]

start_urls = ['http://ashvsash.com/']

def parse_node_thumbnail_article_info(self, thumbnail,article,info):

url = thumbnail.xpath("./a/@href").extract()

title = article.xpath(".//a[@title]/@title").extract()

info_date = info.xpath("./span[@class='info_date info_ico']/text()").extract()

info_views = info.xpath("./span[@class='info_views info_ico']/text()").extract()

info_category = info.xpath("./span[@class='info_category info_ico']/a/text()").extract()

if debug:

print("\nurl位址:",url[0])

print("日期 = ", info_date[0])

print("觀看數 = ", info_views[0])

print("類型 = ", info_category[0])

print("标題 = ", title[0])

return {'url':url[0],'date':info_date[0],'views':info_views[0],'title':title[0],'category':info_category[0]}

def parse_movie_detail_page(self, response):

result = {}

movie_info = response.meta['movie_info']

key = "位址:"

value = movie_info['url']

result[key] = value

key = "觀看數:"

value = movie_info['views']

result[key] = value

key = "标題:"

value = movie_info['title']

result[key] = value

try:

key = response.xpath(r'//*[@id="post_content"]/p[2]/span[1]/text()').extract()[0]

key += ":"

value = response.xpath(r'//*[@id="post_content"]/p[2]/span[2]/a/text()').extract()[0]

result[key] = value

key = response.xpath(r'//*[@id="post_content"]/p[2]/span[3]/text()').extract()[0]

key += ":"

value = response.xpath(r'//*[@id="post_content"]/p[2]/span[4]/a/text()').extract()

result[key] = value

key = response.xpath(r'//*[@id="post_content"]/p[2]/span[6]/text()').extract()[0]

value = response.xpath(r'//*[@id="post_content"]/p[2]/text()[6]').extract()

result[key] = value

key = response.xpath(r'//*[@id="post_content"]/p[2]/span[8]/text()').extract()[0]

value = response.xpath(r'//*[@id="post_content"]/p[2]/text()[10]').extract()[0]

result[key] = value

print("-----------------------------------------------------------------")

my_print(result)

print("-----------------------------------------------------------------")

except:

#有些網頁解析會出錯，先簡單的忽略。

print("<<<<<<------------------------------------------------------------")

def parse(self, response):

post_container = response.xpath("//ul[@id='post_container']")

new_urls = response.xpath(r'//div[@class="pagination"]/a/@href').extract()

#print(new_urls)

for url in new_urls:

yield scrapy.Request(url = url, callback=self.parse)

#next，2,3,4網頁重新入隊。

li = post_container.xpath(".//li")

for item in li:

node_thumbnail = item.xpath("./div[@class='thumbnail']")

node_article = item.xpath("./div[@class='article']")

node_info = item.xpath("./div[@class='info']")

movie_info = self.parse_node_thumbnail_article_info(node_thumbnail, node_article, node_info)

yield scrapy.Request(url=movie_info['url'], callback=self.parse_movie_detail_page, meta={'movie_info':movie_info})

如果在定義一個item，通過格式化map到資料庫表中，就可以輕易的存儲到資料庫内部。使用scrapy，python 3.6在windows 7上調試通過。如果安裝過程中出現錯誤，請從

http://www.lfd.uci.edu/~gohlke/pythonlibs/下載下傳相關的包，直接本地pip安裝包即可。會遇到的問題可能是lxml的安裝，需要安裝vs編譯器即可，主要檢視安裝過程的錯誤資訊。

使用scrapy-redis也比較簡單，pip install scrapy-redis，安裝後從，scrapy-redis繼承生成spider類，然後用過redis将url 送出到redis中，這樣運作此spider即可。spider從scrapy-redis的爬蟲會預設沖redis中取出url。注意setting中配置redis。這樣分布式爬蟲就OK了。主爬蟲爬去url，push 到redis中，分布式爬蟲提取url，做具體的分析。當然也可以在push新的url進queue。

scrapy抓取一個電影網站資訊

繼續閱讀

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入