使用scrapy抓取ashvsash電影網站的電影資訊。這裡隻簡單的print資訊,沒有存儲到資料庫,稍加修改使能pipe,用PyMySQL或者mongdb庫,過濾一下資料即可。備注:提取資訊的時候有些網頁會失敗,需要細緻調整。直接代碼
# -*- coding: utf-8 -*-
import scrapy
#列印函數,友善檢視
def my_print(a_map):
for item in a_map:
print(("%-15s %-50s")%(item, a_map[item]))
debug = 1
class MovicesSpider(scrapy.Spider):
name = "movices"
allowed_domains = ["ashvsash.com"]
start_urls = ['http://ashvsash.com/']
def parse_node_thumbnail_article_info(self, thumbnail,article,info):
url = thumbnail.xpath("./a/@href").extract()
title = article.xpath(".//a[@title]/@title").extract()
info_date = info.xpath("./span[@class='info_date info_ico']/text()").extract()
info_views = info.xpath("./span[@class='info_views info_ico']/text()").extract()
info_category = info.xpath("./span[@class='info_category info_ico']/a/text()").extract()
if debug:
print("\nurl位址:",url[0])
print("日期 = ", info_date[0])
print("觀看數 = ", info_views[0])
print("類型 = ", info_category[0])
print("标題 = ", title[0])
return {'url':url[0],'date':info_date[0],'views':info_views[0],'title':title[0],'category':info_category[0]}
def parse_movie_detail_page(self, response):
result = {}
movie_info = response.meta['movie_info']
key = "位址:"
value = movie_info['url']
result[key] = value
key = "觀看數:"
value = movie_info['views']
result[key] = value
key = "标題:"
value = movie_info['title']
result[key] = value
try:
key = response.xpath(r'//*[@id="post_content"]/p[2]/span[1]/text()').extract()[0]
key += ":"
value = response.xpath(r'//*[@id="post_content"]/p[2]/span[2]/a/text()').extract()[0]
result[key] = value
key = response.xpath(r'//*[@id="post_content"]/p[2]/span[3]/text()').extract()[0]
key += ":"
value = response.xpath(r'//*[@id="post_content"]/p[2]/span[4]/a/text()').extract()
result[key] = value
key = response.xpath(r'//*[@id="post_content"]/p[2]/span[6]/text()').extract()[0]
value = response.xpath(r'//*[@id="post_content"]/p[2]/text()[6]').extract()
result[key] = value
key = response.xpath(r'//*[@id="post_content"]/p[2]/span[8]/text()').extract()[0]
value = response.xpath(r'//*[@id="post_content"]/p[2]/text()[10]').extract()[0]
result[key] = value
print("-----------------------------------------------------------------")
my_print(result)
print("-----------------------------------------------------------------")
except:
#有些網頁解析會出錯,先簡單的忽略。
print("<<<<<<------------------------------------------------------------")
def parse(self, response):
post_container = response.xpath("//ul[@id='post_container']")
new_urls = response.xpath(r'//div[@class="pagination"]/a/@href').extract()
#print(new_urls)
for url in new_urls:
yield scrapy.Request(url = url, callback=self.parse)
#next,2,3,4網頁重新入隊。
li = post_container.xpath(".//li")
for item in li:
node_thumbnail = item.xpath("./div[@class='thumbnail']")
node_article = item.xpath("./div[@class='article']")
node_info = item.xpath("./div[@class='info']")
movie_info = self.parse_node_thumbnail_article_info(node_thumbnail, node_article, node_info)
yield scrapy.Request(url=movie_info['url'], callback=self.parse_movie_detail_page, meta={'movie_info':movie_info})
如果在定義一個item,通過格式化map到資料庫表中,就可以輕易的存儲到資料庫内部。使用scrapy,python 3.6在windows 7上調試通過。如果安裝過程中出現錯誤,請從
http://www.lfd.uci.edu/~gohlke/pythonlibs/下載下傳相關的包,直接本地pip安裝包即可。會遇到的問題可能是lxml的安裝,需要安裝vs編譯器即可,主要檢視安裝過程的錯誤資訊。
使用scrapy-redis也比較簡單,pip install scrapy-redis,安裝後從,scrapy-redis繼承生成spider類,然後用過redis将url 送出到redis中,這樣運作此spider即可。spider從scrapy-redis的爬蟲會預設沖redis中取出url。注意setting中配置redis。這樣分布式爬蟲就OK了。主爬蟲爬去url,push 到redis中,分布式爬蟲提取url,做具體的分析。當然也可以在push新的url進queue。