爬取計劃:每種職業計劃爬取30頁
頁數判斷:
定位這個來判斷,下方的30
上一頁 | 1 | .... | 28 | 29 | 30 | 31 | 下一頁 |
定位這個進入工作的詳細資訊頁面:
PHP工程師 |
PHP實習生應屆生均可 |
PHP軟體開發工程師 |
PHP工程師 |
PHP工程師 |
PHP工程師 |
jobs = response.css("td.zwmc>div>a")
解析從myspider:start_urls處傳回的response:
def parse(self,response):
1.判斷頁數
2.解析頁面
(i)提取到的jobs的url
(ii)産生request,跳轉到parsejob()函數,進行下一步的處理
(iii)提取下一頁的url,并産生request
def parsejob(self,response):
1.提取有關job的詳細資訊
scrapy爬蟲觀察:
這個是重點,産生了一個請求
[scrapy.core.engine] DEBUG: Crawed(200) <GET http://..............>
這個應該是要解析的
[scrapy.core.scraper] DEBUG: Scraped from <200 http://..............>
使用爬蟲是遇到的情況:
測試條件:
Download Delay = 5
在無request情況下,本地爬蟲産生一個錯誤raise NotImplementError
原因:注釋掉parse(self,response):函數
隻保留parsejob(self,response):函數負責處理response
結果:
爬蟲并未停止運作,因為内部機制有個叫爬蟲閑置(具體名字忘了,下次見到補上),專門應對這種分布式情況,下面這段話可以解釋
# Max idle time to prevent the spider from being closed when distributed crawling.
# This only works if queue class is SpiderQueue or SpiderStack,
# and may also block the same time when your spider start at the first time (because the queue is empty).
#SCHEDULER_IDLE_BEFORE_CLOSE = 10
爬蟲代碼:
from scrapy_redis.spiders import RedisSpider
import scrapy
class MySpider(RedisSpider):
name = "zhilian"
redis_key = "zhilian:start_urls"
allowed_domains = ["jobs.zhaopin.com","sou.zhaopin.com"]
def parse(self,response):
pagenum = response.xpath("//body/div[3]/div[3]/div[2]/form/div[1]/div[1]/div[3]/ul/li[6]/a/text()").extract_first()
if int(pagenum) <= 30:
jobsurl = response.css("td.zwmc>div>a::attr(href)").extract()
for joburl in jobsurl:
yield scrapy.Request(joburl,callback=self.parsejob)
nextPage = response.xpath("//body/div[3]/div[3]/div[2]/form/div[1]/div[1]/div[3]/ul/li[11]/a/@href").extract_first()
yield scrapy.Request(nextPage,callback=self.parse)
def parsejob(self,response):
yield {
'jobname':response.xpath("//body/div[5]/div[1]/div[1]/h1/text()").extract_first(),
}