天天看點

爬取智聯招聘資訊

爬取計劃:每種職業計劃爬取30頁

頁數判斷:

                         定位這個來判斷,下方的30

上一頁 1 .... 28 29 30 31 下一頁

定位這個進入工作的詳細資訊頁面:

PHP工程師
PHP實習生應屆生均可
PHP軟體開發工程師
PHP工程師
PHP工程師
PHP工程師

jobs = response.css("td.zwmc>div>a")

解析從myspider:start_urls處傳回的response:

def parse(self,response):

             1.判斷頁數

             2.解析頁面

                    (i)提取到的jobs的url

                     (ii)産生request,跳轉到parsejob()函數,進行下一步的處理

                     (iii)提取下一頁的url,并産生request

def parsejob(self,response):

            1.提取有關job的詳細資訊

scrapy爬蟲觀察:                                                            

                                                                                              這個是重點,産生了一個請求

      [scrapy.core.engine]    DEBUG:  Crawed(200)   <GET  http://..............>

                                                                                             這個應該是要解析的

      [scrapy.core.scraper]    DEBUG:   Scraped from  <200    http://..............>

使用爬蟲是遇到的情況:

        測試條件:

                       Download Delay = 5

                       在無request情況下,本地爬蟲産生一個錯誤raise  NotImplementError

                       原因:注釋掉parse(self,response):函數

                                    隻保留parsejob(self,response):函數負責處理response

                       結果:

                                    爬蟲并未停止運作,因為内部機制有個叫爬蟲閑置(具體名字忘了,下次見到補上),專門應對這種分布式情況,下面這段話可以解釋

# Max idle time to prevent the spider from being closed when distributed crawling.
 # This only works if queue class is SpiderQueue or SpiderStack,
 # and may also block the same time when your spider start at the first time (because the queue is empty).
 #SCHEDULER_IDLE_BEFORE_CLOSE = 10
           

爬蟲代碼:

from scrapy_redis.spiders import RedisSpider
import scrapy

class MySpider(RedisSpider):
    name = "zhilian"
    redis_key = "zhilian:start_urls"
    allowed_domains = ["jobs.zhaopin.com","sou.zhaopin.com"]
    def parse(self,response):
        pagenum = response.xpath("//body/div[3]/div[3]/div[2]/form/div[1]/div[1]/div[3]/ul/li[6]/a/text()").extract_first()
        if int(pagenum) <= 30:
            jobsurl = response.css("td.zwmc>div>a::attr(href)").extract()
            for joburl in jobsurl:
                yield scrapy.Request(joburl,callback=self.parsejob)
            nextPage = response.xpath("//body/div[3]/div[3]/div[2]/form/div[1]/div[1]/div[3]/ul/li[11]/a/@href").extract_first()
            yield scrapy.Request(nextPage,callback=self.parse)
    def parsejob(self,response):
        yield {
            'jobname':response.xpath("//body/div[5]/div[1]/div[1]/h1/text()").extract_first(),
        }
           

繼續閱讀