scrapy模拟浏覽器爬取51job(動态渲染頁面爬取)

scrapy模拟浏覽器爬取51job

51job連結

網絡爬蟲時，網頁不止有靜态頁面還有動态頁面，動态頁面主要由JavaScript動态渲染，網絡爬蟲經常遇見爬取JavaScript動态渲染的頁面。

動态渲染頁面爬取，就是模拟浏覽器的運作方式，可以做到在浏覽器中看到是什麼内容爬取的源碼就是相應的内容，實作了可見即可爬。

這個方法在爬蟲過程中會打開一個浏覽器加載該網頁，自動操作浏覽器浏覽各個網頁，同時也可爬取加載的頁面 HTML。用一句簡單而通俗的話說，就是使用浏覽器渲染方法将爬取動态網頁變成爬取靜态網頁。

我們可以用 Python 的 Selenium 庫模拟浏覽器完成抓取。Selenium 是一個用于 Web 應用程式測試的工具。Selenium 測試直接運作在浏覽器中，浏覽器自動按照腳本代碼做出單擊、輸入、打開、驗證等操作，就像真正的使用者在操作一樣

安裝部署

Selenium是一個自動化測試工具，利用它可以驅動浏覽器執行特定的動作，如點選、下拉等操作，同時還可以擷取浏覽器目前呈現的頁面的源代碼，做到可見即可爬。

Selenium庫安裝如下：

pip install selenium

Selenium庫安裝後，可在指令行下進行測試，具體測試指令如下：

import selenium

輸入以上内容，沒有出現錯誤，說明Selenium庫安裝成功，具體如下圖。

scrapy模拟浏覽器爬取51job(動态渲染頁面爬取)

浏覽器驅動的下載下傳安裝

浏覽器驅動也是一個獨立的程式，是由浏覽器廠商提供的，不同的浏覽器需要不同的浏覽器驅動。比如 Chrome 浏覽器和火狐浏覽器有各自不同的驅動程式。

浏覽器驅動接收到我們的自動化程式發送的界面操作請求後，會轉發請求給浏覽器，讓浏覽器去執行對應的自動化操作。浏覽器執行完操作後，會将自動化的結果傳回給浏覽器驅動，浏覽器驅動再通過 HTTP 響應的消息傳回給我們的自動化程式的用戶端庫。自動化程式的用戶端庫接收到響應後，将結果轉化為資料對象傳回給程式代碼。

在下載下傳 Chrome 浏覽器驅動前，首先确定 Chrome 浏覽器的版本。點選 Chrome 浏覽器“自定義及控制 Goole Chrome”按鈕，選擇“幫助”、“關于 Google Chrome(G)”，檢視浏覽器的實際版本号。

scrapy模拟浏覽器爬取51job(動态渲染頁面爬取)

https://chromedriver.storage.googleapis.com/index.html 是 Chorome 浏覽器驅動的下載下傳位址。按照 Chrome 的版本号以及作業系統，選擇不同的版本下載下傳

scrapy模拟浏覽器爬取51job(動态渲染頁面爬取)

下載下傳完成後解壓縮，将 ChromeDriver.exe 拷貝到指定目錄，後續編寫代碼要指定驅動所在目錄。

聲明浏覽器對象

Selenium支援多個浏覽器，比如：Chrome、Firefox、Edge等，還可以支援Android、BlackBerry等手機段浏覽器。另外也支援無界面浏覽器PhantomJS。

具體的初始化方式如下：

from selenium import webdriver
browser = webdriver.Chrome(executable_path=path)
browser=webdriver.Firefox(executable_path=path)
browser=webdriver.Edge(executable_path=path)
browser=webdriver.PhantomJS(executable_path=path)
browser=webdriver.Safari(executable_path=path)

其中，executable_path表示：浏覽器驅動器存放位置。

以上步驟實作了浏覽器對象的初始化，并将其指派為browser對象。

通路頁面

Selenium使用get()方法請求網頁，具體的文法如下：

通路頁面的實作方式如下：

from selenium import webdriver
path="E:/chromedriver.exe"
browser = webdriver.Chrome(executable_path=path) #擷取 Chrome 驅動執行個體
browser.get('https://www.taobao.com/')#打開淘寶
print(browser.page_source)  #傳回源碼
browser.close() #關閉浏覽器

運作程式後，彈出了Chrome浏覽器并且自動通路了淘寶，然後輸出淘寶網頁的源代碼，最後關閉浏覽器。

Webdriver.Chrome()為擷取 Chrome 浏覽器驅動執行個體，Webdriver 後的方法名是浏覽器的名稱，如 webdriver.Firefox()為火狐浏覽器的驅動執行個體。其中參數 d:\ChromeDriver.exe 為驅動所在的路徑。參數可省略，但是需要将 ChromeDriver.exe 的路徑放入到系統的環境變量中。wd.get(url)可以打開指定的網頁。wd.close()關閉 selenium 打開的浏覽器。

在 selenium 子產品的使用過程中，常見錯誤如下

錯誤資訊為："Exception AttributeError:Service object has no attribute process in…”，可能是 geckodriver 環境變量有問題，重新将 webdriver 所在目錄配置到環境變量中。或者直接在代碼中指定路徑：webdriver.Chrome(‘ChromeDriver 全路徑’)
錯誤信息為： selenium.common.exceptions.WebDriverException: Message: Unsupported Marionette protocol version 2，required
可能是 Chrome 版本太低。

元素選擇器

要想對頁面進行操作，首先要做的是選中頁面元素。元素選取方法如下表

scrapy模拟浏覽器爬取51job(動态渲染頁面爬取)

從命名上來講，定位一個元素使用的單詞為 element,定位多個元素使用的單詞為 elements。從使用的角度來說，定位一個元素，傳回的類型為元素對象，如果存在多個比對第一個，如果查找不到比對的元素，會出現異常，程式代碼不會繼續執行;定位多個元素返回的資料類型為清單，可循環周遊、可使用清單索引,查找不到比對元素不會出現異常，适合于複雜情況下的判斷。

以下以百度首頁為例進行基本案例講解。CSS 選擇器的基本使用方法要求讀者務必掌握，簡要回顧下。Id 選擇器使用#，如“#u1”，定位 id 為 u1 的元素;類選擇器使用“.”，如“.mnav”，定位所有 class 為 mnav 的元素;元素選擇器直接使用标簽名，如“div”，定位所有的 div;組合選擇器，以上多種元素選擇方式組合在一起，是使用頻率最高的一類選擇器。如“#u1 .pf”,定位 id 為 u1 的元素下的所有 class 為 pf 的元素；“#u1>.pf”,定位 id 為 u1 的元素下的 class 為 pf 的元素,并且要求 class 為 pf 的元素是 u1 的直接子級。

scrapy模拟浏覽器爬取51job(動态渲染頁面爬取)

為了更好展現定位到指定元素，使用了 get_attribute 方法來獲取元素的屬性，參數可以是合法的 html 标簽屬性,如 class 或 name，outerHTML 表示擷取定位元素的 html 并且包括元素本身。element1.text 表示擷取元素的文本節點，并包括下級文本。

下表羅列出常用的 CSS 選擇器和其他選擇器對比。

scrapy模拟浏覽器爬取51job(動态渲染頁面爬取)

操縱元素的方法

操控元素通常包括點選元素、在輸入框中輸入字元串、擷取元素包含的資訊。

Selenium可驅動浏覽器執行一些操作，即可以讓浏覽器模拟執行一些動作，常見的操作及方法如下：

輸入文字：使用send_keys()方法實作

清空文字：使用clear()方法實作

點選按鈕：使用click()方法實作

from selenium import webdriver
import time
path="E:/chromedriver.exe"
browser = webdriver.Chrome(executable_path=path)
browser.get('https://music.163.com/')
#擷取輸入框
input = browser.find_element_by_id('srch')
#搜尋框輸入Andy Lao，但是未點選搜尋按鈕是以不進行搜尋
input.send_keys('Andy Lao')
time.sleep(1)
#清空輸入框
input.clear()
input.send_keys('劉德華')
#擷取搜尋按鈕
button = browser.find_element_by_name('srch')
#點選按鈕完成搜尋任務
button.click()
#關閉浏覽器
browser.close()

程式實作流程如下：

1.驅動浏覽器打開網易雲音樂；

2.使用find_element_by_id()方式擷取輸入框；

3.使用send_keys()方法輸入：Andy Lao；

4.等待一秒後使用clear()清空輸入框；

5.再次調用send_keys()方法輸入：劉德華；

6.再次使用使用find_element_by_id()方式擷取輸入框；

7.調用click()方法完成搜尋動作。

動作鍊

Selenium可驅動浏覽器執行其他操作，這些操作沒有特定的執行對象，比如：滑鼠拖拽、鍵盤按鍵等，此類操作稱為動作鍊。

Selenium庫提供了Actionchains子產品，該子產品專門處理動作鍊，比如：滑鼠移動，滑鼠按鈕操作，按鍵、上下文菜單（滑鼠右鍵）互動等。

click(on_element=None) ——單擊滑鼠左鍵

click_and_hold(on_element=None) ——點選滑鼠左鍵，不松開

context_click(on_element=None) ——點選滑鼠右鍵

double_click(on_element=None) ——輕按兩下滑鼠左鍵

drag_and_drop(source, target) ——拖拽到某個元素然後松開

drag_and_drop_by_offset(source, xoffset, yoffset) ——拖拽到某個坐标然後松開

key_down(value, element=None) ——按下某個鍵盤上的鍵

key_up(value, element=None) ——松開某個鍵move_by_offset(xoffset, yoffset) ——滑鼠從目前位置移動到某個坐标move_to_element(to_element) ——滑鼠移動到某個元素move_to_element_with_offset(to_element, xoffset, yoffset) ——移動到距某個元素（左上角坐标）多少距離的位置

perform() ——執行鍊中的所有動作

release(on_element=None) ——在某個元素位置松開滑鼠左鍵

send_keys(*keys_to_send) ——發送某個鍵到目前焦點的元素

send_keys_to_element(element, *keys_to_send) ——發送某個鍵到指定元素

進入正題！！！

我使用的是火狐浏覽器，你們可以自行決定

當我們爬取的時候，會遇到有滑塊，這個時候我們就需要知道滑塊到底滑行了多少，在模拟人的操作時，前面一個階段，我們會快速拉滑塊，給他一個正的加速度，在滑塊要到的時候，我們就要降低速度。

def get_track(distance,t):
    track = []
    current = 0  #目前初始位置
    #mid = distance * t / (t+1)
    mid = distance * 3 / 4  
    #print(mid)
    v = 6.8  # 初速度
    while current < distance:
        if current < mid:
            a = 2
        else:
            a = -3
        v0 = v  
        v = v0 + a * t
        move = v0 * t + 1/2 * a * t * t  #計算滑行的距離，與高中的實體知識相關，不知道的了解一下喲
        current += move
        #print(current)
        track.append(round(move))
    return track

而滑行的速度該如何計算？舉一個例子

scrapy模拟浏覽器爬取51job(動态渲染頁面爬取)

使用開發者工具，或者ps,可以找出圓圈的長和寬，假設為：40x30,接着我們可以找出整個滑塊的長和寬，假設為：340x30，則滑塊需要滑行的距離為（340-40），也就是300。是以我們在使用小程式計算的時候，總的距離大概是300左右，不要超過太多和少太多。

對滑塊的操作：

①點選滑鼠左鍵，不松開

②向右拖

③松開滑鼠

完整代碼：

Middleware.py

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals
from selenium.common.exceptions import TimeoutException
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
import time,requests,random
from selenium.webdriver.common.action_chains import ActionChains 
from scrapy.http import HtmlResponse

def get_track(distance,t):
    track = []
    current = 0
    #mid = distance * t / (t+1)
    mid = distance * 3 / 4
    #print(mid)
    v = 6.8
    while current < distance:
        if current < mid:
            a = 2
        else:
            a = -3
        v0 = v
        v = v0 + a * t
        move = v0 * t + 1/2 * a * t * t
        current += move
        #print(current)
        track.append(round(move))
    return track

class SeleniumMiddleware:
    
    def __init__(self):
        # 1.建立chrome參數
        opt= Options()
        
        # 2.建立無界面對象
        self.browser = Firefox(executable_path='D:\geckodriver.exe', options=opt)   # 建立無界面對象
        self.browser.maximize_window() ##浏覽器最大化

    @classmethod
    def from_crawler(cls, crawler):  # 關閉浏覽器
        s = cls()
        crawler.signals.connect(s.spider_closed, signal=signals.spider_closed)
        return s
        
    ##按照軌迹拖動，完全驗證
    def move_to_gap(self,slider,tracks):
#   #     拖動滑塊到缺口處 param slider: 滑塊,param track: 軌迹
        ActionChains(self.browser).click_and_hold(slider).perform() 
        print(tracks)
        for x in tracks:
            print(x)
            ActionChains(self.browser).move_by_offset(x, 0).perform()  #向右滑動
        ActionChains(self.browser).release().perform()  #釋放操作
        # # perform() ——執行鍊中的所有動作 ，release(on_element=None) ——在某個元素位置松開滑鼠左鍵
        
    def process_request(self, request, spider):
        ## # 判斷是否需要模拟器下載下傳, 如果不需要模拟直接跳過模拟去download下載下傳
        try:
            ## 3.打開指定的網頁
            self.browser.get(request.url)  
            
            #滑塊處理
            if request.url.find("https://jobs.51job.com/")!= -1:
                try:
                    yzm = self.browser.find_element_by_xpath("//span[@id='nc_1_n1z']")
                    print(yzm)
                    if yzm:
                        print("====有滑塊=====")
                        self.move_to_gap(yzm,get_track(258, 2))  # 拖住滑塊
                        time.sleep(10)
                        print("====lllllll====")
                    else:
                        print("===沒有滑塊===")
                except Exception as e:
                    print("==="+str(e))
            else:
                print("===feeder====")
                time.sleep(2)
            return HtmlResponse(url=request.url, body=self.browser.page_source, request=request, encoding='utf-8',status=200)
        except TimeoutException:
            return HtmlResponse(url=request.url, status=500, request=request)
        
    def spider_closed(self):
        self.browser.quit()

job.py

# -*- coding: utf-8 -*-
import scrapy
#from scrapy.utils.response import open_in_browser
import copy

class JobSpider(scrapy.Spider):
    name = 'job'
    allowed_domains = ['51job.com']
    start_urls=['https://search.51job.com/list/060000,000000,0000,00,9,99,%25E5%25A4%25A7%25E6%2595%25B0%25E6%258D%25AE,2,{i}.html' for i in range(1,2)]
   
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url)
        #yield scrapy.Request("https://httpbin.org/ip")
    def parse(self, response):
        item = {}
        print("======")
        print(len(response.xpath("//div[@class='j_joblist']/div[@class='e']")))
        for entry in response.xpath("//div[@class='j_joblist']/div[@class='e']"):
            url = entry.xpath(".//p[@class='t']/../@href").get()
            item['url'] = url
            item['job']=entry.xpath(".//p[@class='t']/span[1]/text()").get()
            item['price'] = entry.xpath(".//span[@class='sal']/text()").get()
            item['where'] = entry.xpath(".//p[@class='info']/span[2]/text()").get().split('  |  ')[0]
            item['jingyan'] = entry.xpath(".//p[@class='info']/span[2]/text()").get().split('  |  ')[1]
            item['xueli'] = entry.xpath(".//p[@class='info']/span[2]/text()").get().split('  |  ')[2]
            item['gongsi']=entry.xpath(".//div[@class='er']/a/text()").get()
            item['daiyu']=entry.xpath(".//p[@class='tags']/@title").get()
            yield scrapy.Request(url,callback=self.parse_detail,meta={'item':copy.deepcopy(item)},dont_filter=True)
    def parse_detail(self,response):
        item = response.meta['item']
        content = response.xpath("//div[contains(@class,'job_msg')]").xpath("substring-before(.,'職能類别：')").xpath('string(.)').extract()
        desc=""
        for i in content:
            desc=desc.join(i.split())
        item['desc']=desc
        yield item

将資料存進mongodb

不知如何操作的話，可以看看之前我的文章

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#import pymysql

class Job51Pipeline(object):
    def process_item(self, item, spider):
        return item
       
import pymongo
from urllib import parse

class NewPipeline_mongo:
    def __init__(self, mongo_uri, mongo_db,account,passwd):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db
        self.account = account
        self.passwd = passwd
        
    @classmethod
    def from_crawler(cls, crawler):
        #print(crawler.settings.get('USERNAME'))
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI','localhost'),
            mongo_db=crawler.settings.get('MONGO_DB','cq'),
            account = crawler.settings.get('USERNAME','root'),
            passwd = crawler.settings.get('PWD','123456')
        )

    def open_spider(self, spider):
        uri = 'mongodb://%s:%s@%s:27017/?authSource=admin' % (self.account, parse.quote_plus(self.passwd),self.mongo_uri)
        #print(uri)
        self.client = pymongo.MongoClient(uri)
        self.db = self.client[self.mongo_db]
        print(self.mongo_db)
        
    def process_item(self, item, spider):
        print(item)
        collection = 'job51'
        self.db[collection].insert_one(dict(item))
        return item
    
    def close_spider(self, spider):
        self.client.close()

settings.py

ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
DOWNLOADER_MIDDLEWARES = {
    'job51.middlewares.SeleniumMiddleware': 543,
}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'job51.pipelines.NewPipeline_mongo': 200,
}

scrapy模拟浏覽器爬取51job(動态渲染頁面爬取)

scrapy模拟浏覽器爬取51job(動态渲染頁面爬取)