爬取去哪兒網機票資料

抓取去哪兒網機票資料

爬取去哪兒網機票資料

此次，我們使用webdriver測試抓取去哪兒網機票資料，為什麼不爬取主站而爬取 m站，因為主站機票價格通過css操作使網頁顯示價格與html元素呈現的價格不一樣，雖然可以解決但比較繁瑣。但是m站價格兩者是相同的，是以我們抓取m站點的資料，感興趣的可以自行破解css混淆抓取主站資料。

移動端資料

爬取去哪兒網機票資料

主站PC端

爬取去哪兒網機票資料

通過分析得知資料擷取url如下

爬取去哪兒網機票資料

是以我們隻需要通過webdriver請求上述位址，更改相應的參數就能擷取到資料。

requests_dic  = {
    'depCity': from_city,  出發地
    'arrCity': to_city, 到達地
    'goDate': '2019-02-27' 日期
}

請求網址:driver_url = 'https://m.flight.qunar.com/ncs/page/flightlist?

%s&from=touch_index_search&child=0&baby=0&cabinType=0'%requestjdic

注意：webdriver要設定的相關參數

mobile_emulation = {"deviceName": "iPhone X"}
options = Options()
# 很重要破解webDiver檢測 避免js檢測webdriver機制
options.add_experimental_option('excludeSwitches', ['enable-automation'])
options.add_experimental_option("mobileEmulation", mobile_emulation)
driver = webdriver.Chrome(options=options)

去哪網有webdriver 檢測機制，單純使用webdriver 會遭到反爬，在console 中輸入window.navigator.webdriver, 正常的浏覽器會顯示 undefined, webdriver下會顯示 true。有興趣的自行測試。

在js判斷window.navigator.webdriver傳回值就可以檢測,懂js的可能會想到覆寫這個值，比如使用如下代碼Object.defineProperties(navigator,{webdriver:{get:()=>undefined}}),确實可以修改成功。這種寫法還是存在某些問題的,如果此時你在模拟浏覽器中通過點選連結、輸入網址進入另一個頁面,或者開啟新的視窗,你會發現window.navigator.webdriver又變成了true.那麼是不是可以在每一個頁面都打開以後,再次通過webdriver執行上面的js代碼,進而實作在每個頁面都把window.navigator.webdriver設定為undefined呢?也不行。因為當你執行：driver.get(網址)的時候,浏覽器會打開網站,加載頁面并運作網站自帶的js代碼。是以在你重設window.navigator.webdriver之前,實際上網站早就已經知道你是模拟浏覽器了。在啟動Chromedriver之前,為Chrome開啟實驗性功能參數excludeSwitches,它的值為['enable-automation']進而解決這個問題。

現在就可以使用webdriver 爬取頁面, 進入頁面可以看到。

爬取去哪兒網機票資料

我們隻需要控制webdriver向下滑動并且點選加載更多的元素就可以擷取更多的資料。

# 螢幕向下滑動到最低端
driver.execute_script("window.scrollTo(100, document.body.scrollHeight);")
more_list = driver.find_element_by_xpath(".//section[@class='list-getmore']")
print(more_list.text)
# 選擇加載更多按鈕并且點選
driver.find_element_by_xpath(".//section[@class='list-getmore']").click()

通過xpath進行資料的提取

from_time = text(driver.find_elements_by_xpath('//div[@class="from-info"]/p[1]'))
from_airport = text(driver.find_elements_by_xpath('//div[@class="from-info"]/p[2]'))
to_time = text(driver.find_elements_by_xpath('//div[@class="to-info"]/p[1]'))
to_airport = text(driver.find_elements_by_xpath('//div[@class="to-info"]/p[2]'))
company_main = driver.find_elements_by_xpath('//div[@class="company-info"]')
price = text(driver.find_elements_by_xpath('//p[@class="price-info"]'))

pandas寫檔案

df = pd.DataFrame(
    {'from_time': from_time, 'from_airport': from_airport, 'to_time': to_time, 'to_airport': to_airport,
     'the_plane': plane_list, 'company': company_list, 'real_price_list': price})
df.to_csv("qunaer.csv", header=0, mode='a+', index=0)

部分結果如下

爬取去哪兒網機票資料

關于Chrome的excludeSwitches等相關參數含義自行Google。

爬取去哪兒網機票資料

想要擷取源碼關注公衆号 <程式員之心> 背景回複 <去哪兒網> 就可擷取。隻分享技術，切勿商用。

源碼分享 https://github.com/tanjunchen/SpiderProject/tree/master/selenium+qunaerwang

爬取去哪兒網機票資料

抓取去哪兒網機票資料

爬取去哪兒網機票資料

繼續閱讀

v2ex的簡單爬蟲

Python漫畫爬蟲開源 66漫畫 AJAX，包含資料庫連接配接，圖檔下載下傳處理

requests子產品進行人人網模拟登陸

Python image.show() 出錯FSPathMakeRef(/Applications/Preview.app) failed with error -43

2023爬蟲學習筆記 -- 多線程操作

M團店鋪評價采集不到問題問題展示：解決方案：

Python爬蟲學習（1）

Python爬蟲學習進階

Python爬蟲（入門+進階）學習筆記 1-2 初識Python爬蟲

Python進階爬蟲——Class1：認識爬蟲

python爬蟲學習筆記-1

python學習之urllib使用小結

NOIp模拟題之肮髒的牧師（桶排序）

一篇文章教你如何在一個月内學會爬取大規模資料

Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗

sort()函數到底是怎樣進行數字排序的