1、直接從JavaScript中采集加載的資料

import requests
import urllib.parse
from lxml import etree

header={
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36'
}
def getList(url):
    req = requests.get(url, headers=header)
    req.encoding="utf-8"
    html = etree.HTML(source)
    xpath = html.xpath('//ul[@class="gl-warp clearfix"]/li')
    for i in xpath:
        a=i.xpath("div/div[4]/a/em/text()")
        b=i.xpath("div/div[3]/strong/i/text()")
        


if __name__ == '__main__':
    label = "手機"
    label = urllib.parse.quote(label)
    url = "https://search.jd.com/Search?keyword={}&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq={}&cid2=653&cid3=655&page={}&s=110&click=0"
    num = 0
    for index in range(1, 3, 2):
        u = url.format(label, label, index)
        getList(u)

這樣就可以爬取到手機的價格和型号了但是有個問題因為京東是動态加載是以使用python爬取的話隻能爬取到前30個之後動态加載的就沒有辦法了要想解決就需要引入一個新的東西selenium

2、引入selenium解決動态加載資料無法爬取的問題

2.1.Selenium簡介

Selenium是一個用于測試網站的自動化測試工具，支援各種浏覽器包括Chrome、Firefox、Safari等主流界面浏覽器，同時也支援phantomJS無界面浏覽器。

2.2.支援多種作業系統

如Windows、Linux、IOS、Android等。

2.3.安裝Selenium

pip install Selenium

2.4.安裝浏覽器驅動

Selenium3.x調用浏覽器必須有一個webdriver驅動檔案

Chrome驅動檔案下載下傳：點選下載下傳chromedrive
Firefox驅動檔案下載下傳:點解下載下傳geckodriver

3.使用Selenium

3.1.第一步是引入

from selenium import webdriver

3.2.第二步是使用

#打開chrome 這裡可以選擇其他的浏覽器 例如:Firefox等
driver=webdriver.Chrome()
#需要打開的url
driver.get(url)
# 執行頁面向下滑至底部的動作
driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")

但是這個時候,可能會報一個錯誤,這個錯誤就是因為沒有下載下傳或配置驅動:

selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.

遇到這個錯誤不要慌,我能搞定,這個時候,就把之前下載下傳的Chrome或者Firefox驅動檔案放到你python安裝目錄的Scripts下面就可以了

溫馨小貼士:在使用Selenium的時間,記得讓它休息一下(time.sleep(2)),不然有可能代碼沒有全部加載,就結束了偶

但是有個問題,我頁面加載完成了,可是我使用requests.get擷取的html和它好像沒毛線關系啊,這個時候就需要一個方法了:

source = driver.page_source

這個source就等同于 requests.get(url, headers=header)傳回的參數了

4.最後了,貼下完整版源碼

import requests
import urllib.parse
from lxml import etree
from selenium import webdriver
import time

header={
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36'
}
def getList(url):
    driver=webdriver.Firefox()
    driver.get(url)
    driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")
    #  停頓5秒等待頁面加載完畢！！！（必須留有頁面加載的時間，否則獲得的源代碼會不完整。）
    time.sleep(2)
    source = driver.page_source
    # print(source)
    # req = requests.get(url, headers=header)
    # req.encoding="utf-8"
    html = etree.HTML(source)
    xpath = html.xpath('//ul[@class="gl-warp clearfix"]/li')
    for i in xpath:
        a=i.xpath("div/div[4]/a/em/text()")
        b=i.xpath("div/div[3]/strong/i/text()")
        print(a)


if __name__ == '__main__':
    label = "手機"
    label = urllib.parse.quote(label)
    url = "https://search.jd.com/Search?keyword={}&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq={}&cid2=653&cid3=655&page={}&s=110&click=0"
    num = 0
    for index in range(1, 3, 2):
        u = url.format(label, label, index)
        print(u)
        getList(u)

使用python+selenium爬取京東商品清單1、直接從JavaScript中采集加載的資料2、引入selenium解決動态加載資料無法爬取的問題 3.使用Selenium 4.最後了,貼下完整版源碼

1、直接從JavaScript中采集加載的資料

2、引入selenium解決動态加載資料無法爬取的問題

2.1.Selenium簡介

2.2.支援多種作業系統

2.3.安裝Selenium

2.4.安裝浏覽器驅動

3.使用Selenium

3.1.第一步是引入

3.2.第二步是使用

4.最後了,貼下完整版源碼

繼續閱讀

TestLink導出用例轉換工具(XML2Excel)

利用Selenium內建TestLink做自動化測試

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入