selenium登入爬取淘寶商品資訊

淘寶網：淘寶網是亞太地區較大的網絡零售、商圈，由阿裡巴巴集團在2003年5月創立。淘寶網是中國深受歡迎的網購零售平台，擁有近5億的注冊使用者數，每天有超過6000萬的固定訪客，同時每天的線上商品數已經超過了8億件，平均每分鐘售出4.8萬件商品。随着淘寶網規模的擴大和使用者數量的增加，淘寶也從單一的C2C網絡集市變成了包括C2C、團購、分銷、拍賣等多種電子商務模式在内的綜合性零售商圈。目前已經成為世界範圍的電子商務交易平台之一。

淘寶官網： https://taobao.com/

注冊/登入淘寶： https://login.taobao.com/member/login.jhtml

操作環境： python3.6+jupyter notebook,win10,goole

技術難點：

模拟登入
跳過滑塊驗證

實作步驟講解

1、模拟登入

1.1、為什麼要模拟登入？

淘寶商城的商品資訊可以說要比京東商城多一些保護，京東商城可以直接搜尋商品進行下拉加載資料進行爬取資訊，但是淘寶不讓直接搜尋商品，必須要登入後才能檢視商品。

1.2、模拟登入方式

可以直接使用淘寶的賬号登入

直接使用淘寶賬号登入，平時普通的登入可以直接登入，但是使用selenium模拟登入時需要進行進行滑塊驗證
支付寶登入
第三方登入（微網誌）

需要進行驗證碼驗證才能登入
APP掃碼登入

也需要使用滑塊驗證，但是可以使用IP模拟使用者通路跳過滑塊

1.3、模拟掃碼登入

#跳過滑動驗證
    chrome_option = webdriver.ChromeOptions()
    chrome_option.add_argument('--proxy--server=127.0.0.1:8080')#使用IP位址

2、選取爬取頁數

2.1、為什麼要選取爬取的頁數，而不是全部爬取？

淘寶使用模糊可重複搜尋，随便提供一個詞都可以查找的幾十頁的商品資訊，後面的很多商品都會和前面的商品重複，全部爬取的資料價值不大。
全部爬取技術更容易實作，學習的技術價值也不大。
使用選取片段擷取資料更友善

2.1、定點具體頁數

搜尋商品後，下拉網頁到底部，我們可以看到進行切換頁面的按鈕

selenium登入爬取淘寶商品資訊

我們在這個框内輸入要跳轉到的頁數，點選“确定”就可以跳轉了。

實作邏輯：找到搜尋框—>清空—>輸入數字—>确定

search = driver.find_element_by_xpath('//*[@id="mainsrp-pager"]/div/div/div/div[2]/input')#找到搜尋框
time.sleep(2)
search.clear()#清空搜尋框
time.sleep(1)
search.send_keys(num)#輸入數字
driver.find_element_by_xpath('//*[@id="mainsrp-pager"]/div/div/div/div[2]/span[3]').click()#點選确定按鈕

2.2、定點片段頁數

global starPage
global endPage
starPage = int(input("請輸入起始頁數字："))
endPage = int(input("請輸入終止頁數字："))
def start():
    global starPage
    global endPage
    for num in range(starPage,endPage+1):
        print ("正在準備爬取第%s頁"%num)
        spider()
        if num < endPage:
            nextPage()
def nextPage():
    print("點選下一頁")
def spider():
    print("開始擷取資訊")
if __name__ == '__main__':    
    start()

邏輯結果：

selenium登入爬取淘寶商品資訊

3、源碼彙總

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from lxml import etree
import time

things = input("請輸入您要查詢的商品：")
driver = webdriver.Chrome()
driver.implicitly_wait(5)
driver.get('https://login.taobao.com/member/login.jhtml')
# 等待掃碼登入
time.sleep(10)
def scan_login():
    #跳過滑動驗證
    chrome_option = webdriver.ChromeOptions()
    chrome_option.add_argument('--proxy--server=127.0.0.1:8080')#使用代理IP,告訴伺服器這是人為操作

    search = driver.find_element_by_xpath('//*[@id="q"]') #在kw内輸入
    search.send_keys(things)#擷取輸入的商品
    time.sleep(2)
    search.send_keys(Keys.ENTER)#按回車 
    time.sleep(4)#大約加載4秒
    maxPage = driver.find_element_by_xpath('//*[@id="mainsrp-pager"]/div/div/div/div[1]').text #查找到商品的最大頁數
    print ("您所查詢的商品",maxPage)
def start(starPage,endPage):#選擇商品頁數片段
    for num in range(starPage,endPage+1):        
        print ("正在準備爬取第%s頁"%num)
        js="document.documentElement.scrollTop=4950"#下拉加載
        driver.execute_script(js)
        driver.implicitly_wait(5)
        search = driver.find_element_by_xpath('//*[@id="mainsrp-pager"]/div/div/div/div[2]/input')#擷取輸入頁數框
        time.sleep(4)
        search.clear()#清空内容
        time.sleep(1)
        search.send_keys(num)
        time.sleep(3)
        spider()
        if num < endPage:#當輸入頁數小于終止頁時可以跳轉到下一頁
            nextPage()
def nextPage():
        driver.find_element_by_xpath('//*[@id="mainsrp-pager"]/div/div/div/div[2]/span[3]').click()#點選确定，跳轉頁數
def spider():
    time.sleep(5)
    source = driver.page_source#擷取網頁源碼
    html = etree.HTML(source)#解析源網頁
    for et in html.xpath('//*[@id="mainsrp-itemlist"]/div/div/div[1]/div'):
        names = et.xpath('./div[2]/div[2]/a/text()')
        name = (str(names)).replace(" ","").replace("'","").replace(",","").replace("[\\n\\n\\n\\n","").replace("\\n]","").replace("[\\n\\n","")
        #//   雙斜杠可以表明轉譯符
        price = et.xpath('./div[2]/div/div/strong/text()')
        buy = et.xpath('./div[2]/div[1]/div[2]/text()')
        store = et.xpath('./div[2]/div[3]/div[1]/a/span[2]/text()')
        location = et.xpath('//*[@id="mainsrp-itemlist"]/div/div/div[1]/div[5]/div[2]/div[3]/div[2]/text()')
        print (name,price,buy,store,location,'\n')
        
if __name__ == '__main__':
    scan_login()
    starPage = int(input("請輸入起始頁數字："))
    endPage = int(input("請輸入終止頁數字："))
    start(starPage,endPage)

爬取部分結果：

selenium登入爬取淘寶商品資訊

selenium登入爬取淘寶商品資訊

實作步驟講解

1、模拟登入

2、選取爬取頁數

3、源碼彙總

繼續閱讀

商業分析python實戰（二）：電影智能推薦

釋出了python實戰項目，給大家分享一下！

【Python實戰】使用python計算多種類型到期還款日

python 切分字元串（隻切分最後N個）

python連接配接資料庫：pymsql子產品--增删查改操作類化dbDemon項目目錄結構：和shell互動操作格式化輸出内容（第二個測試檔案）

247個python實戰案例＋技巧！#python#資料分析#python程式設計#幹貨分享#資料分析

Python實作對檔案的批量移動、複制、删除等前沿代碼實作結果展示

【Python實戰】使用python計算多種還款方式的還款計劃

百看不如一練，247個python實戰案例拿去練手吧希望對大家有幫助！喜歡python和正在學習python的小夥伴可以

python-字元串大小寫轉換、是否全為大小寫、字元等判定

python 8-5 如何使用線程池線程池是指配置設定固定個數的線程,concurrentfutues下的ThreadPoolExecutor

有1、2、3、4個數字，能組成多少個互不相同且無重複數字的三位數？都是多少

Python tornado上傳檔案

Python實作淘寶爬取——奶粉銷售資訊爬取及其資料可視化簡介爬蟲資料處理資料可視化

urllib操作實戰一：post方式通路百度翻譯頁面分析代碼實作

沒有網絡怎麼學網絡爬蟲之爬取智聯招聘網python就業招聘資訊存入Excel表格

selenium登入 爬取淘寶商品資訊

實作步驟講解

1、模拟登入

2、選取爬取頁數

3、源碼彙總

繼續閱讀

selenium登入爬取淘寶商品資訊