selenium+chrome 爬取淘女郎頁面

分析https://www.taobao.com/markets/mm/mmku 這個頁面，右鍵檢視網頁源代碼搜尋 img 竟然找不到圖檔标簽，可以猜測是ajax異步加載，是以爬取頁面難度更新了，目前有兩種方法

分析網頁請求，寫接口來處理
利用selenium 模拟浏覽器登入

本次利用第二種方法，需求是先安裝selenium 和 chromedriver （一定要下3.4版本以上的，不然會報錯 element cant click）

爬蟲思路

既然可以模拟浏覽器了，那麼異步加載就不是問題，img會被渲染到請求的html裡面，現在隻需要處理翻頁問題了，因為一頁隻有幾張圖檔。用chrome的檢查元素功能，找到翻頁欄，頁數（class:skip-wrap),的class或者是id，利用By進行元素定位後模拟點選，存儲圖檔就好啦，簡單暴力。

#-*- coding:utf-8 -*-
'''Zheng 's BUG'''
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import os
class Crawl(object):
    # 擷取首頁的頁面資訊
    def getMMsInfo(self):
        url = 'https://www.taobao.com/markets/mm/mmku'
        #chromedriver一定要3.4以上的，不然會出現element 不能點選的錯誤
        driver = webdriver.Chrome(executable_path="C:\Program Files (x86)\Google\Chrome\Application\chromedriver")
        driver.get(url)
        try:
            #等到頁面跳轉條加載完畢
            WebDriverWait(driver, ).until(EC.presence_of_element_located((By.CLASS_NAME,"skip-wrap"))) # 檢視10秒内，是否有頁碼部分出現
            print("成功提取頁碼")
            #通過pagesource傳給soup
            soup = BeautifulSoup(driver.page_source, "html.parser")
            # 擷取到了全部的頁數
            pageNum = soup.find('span',class_ = "skip-wrap").find('em').text
            print("頁碼:"+pageNum)

            print("開始爬取頭像！")
            # 同時得儲存第一出現的圖檔，因為目前頁是不能點選的，是以第一次不能通過點選完成
            # 每個mm的資訊都在一個consli裡
            mms = soup.find_all('div', class_="cons_li")
            # 對于每一個mm對象，擷取其名字和頭像
            self.saveMMS(mms)

            # 從第2頁開始便利點選
            for i in range(,int(pageNum)):
                # 點選目前頁
                # 防止element不能點選，這裡加了一個等待元素出現
                element = WebDriverWait(driver, ).until(EC.presence_of_element_located((By.PARTIAL_LINK_TEXT, str(i))))

                #curpage = driver.find_element_by_partial_link_text(str(i))
                print(i)
                element.click()
                # 等待目前頁加載完成
                pics = WebDriverWait(driver,).until(EC.presence_of_element_located((By.CLASS_NAME,"skip-wrap")))
                # 擷取網頁内容
                soup = BeautifulSoup(driver.page_source,"html.parser")
                mms = soup.find_all('div',class_ = "cons_li")
                # 對于每一個mm對象，擷取其名字和頭像
                self.saveMMS(mms)
                print("目前完成：第"+str(i)+"頁")
        finally:
            driver.quit()

    #一頁的mm的li資訊
    def saveMMS(self,mms):
        for mm in mms:
            name = mm.find('div', class_="item_name").find("p").text
            #get("src")和arrts["src"]
            img = mm.find('div', class_='item_img').find('img').get("src")
            # 如果路徑不存在，設定存儲路徑
            dirpath = os.getcwd() + "\\美人\\"
            if not os.path.exists(dirpath):
                os.makedirs(dirpath)
            namepath = os.getcwd() + "\\美人\\" + name + ".jpg"
            self.saveImg(img, namepath)

    # 儲存一張照片
    def saveImg(self, imageURL, fileName):
        if imageURL is None:
            return
        if 'http' not in imageURL: #去掉src不格式的圖檔
            return
        #流獲得圖檔url内容
        u = requests.get(imageURL,stream = True).content

        try:
            with open(fileName,'wb') as jpg:
                jpg.write(u)
        except IOError:
            print("寫入圖檔錯誤！")

    # 開始函數
    def start(self):
        print("抓起淘女郎-美人庫第一頁的内容，并存儲于 美人 檔案夾下")
        self.getMMsInfo()
        print("下載下傳完成！")

tbmm = Crawl()
tbmm.start()

python3爬取淘女郎圖檔selenium+chrome 爬取淘女郎頁面

selenium+chrome 爬取淘女郎頁面

爬蟲思路

繼續閱讀

Python爬蟲之網站超清圖檔爬取(2021.3.29)

Python入門級爬取百度百科詞條

16Python爬蟲---Scrapy常用指令

Python爬蟲基本庫的使用第二章基本庫的使用

Python爬蟲（四）lxml、xpath安裝子產品導入查找節點屬性查找 @ 符号使用謂語選取未知節點擷取文本和屬性

爬蟲學習之04-request子產品擷取糗事百科一張熱圖

python3下用selenium庫和chrome的headless模式實作網頁抓取（注釋中有用phantomJS的小段代碼）

【Python爬蟲案例學習19】多程序爬取某圖檔網站

python爬蟲實戰之爬取成語大全

【爬取百度首頁】-将整個html源碼儲存-headers使用一、網頁分析二、代碼實作與步驟三、結果分析

爬取百度貼吧

爬取貓眼電影--靜态網頁反爬與多線程/多程序爬取網頁解析爬取代碼多線程與多程序

requests子產品進行人人網模拟登陸

2023爬蟲學習筆記 -- 多線程操作

Python爬蟲學習（1）

Boss直聘Python爬蟲實戰