概述

爬取妹子圖網的美女圖檔。

準備

所需子產品

time
requests
lxml

涉及知識點

python基礎
requests子產品基礎
xpath表達式基礎

運作效果

控制台列印：

python爬蟲實踐之爬取美女圖檔目錄概述準備運作效果完成爬蟲

電腦本地檔案：

python爬蟲實踐之爬取美女圖檔目錄概述準備運作效果完成爬蟲

完成爬蟲

1. 分析網頁

打開妹子圖網，按F12分析網頁

第一頁的URL是：https://www.meizitu.com/a/list_1_1.html

第二頁的URL是：https://www.meizitu.com/a/list_1_2.html

第三頁的URL是：https://www.meizitu.com/a/list_1_3.html

故分析三者可以得出：

# 第1頁：https://www.meizitu.com/a/list_1_1.html
# 第2頁：https://www.meizitu.com/a/list_1_2.html
# 第3頁：https://www.meizitu.com/a/list_1_3.html
# 故可以推斷出URL公式:url="https://www.meizitu.com/a/list_1_"+page_index+".html"
# 其中page_index指的是頁碼

擷取每一頁的URL後，接着是擷取該頁圖檔的超連結，然後通過該超連結進入具體的頁面下載下傳圖檔。

python爬蟲實踐之爬取美女圖檔目錄概述準備運作效果完成爬蟲

這些超連結可以通過xpath表達式進行提取。

接下來進入具體的頁面：

python爬蟲實踐之爬取美女圖檔目錄概述準備運作效果完成爬蟲

這标題和超連結也可以xpath表達式進行提取。

2. 爬蟲代碼

import time

import requests
from lxml import etree

# 爬蟲實戰:爬取妹子圖網的圖檔

# 第1頁：https://www.meizitu.com/a/list_1_1.html
# 第2頁：https://www.meizitu.com/a/list_1_2.html
# 第3頁：https://www.meizitu.com/a/list_1_3.html
# 故可以推斷出URL公式:url="https://www.meizitu.com/a/list_1_"+page_index+".html"
# 其中page_index指的是頁碼

page_index = 1  # 這裡隻下載下傳了第一頁相關的圖檔,如需下載下傳更多可以使用for循環
# 請求頭
header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36"
}
# 組裝請求的URL
url = "https://www.meizitu.com/a/list_1_" + str(page_index) + ".html"
# 發送請求,擷取響應的HTML源代碼
response = requests.get(url, headers=header).content.decode("gb2312")
# 将源碼字元串轉換成HTML對象
html = etree.HTML(response)
# 通過xpath表達式提取頁面中通往圖檔詳情頁面的超連結,傳回一個清單
image_link_all_list = html.xpath("//li[@class='wp-item']/div[@class='con']/div[@class='pic']/a/@href")
# 循環清單中的圖檔詳情頁面的超連結
for image_link in image_link_all_list:
    # 擷取圖檔詳情頁面的HTML源碼
    response_image_detail = requests.get(image_link).content
    # 将源碼字元串轉換成HTML對象
    html_image_detail = etree.HTML(response_image_detail)
    # 擷取每張圖檔的下載下傳連結
    image_link_detail_list = html_image_detail.xpath("//div[@id='picture']//img/@src")
    # 擷取每張圖檔的标題
    image_name_detail_list = html_image_detail.xpath("//div[@id='picture']//img/@alt")
    # 條件判斷
    if len(image_link_detail_list) == len(image_name_detail_list):
        # 循環圖檔下載下傳連結
        for i in range(0, len(image_name_detail_list)):
            # 請求每張圖檔的資料
            data = requests.get(image_link_detail_list[i]).content
            # 下載下傳提示
            print("正在下載下傳圖檔" + image_name_detail_list[i] + ".jpg中......")
            # 将圖檔下載下傳儲存到電腦本地
            with open(r"C:/Users/Administrator/Pictures/images/" + image_name_detail_list[i] + ".jpg",
                      "wb") as file_object:
                # 寫入資料
                file_object.write(data)
                # 緩一緩
                time.sleep(0.5)

python爬蟲實踐之爬取美女圖檔目錄概述準備運作效果完成爬蟲

目錄

概述

準備

所需子產品

涉及知識點

運作效果

完成爬蟲

1. 分析網頁

2. 爬蟲代碼

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入