Python爬蟲入門教程 7-100 蜂鳥網圖檔爬取之二

1. 蜂鳥網圖檔-簡介

今天玩點新鮮的，使用一個新庫

aiohttp

，利用它提高咱爬蟲的爬取速度。

安裝子產品正常套路

pip install aiohttp

運作之後等待，安裝完畢，想要深造，那麼官方文檔必備：

https://aiohttp.readthedocs.io/en/stable/

接下來就可以開始寫代碼了。

我們要爬取的頁面，這一次選取的是

http://bbs.fengniao.com/forum/forum_101_1_lastpost.html

打開頁面，我們很容易就擷取到了頁碼

好久沒有這麼友善的看到頁碼了。

嘗試用

aiohttp

通路這個頁面吧，子產品的引入，沒有什麼特殊的，采用

import

即可

如果我們需要使用

Asyncio + Aiohttp

異步IO 編寫爬蟲，那麼需要注意，你需要異步的方法前面加上

async

接下來，先嘗試去擷取一下上面那個位址的網頁源碼。

代碼中，先聲明一個fetch_img_url的函數，同時攜帶一個參數，這個參數也可以直接寫死。

with

上下文不在提示，自行搜尋相關資料即可 (｀・ω・´)

aiohttp.ClientSession() as session:

建立一個

session

對象，然後用該

session

對象去打開網頁。

session

可以進行多項操作，比如

post

get

put

等

代碼中

await response.text()

等待網頁資料傳回

asyncio.get_event_loop

建立線程，

run_until_complete

方法負責安排執行

tasks

中的任務。

tasks

可以為單獨的函數，也可以是清單。

import aiohttp  
import asyncio 


async def fetch_img_url(num):
    url = f'http://bbs.fengniao.com/forum/forum_101_{num}_lastpost.html'  # 字元串拼接
    # 或者直接寫成 url = 'http://bbs.fengniao.com/forum/forum_101_1_lastpost.html'
    print(url)
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6726.400 QQBrowser/10.2.2265.400',
    }

    async with aiohttp.ClientSession() as session:
        # 擷取輪播圖位址
        async with session.get(url,headers=headers) as response:
            try:
                html = await response.text()   # 擷取到網頁源碼
                print(html)
                
            except Exception as e:
                print("基本錯誤")
                print(e)

# 這部分你可以直接臨摹
loop = asyncio.get_event_loop()
tasks = asyncio.ensure_future(fetch_img_url(1))
results = loop.run_until_complete(tasks)

上面代碼最後一部分也可以寫成

loop = asyncio.get_event_loop()
tasks =  [fetch_img_url(1)]
results = loop.run_until_complete(asyncio.wait(tasks))

好了，如果你已經成果的擷取到了源碼，那麼距離最終的目的就差那麼一丢丢了。

修改代碼為批量擷取10頁。

隻需要修改

tasks

即可,在此運作，看到如下結果

tasks =  [fetch_img_url(num) for num in range(1, 10)]

下面的一系列操作和上一篇部落格非常類似，找規律。

随便打開一個頁面

http://bbs.fengniao.com/forum/forum_101_4_lastpost.html

點選一張圖檔，進入内頁，在點選内頁的一張圖檔，進入到一個輪播頁面

再次點選進入圖檔播放頁面

最後我們在圖檔播放頁面，找到源碼中發現了所有的圖檔連結，那麼問題出來了，如何從上面的第一個連結，轉變成輪播圖的連結？？？

下面的源碼是在

http://bbs.fengniao.com/forum/pic/slide_101_10408464_89383854.html

右鍵檢視源碼。

繼續分析吧~~~~

ヾ(=･ω･=)o

http://bbs.fengniao.com/forum/forum_101_4_lastpost.html
轉變成下面的連結？
http://bbs.fengniao.com/forum/pic/slide_101_10408464_89383854.html

繼續看第一個連結，我們使用F12開發者工具，去抓取一個圖檔看看。

圖檔中标黃色框的位置，發現了我們想要的數字，那麼好了，我們隻需要通過正規表達式把他們比對出來就好了。

代碼在下面

####

的位置，需要注意的是，我采用的原始的正則比對，在編寫正規表達式的過程中，我發現一步竟然沒有完整比對，隻能分成兩個步驟了，你可以看一下具體的細節

o(╥﹏╥)o

查找所有的圖檔 <div class="picList">
擷取我們想要的兩部分數字

async def fetch_img_url(num):
    url = f'http://bbs.fengniao.com/forum/forum_101_{num}_lastpost.html'
    print(url)
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6726.400 QQBrowser/10.2.2265.400',
    }

    async with aiohttp.ClientSession() as session:
        # 擷取輪播圖位址
        async with session.get(url,headers=headers) as response:
            try:
                ###############################################
                url_format = "http://bbs.fengniao.com/forum/pic/slide_101_{0}_{1}.html"
                html = await response.text()   # 擷取到網頁源碼
                pattern = re.compile('<div class="picList">([\s\S.]*?)</div>')
                first_match = pattern.findall(html)
                href_pattern = re.compile('href="/forum/(\d+?)_p(\d+?)\.html')
                urls = [url_format.format(href_pattern.search(url).group(1), href_pattern.search(url).group(2)) for url in first_match]
                ##############################################

            except Exception as e:
                print("基本錯誤")
                print(e)

代碼完成，我們已經擷取到，我們想要的URL了，下面繼續讀取URL内部資訊，然後比對我們想要的圖檔連結

async def fetch_img_url(num):
    # 去抄上面的代碼
    async with aiohttp.ClientSession() as session:
        # 擷取輪播圖位址
        async with session.get(url,headers=headers) as response:
            try:
                #去抄上面的代碼去吧
                ################################################################
                for img_slider in urls:
                    try:
                        async with session.get(img_slider, headers=headers) as slider:
                            slider_html = await slider.text()   # 擷取到網頁源碼
                            try:
                                pic_list_pattern = re.compile('var picList = \[(.*)?\];')
                                pic_list = "[{}]".format(pic_list_pattern.search(slider_html).group(1))
                                pic_json = json.loads(pic_list)  # 圖檔清單已經拿到
                                print(pic_json)
                            except Exception as e:
                                print("代碼調試錯誤")
                                print(pic_list)
                                print("*"*100)
                                print(e)

                    except Exception as e:
                        print("擷取圖檔清單錯誤")
                        print(img_slider)
                        print(e)
                        continue
                ################################################################


                print("{}已經操作完畢".format(url))
            except Exception as e:
                print("基本錯誤")
                print(e)

圖檔最終的JSON已經拿到，最後一步，下載下傳圖檔，當當當~~~~，一頓迅猛的操作之後，圖檔就拿下來了

async def fetch_img_url(num):
    # 代碼去上面找
    async with aiohttp.ClientSession() as session:
        # 擷取輪播圖位址
        async with session.get(url,headers=headers) as response:
            try:
                # 代碼去上面找
                for img_slider in urls:
                    try:
                        async with session.get(img_slider, headers=headers) as slider:
                            # 代碼去上面找
                            ##########################################################
                            for img in pic_json:
                                try:
                                    img = img["downloadPic"]
                                    async with session.get(img, headers=headers) as img_res:
                                        imgcode = await img_res.read()  # 圖檔讀取
                                        with open("images/{}".format(img.split('/')[-1]), 'wb') as f:
                                            f.write(imgcode)
                                            f.close()
                                except Exception as e:
                                    print("圖檔下載下傳錯誤")
                                    print(e)
                                    continue
                            ###############################################################

                    except Exception as e:
                        print("擷取圖檔清單錯誤")
                        print(img_slider)
                        print(e)
                        continue
                print("{}已經操作完畢".format(url))
            except Exception as e:
                print("基本錯誤")
                print(e)

圖檔會在你提前寫好的

images

檔案夾裡面快速的生成

tasks

最多可以開1024協程，但是建議你開100個就OK了，太多并發，人家伺服器吃不消。

更多資源，歡迎關注公衆号：非大學程式員，搜尋

htmlhttp

就可以找到啦

以上操作執行完畢，在添加一些細節，比如儲存到指定檔案夾，就OK了。

Python爬蟲入門教程 7-100 蜂鳥網圖檔爬取之二

1. 蜂鳥網圖檔-簡介

繼續閱讀

學習軟體測試基礎測試第七天

淺談企業活動中進行資料分析的重要性

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Windows下配置Apache的SSL服務

Mac｜Windows系統本地照片自動上傳到伺服器

Ambari介紹和架構原理

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

NOSQL安全攻擊

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

win10本地scala和spark安裝安裝scala安裝spark

在python中建立excel并寫入