python并行實戰——多線程

一、前言

當爬取很多頁的内容時候，爬取的單線程顯得尤其慢，是以就在并行中，就可以使用多線程進行爬蟲，可以大大提高效率。當然python的底層編碼是不适合做多線程，因為存在GIL鎖（想要了解這個網上也很多資料），但是對于送出請求多的并行任務，python的多線程還是優于多程序并行的（多程序并行主要應用于計算量大且複雜的場景）。是以今天我們就來使用兩種方式對某個小說網站進行爬取《抗日之肥膽英雄》的章節。該小說網址是https://www.biquge.lol/book/9370/3118042.html。

二、爬蟲資料預處理準備

get_url

函數作用：需要拿到要爬取章節的url，同時也得去發現網站下一頁url是怎麼變化，或者還是異步加載問題。此網站是有規律，對于網站異步加載找url可以看看這個文章：《python網絡爬蟲實戰——利用逆向工程爬取動态網頁》https://blog.csdn.net/zou_gr/article/details/109625855。

clean

函數作用：拿到小說内容的string會有奇奇怪怪的符号，使用此函數去除怪符号，多空格。

get_data

函數作用：使用正規表達式拿到小說内容标題還有一個辨別（index），這個辨別的作用就是為了後面并行爬取的時候，可能會把小說章節順序搞亂，有了這個辨別我就可以重新排名，打回正常順序。

三、串行方法爬取

為了展示并行的使用和強大，得找個對比，有了上面的函數，爬取并行就簡單了：

def main():
    page_urls = get_url(num, num1)
    page = []
    for page_url in page_urls:
        try:
            title_content = get_data(page_url)
            page.append(title_content)
        except:
            page.append("")
            continue
    with open(r"抗日之肥膽英雄_串行.txt","w",encoding="utf8") as f:
        f.write("\n".join(page))


if __name__ == "__main__":
    start = time.time()
    num = 3118042
    num1 = num + 10
    main()
    end = time.time()
    print("串行總共花費時間：{}".format(end - start))
    print("串行爬取小說每章花費時間：{} s/章(2頁)\n".format(((end - start) / (num1 - num))))

輸出結果：

python并行實戰——多線程

四、threading方法爬取《抗日之肥膽英雄》

page = []
num = 3118042
num1 = num + 20
urls = get_url(num,num1)
print(len(urls))

if __name__ == "__main__":
    #error = 0
    start = time.time()
    threads = []
    # 可以調節線程數， 進而控制抓取速度
    threadNum = 4
    index = [i for i in range(0,len(urls),4)]
    for i in index:
        t1 = threading.Thread(target=main, args=(urls[i],))
        t2 = threading.Thread(target=main, args=(urls[i + 1],))
        t3 = threading.Thread(target=main, args=(urls[i + 2],))
        t4 = threading.Thread(target=main, args=(urls[i + 3],))
        threads.append(t1)
        threads.append(t2)
        threads.append(t3)
        threads.append(t4)
    for t in threads:
        t.start()
    for t in threads:  # 確定主線程最後退出， 且各個線程間沒有阻塞
        t.join()
    end = time.time()
    print("threading開啟4核并行花費時間：{}\n".format(end - start))
    print("threading開啟4核并行爬取小說每章花費時間：{} s/章(2頁)\n".format(((end - start) / (num1 - num))))

輸出結果：

python并行實戰——多線程

五、concurrent方法爬取《抗日之肥膽英雄》

def _add_content(url):   #拿到get_url傳回的小說内容和标題，放在page這個清單
    page = []
    try:
        #num = page_url.replace("https://www.biquge.lol/book/9370/","").replace(".html","").replace("_",".")
        title_content = get_data(url)
        page.append(title_content)
        print('正在下載下傳:' + url)
    except:
        page.append("")
        print('下載下傳失敗:' + url)
    return page

if __name__ == "__main__":
    num = 3118042
    num1 = num + 10
    start = time.time()
    urls = get_url(num, num1)
    executor = ThreadPoolExecutor(max_workers=20)  #20個線程池
    con = executor.map(_add_content, urls)
    con_list = list(con)
    cont_list_= [i[0] for i in con_list]
    con_str = "\n".join(cont_list_)
    with open(r"抗日之肥膽英雄.txt", "w", encoding='utf8') as f:
        f.write(con_str)
    end = time.time()
    print("concurrent開啟20核并行花費時間：{}\n".format(end - start))
    print("concurrent開啟20核并行爬取小說每章花費時間：{} s/章(2頁)\n".format(((end - start) / (num1 - num))))

輸出結果：

python并行實戰——多線程

## 五、總結

結果展示：

python并行實戰——多線程

1.我們估算爬取每個章節的時間來作為爬取速度的量化，并且每一個請求休眠了2s，怕網站承受不起。

2.從理論來說，concurrent的方法會更合适爬蟲的并行，編寫也是相對簡單，但是不考慮編寫程式的繁瑣性，threadding的方法也是可以的。

3.有個奇怪的地方，就是我開了20個線程池的concurrent都比不上開4個核的threading，嘗試使用更多或者更少的線程池也沒用，速度還是比不上threading，爬取一章節的時間大約是concurrent的二分之一。

python并行實戰——多線程

一、前言

二、爬蟲資料預處理準備

三、串行方法爬取

四、threading方法爬取《抗日之肥膽英雄》

五、concurrent方法爬取《抗日之肥膽英雄》

繼續閱讀

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入