多線程爬取表情包

本例通過多線程爬取http://www.doutula.com網站來講解如何讓使用多線程進行爬取。

首先打開這個網站，點選最新表情：

多線程爬取表情包

然後我們隻需要找出最新表情相關的規律，比如我們點選一下第二頁，url就變成了http://www.doutula.com/photo/list/?page=2，是以第幾頁page就等于幾。

然後就要找出在一頁當中這些表情是存儲在哪裡，先來看一下不用多線程的程式：

import requests
import os
from bs4 import BeautifulSoup

def download_img(page):
    response = requests.get('http://www.doutula.com/photo/list/?page={}'.format(page))
    text = response.text
    soup = BeautifulSoup(text,'lxml')
    div = soup.find_all('div',class_ = "page-content text-center")[0]
    imgs = div.find_all('img')
    for img in imgs:
        try:
            name = img.attrs['alt']#擷取img标簽的alt屬性值，也就是名字
        except KeyError:
            pass
        try :
            img_url = img.attrs['data-backup']
        except KeyError:
            pass
        try:
            R = requests.get(img_url)
        except UnboundLocalError:
            pass
        try:
            with open('圖檔1/{}.jpg'.format(name),'wb') as f:
                f.write(R.content)
        except:
            pass

if __name__ == '__main__':
    if not os.path.exists('圖檔1'):
        os.mkdir('圖檔1')
    for i in range(2000):
        download_img(i+1)

如果是用多線程，就用一下結構：

多線程爬取表情包

首先在主程式當中吧每一個待爬取頁面的url定義好，然後生産者就從url隊列當中去擷取每一個url，再提取出每一個圖檔的url，拿到這些url以後再把url添加到全局的隊列當中，這個隊列是專門用來存儲每個表情的url的，存儲完了以後再使用消費者從這個隊列當中去取出每個表情的url，然後下載下傳到本地。

下面是其完整代碼：

import requests
import os
from bs4 import BeautifulSoup
from queue import Queue
import threading

#定義生産者類，在建立生産者線程的時候就把兩個隊列傳到這裡面來
class Procuder(threading.Thread):
    def __init__(self,page_queue,img_queue,*args,**kwargs):#*args,**kwargs代表包括任意參數
        super(Procuder,self).__init__(*args,**kwargs)
        self.page_queue = page_queue
        self.img_queue = img_queue
    #用于解析url
    def run(self):
        #因為每一個線程都需要不斷地去從頁面隊列當中去取url來解析
        while True:
            #為了避免生産者線程一直處于死循環狀态，也就是當把url解析完畢時就應該停止了。當隊列為空時退出循環
            if self.page_queue.empty():
                break
            url = self.page_queue.get()
            #拿到url後就擷取我們想要的圖檔的url
            self.parse_page(url)#去解析每一頁的url

    def parse_page(self,url):
        response = requests.get(url)
        text = response.text
        soup = BeautifulSoup(text,'lxml')
        div = soup.find_all('div',class_ = "page-content text-center")[0]
        imgs = div.find_all('img')
        for img in imgs:
            try:
                name = img.attrs['alt']#擷取img标簽的alt屬性值，也就是名字
            except KeyError:
                pass
            try :
                img_url = img.attrs['data-backup']
            except KeyError:
                pass
            #拿到每個圖檔的名稱和連結以後就要添加到隊列當中
            try:
                self.img_queue.put((name,img_url))#傳一個元組進去
            except:
                pass

#消費者
class Consumer(threading.Thread):
    def __init__(self,page_queue,img_queue,*args,**kwargs):#*args,**kwargs代表包括任意參數
        super(Consumer,self).__init__(*args,**kwargs)
        self.page_queue = page_queue
        self.img_queue = img_queue
    #不斷地從img_queue裡面取出url然後下載下傳下來
    def run(self):
        while True:
            if self.img_queue.empty() and self.page_queue.empty():
                break
            filename,img_url = self.img_queue.get()#因為傳回的是一個元組
            try:
                R = requests.get(img_url)
            except UnboundLocalError:
                pass
            try:
                with open('圖檔2/{}.jpg'.format(filename), 'wb') as f:
                    f.write(R.content)
                print('{}下載下傳完成'.format(filename))
            except:
                pass

if __name__ == '__main__':
    #定義兩個隊列
    page_queue = Queue(100)#100個頁面的url隊列
    img_queue = Queue(1000)#圖檔的url
    if not os.path.exists('圖檔2'):
        os.mkdir('圖檔2')
    for i in range(100):
        url = 'http://www.doutula.com/photo/list/?page={}'.format(i+1)
        page_queue.put(url)#将每一頁的url添加進去
        #然後建立生産者，把頁面請求下來再擷取到每個表情的url
    for x in range(5):#建立5個生産者
        t = Procuder(page_queue,img_queue)
        t.start()
    for x in range(5):#建立5個消費者
        t = Consumer(page_queue,img_queue)
        t.start()

多線程爬取表情包

繼續閱讀

v2ex的簡單爬蟲

Python漫畫爬蟲開源 66漫畫 AJAX，包含資料庫連接配接，圖檔下載下傳處理

requests子產品進行人人網模拟登陸

Python image.show() 出錯FSPathMakeRef(/Applications/Preview.app) failed with error -43

2023爬蟲學習筆記 -- 多線程操作

M團店鋪評價采集不到問題問題展示：解決方案：

Python爬蟲學習（1）

Python爬蟲學習進階

Python爬蟲（入門+進階）學習筆記 1-2 初識Python爬蟲

Python進階爬蟲——Class1：認識爬蟲

python爬蟲學習筆記-1

python學習之urllib使用小結

NOIp模拟題之肮髒的牧師（桶排序）

一篇文章教你如何在一個月内學會爬取大規模資料

Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗

sort()函數到底是怎樣進行數字排序的