python爬蟲--多協程

本文所有皆為單核CPU情況，多程序（多核CPU）有待學習

1.爬取任務量較小時

from gevent import monkey
#從gevent庫裡導入monkey子產品。
monkey.patch_all()
#monkey.patch_all()能把程式變成協作式運作，就是可以幫助程式實作異步。
import gevent
import time
import requests
#導入gevent、time、requests。

start = time.time()
#記錄程式開始時間。

url_list = [
    "https://www.kaikeba.com/",
    "https://www.csdn.net/",
    "https://www.json.cn/",
    "https://cn.bing.com/",
    "https://www.jianshu.com/",
    "http://www.techweb.com.cn/",
    "https://www.bilibili.com/",
    "https://www.huxiu.com/"
]
#把8個網站封裝成清單。

def crawler(url):
#定義一個crawler()函數。
    r = requests.get(url)
    #用requests.get()函數爬取網站。
    print(url,time.time()-start,r.status_code)
    #列印網址、請求運作時間、狀态碼。

tasks_list = [ ]
#建立空的任務清單。

for url in url_list:
#周遊url_list。
    task = gevent.spawn(crawler,url)
    #用gevent.spawn()函數建立任務。
    tasks_list.append(task)
    #往任務清單添加任務。

gevent.joinall(tasks_list)
#執行任務清單裡的所有任務，就是讓爬蟲開始爬取網站。
end = time.time()
#記錄程式結束時間。
print(end-start)
#列印程式最終所需時間。

2.爬取任務量較大時（爬取1000個網站），使用隊列Queue（）

from gevent import monkey

monkey.patch_all()

#從gevent庫裡導入monkey子產品

import gevent

import time

import requests

from gevent.queue import Queue
monkey.patch_all()

#從gevent庫裡導入queue子產品


#monkey.patch_all()能把程式變成協作式運作，就是可以幫助程式實作異步。

start = time.time()



url_list = [
    "https://www.kaikeba.com/",
    "https://www.csdn.net/",
    "https://www.json.cn/",
    "https://cn.bing.com/",
    "https://www.jianshu.com/",
    "http://www.techweb.com.cn/",
    "https://www.bilibili.com/",
    "https://www.huxiu.com/"
]



work = Queue()

#建立隊列對象，并指派給work

for url in url_list:

#周遊url_list

    work.put_nowait(url)

    #用put_nowait()函數可以把網址都放進隊列裡



def crawler():

    while not work.empty():

    #當隊列不是空的時候，就執行下面的程式

        url = work.get_nowait()

        #用get_nowait()函數可以把隊列裡的網址都取出

        r = requests.get(url)

        #用requests.get()函數抓取網址

        print(url,work.qsize(),r.status_code)

        #列印網址、隊列長度、抓取請求的狀态碼



tasks_list  = [ ]

#建立空的任務清單

for x in range(2):

#相當于建立了2個爬蟲

    task = gevent.spawn(crawler)

    #用gevent.spawn()函數建立執行crawler()函數的任務

    tasks_list.append(task)

    #往任務清單添加任務。

gevent.joinall(tasks_list)

#用gevent.joinall方法，執行任務清單裡的所有任務，就是讓爬蟲開始爬取網站

end = time.time()

print(end-start)

3.典例

目标：利用多協程和隊列，來爬取豆瓣圖書Top250（書名，作者，評分）并存儲csv

豆瓣圖書：https://book.douban.com/top250?start=0

from gevent import monkey
monkey.patch_all()
import gevent,time,requests
from bs4 import BeautifulSoup
from gevent.queue import Queue
import csv
csv_file = open('books.csv','w',newline='')
writer = csv.writer(csv_file)

url = 'https://book.douban.com/top250'
pageSize = 25
startPage = 0
start = time.time()

headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}

work = Queue()

for i in range(3):
    params = {
        'start': startPage + i * pageSize
    }
    work.put_nowait(params)

def crawler():
    while not work.empty():
        param = work.get_nowait()
        res = requests.get(url, params=param, headers=headers)
        books_html = BeautifulSoup(res.text, 'html.parser')
        indent_div = books_html.find('div',class_='indent')
        list_books = indent_div.find_all('table')
        for book in list_books:
            title_div = book.find('div',class_='pl2')
            tag_a = title_div.find('a')
            title = tag_a.text.replace(' ', '').replace('\n', '')
            tag_p = book.find('p',class_='pl')
            author = tag_p.text.replace(' ', '').replace('\n', '')
            tag_span = book.find('span',class_='rating_nums')
            rating_nums = tag_span.text.replace(' ', '').replace('\n', '')
            print(title, author, rating_nums)
            writer.writerow([title, author, rating_nums])


tasks_list  = [ ]
for x in range(3):
    task = gevent.spawn(crawler)
    tasks_list.append(task)
gevent.joinall(tasks_list)
end = time.time()
print(end-start)
csv_file.close()

python爬蟲--多協程

繼續閱讀

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入