07-爬蟲的多線程排程 | 01.資料抓取 | Python07-爬蟲的多線程排程

2021-11-06 19:40:16

一般讓爬蟲在一個程序内多線程并發，有幾種方法：

run the callable object in a separate thread.

cause a function to be executed by the reactor thread.

也就是說，reactor.callfromthread 是在由 reactor.run() 激發的主消息循環（main event loop）中執行，是以也就能被 reactor.stop() 終止執行。甚至可以通過：

來主動要求主消息循環關閉 reactor 的主線程運作。

callfromthread 有時候比較危險，如果壓的任務太多，會阻塞主消息循環，造成其他事件無法得到及時的處理。

參考 callinthread 的代碼，可以看出它是在 reactor 的一個私有線程池裡工作的：

def callinthread(self, _callable, *args, **kwargs):

if self.threadpool is none:

self._initthreadpool()

self.threadpool.callinthread(_callable, *args, **kwargs)

是以，我們可以通過

這裡有兩個問題：

1、如何通知 callinthread 執行任務的線程退出呢，如何確定線程池内的工作線程安全退出呢？

2、如果讓工作線程去某網站抓取頁面，由于 tcp/ip 的不确定性，可能該工作線程挂起，長時間不傳回。如果線程池内的每一個線程被這樣耗盡，沒有空閑線程，就相當于抓取全部停止了。某個線程或許會因請求逾時而退出，但這也未必可靠。一般通過代碼：

import timeoutsocket

timeoutsocket.setdefaultsockettimeout(120)

設定 socket 逾時時間，但有時候就是會莫名其妙地挂住線程。

twisted.internet.threads.defertothread 與 callinthread 一樣，預設用 reactor.getthreadpool() 所開辟的線程池。它調用這個線程池的 threadpool.callinthreadwithcallback 方法，實際效果和 reactor.callinthread 一樣。差別隻是 defertothread 可以傳回一個deferred對象，進而允許你設定回調函數。

示範代碼：

def finish_success(request):

pass

threads.defertothread(parsedata, body).addcallback(lambda x: finish_success(request))

twisted還提供了一個簡易辦法

return a deferred that has already had '.callback(result)' called.

this is useful when you're writing synchronous code to an asynchronous interface: i.e., some code is calling you expecting a deferred result, but you don't actually need to do anything asynchronous. just return defer.succeed(theresult).

defer.succeed 說白了就是為了讓某函數 a 傳回一個 deferred 對象，進而讓 a.addcallback(…) 異步觸發成為現實。

07-爬蟲的多線程排程 | 01.資料抓取 | Python07-爬蟲的多線程排程

繼續閱讀

Cloud Studio初體驗

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

面試題解析：你接口測試是怎麼做的？

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method

在python中建立excel并寫入