天天看點

python3 urllib3和urllib,示例urllib3和python中的線程

python3 urllib3和urllib,示例urllib3和python中的線程

I am trying to use urllib3 in simple thread to fetch several wiki pages.

The script will

Create 1 connection for every thread (I don't understand why) and Hang forever.

Any tip, advice or simple example of urllib3 and threading

import threadpool

from urllib3 import connection_from_url

HTTP_POOL = connection_from_url(url, timeout=10.0, maxsize=10, block=True)

def fetch(url, fiedls):

kwargs={'retries':6}

return HTTP_POOL.get_url(url, fields, **kwargs)

pool = threadpool.ThreadPool(5)

requests = threadpool.makeRequests(fetch, iterable)

[pool.putRequest(req) for req in requests]

@Lennart's script got this error:

http://en.wikipedia.org/wiki/2010-11_Premier_LeagueTraceback (most recent call last):

File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run

http://en.wikipedia.org/wiki/List_of_MythBusters_episodeshttp://en.wikipedia.org/wiki/List_of_Top_Gear_episodes http://en.wikipedia.org/wiki/List_of_Unicode_characters result = request.callable(*request.args, **request.kwds)

File "crawler.py", line 9, in fetch

print url, conn.get_url(url)

AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'

Traceback (most recent call last):

File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run

result = request.callable(*request.args, **request.kwds)

File "crawler.py", line 9, in fetch

print url, conn.get_url(url)

AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'

Traceback (most recent call last):

File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run

result = request.callable(*request.args, **request.kwds)

File "crawler.py", line 9, in fetch

print url, conn.get_url(url)

AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'

Traceback (most recent call last):

File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run

result = request.callable(*request.args, **request.kwds)

File "crawler.py", line 9, in fetch

print url, conn.get_url(url)

AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'

After adding import threadpool; import urllib3 and tpool = threadpool.ThreadPool(4) @user318904's code got this error:

Traceback (most recent call last):

File "crawler.py", line 21, in

tpool.map_async(fetch, urls)

AttributeError: ThreadPool instance has no attribute 'map_async'

解決方案

Here is my take, a more current solution using Python3 and concurrent.futures.ThreadPoolExecutor.

import urllib3

from concurrent.futures import ThreadPoolExecutor

urls = ['http://en.wikipedia.org/wiki/2010-11_Premier_League',

'http://en.wikipedia.org/wiki/List_of_MythBusters_episodes',

'http://en.wikipedia.org/wiki/List_of_Top_Gear_episodes',

'http://en.wikipedia.org/wiki/List_of_Unicode_characters',

]

def download(url, cmanager):

response = cmanager.request('GET', url)

if response and response.status == 200:

print("+++++++++ url: " + url)

print(response.data[:1024])

connection_mgr = urllib3.PoolManager(maxsize=5)

thread_pool = ThreadPoolExecutor(5)

for url in urls:

thread_pool.submit(download, url, connection_mgr)

Some remarks

My code is based on a similar example from the Python Cookbook by Beazley and Jones.

I particularly like the fact that you only need a standard module besides urllib3.

The setup is extremely simple, and if you are only going for side-effects in download (like printing, saving to a file, etc.), there is no additional effort in joining the threads.

If you want something different, ThreadPoolExecutor.submit actually returns whatever download would return, wrapped in a Future.

I found it helpful to align the number of threads in the thread pool with the number of HTTPConnection's in a connection pool (via maxsize). Otherwise you might encounter (harmless) warnings when all threads try to access the same server (as in the example).