05-访问超时设置 | 01.数据抓取 | Python05-访问超时设置

2021-11-06 19:40:18

设置 http 或 socket 访问超时，来防止爬虫抓取某个页面时间过长。

pycurl 库的调用中，可以设置超时时间：

c.setopt(pycurl.connecttimeout, 60)

在 python 2.6 版本下，httplib 库由于有如下构造函数：

class httpconnection:

def __init__(self, host, port=none, strict=none,

timeout=socket._global_default_timeout):

self.timeout = timeout

所以可以设置：

如果通过 httpconnection 或 httpsconnection 的构造函数给定超时时间，那么阻塞操作（如试图建立连接）将会超时。如果没有给或者赋值 none ，那么它将使用全局的超时时间设置。

python 2.5 下，因为 httpconnection 类的 __init__ 函数没有 timeout 参数，所以通过一个隐藏很深的函数：

httplib.socket.setdefaulttimeout(3)#输入参数单位貌似是分钟

来设置超时。

最后，抓取时如果实在找不到什么函数能设置超时时间，那么可以设置全局的 socket 超时，虽然这样做不大合适：

>>> import socket

>>> socket.setdefaulttimeout(90)

from urllib2 import urlopen

import socket

slowurl =”http://www.wenxuecity.com/”

socket.setdefaulttimeout(1)

try:

data = urlopen(slowurl)

data.read()

except socket.error:

errno, errstr = sys.exc_info()[:2]

if errno == socket.timeout:

print "there was a timeout"

else:

print "there was some other socket error"

继续阅读