基于Python、PyQuery实现的一个网络爬虫实例

一、前言由于项目需要，写了一个爬虫程序。用它解析唯品会中连衣裙的搜索结果页面（结果不止一页，需要循环解析），获取连衣裙的图片url（唯品会搜索结果页上的图片会有正反两张），然后把图片下载下来。因为以前用过一点JQuery，所以这里选择PyQuery。看中的是它强大的’选择器’，可以方便的取到我需要的标签，并且后续的取属性等操作也很方便。二、遇到的问题 1.长时间无响应（卡死）程序中使用 urlopen打开下载网页和图片。我以为会有默认超时时间，并未设置。发现程序会卡死之后，我查阅了文档。 urllib2. urlopen ( url[, data[, timeout[, cafile[, capath[, cadefault[, context]]]]] ) The optional timeout parameter specifies a timeout in seconds for blocking operations like the connection attempt (if not specified, the global default timeout setting will be used). This actually only works for HTTP, HTTPS and FTP connections. 如上所述，函数有个可选参数 timeout，如果不设置，则使用全局缺省超时时间。 socket. setdefaulttimeout ( timeout ) Set the default timeout in seconds (float) for new socket objects. A value of None indicates that new socket objects have no timeout. When the socket module is first imported, the default is None. 如上所述，如果不设置，全局缺省超时时间为 None。所以程序会卡死。只需设置一下缺省超时时间，问题就可以解决。 2.各种网络异常导致程序中断遇到了比如 connection reset by peer等异常，这些异常有些是服务器的问题，有些是网络的问题。因为我不需要完整的下载所有图片，所以我只是简单的捕获这些异常，不做特别处理，为的是让程序可以继续执行。 3.有些标签是由JS动态创建的，分析静态页面，无法得到结果上面提到唯品会搜索结果页上的图片会有正反两张。问题是反面那张图片url只有在显示一次（鼠标悬浮）之后才会被赋值。这个问题我没有解决，只是饶了过去。我发现，正面图像和背面图像url只差一个字符，通过正面图像url很容易得出背面图像url。如下所示。 http://a.vpimg2.com/upload/merchandise/318350/SOULINE-SL4132130342-5.jpg http://a.vpimg2.com/upload/merchandise/318350/SOULINE-SL4132130342-7.jpg Python替换字符串中一个字符也很简单，它的索引（支持逆向索引）、截取（切片）都很方便。 4.open(file)路径不存在会抛出异常文件不存在会自动创建，路径不存在会抛出异常，可以使用 os.makedirs(dir)创建路径，在这里我手动创建了，因为只有一个文件夹。

三、源码

#coding=utf-8
import urllib2
import httplib
import socket
from pyquery import PyQuery as pq

def download_img(img_url, img_count):
	img_name = str(img_count) + '.jpg'
	try:
		request = urllib2.Request(img_url)
		request.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5')
		img_stream = urllib2.urlopen(request)
		img_file = open('dress/'+img_name, 'wb')
		img_file.write(img_stream.read())
		img_file.close()
		return 1
	except Exception, e:
		print e
		return 0

httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'

socket.setdefaulttimeout(10)

page_url_prefix = 'http://search.vip.com/search?searchkw=连衣裙&cate=&os=1&page='
img_count = 1

for page_count in xrange(1,50):
	page_url = page_url_prefix + str(page_count)
	print page_url

	try:
		page_html_file = urllib2.urlopen(page_url)
		page_src_string = page_html_file.read()
		d = pq(page_src_string)
	except Exception, e:
		print e
		continue

	for img_url in [i.attr('src') for i in d('img.J_first_pic').items()]:
		
		print img_url
		if download_img(img_url, img_count) == 1:
			img_count = img_count + 1

		img_url = img_url[:-5] + '7' + img_url[-4:]

		print img_url
		if download_img(img_url, img_count) == 1:
			img_count = img_count + 1

基于Python、PyQuery实现的一个网络爬虫实例

继续阅读

来自python的【条件控制/语句循环/break/continue/else/pass】一、条件控制二、语句循环

无法解析的外部符号 wmain，该符号在函数 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink导出用例转换工具(XML2Excel)

YAML简介和PyYAML安全操作YAML支持的类型YAML的优点：yaml的基本语法python操作

Small tricks

libsvm for python 安装

学习软件测试基础测试第七天

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

27. Remove Element(列表)题目代码

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

在python中创建excel并写入