Python爬虫入门_之urllib2&&urllib

2023-08-05 14:51:20

####笔者是在python2.7环境下学习爬虫的

import urllib2	#引入模块
import urllib
html = urllib2.urlopen('http://www.jikexueyuan.com')
html.read()

以上几行，简单的把极客学院的html页面爬下来了，分析一下urllib2模块：

# urlopen()
>>> urllib2.urlopen(url, data, timeout) #第一个参数是打开的url，第二个是，将要传入的参数
这里涉及到用get/post方式请求打开url
>>> value = {'username':'root','password':123456}
>>> param = urllib.urlencode(value)
>>> print param
'username=root&password=123456'
>>> html = urllib2.urlopen('www.ccut.edu.cn?%s' % param) #以get方式请求
>>> html = urllib2.urlopen('www.ccut.edu.cn', param)#以post方式请求
>>>

#urllib2.Request()可以用来设置代理防止反爬虫
>>> user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'  
>>> headers = { 'User-Agent' : user_agent }
>>> request = urllib2.Request(url, param, headers)#此处的url,param都同上
>>> response = urllib2.urlopen(request)
>>> response.read() #到此结束，重新定义了代理

代理设置：假如一个网站它会检测某一段时间某个IP 的访问次数，如果访问次数过多，它会禁止你的访问。所以你可以设置一些代理服务器来帮助你做工作，每隔一段时间换一个代理

enable_proxy = True
proxy_handler = urllib2.ProxyHandler({"http" : 'http://some-proxy.com:8080'})
null_proxy_handler = urllib2.ProxyHandler({})
if enable_proxy:
    opener = urllib2.build_opener(proxy_handler)
else:
    opener = urllib2.build_opener(null_proxy_handler)
urllib2.install_opener(opener)

模拟登录：

#很多网页需要登录才能看到我们想要抓取的内容，我们可以模拟登录这个过程，保存cookie：
	 url = 'www.ccut.edu.cn'
	cookj = cookielib.CookieJar()
	opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookj))
	urllib2.install_opener(opener)
	response = urllib2.urlopen(url)

更多详细请参考这篇文章http://cuiqingcai.com/954.html

Python爬虫入门_之urllib2&&urllib

继续阅读

无法解析的外部符号 wmain，该符号在函数 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink导出用例转换工具(XML2Excel)

YAML简介和PyYAML安全操作YAML支持的类型YAML的优点：yaml的基本语法python操作

Small tricks

libsvm for python 安装

学习软件测试基础测试第七天

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

27. Remove Element(列表)题目代码

sort()函数到底是怎样进行数字排序的

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

在python中创建excel并写入

Python爬虫入门_之urllib2&amp;&amp;urllib

继续阅读

Python爬虫入门_之urllib2&&urllib