网络爬虫学习第一弹：urllib库使用

小道之前有学过一点爬虫，但是面对越来越严峻的就业形势，为了提高自身竞争力，决定系统的学习一下爬虫。用的是崔庆才大大的书。写博作为自己的学习笔记，方便以后复习。欢迎高人补充赐教！小道感激不尽！

网络爬虫之urllib库使用

import urllib.request

response=urllib.request.urlopen("https://www.python.org")
#print(response.read().decode("urf-8")) # 返回网页内容
print(type(response)) # 返回response的类型为HTTPResponse
print('-----------------')
print(response.status) # 返回结果的状态码，200表示请求成功
print('-----------------')
print(response.getheaders()) # 返回响应的头部信息
print('-----------------')
print(response.getheader('Server'))# 在请求头中，传递参数“Server”返回对应值：nginx
                                   # 表示服务器用nginx搭建

<class 'http.client.HTTPResponse'>
-----------------
200
-----------------
[('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'SAMEORIGIN'), ('x-xss-protection', '1; mode=block'), ('X-Clacks-Overhead', 'GNU Terry Pratchett'), ('Via', '1.1 varnish'), ('Content-Length', '49192'), ('Accept-Ranges', 'bytes'), ('Date', 'Sat, 27 Oct 2018 07:52:03 GMT'), ('Via', '1.1 varnish'), ('Age', '539'), ('Connection', 'close'), ('X-Served-By', 'cache-iad2137-IAD, cache-tyo19936-TYO'), ('X-Cache', 'HIT, HIT'), ('X-Cache-Hits', '3, 1120'), ('X-Timer', 'S1540626724.880357,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]
-----------------
nginx

data参数

import urllib.parse
import urllib.request

# data参数是可选的，用于POST方式请求
# 这里传递一个参数：{'word':'hello'}，需要urllib.parse模块中的urlencode将参数字典转化为字符串
# 然后再转码为bytes类型（字节流），需要方法bytes()，第二个参数制定编码格式
data=bytes(urllib.parse.urlencode({'word':"hello"}),encoding="utf-8")
response=urllib.request.urlopen("http://httpbin.org/post",data=data)
print(response.read()) # 返回post请求内容

# 我们传递的参数在form里面出现了，表明模拟了表单提交的方式，以post传输数据

b'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "word": "hello"\n  }, \n  "headers": {\n    "Accept-Encoding": "identity", \n    "Connection": "close", \n    "Content-Length": "10", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "Python-urllib/3.6"\n  }, \n  "json": null, \n  "origin": "171.210.147.2", \n  "url": "http://httpbin.org/post"\n}\n'

timeout参数

import urllib.request

# timeout 参数用于设置超时时间，单位为秒，超过这个时间没有响应就会抛出异常。
response=urllib.request.urlopen("http://httpbin.org/get",timeout=1)
print(response.read())

b'{\n  "args": {}, \n  "headers": {\n    "Accept-Encoding": "identity", \n    "Connection": "close", \n    "Host": "httpbin.org", \n    "User-Agent": "Python-urllib/3.6"\n  }, \n  "origin": "171.210.147.2", \n  "url": "http://httpbin.org/get"\n}\n'

import urllib.request
import socket
import urllib.error
# 异常处理：当请求超时的时候，捕获达到到异常URError，如果异常类型与socket.timeout超时异常相同，输出TIME OUT
try:
    response=urllib.request.urlopen("http://httpbin.org/get",timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason,socket.timeout):
        print("TIME OUT")

TIME OUT

urllib.request.Request()

from urllib import request,parse

url="http://httpbin.org/post"

# 构造请求头
headers={
    'User-Agent':'Mozilla/4.0(compatible;MSIE 5.5;Windows NT)', # 修改User-Agent伪装成火狐浏览器
    'Host':'httpbin.org' # 设置要请求的服务器的域名
}
dict={
    'name':'Germey'
}

data=bytes(urllib.parse.urlencode(dict),encoding='utf-8')
# 构造Requset对象
req=urllib.request.Request(url=url,data=data,headers=headers,method="POST") # 设置请求方式为 method="POST"

# 构造请求头还可以用函数添加
#req=urllib.request.Request(url=url,data=data,method="POST")
#req.add_header('User-Agent','Mozilla/4.0(compatible;MSIE 5.5;Windows NT)')

response=request.urlopen(req)
print(response.read().decode("utf-8"))

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "Germey"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/4.0(compatible;MSIE 5.5;Windows NT)"
  }, 
  "json": null, 
  "origin": "182.144.185.154", 
  "url": "http://httpbin.org/post"
}

# Request参数说明
req=urllib.request.Request(url,                 # 必传参数
                           data=None,           # 如果要传，必须传bytes类型
                           headers={},          # 请求头，通常用于设置User-Agent伪装成浏览器
                           origin_req_host=None,# 请求方的host名称或IP地址
                           unverifiable=False,  # 表示这个请求是否是无法验证的
                           method=None          # 用来传入请求的方法
                          )

高级用法

urllib.request模块中的BaseHandler类

是所有其他Handler的父类，Handler用于处理各种登录验证，下面是常用的子类：

HTTPDefaultErrorHandler：用于处理HTTP响应错误，错误都会抛出HTTPError类型的异常

HTTPRedirectErrorHandler：用于处理重定向

HTTPCookieProcessor：用于处理Cookies

ProxyHandler：用于设置代理，默认代理为空

HTTPPasswordMgr：用于管理密码，它维护了用户名和密码的表单

HTTPBasicAuthHandler：用于管理认证，如果一个链接打开需要认证，那么可以用它来解决认证问题

urllib.request模块中的OpenerDirector类

opener类能够实现高级请求，opener可以使用open()返回的类型与urlopen()相同，但我们可以用Handler来构造Opener

验证

# 当要爬取的网页打开时即弹出提示框，可用以下方法
from urllib.request import HTTPPasswordMgrWithDefaultRealm,HTTPBasicAuthHandler,build_opener
from urllib.error import URLError

username="username"
password="password"
url="http://localhost:5000"

p=HTTPPasswordMgrWithDefaultRealm() # 实例化一个密码管理器
p.add_password(None,url,username,password) # 传入url,用户名，密码
auth_handler=HTTPBasicAuthHandler(p)# 传入密码管理器，用于管理认证处理
opener=build_opener(auth_handler)# 用handler构造一个opener

try:
    result=opener.open(url) 
    html=result.read().decode("utf-8")
except URLError as e:
    print(e.reason)

[WinError 10061] 由于目标计算机积极拒绝，无法连接。

代理

from urllib.request import ProxyHandler,build_opener,urlopen
from urllib.error import URLError

# 在本地搭建一个代理，它运行在9743端口上
# ProxyHandler参数是一个字典，key是协议类型，values为代理链接，可添加多个代理
proxy_handler=ProxyHandler({
    'http':'http://127.0.0.1:9743',
    'https':'https://127.0.0.1:9743'
})
opener=build_opener(proxy_handler)
try:
    response=opener.open('http://www.baidu.com')
    print(response.read().decode("utf-8"))
except URLError as e:
    print(e.reason)

[WinError 10061] 由于目标计算机积极拒绝，无法连接。

Cookies

import http.cookiejar,urllib.request

cookie=http.cookiejar.CookieJar() # 创建一个CookieJar对象
handler=urllib.request.HTTPCookieProcessor(cookie) # 利用HTTPCookieProcessor创建handler
opener=urllib.request.build_opener(handler)
response=opener.open("http://www.baidu.com")
for item in cookie:
    print(item.name+"="+item.value) 
print(type(cookie))

BAIDUID=BCBC178D1854AF1C0D3FAAE603CE19E1:FG=1
BIDUPSID=BCBC178D1854AF1C0D3FAAE603CE19E1
H_PS_PSSID=1453_21098_27400_26350
PSTM=1540797225
delPer=0
BDSVRTM=0
BD_HOME=0
<class 'http.cookiejar.CookieJar'>

将Cookie输出成文件格式

# 法一：
import http.cookiejar,urllib.request

filename="cookie.txt"
# MozillaCookieJar是cookiejar的子类用于处理Cookies和文件相关的事件，
# 可以将Cookie保存为Mozilla型浏览的Cookie格式
cookie=http.cookiejar.MozillaCookieJar(filename)
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response=opener.open("http://www.baidu.com")
cookie.save(ignore_discard=True,ignore_expires=True)
          # ignore_discard：即使cookies将被丢弃也将它保存下来
          # ignore_expoers：如果在该文件中cookies已经存在，则覆盖原文件写入

# 法二：
import http.cookiejar,urllib.request

filename="cookie_2.txt"
# LWPCookieJar可以将Cookie保存为libwww-perl(LWP)的格式
cookie=http.cookiejar.LWPCookieJar(filename)
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response=opener.open("http://www.baidu.com")
cookie.save(ignore_discard=True,ignore_expires=True)

从文件中读取或利用Cookies

import http.cookiejar,urllib.request

cookie=http.cookiejar.LWPCookieJar()
cookie.load("cookie_2.txt",ignore_discard=True,ignore_expires=True) # 利用load()方法来读取本地的Cookie文件
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response=opener.open("http://www.baidu.com")
#print(response.read().decode("utf-8"))

异常处理

URLError

URLError来自error模块，由request模块产生的错误都可以通过这个类来处理

from urllib import request,error

try:
    response=request.urlopen("https://cuiqingcai.com/intex.htm")
except error.URLError as e:
        print(e.reason) # 属性reason用来返回产生错误的原因

Not Found

HTTPError

它是URLError的子类，专门用于处理HTTP请求错误，比如认证失败等。它有如下三个属性：

code：返回HTTP状态码，比如404表示网页不存在，500表示服务器内部错误

reason：同父类一样用于返回错误原因

headers：返回请求头

from urllib import request,error

try:
    response=request.urlopen("https://cuiqingcai.com/intex.htm")
except error.HTTPError as e:
    print(e.reason,e.code,e.headers,sep='\n')

Not Found
404
Server: nginx/1.10.3 (Ubuntu)
Date: Mon, 29 Oct 2018 13:29:20 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: close
Vary: Cookie
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, max-age=0
Link: <https://cuiqingcai.com/wp-json/>; rel="https://api.w.org/"

因为HTTPError是URLError的子类，因此先捕获子类错误，再捕获父类错误，这样的写法更好

执行顺序:

try(True)->else

try(False)->except(False)->except

from urllib import request,error

try:
    response=request.urlopen("https://cuiqingcai.com/intex.htm")
except error.HTTPError as e: # 先捕获子类错误，如果没有捕获到，执行下一条
    print(e.code,e.reason,e.headers,sep="\n")
except error.URLError as e: # 捕获父类错误，如果捕获到，输出错误原因
    print(e.reason)
else: # 用else来处理正常
    print("Request Succesfully")

404
Not Found
Server: nginx/1.10.3 (Ubuntu)
Date: Mon, 29 Oct 2018 14:08:24 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: close
Vary: Cookie
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, max-age=0
Link: <https://cuiqingcai.com/wp-json/>; rel="https://api.w.org/"

reason属性返回的不一定是字符串，也有可能是一个对象

from urllib import request,error
import socket

try:
    response=request.urlopen("http://www.baidu.com",timeout=0.01)
except error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason,socket.timeout): # 用isinstance()来判断类型，来做出更精确的判断
        print("TIME OUT")

<class 'socket.timeout'>
TIME OUT

解析链接

urlparse()

用于解析URL，返回一个ParseResult对象，包含六部分：

1.scheme：代表协议

2.netloc：代表域名

3.path：代表路径

4.params：代表参数

5.query：代表查询条件，一般用作GET类型的URL

6.fragment：代表锚点，用于直接定位页面内部的下拉位置

from urllib.parse import urlparse

url="http://www.baidu.com/index.html;user?id=5#comment"
result=urlparse(url)
print(type(result),result,sep="\n")

<class 'urllib.parse.ParseResult'>
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

参数：

urlstring：必填项，即待解析的URL

scheme：用于设置默认协议，假如提供的url中没有scheme，就会用设置的scheme用作默认协议。如果有scheme，即使设置了scheme，也会返回解析出的scheme。

allow_fragments：是否忽略fragment。如果被设置为False就会忽略fragment，它会被解析为path、params或query的一部分。

from urllib.parse import urlparse

result=urlparse("http://www.baidu.com/index.html;user?id=5#comment")
# ResultParse是一个元组的形式，可以通过index和属性名来获取
print(result.scheme,result[1],result.netloc,result[2],sep="\n")

http
www.baidu.com
www.baidu.com
/index.html

urlunparse()

用于构造一个URL，它接受一个可迭代对象，但长度必须是6

from urllib.parse import urlunparse

url=["http","www.baidu.com","index.html","user","id=5","comment"]
result=urlunparse(url)
print(result)

http://www.baidu.com/index.html;user?id=5#comment

urlsplit()

与urlparse()类似，但是不单独解析params，将其合并到了path里面

from urllib.parse import urlsplit

url="http://www.baidu.com/index.html;user?id=5#comment"
result=urlsplit(url)
print(result)

SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')

urlunsplit()

与urlunparse类似,区别是接受的可迭代对象，长度必须为5

from urllib.parse import urlunsplit

url=["http","www.baidu.com","index.htmluser","id=5","comment"]
result=urlunsplit(url)
print(result)

http://www.baidu.com/index.htmluser?id=5#comment

urljoin()

是生成链接的一种方法，提供第一个参数base_url(基础链接)，和第二个参数新链接。基础链接只提供：scheme、netloc、path，如果新链接有，就用于新链接。如果新链接没有，就用基础链接来补全新链接。

from urllib.parse import urljoin

print(urljoin("http://www.baidu","www.taoist.com/index.htmluser"))
print(urljoin("http://www.baidu.com","index.htmluser"))

http://www.baidu/www.taoist.com/index.htmluser
http://www.baidu.com/index.htmluser

urlencode()

用于构造GET请求参数，先声明一个字典将参数表示出来，再用urlencode()方法将其序列化为GET请求参数

from urllib.parse import urlencode

params={
    'name':'taoist',
    'age':'20'
}

params=urlencode(params)
base_url="http://www.baidu.com?"
url=base_url+params
print(url)

http://www.baidu.com?name=taoist&age=20

parse_qs()

反序列化，如果有一堆GET请求参数，利用parse_qs()方法，可以把它转化为字典

from urllib.parse import parse_qs

query='name=germey&age=22'
print(parse_qs(query))

{'name': ['germey'], 'age': ['22']}

parse_qsl()

用于将参数转化为元组组成的列表

from urllib.parse import parse_qsl

query='name=taoist&age=20'
print(parse_qsl(query)) 
# 返回结果是一个列表，每个元素都是一个元组，元组中第一个内容是参数名，第二个值是参数值

[('name', 'taoist'), ('age', '20')]

quote()

该方法可以将内容转化为URL的编码格式。当URL带有中文参数时，可能会导致乱码问题，此时可以用这种方法进行转化。

from urllib.parse import quote

keyword="机器"
url='https://www.baidu.com/s?wd='+quote(keyword)
print(url)

https://www.baidu.com/s?wd=%E6%9C%BA%E5%99%A8

unquote()

该方法可以对URL进行解码

from urllib.parse import unquote

url="https://www.baidu.com/s?wd=%E6%9C%BA%E5%99%A8"
print(unquote(url))

https://www.baidu.com/s?wd=机器

分析Robots协议

Robots协议的全名叫网络爬虫排除标准(Robots Exclusion Protocol)，用来告诉爬虫哪些可以抓取，哪些不可以抓取。它通常是一个叫做robots.txt的文本文件，一般放在网站的根目录下。当搜索爬虫访问一个站点时，它首先会检查这个站点的根目录下是否存在robot.txt文件。如果存在，搜索爬虫会根据其中定义的范围来进行爬取，否则会访问所有可爬取的页面。

样例：

User-agent: *

Disallow: /

Allow: /pulic/

User-agent：描述了搜索爬虫的名称，*表示该协议对任何爬虫有效

Disallow：指定了不允许爬取的目录，/表示不允许抓取所有页面

Allow：指定可以抓取的页面

robotparser

urllib中的robots协议解析模块，该模块提供了一个类RobotFileParser,常用方法如下：

set_url()：用来设置robots.txt文件的链接。

read()：用于读取robots.txt文件进行分析。一定要调用此方法，否则不会返回任何内容。

parse()：用于解析robots.txt文件，传入的参数是robots.txt某些行的内容，会按照robots.txt的语法来解析。

can_fetch()：该方法传入两个参数，第一个是User-agent,第二个是要抓取的URL。返回的内容是该搜索引擎是否可以抓取该页面，True或False

mtime()：返回的是上一次抓取和分析robots.txt的时间。

modified()：可以将当前时间设置为上次抓取和分析robots.txt的时间

from urllib.robotparser import RobotFileParser

rp=RobotFileParser() # 创建一个RobotFileParser对象
rp.set_url("http://www.jianshu.com/robots.txt") # 传入robots.txt的URL
rp.read() # 读取并解析协议
# 根据协议判断页面是否可读取，返回布尔值
print(rp.can_fetch('*',"http://www.jianshu.com/search?q=python&page=1&type=collections"))

False

from urllib.robotparser import RobotFileParser
from urllib import request

url="http://www.jianshu.com/robots.txt"
headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req=request.Request(url=url,headers=headers)

rp=RobotFileParser()
rp.parse(urlopen(req).read().decode("utf-8").split("\n")) # 将每句协议进行解析
print(rp.can_fetch('*',"http://www.jianshu.com/search?q=python&page=1&type=collections"))

False

# See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file
#
# To ban all spiders from the entire site uncomment the next two lines:
User-agent: *
Disallow: /search
Disallow: /convos/
Disallow: /notes/
Disallow: /admin/
Disallow: /adm/
Disallow: /p/0826cf4692f9
Disallow: /p/d8b31d20a867
Disallow: /collections/*/recommended_authors
Disallow: /trial/*
Disallow: /keyword_notes
Disallow: /stats-2017/*

User-agent: trendkite-akashic-crawler
Request-rate: 1/2 # load 1 page per 2 seconds
Crawl-delay: 60

User-agent: YisouSpider
Request-rate: 1/10 # load 1 page per 10 seconds
Crawl-delay: 60

User-agent: Cliqzbot
Disallow: /

User-agent: Googlebot
Request-rate: 2/1 # load 2 page per 1 seconds
Crawl-delay: 10

参考：崔庆才《python3网络爬虫开发实战》