問題:1、urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)
内部參數解釋
2、urlparse(urlstring[, scheme[, allow_fragments]]
3、JS渲染
附件有筆者對部分内容和方法的标注及解釋
簡單思路描述:在擷取預期的資源時,部分網站存在防爬機制,僞裝成正常的使用者,構造headers(User-Agent,data,Referer,status)請求的身份,cookie,換Proxy代理等方法繞過此機制
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' ,'Referer':'http://www.zhihu.com/articles' }
URL
請求方法(get,head,put,delete,post,options)
請求包headers
請求包内容
req = request.Request(url=url,data=data,headers=headers,method=’POST’)
response= request.urlopen(req)
等價于(優于)
response = urllib.request.urlopen(‘url’,逾時設定)
cookie儲存檔案-重複登入
逾時設定(timeout)、報錯(urllib.error異常處理子產品)(面向爬蟲作者)
Urllib 詳解
它是python内置的HTTP請求庫
urllib.request 請求子產品
urllib.error 異常處理子產品
urllib.parse url解析子產品
urllib.rebotparser robots.txt解析子產品
相比python2的變化
原本的python2中比如 urllib2.urlopen(‘url’)
python3中成為 urllib.request.urlopen(‘url’)
urlopen之用法
urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)
用例1: get請求
importurllib.request
response= urllib.request.urlopen(‘http://www.baidu.com’)
print(response.read().decode(‘utf-8))
用例2: post請求
importurllib.parse
data =bytes(urllib.parse.urlencode({‘word’:’hello’},encoding=’utf-8’)
response= urllib.request.urlopen(‘http://baidu.com/post’,data=data)
print(response.read())
用例3: 逾時設定
response= urllib.request.urlopen(‘http://baidu.com’,timeout=1)
用例4: 逾時報錯
importsocket
import.urllib.error
try:
response = urllib.reqeust.urlopen(‘http://www.baidu.com’,timeout=0.1)
except urllib.error.URLError as e:
ifisinstance(e.reason,socket.timeout):
print(‘TimeOut’)
響應:
響應類型:
import urllib.request
response= urllib.reqeust.urlopen(‘url’)
print(type(response))
狀态碼,響應頭
response= urllib.request.urlopen(‘url’)
print(response.status)
print(response.getheaders())
print(response.getheader(‘Server’))
用例2:
response= urllib.request.urlopen(‘https://python.org’)
print(response.read().decode(‘utf-8’))
Request 如果要發送一些複雜的請求,比如要發送headers
用例1:
request= urllib.request.Request(‘https://python.org’)
response= urllib.request.urlopen(request)
fromurllib import request,parse
url = http://httpbin.org/post
headers= {
‘User-Angent’:’xxxxxxx’,
‘Host’:’httpbin.org’
}
dict = {
‘name’ = ‘feng’
data =bytes(parse.urlencode(dict),encoding=’utf-8’)#解析送出的參數
req =request.Request(url=url,data=data,headers=headers,method=’POST’)
response=request.urlopen(req)
用例3:
url =‘http://httpbin.org/post’
dict ={‘name’:’feng’}
data =bytes(parse.urlencode(dict),encoding=’utf-8’)
req.add_header(‘User-Anget’,’xxxxxxx’)
Handler 用法代理
proxy_handler= urllib.request.ProxyHandler({
‘http’:’http://127.0.0.1:9743’,
‘https’:’https://127.0.0.1:9743’
})
opener =url.request.build_opener(proxy_handler)
response= opener.open(‘http://www.baidu.com’)
Cookies使用
用例1: 擷取cookie
import http.cookiejar,urllib.request
#聲明一個CookieJar對象執行個體來儲存cookie
cookie = http.cookiejar.CookieJar()
#利用urllib2庫的HTTPCookieProcessor對象來建立cookie處理器
handler =urllib.request.HTTPCookieProcessor(cookie)
#通過handler來建構opener
opener = urllib.request.build_opener(handler)
response = opener.open(‘url’)
for item in cookie:
print(item.name+’=’+item.value)
用例2: 把cookie儲存成txt檔案 火狐浏覽器格式
importhttp.cookiejar,urllib.request
filename= ‘cookie.txt’
cookie =http.cookiejar.MozillaCookieJar(filename)
handler= urllib.request.HTTPcookieProcessor(cookie)
opener =urllib.request.build_opener(handler)
response= opener.open(‘url’)
cookie.save(ignore_discard=True,ignore_expires=True)
官方解釋如下:
ignore_discard:save even cookies set to be discarded.
ignore_expires:save even cookies that have expiredThe file is overwritten if it already exists
由此可見,ignore_discard的意思是即使cookies将被丢棄也将它儲存下來,ignore_expires的意思是如果在該檔案中 cookies已經存在,則覆寫原檔案寫入,在這裡,我們将這兩個全部設定為True。運作之後,cookies将被儲存到cookie.txt檔案中, 我們檢視一下内容
用例3: 另外一種cookie儲存方式 LWP 2.0格式
filename=’cookie.txt’
cookie =http.cookiejar.LWPCookieJar(filename)
handler= urllib.request.HTTPCookieProcessor(cookie)
response=opener.open(‘url’)
用例4: 讀取cookie檔案
cookie =http.cookiejar.LWPCookieJar()
#從檔案中讀取cookie内容到變量
cookie.load(‘cookie.txt’, ignore_discard=True,ignore_expires=True)
異常處理
fromurllib import request,error
response= request.urlopen(‘http://aaaaaaa.com/sss.html’)
excepterror.URLError as e:
print(e.reason)
formurllib import request,error
response = request.urlopen(‘http://aaalckjvz.com/aaa.html’)
excepterror.HTTPError as e:
print(e.reason,e.code,e.headers,sep=’\n’)
except error.URLErroras e:
else:
print(‘Request Successfully’)
importurllib.error
response =urllib.request.urlopen(‘https://www.baidu.com’,timeout=0.01)
excepturllib.error.URLError as e:
print(type(e.reason))
if isinstance(e.reason,socket.timeout):
print(‘Time Out’)
URL解析
urlparse和urlunparse
fromurllib.parse import urlparse
result =urlparse (‘url’)
print(type(result),result)
它的參數
result =urlparse(‘url’,scheme=’https’)解析協定 可以去掉http://
result =urlparse(‘url’,scheme=’http’)
result =urlparse(‘url’,allow_fragments=False) url帶有查詢參數
result =urlparse(‘url’,allow_fragments=False) url不帶有查詢參數
urlunparse
用來拼接url
用例:
fromurllib.parse import urlunparse
data = [‘http’,’www.baidu.com’,’index.html’,’user’,’a=1’,’comment’]
print(urlunparse(data))
urljoin 用來拼接url的方法 或者叫組合方法
url必須為一緻站點,否則後面參數會覆寫前面的host
fromurllib.parse import urljoin
print(urljoin(‘http://www.baidu.com’,’FAQ.html))
print(urljoin(‘http://www.badiu.com’,’https://www.baidu.com/FAQ.html’))
print(urljoin(‘http://www.baidu.com/about.html’,’http://www.baidu.com/FAQ.html’))
print(urljoin(‘www.baidu.com#comment’,’?category=2’))
urlencode 又一種url組合方法,将字典對象轉換為get請求參數
fromurllib.parse import urlencode
params ={
‘name’:’feng’,
‘age’:18
base_url= ‘http://www.baidu.com?’
url =base_url+urlencode(params)