Urllib 詳解

問題:1、urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)

内部參數解釋

2、urlparse(urlstring[, scheme[, allow_fragments]]

3、JS渲染

附件有筆者對部分内容和方法的标注及解釋

簡單思路描述:在擷取預期的資源時，部分網站存在防爬機制,僞裝成正常的使用者，構造headers（User-Agent，data,Referer，status）請求的身份，cookie,換Proxy代理等方法繞過此機制

headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' ,'Referer':'http://www.zhihu.com/articles' }

URL

請求方法（get,head,put,delete,post,options）

請求包headers

請求包内容

req = request.Request(url=url,data=data,headers=headers,method=’POST’)

response= request.urlopen(req)

等價于（優于）

response = urllib.request.urlopen(‘url’,逾時設定)

cookie儲存檔案-重複登入

逾時設定(timeout)、報錯(urllib.error異常處理子產品)（面向爬蟲作者)

Urllib 詳解

它是python内置的HTTP請求庫

urllib.request 請求子產品

urllib.error 異常處理子產品

urllib.parse url解析子產品

urllib.rebotparser robots.txt解析子產品

相比python2的變化

原本的python2中比如 urllib2.urlopen(‘url’)

python3中成為 urllib.request.urlopen(‘url’)

urlopen之用法

urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)

用例1: get請求

importurllib.request

response= urllib.request.urlopen(‘http://www.baidu.com’)

print(response.read().decode(‘utf-8))

用例2: post請求

importurllib.parse

data =bytes(urllib.parse.urlencode({‘word’:’hello’},encoding=’utf-8’)

response= urllib.request.urlopen(‘http://baidu.com/post’,data=data)

print(response.read())

用例3: 逾時設定

response= urllib.request.urlopen(‘http://baidu.com’,timeout=1)

用例4: 逾時報錯

importsocket

import.urllib.error

try:

response = urllib.reqeust.urlopen(‘http://www.baidu.com’,timeout=0.1)

except urllib.error.URLError as e:

ifisinstance(e.reason,socket.timeout):

print(‘TimeOut’)

響應:

響應類型:

import urllib.request

response= urllib.reqeust.urlopen(‘url’)

print(type(response))

狀态碼,響應頭

response= urllib.request.urlopen(‘url’)

print(response.status)

print(response.getheaders())

print(response.getheader(‘Server’))

用例2:

response= urllib.request.urlopen(‘https://python.org’)

print(response.read().decode(‘utf-8’))

Request 如果要發送一些複雜的請求,比如要發送headers

用例1:

request= urllib.request.Request(‘https://python.org’)

response= urllib.request.urlopen(request)

fromurllib import request,parse

url = http://httpbin.org/post

headers= {

‘User-Angent’:’xxxxxxx’,

‘Host’:’httpbin.org’

}

dict = {

‘name’ = ‘feng’

data =bytes(parse.urlencode(dict),encoding=’utf-8’)#解析送出的參數

req =request.Request(url=url,data=data,headers=headers,method=’POST’)

response=request.urlopen(req)

用例3:

url =‘http://httpbin.org/post’

dict ={‘name’:’feng’}

data =bytes(parse.urlencode(dict),encoding=’utf-8’)

req.add_header(‘User-Anget’,’xxxxxxx’)

Handler 用法代理

proxy_handler= urllib.request.ProxyHandler({

‘http’:’http://127.0.0.1:9743’,

‘https’:’https://127.0.0.1:9743’

})

opener =url.request.build_opener(proxy_handler)

response= opener.open(‘http://www.baidu.com’)

Cookies使用

用例1: 擷取cookie

import http.cookiejar,urllib.request

#聲明一個CookieJar對象執行個體來儲存cookie

cookie = http.cookiejar.CookieJar()

#利用urllib2庫的HTTPCookieProcessor對象來建立cookie處理器

handler =urllib.request.HTTPCookieProcessor(cookie)

#通過handler來建構opener

opener = urllib.request.build_opener(handler)

response = opener.open(‘url’)

for item in cookie:

print(item.name+’=’+item.value)

用例2: 把cookie儲存成txt檔案火狐浏覽器格式

importhttp.cookiejar,urllib.request

filename= ‘cookie.txt’

cookie =http.cookiejar.MozillaCookieJar(filename)

handler= urllib.request.HTTPcookieProcessor(cookie)

opener =urllib.request.build_opener(handler)

response= opener.open(‘url’)

cookie.save(ignore_discard=True,ignore_expires=True)

官方解釋如下：

ignore_discard:save even cookies set to be discarded.

ignore_expires:save even cookies that have expiredThe file is overwritten if it already exists

由此可見，ignore_discard的意思是即使cookies将被丢棄也将它儲存下來，ignore_expires的意思是如果在該檔案中 cookies已經存在，則覆寫原檔案寫入，在這裡，我們将這兩個全部設定為True。運作之後，cookies将被儲存到cookie.txt檔案中，我們檢視一下内容

用例3: 另外一種cookie儲存方式 LWP 2.0格式

filename=’cookie.txt’

cookie =http.cookiejar.LWPCookieJar(filename)

handler= urllib.request.HTTPCookieProcessor(cookie)

response=opener.open(‘url’)

用例4: 讀取cookie檔案

cookie =http.cookiejar.LWPCookieJar()

#從檔案中讀取cookie内容到變量

cookie.load(‘cookie.txt’, ignore_discard=True,ignore_expires=True)

異常處理

fromurllib import request,error

response= request.urlopen(‘http://aaaaaaa.com/sss.html’)

excepterror.URLError as e:

print(e.reason)

formurllib import request,error

response = request.urlopen(‘http://aaalckjvz.com/aaa.html’)

excepterror.HTTPError as e:

print(e.reason,e.code,e.headers,sep=’\n’)

except error.URLErroras e:

else:

print(‘Request Successfully’)

importurllib.error

response =urllib.request.urlopen(‘https://www.baidu.com’,timeout=0.01)

excepturllib.error.URLError as e:

print(type(e.reason))

if isinstance(e.reason,socket.timeout):

print(‘Time Out’)

URL解析

urlparse和urlunparse

fromurllib.parse import urlparse

result =urlparse (‘url’)

print(type(result),result)

它的參數

result =urlparse(‘url’,scheme=’https’)解析協定可以去掉http://

result =urlparse(‘url’,scheme=’http’)

result =urlparse(‘url’,allow_fragments=False) url帶有查詢參數

result =urlparse(‘url’,allow_fragments=False) url不帶有查詢參數

urlunparse

用來拼接url

用例:

fromurllib.parse import urlunparse

data = [‘http’,’www.baidu.com’,’index.html’,’user’,’a=1’,’comment’]

print(urlunparse(data))

urljoin 用來拼接url的方法或者叫組合方法

url必須為一緻站點,否則後面參數會覆寫前面的host

fromurllib.parse import urljoin

print(urljoin(‘http://www.baidu.com’,’FAQ.html))

print(urljoin(‘http://www.badiu.com’,’https://www.baidu.com/FAQ.html’))

print(urljoin(‘http://www.baidu.com/about.html’,’http://www.baidu.com/FAQ.html’))

print(urljoin(‘www.baidu.com#comment’,’?category=2’))

urlencode 又一種url組合方法,将字典對象轉換為get請求參數

fromurllib.parse import urlencode

params ={

‘name’:’feng’,

‘age’:18

base_url= ‘http://www.baidu.com?’

url =base_url+urlencode(params)

Urllib 詳解

繼續閱讀

python3 urllib 通路HTTPS網站的出錯解決辦法

Python3 urllib 筆記urllib

python爬蟲第1章 urllib庫（一） urllib庫概述python爬蟲第1章 urllib庫（一） urllib庫概述

Python爬蟲實作--微網誌模拟登陸--涉及到的知識點，python包，實作代碼詳解。

python用urllib爬取百度爬取線上網站

python 特别簡單的一個小爬蟲（看着玩吧）

python3 urllib調用spring cloud服務報urllib.error.HTTPError: HTTP Error 400: Bad Request排查

Python3内置http.client，urllib.request及三方庫requests發送請求對比

python網絡爬蟲學習日記-----urllib中urlopen()的使用

爬蟲基礎 ----- urllib子產品

python學習筆記-爬蟲01

urllib的用法urllib.request子產品的用法

python httplib urllib urllib2差別（一撇） 3，get put post delete 方法，參考自 python urllib2對http的get，put，post，delete）

python發送http請求包：httplib，urllib，urllib2，urllib3，requests

解析json中的url并下載下傳到本地

Python爬蟲基本庫的使用第二章基本庫的使用