Python網絡爬蟲---urllib子產品、逾時設定、自動模拟http請求之get方法和post方法

Python網絡爬蟲

1.urllib基礎

urlretrieve(“網址”, "本地檔案存儲位址") 方法，用來将檔案下載下傳到指定路徑
urlcleanup() 方法，用來清除記憶體中爬取的網頁内容
urlopen() 方法，用來爬取網頁資訊
info() 方法，看爬取的網頁的簡介資訊
getcode() 方法，用來傳回網頁爬取的狀态碼，如果傳回 200 表示處于爬取狀态，反之，不處于
geturl() 方法，用來擷取目前通路的網頁的url

eg：

import urllib.request,re

sock = urllib.request.urlopen("https://www.baidu.com/")
data = sock.read().decode("utf-8")
urllib.request.urlretrieve("https://www.baidu.com/", "D:\\python\\1.html")
urllib.request.urlcleanup()
print(sock.info())
print(sock.getcode())
print(sock.geturl)
sock.close()

2.逾時設定

說明：有的網站能很快的被通路，有的通路很慢，是以，通過逾時設定，合理配置設定時間，能夠增加我們爬取網頁資訊的效率。

eg：

import urllib.request,re

#文法：逾時設定就是在urlopen()方法中多加一個timeout參數,所接的數字是幾秒響應
sock = urllib.request.urlopen("https://www.baidu.com/", timeout=0.1)

#通常逾時設定與異常處理一起用
for i in range(1, 100):
    try:
        data = urllib.request.urlopen("https://www.baidu.com/", timeout=0.1).read().decode("utf-8")
        print(len(data))
    except Exception as err:
        print("網頁響應逾時！")

3.自動模拟http請求之get方法和post方法

urllib.request.urlopen()

的

urlopen

預設是以

get

的請求方式，如果要以

post

發送請求，需要用到

urllib.request.Request()

中設定

meta

參數.

（1） get方法自動請求方式實作自動爬取網頁資訊

說明：如何爬取搜尋引擎中多頁資訊。

首先，我們要對爬取的網頁的網址進行分析，下面拿360搜尋引擎來做試驗：

①用360搜尋引擎搜尋python

Python網絡爬蟲---urllib子產品、逾時設定、自動模拟http請求之get方法和post方法

②切換頁碼，觀察每頁網址的共同點

Python網絡爬蟲---urllib子產品、逾時設定、自動模拟http請求之get方法和post方法

③我将前三頁的網址放在了一起，容易觀察到紅色方框内是相同的，其中

q=python

中的

python

就是我們搜尋的内容，稱為關鍵字，而

pn=1

中的

代表的是第幾頁。

Python網絡爬蟲---urllib子產品、逾時設定、自動模拟http請求之get方法和post方法

④知道了

pn=1

的含義，那麼我們可以更改

pn=

的值，來切換頁面，進而通路前10頁網頁資訊。這裡我們要爬取的資訊是每頁下方的相關搜尋内容。

Python網絡爬蟲---urllib子產品、逾時設定、自動模拟http請求之get方法和post方法

⑤在要爬取的網頁中點選滑鼠右鍵，點選“檢視網頁源代碼”，然後按

ctrl + f

出來搜尋款，搜尋你要擷取的内容。

Python網絡爬蟲---urllib子產品、逾時設定、自動模拟http請求之get方法和post方法

⑥多搜幾個，發現規律，都是這個模式

data-source="2">python官方文檔</a>

，隻是數字為2或3或4。是以，我們的正規表達式可以為為

data-source="[234]">(.*?)</a>

⑦編寫代碼，代碼如下：

#簡單爬蟲編寫,自動爬取網頁資訊
import urllib.request,re

keywd = "python"
#如果關鍵字是漢字，則需要對漢字進行轉碼，因為浏覽器不能識别漢字
keywd = urllib.request.quote(keywd)        #對關鍵字進行編碼，若不是漢字則可以省略這一步
for i in range(1, 11):
    url = "https://www.so.com/s?q=" + keywd + "&pn=" + str(i)
    data = urllib.request.urlopen(url).read().decode("utf-8")
    pat = 'data-source="[234]">(.*?)</a>'
    res = re.compile(pat).findall(data)
    for x in res:
        print(x)

#結果太長，這裡我給出小部分答案：
Python
python官方文檔
python教學
python學習
python官網
python發音
Python官網
python是什麼
python下載下傳
python例子練手
python3
python能做什麼
Python

（2） post方法自動請求方式實作自動爬取網頁資訊

說明：有的網頁是需要使用者填寫并送出一些資訊後才顯示出來的，這種情況，我們用

post

方法來進行自動爬取網頁資訊。

這裡有一個網址是提供post請求的練習的。我是網址

對于這種網頁，我們看其源代碼的時候，要着重看其

name

屬性，比如：

Python網絡爬蟲---urllib子產品、逾時設定、自動模拟http請求之get方法和post方法

可以看見

name

所對應的值就是所送出的post請求的關鍵，是以，我們可以通過代碼模拟出post請求，進而獲得資訊。

代碼如下：

import urllib.request
import urllib.parse     #要導入該子產品

post_url = 'http://www.iqianyue.com/mypost'
post_data = urllib.parse.urlencode({'name':'asdasd', 'pass':'1123'}).encode("utf-8")
url = urllib.request.Request(post_url, post_data, meta='post')
data = urllib.request.urlopen(url).read().decode('utf-8')
print(data)

#結果：
<html>
<head>
<title>Post Test Page</title>
</head>

<body>
<form action="" method="post">
name:<input name="name" type="text" /><br>
passwd:<input name="pass" type="text" /><br>
<input name="" type="submit" value="submit" />
<br />
you input name is:asdasd<br>you input passwd is:1123</body>
</html>

總結：

get方法 就是在爬取資訊之前，用關鍵字和頁碼數對網址進行處理，再進行資訊爬取
post方法 是在爬取資訊之前，建立 dict ，再用 dict 模拟請求，進而得到真實網址，再進行資訊爬取

Python網絡爬蟲---urllib子產品、逾時設定、自動模拟http請求之get方法和post方法