Python爬蟲——寫出最簡單的網頁爬蟲

最近對python爬蟲有了強烈地興趣，在此分享自己的學習路徑，歡迎大家提出建議。我們互相交流，共同進步。

1.開發工具

筆者使用的工具是sublime text3，它的短小精悍(可能男人們都不喜歡這個詞)使我十分着迷。推薦大家使用，當然如果你的電腦組態不錯，pycharm可能更加适合你。

sublime text3搭建python開發環境推薦檢視此部落格：

[sublime搭建python開發環境][http://www.cnblogs.com/codefish/p/4806849.html]

2.爬蟲介紹

爬蟲顧名思義，就是像蟲子一樣，爬在internet這張大網上。如此，我們便可以擷取自己想要的東西。

既然要爬在internet上，那麼我們就需要了解url，法号“統一資源定位器”，小名“連結”。其結構主要由三部分組成：

(1)協定：如我們在網址中常見的http協定。

(2)域名或者ip位址：域名，如：www.baidu.com，ip位址，即将域名解析後對應的ip。

(3)路徑：即目錄或者檔案等。

3.urllib開發最簡單的爬蟲

(1)urllib簡介

module

introduce

urllib.error

exception classes raised by urllib.request.

urllib.parse

parse urls into or assemble them from components.

urllib.request

extensible library for opening urls.

urllib.response

response classes used by urllib.

urllib.robotparser

load a robots.txt file and answer questions about fetchability of other urls.

(2)開發最簡單的爬蟲

百度首頁簡潔大方，很适合我們爬蟲。

爬蟲代碼如下：

from urllib import request

def visit_baidu():

url = "http://www.baidu.com"

# open the url

req = request.urlopen(url)

# read the url

html = req.read()

# decode the url to utf-8

html = html.decode("utf_8")

print(html)

if __name__ == '__main__':

visit_baidu()

結果如下圖：

我們可以通過在百度首頁空白處右擊，檢視審查元素來和我們的運作結果對比。

當然，request也可以生成一個request對象，這個對象可以用urlopen方法打開。

代碼如下：

def vists_baidu():

# create a request obkect

req = request.request('http://www.baidu.com')

# open the request object

response = request.urlopen(req)

# read the response

html = response.read()

html = html.decode('utf-8')

vists_baidu()

運作結果和剛才相同。

(3)錯誤處理

錯誤處理通過urllib子產品來處理，主要有urlerror和httperror錯誤，其中httperror錯誤是urlerror錯誤的子類，即httrperror也可以通過urlerror捕獲。

httperror可以通過其code屬性來捕獲。

處理httperror的代碼如下：

from urllib import error

def err():

url = "https://segmentfault.com/zzz"

req = request.request(url)

try:

response = request.urlopen(req)

html = response.read().decode("utf-8")

print(html)

except error.httperror as e:

print(e.code)

err()

運作結果如圖：

404為列印出的錯誤代碼，關于此詳細資訊大家可以自行百度。

urlerror可以通過其reason屬性來捕獲。

chulihttperror的代碼如下：

url = "https://segmentf.com/"

except error.urlerror as e:

print(e.reason)

既然為了處理錯誤，那麼最好兩個錯誤都寫入代碼中，畢竟越細緻越清晰。須注意的是，httperror是urlerror的子類，是以一定要将httperror放在urlerror的前面，否則都會輸出urlerror的，如将404輸出為not found。

# 第一種方法，urlerroe和httperror

print(e.reason)

大家可以更改url來檢視各種錯誤的輸出形式。

作者：xiaomi

來源：51cto

Python爬蟲——寫出最簡單的網頁爬蟲

繼續閱讀

HBuilder開發App Step1——環境搭建，HelloMUI 以及真機調試

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入