Lxml 解析網頁用法筆記

用python的urllib2庫實作的擷取到網頁資料之後，使用lxml對擷取的網頁進行資料抓取。

1.導入包 from lxml import etree

2.page = etree.HTML(html) 或者 page = etree.HTML(html.decode('utf-8'))

3.對Element對象（page）使用xpath篩選，傳回一個清單（裡面的元素也是Element）

舉例：

<html>
　　<head>
　　　　<meta name="content-type" content="text/html; charset=utf-8" />
　　　　<title>示例</title>　　　　
　　</head>
　　<body>
　　　　<h1 class="cl1">測試内容一</h1>
　　　　<p style="font-size: 200%">測試内容二</p>
　　　　測試内容三
　　　　<p>測試内容四</p>
　　　　<a href="http://www.baidu.com/" target="_blank" rel="external nofollow"   target="_blank">百度</a> 
　　　　<a href="http://www.google.com" target="_blank" rel="external nofollow"  target="_blank">谷歌</a>
　　　　<a href="http://www.ali.com" target="_blank" rel="external nofollow"  target="_blank">阿裡</a> 
　　　　<a href="http://game.tencent.com" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank"><img src='www.baidu.com'/>騰訊</a>
　　　　<a href="http://game.sina.com" target="_blank" rel="external nofollow"  target="_blank"><img src='www.baidu.com'/>新浪</a>
　　　　<a href="http://www.huawei.com" target="_blank" rel="external nofollow"  target="_blank"><img src='www.baidu.com'/>華為</a> 
　　　　<a href="http://www.xiaomi.com" target="_blank" rel="external nofollow"  target="_blank"><img src='www.baidu.com'/>小米</a>

　　</body>
</html>

解析html

from lxml import etree
page = etree.HTML(html.decode('utf-8'))

擷取标簽

# a标簽
tags = page.xpath(u'/html/body/a')
print(tags)  
# html 下的 body 下的所有 a
# 結果[<Element a at 0x34b1f08>, ...]

/html 整個網頁的根目錄

/html/body/a 擷取整個網頁<body>标簽下所有<a>标簽

//a 擷取html下所有a标簽，在本例中功能同上（所有a标簽都放在body下，别的地方沒有）

/descendant::a 等價于 //a descendant::字首可紙袋任意多層中間節點，也可以省略成一個“ /”

/html/body/*/a 表示取出body下第二級的所有a标簽，不管它的上級是什麼标簽，‘*’可以代表所有的節點名

擷取head裡面的标簽要特别一點比如//html/head/* 或者//html/head/title

擷取節點（标簽）屬性：

for taga in tags:
    print(taga.attrib)    
    # 擷取屬性： {'target': '_blank', 'href': 'http://www.ali.com'}
    print(taga.get('href'))
    # 擷取某一屬性：http://www.ali.com
    print(taga.text)
    # 擷取文本： 阿裡

利用屬性篩選标簽

# 直接定位到<h1 class="cl1">測試内容一</h1>
hs = page.xpath("//h1[@class='heading']")
for h in hs:
    print(h.values())
    print(h.text)
    # 列印結果：
    # ['heading']
    # 測試内容一

屬性可以寫@name,@id,@value,@src,@href...

如果沒有屬性，也可以使用text()(表示标簽内包含的内容)和positon()(擷取節點的位置)

示例：

a[position()=2] 表示取得第二個a節點，可以被省略為a[2]

需要注意數字定位和過濾條件的順序

/body/a[5][@name='hello'] 表示取下第五個a标簽，并且name必須是hello，否則為空

/body/a[@name='hello'][5] 表示取body下第五個name為hello的a标簽

preceding-sibling::和 following-sibling::

preceding-sibling::字首表示同一層的上一個節點

following-sibling::字首表示同一層的下一個節點

示例

//body//following-sibling::a  同層下一個a标簽

//body/h1/preceding-sibling::*  所有h1上所有h1同級的子标簽

tail擷取特殊内容

<a href="http://game.tencent.com" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank"><img src='www.baidu.com'/>騰訊</a>

‘騰訊’兩個字在<img/>和</a>标簽中間，正常使用text是擷取不到内容的，需要使用taga.tail.strip()來擷取内容

tail的意思是結束節點前面的内容，就是<img/>和</a>标簽中間的内容

如果script與style标簽之間的内容影響解析頁面，或者頁面很不規則，可以使用lxml.html.clean子產品。子產品 lxml.html.clean 提供一個Cleaner 類來清理 HTML 頁。它支援删除嵌入或腳本内容、特殊标記、 CSS 樣式注釋或者更多。

　　cleaner = Cleaner(style=True, scripts=True,page_structure=False, safe_attrs_only=False)

　　print cleaner.clean_html(html)

　　注意，page_structure,safe_attrs_only為False時保證頁面的完整性，否則，這個Cleaner會把你的html結構與标簽裡的屬性都給清理了。使用Cleaner類要十分小心，小心擦槍走火。

　　忽略大小寫可以：

page = etree.HTML(html)
　　keyword_tag = page.xpath("//meta[translate(@name,'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 'abcdefghjiklmnopqrstuvwxyz')='keywords']")

Lxml 解析網頁用法筆記

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入