學習筆記
lxml子產品
- 關于lxml
lxml解析子產品可以利用Xpath表達式來比對HTML字元串的内容。
- 關于lxml解析庫的安裝
進入cmd,輸入以下代碼,即可安裝:
pip install lxml
- 文法
from lxml import etree
#建立解析對象
parse_html = etree.HTML(html)
#html = requests.get(url, headers = headers).content.decode('utf-8')
#解析對象調用xpath
r_list = parse_html.xpath('xpath表達式')
#隻要調用xpath,傳回的結果一定為清單
- 舉個例子
針對下面HTML文檔,我們利用Xpath擷取所有li節點對象、所有name節點的class屬性值、所有food節點裡的文本内容:
<ol>
<li class="Ra01">
<name class = 'Bunny01'>小黃</name>
<age>8</age>
<food>胡蘿蔔</food>
</li>
<li class="Ra01">
<name class = 'Bunny02'>大白</name>
<age>9</age>
<food>白菜</food>
</li>
<li class="Ra02">
<name class = 'Bunny03'>奧尼爾</name>
<age>20</age>
<food>提草</food>
</li>
<li class="Ra03">
<name class = 'Bunny03'>王子</name>
<age>30</age>
<food>進口提草</food>
</li>
</ol>
# -*- coding: utf-8 -*-
from lxml import etree
html = \
"""
<ol>
<li class="Ra01">
<name class = 'Bunny01'>小黃</name>
<age>8</age>
<food>胡蘿蔔</food>
</li>
<li class="Ra01">
<name class = 'Bunny02'>大白</name>
<age>9</age>
<food>白菜</food>
</li>
<li class="Ra02">
<name class = 'Bunny03'>奧尼爾</name>
<age>20</age>
<food>提草</food>
</li>
<li class="Ra03">
<name class = 'Bunny03'>王子</name>
<age>30</age>
<food>進口提草</food>
</li>
</ol>
"""
parse_html = etree.HTML(html)
#擷取所有li節點對象
li_list = parse_html.xpath('//ol/li')
print(li_list)
print('-'*20)
#擷取所有name節點的class屬性值
name_list = parse_html.xpath('//ol/li/name/@class')
print(name_list)
print('-'*20)
#擷取所有food節點裡的文本内容
food_list = parse_html.xpath('//ol/li/food/text()')
print(food_list)
[<Element li at 0xad2d7371c8>, <Element li at 0xad2d737448>, <Element li at 0xad2d737288>, <Element li at 0xad2d737488>]
--------------------
['Bunny01', 'Bunny02', 'Bunny03', 'Bunny03']
--------------------
['胡蘿蔔', '白菜', '提草', '進口提草']