天天看點

利用python爬蟲(part5)--lxml子產品

學習筆記

lxml子產品

  • 關于lxml

lxml解析子產品可以利用Xpath表達式來比對HTML字元串的内容。

  • 關于lxml解析庫的安裝

進入cmd,輸入以下代碼,即可安裝:

pip install lxml      
  • 文法
from lxml import etree

#建立解析對象
parse_html = etree.HTML(html)
#html = requests.get(url, headers = headers).content.decode('utf-8')
#解析對象調用xpath
r_list = parse_html.xpath('xpath表達式')
#隻要調用xpath,傳回的結果一定為清單      
  • 舉個例子

針對下面HTML文檔,我們利用Xpath擷取所有li節點對象、所有name節點的class屬性值、所有food節點裡的文本内容:

<ol>
      <li class="Ra01">
        <name class = 'Bunny01'>小黃</name>
        <age>8</age>
        <food>胡蘿蔔</food>
      </li>
      <li class="Ra01">
        <name class = 'Bunny02'>大白</name>
        <age>9</age>
        <food>白菜</food>
      </li>
      <li class="Ra02">
        <name class = 'Bunny03'>奧尼爾</name>
        <age>20</age>
        <food>提草</food>
      </li>
      <li class="Ra03">
        <name class = 'Bunny03'>王子</name>
        <age>30</age>
        <food>進口提草</food>
      </li>

  </ol>      
# -*- coding: utf-8 -*-

from lxml import etree

html = \
"""
    <ol>
        <li class="Ra01">
            <name class = 'Bunny01'>小黃</name>
            <age>8</age>
            <food>胡蘿蔔</food>
        </li>
        <li class="Ra01">
            <name class = 'Bunny02'>大白</name>
            <age>9</age>
            <food>白菜</food>
        </li>
        <li class="Ra02">
            <name class = 'Bunny03'>奧尼爾</name>
            <age>20</age>
            <food>提草</food>
        </li>
        <li class="Ra03">
            <name class = 'Bunny03'>王子</name>
            <age>30</age>
            <food>進口提草</food>
        </li>

    </ol>
"""

parse_html = etree.HTML(html)
#擷取所有li節點對象
li_list = parse_html.xpath('//ol/li')
print(li_list)
print('-'*20)

#擷取所有name節點的class屬性值
name_list = parse_html.xpath('//ol/li/name/@class')
print(name_list)
print('-'*20)

#擷取所有food節點裡的文本内容
food_list = parse_html.xpath('//ol/li/food/text()')
print(food_list)      
[<Element li at 0xad2d7371c8>, <Element li at 0xad2d737448>, <Element li at 0xad2d737288>, <Element li at 0xad2d737488>]
--------------------
['Bunny01', 'Bunny02', 'Bunny03', 'Bunny03']
--------------------
['胡蘿蔔', '白菜', '提草', '進口提草']