天天看點

lxml中etree.HTML()和etree.tostring()用法

etree.HTML():構造了一個XPath解析對象并對HTML文本進行自動修正。

etree.tostring():輸出修正後的結果,類型是bytes

可參考以下代碼:

from lxml import etree
text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >first item</a></li>
         <li class="item-1"><a href="link2.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >second item</a></li>
         <li class="item-inactive"><a href="link3.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >third item</a></li>
         <li class="item-1"><a href="link4.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >fourth item</a></li>
         <li class="item-0"><a href="link5.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = etree.tostring(html)
print(result.decode('utf-8'))
           

這裡首先導入lxml庫的etree子產品,然後聲明了一段HTML文本,調用HTML類進行初始化,這樣就成功構造了一個XPath解析對象。這裡需要注意的是,HTML文本中的最後一個li節點是沒有閉合的,但是etree.HTML子產品可以自動修正HTML文本。

這裡我們調用

tostring()

方法即可輸出修正後的HTML代碼,但是結果是bytes類型。這裡利用

decode()

方法将其轉成str類型,結果如下

<html><body><div>
    <ul>
         <li class="item-0"><a href="link1.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >first item</a></li>
         <li class="item-1"><a href="link2.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >second item</a></li>
         <li class="item-inactive"><a href="link3.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >third item</a></li>
         <li class="item-1"><a href="link4.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >fourth item</a></li>
         <li class="item-0"><a href="link5.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >fifth item</a>
     </li></ul>
 </div>
</body></html>
           

可以看到,經過處理之後,li節點标簽被補全,并且還自動添加了body、html節點。

參考:崔慶才的Python3網絡爬蟲開發實戰