etree.HTML():構造了一個XPath解析對象并對HTML文本進行自動修正。
etree.tostring():輸出修正後的結果,類型是bytes
可參考以下代碼:
from lxml import etree
text = '''
<div>
<ul>
<li class="item-0"><a href="link1.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >first item</a></li>
<li class="item-1"><a href="link2.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >second item</a></li>
<li class="item-inactive"><a href="link3.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >third item</a></li>
<li class="item-1"><a href="link4.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >fourth item</a></li>
<li class="item-0"><a href="link5.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >fifth item</a>
</ul>
</div>
'''
html = etree.HTML(text)
result = etree.tostring(html)
print(result.decode('utf-8'))
這裡首先導入lxml庫的etree子產品,然後聲明了一段HTML文本,調用HTML類進行初始化,這樣就成功構造了一個XPath解析對象。這裡需要注意的是,HTML文本中的最後一個li節點是沒有閉合的,但是etree.HTML子產品可以自動修正HTML文本。
這裡我們調用
tostring()
方法即可輸出修正後的HTML代碼,但是結果是bytes類型。這裡利用
decode()
方法将其轉成str類型,結果如下
<html><body><div>
<ul>
<li class="item-0"><a href="link1.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >first item</a></li>
<li class="item-1"><a href="link2.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >second item</a></li>
<li class="item-inactive"><a href="link3.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >third item</a></li>
<li class="item-1"><a href="link4.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >fourth item</a></li>
<li class="item-0"><a href="link5.html" target="_blank" rel="external nofollow" target="_blank" rel="external nofollow" >fifth item</a>
</li></ul>
</div>
</body></html>
可以看到,經過處理之後,li節點标簽被補全,并且還自動添加了body、html節點。
參考:崔慶才的Python3網絡爬蟲開發實戰