天天看点

lxml中etree.HTML()和etree.tostring()用法

etree.HTML():构造了一个XPath解析对象并对HTML文本进行自动修正。

etree.tostring():输出修正后的结果,类型是bytes

可参考以下代码:

from lxml import etree
text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >first item</a></li>
         <li class="item-1"><a href="link2.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >second item</a></li>
         <li class="item-inactive"><a href="link3.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >third item</a></li>
         <li class="item-1"><a href="link4.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >fourth item</a></li>
         <li class="item-0"><a href="link5.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = etree.tostring(html)
print(result.decode('utf-8'))
           

这里首先导入lxml库的etree模块,然后声明了一段HTML文本,调用HTML类进行初始化,这样就成功构造了一个XPath解析对象。这里需要注意的是,HTML文本中的最后一个li节点是没有闭合的,但是etree.HTML模块可以自动修正HTML文本。

这里我们调用

tostring()

方法即可输出修正后的HTML代码,但是结果是bytes类型。这里利用

decode()

方法将其转成str类型,结果如下

<html><body><div>
    <ul>
         <li class="item-0"><a href="link1.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >first item</a></li>
         <li class="item-1"><a href="link2.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >second item</a></li>
         <li class="item-inactive"><a href="link3.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >third item</a></li>
         <li class="item-1"><a href="link4.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >fourth item</a></li>
         <li class="item-0"><a href="link5.html" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >fifth item</a>
     </li></ul>
 </div>
</body></html>
           

可以看到,经过处理之后,li节点标签被补全,并且还自动添加了body、html节点。

参考:崔庆才的Python3网络爬虫开发实战