xpath 處理網頁出現的問題總結

<div class="name">
    <div class="title">
        <div class="price">
            <span>
                <a href="" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >網頁欣賞</a>
            </span>
        </div>
    </div>
</div>

當文檔為多層結構時，無法比對内容
1. 使用text()方法比對不到内容，得到的隻是空的 ‘\n ’
```
data.xpath('//div[@class="name"]//text()')
           
```
2. 使用xpath（‘string（.）’）的方式，則可以正确比對
```
data.xpath('//div[@class="name"]')[0].xpath('string(.)')
           
```

使用xpath得到的div對象，再次使用xpath比對内容時，出現錯誤

使用 div對象 <class 'lxml.etree._Element'>，再次使用xpath 比對其内容時，失敗

html = etree.parse('./test_xpath.html', etree.HTMLParser())
    strings = etree.HTML(etree.tounicode(html))

    # print(strings)
    pp = strings.xpath('//div[@class="name"]')[0]
    print(type(strings))  # <class 'lxml.etree._Element'>
    print(type(pp))  # <class 'lxml.etree._Element'>
    print(pp.xpath('/div[@class="title"]')) # []

- 為何會比對失敗？？

- print(pp.xpath('/div[@class="title"]')) # [], 前面使用的是 '/' 而不是 ‘//’, 比對的是根路徑，導緻無法查找

使用xpath 打開html檔案時，會遇到無法解碼為中文的情況

html = etree.parse('./test_xpath.html', etree.HTMLParser())
    strings = etree.tostring(html)
    print(strings)

# get page like this
'''
b'<!DOCTYPE html>\n<html >&#13;\n<head>&#13;\n    <meta charset="UTF-8"/>&#13;\n    <title>Title</title>&#13;\n</head>&#13;\n<body>&#13;\n<div class="name">&#13;\n    <div class="title">&#13;\n        <div class="price">&#13;\n            <span>&#13;\n                <a href="" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >&#32593;&#39029;&#27427;&#36175;</a>&#13;\n            </span>&#13;\n        </div>&#13;\n    </div>&#13;\n</div>&#13;\n</body>&#13;\n</html>'
'''
    html = etree.parse('./test_xpath.html', etree.HTMLParser())
    strings = etree.tostring(html).decode()
    print(strings)
'''
<!DOCTYPE html>
<html >&#13;
<head>&#13;
    <meta charset="UTF-8"/>&#13;
    <title>Title</title>&#13;
</head>&#13;
<body>&#13;
<div class="name">&#13;
    <div class="title">&#13;
        <div class="price">&#13;
            <span>&#13;
                <a href="" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >&#32593;&#39029;&#27427;&#36175;</a>&#13;
            </span>&#13;
        </div>&#13;
    </div>&#13;
</div>&#13;
</body>&#13;
</html>

'''
但是還是不對，中文并沒有解析出來

使用etree.tostring()
使用etree.tounicode(), 則正常解析，并且不需要使用decode，就能得到正常的html

html = etree.parse('./test_xpath.html', etree.HTMLParser())
    strings = etree.tounicode(html)
    print(strings)


<!DOCTYPE html>
<html >&#13;
<head>&#13;
    <meta charset="UTF-8"/>&#13;
    <title>Title</title>&#13;
</head>&#13;
<body>&#13;
<div class="name">&#13;
    <div class="title">&#13;
        <div class="price">&#13;
            <span>&#13;
                <a href="" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >網頁欣賞</a>&#13;
            </span>&#13;
        </div>&#13;
    </div>&#13;
</div>&#13;
</body>&#13;
</html>

movie_div = strings.xpath("//div[contains(@class,'doulist-item')]//text()")
movie_div = strings.xpath("//div[contains(@class,'doulist-item')]//text()")
'\n    '

python xpath問題總結xpath 處理網頁出現的問題總結

xpath 處理網頁出現的問題總結

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入