github項目：https://github.com/lei940324/toy/tree/master/筆記

基礎

xpath簡介

XPath 是一門在 XML 文檔中查找資訊的語言。XPath 可用來在 XML 文檔中對元素和屬性進行周遊。

XPath 是 W3C XSLT 标準的主要元素，并且 XQuery 和 XPointer 都建構于 XPath 表達之上。

是以，對 XPath 的了解是很多進階 XML 應用的基礎。

xpath的一般用法

表達式	描述
nodename	選取此節點的所有子節點。
/	從根節點選取。
//	從比對選擇的目前節點選擇文檔中的節點，而不考慮它們的位置。
.	選取目前節點。
…	選取目前節點的父節點。
@	選取屬性。
*	通配符

執行個體

在下面的表格中，列出了一些路徑表達式以及表達式的結果：

路徑表達式	結果
bookstore	選取 bookstore 元素的所有子節點。
/bookstore	選取根元素 bookstore。注釋：假如路徑起始于正斜杠( / )，則此路徑始終代表到某元素的絕對路徑！
bookstore/book	選取屬于 bookstore 的子元素的所有 book 元素。
//book	選取所有 book 子元素，而不管它們在文檔中的位置。
bookstore//book	選擇屬于 bookstore 元素的後代的所有 book 元素，而不管它們位于 bookstore 之下的什麼位置。
//@lang	選取名為 lang 的所有屬性。

注意：xpath第一個元素從1開始，python則從0開始

選取節點

首先載入 lxml 庫

from lxml import etree

随便舉一個 xml 例子：

xml = '''
<?xml version="1.0" encoding="ISO-8859-1"?>

<bookstore>

<book>
  <title >Harry Potter</title>
  <price>29.99</price>
</book>

<book>
  <title >Learning XML</title>
  <price>39.95</price>
</book>

<book>
  <title >Tom</title>
  <price>10</price>
</book>
</bookstore>
'''

print('第1個 book 元素: ')
print(selector.xpath('//book[1]//text()'))

print('\n第2個 book 元素: ')
print(selector.xpath('//book[2]//text()'))

print('\n最後一個 book 元素: ')
print(selector.xpath('//book[last()]//text()'))

print('\n倒數第2個 book 元素: ')
print(selector.xpath('//book[last()-1]//text()'))

print('\n前2個 book 元素: ')
print(selector.xpath('//book[position() < 3]//text()'))

print('\n選取所有擁有名為 lang 的屬性的 title 元素: ')
print(selector.xpath('//title[@lang]//text()'))

print('\n選取所有 title 元素，且這些元素擁有值為 chinese 的 lang 屬性: ')
print(selector.xpath('//title[@]//text()'))

print('\n選取所有的title與price節點: ')
print(selector.xpath('//title//text() | //price//text()'))

第1個 book 元素: 
['\n  ', 'Harry Potter', '\n  ', '29.99', '\n']

第2個 book 元素: 
['\n  ', 'Learning XML', '\n  ', '39.95', '\n']

最後一個 book 元素: 
['\n  ', 'Tom', '\n  ', '10', '\n']

倒數第2個 book 元素: 
['\n  ', 'Learning XML', '\n  ', '39.95', '\n']

前2個 book 元素: 
['\n  ', 'Harry Potter', '\n  ', '29.99', '\n', '\n  ', 'Learning XML', '\n  ', '39.95', '\n']

選取所有擁有名為 lang 的屬性的 title 元素: 
['Harry Potter', 'Learning XML', 'Tom']

選取所有 title 元素，且這些元素擁有值為 chinese 的 lang 屬性: 
['Tom']

選取所有的title與price節點: 
['Harry Potter', '29.99', 'Learning XML', '39.95', 'Tom', '10']

位置路徑表達式

位置路徑可以是絕對的，也可以是相對的。

絕對路徑起始于正斜杠( / )，而相對路徑不會這樣。在兩種情況中，位置路徑均包括一個或多個步，每個步均被斜杠分割：

絕對位置路徑：

/step/step/...

相對位置路徑：

step/step/...

Xpath軸

軸可以定義相對于目前節點的節點集

[<Element html at 0x2ca0097e988>,
 <Element body at 0x2ca00878148>,
 <Element bookstore at 0x2ca0097ee88>,
 <Element book at 0x2ca0097e608>,
 <Element book at 0x2ca0097e788>,
 <Element book at 0x2ca0097e488>]

[<Element html at 0x2ca0097e988>,
 <Element body at 0x2ca00878148>,
 <Element bookstore at 0x2ca0097ee88>,
 <Element book at 0x2ca0097e608>,
 <Element title at 0x2ca00987f48>,
 <Element book at 0x2ca0097e788>,
 <Element title at 0x2ca00987fc8>,
 <Element book at 0x2ca0097e488>,
 <Element title at 0x2ca00995048>]

['eng', 'eng', 'chinese']

軸名稱	結果
ancestor	選取目前節點的所有先輩（父、祖父等）。
ancestor-or-self	選取目前節點的所有先輩（父、祖父等）以及目前節點本身。
attribute	選取目前節點的所有屬性。
child	選取目前節點的所有子元素。
descendant	選取目前節點的所有後代元素（子、孫等）。
descendant-or-self	選取目前節點的所有後代元素（子、孫等）以及目前節點本身。
following	選取文檔中目前節點的結束标簽之後的所有節點。
namespace	選取目前節點的所有命名空間節點。
parent	選取目前節點的父節點。
preceding	選取文檔中目前節點的開始标簽之前的所有節點。
preceding-sibling	選取目前節點之前的所有同級節點。
self	選取目前節點。

功能函數

starts-with函數

['Tom']

contains函數

['Harry Potter', 'Learning XML']

and用法

# 選取lang值包含 en和 g的 title位元組
selector.xpath('//title[contains(@lang,"en") and contains(@lang,"g")]//text()')

['Harry Potter', 'Learning XML']

文本中部分包含用法

['Learning XML']

string用法：擷取文本，傳回字元串格式

info = selector.xpath('//title/ancestor::*')
strings = info[3].xpath('string(.)')
print('string函數：', strings)

texts = info[3].xpath('.//text()')
print('text函數：', texts)

string函數： 
  Harry Potter
  29.99

text函數： ['\n  ', 'Harry Potter', '\n  ', '29.99', '\n']

常見問題

XPath『不包含』應該怎麼寫？

假設有這樣一段HTML代碼：

html = '''
<html>
    <head>
        <title>測試XPath移除功能</title>
    </head>
    <body>
        <div class="post">
            <div class="quote">無關緊要的引用内容</div>
                你好啊
                <strong>産品經理</strong>，
                <span>很高興認識你</span>
                。
        </div>
    </body>
</html>
'''

我想把其中的你好啊産品經理，很高興認識你提取出來。

from lxml import etree
selector = etree.fromstring(html)
selector.xpath('//div[@class="post"]//*[not(@class="quote")]/text()')

['産品經理', '很高興認識你']

但是這裡缺少你好啊，因為它不屬于任何子标簽。

為了單獨直接擷取

div

下面的内容，我們需要使用

再拼接一個 XPath：

data = selector.xpath('//div[@class="post"]/text() | //div[@class="post"]//*[not(@class="quote")]/text()')
text = ''.join(map(lambda x: x.strip() , data))
text

'你好啊産品經理，很高興認識你。'

标簽套标簽,如何提取成一句完整的話？

html = '''
<div id="class3">
    我左青龍,
    <span id='tiger'>
        右白虎,
        <ul>上朱雀,
            <li>下玄武.</li>
        </ul>
        老牛在當中,
    </span>
    龍頭在胸口.
</div>
'''

selector = etree.HTML(html)
data = selector.xpath('//div[@id="class3"]')[0]

方法一：使用string函數

info = data.xpath('string(.)')   # 實際上是去除了div中間的其他多餘标簽
print(info)

我左青龍,
    
        右白虎,
        上朱雀,
            下玄武.
        
        老牛在當中,
    
    龍頭在胸口.

content2 = info.replace('\n','').replace(' ','')   # 将換行與空格分别取代
print(content2)

我左青龍,右白虎,上朱雀,下玄武.老牛在當中,龍頭在胸口.

方法二：使用text函數

info = data.xpath('.//text()')
info

['\n    我左青龍,\n    ',
 '\n        右白虎,\n        ',
 '上朱雀,\n            ',
 '下玄武.',
 '\n        ',
 '\n        老牛在當中,\n    ',
 '\n    龍頭在胸口.\n']

content2 = ''.join(map(lambda x: x.strip() , info))
print(content2)

我左青龍,右白虎,上朱雀,下玄武.老牛在當中,龍頭在胸口.

适合初學者的xpath基礎介紹基礎常見問題

基礎

xpath簡介

推薦工具

xpath的一般用法

選取節點

位置路徑表達式

Xpath軸

功能函數

常見問題

XPath『不包含』應該怎麼寫？

标簽套标簽,如何提取成一句完整的話？

繼續閱讀

mac 安裝vim 插件YouCompleteMe

Ubuntu安裝QQ，微信，迅雷等Wine軟體

webstorm中配置git

Webstorm上配置Github和Git

如何下載下傳blob:https://www.bilibili.com/的視訊

GitHub打開太慢,或者打不開Github,試試代理

Github通路速度慢的解決方案總彙

【考研政治】2021肖八整理（時政部分）

分享開源Cesium地形制作工具

git關聯問題解決

github 如何和 xcode 聯系起來

localstack 1.0 ga 了

解決方案之：DM relay 處理單元報錯

用 Canvas 編織璀璨星空圖

《2020失業潮，普通人能否出奇制勝？》筆記

開源按鍵元件Multi_Button的使用,含測試工程