資料解析（XPath、BeautifulSoup、正規表達式、pyquery）

本文部分資料來自菜鳥教程。

在爬蟲學習中，擷取網頁資料後，需要對資料進行。

有4種解析方式分别是：XPath、BeautifulSoup、正規表達式、pyquery

1、XPath

菜鳥教程：XPath 的搜尋結果

XPath需要依賴lxml庫：安裝方式 pip install lxml

資料解析（XPath、BeautifulSoup、正規表達式、pyquery）

<?xml version="1.0" encoding="UTF-8"?>
 
<bookstore>
 
<book>
  <title >Harry Potter</title>
  <price>29.99</price>
</book>
 
<book>
  <title >Learning XML</title>
  <price>39.95</price>
</book>
 
</bookstore>

XPath 使用路徑表達式在 XML 文檔中選取節點。節點是通過沿着路徑或者 step 來選取的。下面列出了最有用的路徑表達式：

表達式	描述
nodename	選取此節點的所有子節點。
/	從根節點選取（取子節點）。
//	從比對選擇的目前節點選擇文檔中的節點，而不考慮它們的位置（取子孫節點）。
.	選取目前節點。
..	選取目前節點的父節點。
@	選取屬性。

在下面的表格中，我們已列出了一些路徑表達式以及表達式的結果：

路徑表達式	結果
bookstore	選取 bookstore 元素的所有子節點。
/bookstore	選取根元素 bookstore。注釋：假如路徑起始于正斜杠( / )，則此路徑始終代表到某元素的絕對路徑！
bookstore/book	選取屬于 bookstore 的子元素的所有 book 元素。
//book	選取所有 book 子元素，而不管它們在文檔中的位置。
bookstore//book	選擇屬于 bookstore 元素的後代的所有 book 元素，而不管它們位于 bookstore 之下的什麼位置。
//@lang	選取名為 lang 的所有屬性。

謂語（Predicates）

謂語用來查找某個特定的節點或者包含某個指定的值的節點。

謂語被嵌在方括号中。

在下面的表格中，我們列出了帶有謂語的一些路徑表達式，以及表達式的結果：

路徑表達式	結果
/bookstore/book[1]	選取屬于 bookstore 子元素的第一個 book 元素。
/bookstore/book[last()]	選取屬于 bookstore 子元素的最後一個 book 元素。
/bookstore/book[last()-1]	選取屬于 bookstore 子元素的倒數第二個 book 元素。
/bookstore/book[position()<3]	選取最前面的兩個屬于 bookstore 元素的子元素的 book 元素。
//title[@lang]	選取所有擁有名為 lang 的屬性的 title 元素。
//title[@]	選取所有 title 元素，且這些元素擁有值為 eng 的 lang 屬性。
/bookstore/book[price>35.00]	選取 bookstore 元素的所有 book 元素，且其中的 price 元素的值須大于 35.00。
/bookstore/book[price>35.00]//title	選取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值須大于 35.00。

選取未知節點

XPath 通配符可用來選取未知的 XML 元素。

通配符	描述
*	比對任何元素節點。
@*	比對任何屬性節點。
node()	比對任何類型的節點。

在下面的表格中，我們列出了一些路徑表達式，以及這些表達式的結果：

路徑表達式	結果
/bookstore/*	選取 bookstore 元素的所有子元素。
//*	選取文檔中的所有元素。
//title[@*]	選取所有帶有屬性的 title 元素。

選取若幹路徑

通過在路徑表達式中使用"|"運算符，您可以選取若幹個路徑。

在下面的表格中，我們列出了一些路徑表達式，以及這些表達式的結果：

路徑表達式	結果
//book/title \| //book/price	選取 book 元素的所有 title 和 price 元素。
//title \| //price	選取文檔中的所有 title 和 price 元素。
/bookstore/book/title \| //price	選取屬于 bookstore 元素的 book 元素的所有 title 元素，以及文檔中所有的 price 元素。

舉個例子：爬取起點小說網月票榜上的書名

資料解析（XPath、BeautifulSoup、正規表達式、pyquery）

import requests
from lxml import  etree

url='https://www.qidian.com/rank/yuepiao/'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
resp=requests.get(url,headers=headers)#發送請求
e=etree.HTML(resp.text)  #類型轉換 将str類型轉換成class 'lxml.etree._Element'
names=e.xpath('//div[@class="book-mid-info"]/h2/a/text()')
print(names)

資料解析（XPath、BeautifulSoup、正規表達式、pyquery）

2、BeautifulSoup

安裝方式 pip install bs4

解析器一般選用第二個

資料解析（XPath、BeautifulSoup、正規表達式、pyquery）

舉個栗子：

from  bs4 import BeautifulSoup

html='''
    <html>
        <head>
            <title>百度一下</title>
        </head>
        <body>
            <h1 class="info bg" float='left'>百度搜尋</h1>
            <a href="http://www.baidu.com" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" > 百度</a>
            <h2><!--注釋的内容--></h2>
        </body>
    </html>

'''

bs=BeautifulSoup(html,'lxml') #調用解析器，建立對象
print(bs.title)
print(bs.h1.attrs)

#擷取單個屬性
print(bs.h1.get('class'))
print(bs.h1['class'])
print(bs.a['href'])

#擷取内容
print('--------',bs.h2.string)  #擷取到h2标簽中的注釋的文本内容
print(bs.h2.text)     #因為h2标簽中沒有正而八經的文本内容

from  bs4 import  BeautifulSoup

html='''
    <title>百度一下</title>
    <div class="info" float="left">百度搜尋</div>
    <div class="info" float="right" id="gb">
        <span>好好學習，天天向上</span>
        <a href="http://www.baidu.com" target="_blank" rel="external nofollow"  target="_blank" rel="external nofollow" >官網</a>
    </div>
    <span>人生苦短，再來一碗</span>
'''
bs=BeautifulSoup(html,'lxml')
print(bs.title,type(bs.title))
print(bs.find('div',class_='info'),type(bs.find('div',class_='info')))  #擷取第一個滿足條件的标簽
print('--------------------------------------')
print(bs.find_all('div',class_='info'))  #得到的是一個标簽的清單
print('--------------------------------------')
for item in bs.find_all('div',class_='info'):
    print(item,type(item))
print('--------------------------------------')
print(bs.find_all('div',attrs={'float':'right'}))

print('===============CSS選擇器=======================')
print(bs.select("#gb"))
print('--------------------------------------')
print(bs.select('.info'))
print('--------------------------------------')
print(bs.select('div>span'))
print('--------------------------------------')
print(bs.select('div.info>span'))

for item in bs.select('div.info>span'):
    print(item.text)

3、正規表達式

正規表達式是一個特殊的字元序列，它能幫助使用者便捷地檢查一個字元串是否與某種模式比對。 Python的正則子產品是re，是Python的内置子產品，不需要安裝，導入即可。

文法：

序号	元字元	說明
1	.	比對任意字元
2	^	比對字元串的開頭
3	$	比對字元的末尾
4	*	比對前一個元字元0到多次
5	+	比對前一個元字元1到多次
6	?	比對前一個元字元0到1次
7	{m}	比對前一個字元m次
8	{m,n}	比對前一個字元m到n次
9	{m,n}?	比對前一個字元m到n次，并且取盡可能少的情況
10	\\	對特殊字元進行轉義
11	[]	一個字元的集合，可比對其中任意一個字元
12	\|	邏輯表達式”或”，比如a｜b代表可比對a或者b
13	(...)	被括起來的表達式作為一個元組。findall()在有組的情況下隻顯示組的内容

特殊序列：

序号	元字元	說明
1	\A	隻在字元串開頭進行比對
2	\b	比對位于開頭或者結尾的空字元串
3	\B	比對不位于開頭或者結尾的空字元串
4	\d	比對任意十進制數，相當于[0-9]
5	\D	比對任意非數字字元，相當于[^0-9]
6	\s	比對任意空白字元，相當于[\t\n\r\f\v]
7	\S	比對任意非空白字元，相當于[^\t\n\r\f\v]
8	\w	比對任意數字、字母、下劃線，相當于[a-zA-Z0-9_]
9	\W	比對任意非數字、字母、下劃線，相當于[^a-zA-Z0-9_]
10	\Z	隻在字元串結尾進行比對
11	[\u4e00-\u9fa5]	中文

正則處理函數：

資料解析（XPath、BeautifulSoup、正規表達式、pyquery）

自行用代碼測試一下，加深了解，代碼裡的 .group(),加上後友善看。去掉也可以

import  re
s='Istudy study Python3.8 every day'
print('----------------match方法，從起始位置開始比對------------')
print(re.match('I',s).group())
print(re.match('\w',s).group())
print(re.match('.',s).group())

print('---------------search方法，從任意位置開始比對，比對第一個---------------')
print(re.search('study',s).group())
print(re.search('s\w',s).group())

print('---------------findall方法，從任意位置開始比對，比對多個-----------------')
print(re.findall('y',s))  #結果為清單
print(re.findall('Python',s))
print(re.findall('P\w+.\d',s))
print(re.findall('P.+\d',s))

print('--------------sub方法的使用，替換功能-------------------------')
print(re.sub('study','like',s))
print(re.sub('s\w+','like',s))

4、pyquery

pyquery庫是jQuery的Python實作，能夠以jQuery的文法來操作解析 HTML 文檔，易用性和解析速度都很好前提條件：你對CSS選擇器與JQuery有所了解

非Python标準子產品，需要安裝安裝方式 pip install pyquery 測試方式 Import pyquery