[Python3]爬蟲入門之正規表達式

強烈推薦https://regexr.com/該網站。

使用該網站，檢視表達式每一項是什麼意思。很快就可以掌握正則了！！！從來沒覺得正則是如此的簡單！

好了。

在前面我們已經搞定了怎樣擷取頁面的内容，不過還差一步，這麼多雜亂的代碼夾雜文字我們怎樣把它提取出來整理呢？下面就開始介紹一個十分強大的工具，正規表達式！

1.了解正規表達式

正規表達式是對字元串操作的一種邏輯公式，就是用事先定義好的一些特定字元、及這些特定字元的組合，組成一個“規則字元串”，這個“規則字元串”用來表達對字元串的一種過濾邏輯。

正規表達式是用來比對字元串非常強大的工具，在其他程式設計語言中同樣有正規表達式的概念，Python同樣不例外，利用了正規表達式，我們想要從傳回的頁面内容提取出我們想要的内容就易如反掌了。

正規表達式的大緻比對過程是：

1.依次拿出表達式和文本中的字元比較，

2.如果每一個字元都能比對，則比對成功；一旦有比對不成功的字元則比對失敗。

3.如果表達式中有量詞或邊界，這個過程會稍微有一些不同。

2.正規表達式的文法規則

下面是Python中正規表達式的一些比對規則，圖檔資料來自CSDN

[Python3]爬蟲入門之正規表達式

3.正規表達式相關注解

（1）數量詞的貪婪模式與非貪婪模式

正規表達式通常用于在文本中查找比對的字元串。Python裡數量詞預設是貪婪的（在少數語言裡也可能是預設非貪婪），總是嘗試比對盡可能多的字元；非貪婪的則相反，總是嘗試比對盡可能少的字元。例如：正規表達式”ab*”如果用于查找”abbbc”，将找到”abbb”。而如果使用非貪婪的數量詞”ab*?”，将找到”a”。

注：我們一般使用非貪婪模式來提取。

（2）反斜杠問題

與大多數程式設計語言相同，正規表達式裡使用”\”作為轉義字元，這就可能造成反斜杠困擾。假如你需要比對文本中的字元”\”，那麼使用程式設計語言表示的正規表達式裡将需要4個反斜杠”\\\\”：前兩個和後兩個分别用于在程式設計語言裡轉義成反斜杠，轉換成兩個反斜杠後再在正規表達式裡轉義成一個反斜杠。

Python裡的原生字元串很好地解決了這個問題，這個例子中的正規表達式可以使用r”\\”表示。同樣，比對一個數字的”\\d”可以寫成r”\d”。有了原生字元串，媽媽也不用擔心是不是漏寫了反斜杠，寫出來的表達式也更直覺勒。

4.Python Re子產品

Python 自帶了re子產品，它提供了對正規表達式的支援。主要用到的方法列舉如下

1 2 3 4 5 6 7 8 9 10

#傳回pattern對象 re . compile ( string [ , flag ] ) #以下為比對所用函數 re . match ( pattern , string [ , flags ] ) re . search ( pattern , string [ , flags ] ) re . split ( pattern , string [ , maxsplit ] ) re . findall ( pattern , string [ , flags ] ) re . finditer ( pattern , string [ , flags ] ) re . sub ( pattern , repl , string [ , count ] ) re . subn ( pattern , repl , string [ , count ] )

在介紹這幾個方法之前，我們先來介紹一下pattern的概念，pattern可以了解為一個比對模式，那麼我們怎麼獲得這個比對模式呢？很簡單，我們需要利用re.compile方法就可以。例如

1	pattern = re . compile ( r 'hello' )

在參數中我們傳入了原生字元串對象，通過compile方法編譯生成一個pattern對象，然後我們利用這個對象來進行進一步的比對。

另外大家可能注意到了另一個參數 flags，在這裡解釋一下這個參數的含義：

參數flag是比對模式，取值可以使用按位或運算符’|’表示同時生效，比如re.I | re.M。

可選值有：

1 2 3 4 5 6

• re . I (全拼： IGNORECASE ) : 忽略大小寫（括号内是完整寫法，下同） • re . M (全拼： MULTILINE ) : 多行模式，改變 '^'和 '$'的行為（參見上圖） • re . S (全拼： DOTALL ) : 點任意比對模式，改變 '.'的行為 • re . L (全拼： LOCALE ) : 使預定字元類 \ w \ W \ b \ B \ s \ S 取決于目前區域設定 • re . U (全拼： UNICODE ) : 使預定字元類 \ w \ W \ b \ B \ s \ S \ d \ D 取決于 unicode定義的字元屬性 • re . X (全拼： VERBOSE ) : 詳細模式。這個模式下正規表達式可以是多行，忽略空白字元，并可以加入注釋。

在剛才所說的另外幾個方法例如 re.match 裡我們就需要用到這個pattern了，下面我們一一介紹。

注：以下七個方法中的flags同樣是代表比對模式的意思，如果在pattern生成時已經指明了flags，那麼在下面的方法中就不需要傳入這個參數了。

（1）re.match(pattern, string[, flags])

這個方法将會從string（我們要比對的字元串）的開頭開始，嘗試比對pattern，一直向後比對，如果遇到無法比對的字元，立即傳回None，如果比對未結束已經到達string的末尾，也會傳回None。兩個結果均表示比對失敗，否則比對pattern成功，同時比對終止，不再對string向後比對。下面我們通過一個例子了解一下

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44

__author__ = 'CQC' # -*- coding: utf-8 -*- #導入re子產品 import re # 将正規表達式編譯成Pattern對象，注意hello前面的r的意思是“原生字元串” pattern = re . compile ( r 'hello' ) # 使用re.match比對文本，獲得比對結果，無法比對時将傳回None result1 = re . match ( pattern , 'hello' ) result2 = re . match ( pattern , 'helloo CQC!' ) result3 = re . match ( pattern , 'helo CQC!' ) result4 = re . match ( pattern , 'hello CQC!' ) #如果1比對成功 if result1 : # 使用Match獲得分組資訊 print result1 . group ( ) else : print '1比對失敗！' #如果2比對成功 if result2 : # 使用Match獲得分組資訊 print result2 . group ( ) else : print '2比對失敗！' #如果3比對成功 if result3 : # 使用Match獲得分組資訊 print result3 . group ( ) else : print '3比對失敗！' #如果4比對成功 if result4 : # 使用Match獲得分組資訊 print result4 . group ( ) else : print '4比對失敗！'

運作結果

1 2 3 4

hello hello 3比對失敗！ hello

比對分析

1.第一個比對，pattern正規表達式為’hello’，我們比對的目标字元串string也為hello，從頭至尾完全比對，比對成功。

2.第二個比對，string為helloo CQC，從string頭開始比對pattern完全可以比對，pattern比對結束，同時比對終止，後面的o CQC不再比對，傳回比對成功的資訊。

3.第三個比對，string為helo CQC，從string頭開始比對pattern，發現到 ‘o’ 時無法完成比對，比對終止，傳回None

4.第四個比對，同第二個比對原理，即使遇到了空格符也不會受影響。

我們還看到最後列印出了result.group()，這個是什麼意思呢？下面我們說一下關于match對象的的屬性和方法

Match對象是一次比對的結果，包含了很多關于此次比對的資訊，可以使用Match提供的可讀屬性或方法來擷取這些資訊。

屬性：

1.string: 比對時使用的文本。

2.re: 比對時使用的Pattern對象。

3.pos: 文本中正規表達式開始搜尋的索引。值與Pattern.match()和Pattern.seach()方法的同名參數相同。

4.endpos: 文本中正規表達式結束搜尋的索引。值與Pattern.match()和Pattern.seach()方法的同名參數相同。

5.lastindex: 最後一個被捕獲的分組在文本中的索引。如果沒有被捕獲的分組，将為None。

6.lastgroup: 最後一個被捕獲的分組的别名。如果這個分組沒有别名或者沒有被捕獲的分組，将為None。

方法：

1.group([group1, …]):

獲得一個或多個分組截獲的字元串；指定多個參數時将以元組形式傳回。group1可以使用編号也可以使用别名；編号0代表整個比對的子串；不填寫參數時，傳回group(0)；沒有截獲字元串的組傳回None；截獲了多次的組傳回最後一次截獲的子串。

2.groups([default]):

以元組形式傳回全部分組截獲的字元串。相當于調用group(1,2,…last)。default表示沒有截獲字元串的組以這個值替代，預設為None。

3.groupdict([default]):

傳回以有别名的組的别名為鍵、以該組截獲的子串為值的字典，沒有别名的組不包含在内。default含義同上。

4.start([group]):

傳回指定的組截獲的子串在string中的起始索引（子串第一個字元的索引）。group預設值為0。

5.end([group]):

傳回指定的組截獲的子串在string中的結束索引（子串最後一個字元的索引+1）。group預設值為0。

6.span([group]):

傳回(start(group), end(group))。

7.expand(template):

将比對到的分組代入template中然後傳回。template中可以使用\id或\g、\g引用分組，但不能使用編号0。\id與\g是等價的；但\10将被認為是第10個分組，如果你想表達\1之後是字元’0’，隻能使用\g0。

下面我們用一個例子來體會一下

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

# -*- coding: utf-8 -*- #一個簡單的match執行個體 import re # 比對如下内容：單詞+空格+單詞+任意字元 m = re . match ( r '(\w+) (\w+)(?P<sign>.*)' , 'hello world!' ) print "m.string:" , m . string print "m.re:" , m . re print "m.pos:" , m . pos print "m.endpos:" , m . endpos print "m.lastindex:" , m . lastindex print "m.lastgroup:" , m . lastgroup print "m.group():" , m . group ( ) print "m.group(1,2):" , m . group ( 1 , 2 ) print "m.groups():" , m . groups ( ) print "m.groupdict():" , m . groupdict ( ) print "m.start(2):" , m . start ( 2 ) print "m.end(2):" , m . end ( 2 ) print "m.span(2):" , m . span ( 2 ) print r "m.expand(r'\g \g\g'):" , m . expand ( r '\2 \1\3' ) ### output ### # m.string: hello world! # m.re: # m.pos: 0 # m.endpos: 12 # m.lastindex: 3 # m.lastgroup: sign # m.group(1,2): ('hello', 'world') # m.groups(): ('hello', 'world', '!') # m.groupdict(): {'sign': '!'} # m.start(2): 6 # m.end(2): 11 # m.span(2): (6, 11) # m.expand(r'\2 \1\3'): world hello!

（2）re.search(pattern, string[, flags])

search方法與match方法極其類似，差別在于match()函數隻檢測re是不是在string的開始位置比對，search()會掃描整個string查找比對，match（）隻有在0位置比對成功的話才有傳回，如果不是開始位置比對成功的話，match()就傳回None。同樣，search方法的傳回對象同樣match()傳回對象的方法和屬性。我們用一個例子感受一下

1 2 3 4 5 6 7 8 9 10 11 12 13

#導入re子產品 import re # 将正規表達式編譯成Pattern對象 pattern = re . compile ( r 'world' ) # 使用search()查找比對的子串，不存在能比對的子串時将傳回None # 這個例子中使用match()無法成功比對 match = re . search ( pattern , 'hello world!' ) if match : # 使用Match獲得分組資訊 print match . group ( ) ### 輸出 ### # world

（3）re.split(pattern, string[, maxsplit])

按照能夠比對的子串将string分割後傳回清單。maxsplit用于指定最大分割次數，不指定将全部分割。我們通過下面的例子感受一下。

1 2 3 4 5 6 7

import re pattern = re . compile ( r '\d+' ) print re . split ( pattern , 'one1two2three3four4' ) ### 輸出 ### # ['one', 'two', 'three', 'four', '']

（4）re.findall(pattern, string[, flags])

搜尋string，以清單形式傳回全部能比對的子串。我們通過這個例子來感受一下

1 2 3 4 5 6 7

import re pattern = re . compile ( r '\d+' ) print re . findall ( pattern , 'one1two2three3four4' ) ### 輸出 ### # ['1', '2', '3', '4']

（5）re.finditer(pattern, string[, flags])

搜尋string，傳回一個順序通路每一個比對結果（Match對象）的疊代器。我們通過下面的例子來感受一下

1 2 3 4 5 6 7 8

import re pattern = re . compile ( r '\d+' ) for m in re . finditer ( pattern , 'one1two2three3four4' ) : print m . group ( ) , ### 輸出 ### # 1 2 3 4

（6）re.sub(pattern, repl, string[, count])

使用repl替換string中每一個比對的子串後傳回替換後的字元串。

當repl是一個字元串時，可以使用\id或\g、\g引用分組，但不能使用編号0。

當repl是一個方法時，這個方法應當隻接受一個參數（Match對象），并傳回一個字元串用于替換（傳回的字元串中不能再引用分組）。

count用于指定最多替換次數，不指定時全部替換。

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

import re pattern = re . compile ( r '(\w+) (\w+)' ) s = 'i say, hello world!' print re . sub ( pattern , r '\2 \1' , s ) def func ( m ) : return m . group ( 1 ) . title ( ) + ' ' + m . group ( 2 ) . title ( ) print re . sub ( pattern , func , s ) ### output ### # say i, world hello! # I Say, Hello World!

（7）re.subn(pattern, repl, string[, count])

傳回 (sub(repl, string[, count]), 替換次數)。

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

import re pattern = re . compile ( r '(\w+) (\w+)' ) s = 'i say, hello world!' print re . subn ( pattern , r '\2 \1' , s ) def func ( m ) : return m . group ( 1 ) . title ( ) + ' ' + m . group ( 2 ) . title ( ) print re . subn ( pattern , func , s ) ### output ### # ('say i, world hello!', 2) # ('I Say, Hello World!', 2)

5.Python Re子產品的另一種使用方式

在上面我們介紹了7個工具方法，例如match，search等等，不過調用方式都是 re.match，re.search的方式，其實還有另外一種調用方式，可以通過pattern.match，pattern.search調用，這樣調用便不用将pattern作為第一個參數傳入了，大家想怎樣調用皆可。

函數API清單

1 2 3 4 5 6 7

match ( string [ , pos [ , endpos ] ] ) | re . match ( pattern , string [ , flags ] ) search ( string [ , pos [ , endpos ] ] ) | re . search ( pattern , string [ , flags ] ) split ( string [ , maxsplit ] ) | re . split ( pattern , string [ , maxsplit ] ) findall ( string [ , pos [ , endpos ] ] ) | re . findall ( pattern , string [ , flags ] ) finditer ( string [ , pos [ , endpos ] ] ) | re . finditer ( pattern , string [ , flags ] ) sub ( repl , string [ , count ] ) | re . sub ( pattern , repl , string [ , count ] ) subn ( repl , string [ , count ] ) | re . sub ( pattern , repl , string [ , count ] )

具體的調用方法不必詳說了，原理都類似，隻是參數的變化不同。小夥伴們嘗試一下吧~

小夥伴們加油，即使這一節看得雲裡霧裡的也沒關系，接下來我們會通過一些實戰例子來幫助大家熟練掌握正規表達式的。