Python正規表達式指南

本文介紹了Python對于正規表達式的支援，包括正規表達式基礎以及Python正規表達式标準庫的完整介紹及使用示例。本文的内容不包括如何編寫高效的正規表達式、如何優化正規表達式，這些主題請檢視其他教程。注意：本文基于Python2.4完成；如果看到不明白的詞彙請記得百度谷歌或維基，whatever。尊重作者的勞動，轉載請注明作者及原文位址 >.<html

正規表達式并不是Python的一部分。正規表達式是用于處理字元串的強大工具，擁有自己獨特的文法以及一個獨立的處理引擎，效率上可能不如str自帶的方法，但功能十分強大。得益于這一點，在提供了正規表達式的語言裡，正規表達式的文法都是一樣的，差別隻在于不同的程式設計語言實作支援的文法數量不同；但不用擔心，不被支援的文法通常是不常用的部分。如果已經在其他語言裡使用過正規表達式，隻需要簡單看一看就可以上手了。

下圖展示了使用正規表達式進行比對的流程：

正規表達式的大緻比對過程是：依次拿出表達式和文本中的字元比較，如果每一個字元都能比對，則比對成功；一旦有比對不成功的字元則比對失敗。如果表達式中有量詞或邊界，這個過程會稍微有一些不同，但也是很好了解的，看下圖中的示例以及自己多使用幾次就能明白。

下圖列出了Python支援的正規表達式元字元和文法：

正規表達式通常用于在文本中查找比對的字元串。Python裡數量詞預設是貪婪的（在少數語言裡也可能是預設非貪婪），總是嘗試比對盡可能多的字元；非貪婪的則相反，總是嘗試比對盡可能少的字元。例如：正規表達式"ab*"如果用于查找"abbbc"，将找到"abbb"。而如果使用非貪婪的數量詞"ab*?"，将找到"a"。

與大多數程式設計語言相同，正規表達式裡使用"\"作為轉義字元，這就可能造成反斜杠困擾。假如你需要比對文本中的字元"\"，那麼使用程式設計語言表示的正規表達式裡将需要4個反斜杠"\\\\"：前兩個和後兩個分别用于在程式設計語言裡轉義成反斜杠，轉換成兩個反斜杠後再在正規表達式裡轉義成一個反斜杠。Python裡的原生字元串很好地解決了這個問題，這個例子中的正規表達式可以使用r"\\"表示。同樣，比對一個數字的"\\d"可以寫成r"\d"。有了原生字元串，你再也不用擔心是不是漏寫了反斜杠，寫出來的表達式也更直覺。

正規表達式提供了一些可用的比對模式，比如忽略大小寫、多行比對等，這部分内容将在Pattern類的工廠方法re.compile(pattern[,

flags])中一起介紹。

Python通過re子產品提供對正規表達式的支援。使用re的一般步驟是先将正規表達式的字元串形式編譯為Pattern執行個體，然後使用Pattern執行個體處理文本并獲得比對結果（一個Match執行個體），最後使用Match執行個體獲得資訊，進行其他的操作。

<code># encoding: UTF-8</code>

<code>import</code> <code>re</code>

<code># 将正規表達式編譯成Pattern對象</code>

<code>pattern</code><code>=</code> <code>re.</code><code>compile</code><code>(r</code><code>‘hello‘</code><code>)</code>

<code># 使用Pattern比對文本，獲得比對結果，無法比對時将傳回None</code>

<code>match</code><code>=</code> <code>pattern.match(</code><code>‘hello world!‘</code><code>)</code>

<code>if</code> <code>match:</code>

<code> </code><code># 使用Match獲得分組資訊</code>

<code> </code><code>print</code> <code>match.group()</code>

<code># hello</code>

re.compile(strPattern[, flag]):

這個方法是Pattern類的工廠方法，用于将字元串形式的正規表達式編譯為Pattern對象。

第二個參數flag是比對模式，取值可以使用按位或運算符‘|‘表示同時生效，比如re.I |

re.M。另外，你也可以在regex字元串中指定模式，比如re.compile(‘pattern‘, re.I |

re.M)與re.compile(‘(?im)pattern‘)是等價的。

可選值有：

re.I(re.IGNORECASE): 忽略大小寫（括号内是完整寫法，下同）

M(MULTILINE): 多行模式，改變‘^‘和‘$‘的行為（參見上圖）

S(DOTALL): 點任意比對模式，改變‘.‘的行為

L(LOCALE): 使預定字元類 \w \W \b \B \s \S 取決于目前區域設定

U(UNICODE): 使預定字元類 \w \W \b \B \s \S \d \D

取決于unicode定義的字元屬性

X(VERBOSE):

詳細模式。這個模式下正規表達式可以是多行，忽略空白字元，并可以加入注釋。以下兩個正規表達式是等價的：

<code>a</code><code>=</code> <code>re.</code><code>compile</code><code>(r</code><code>"""\d + # the integral part</code>

<code> </code><code>\. # the decimal point</code>

<code> </code><code>\d * # some fractional digits"""</code><code>, re.X)</code>

<code>b</code><code>=</code> <code>re.</code><code>compile</code><code>(r</code><code>"\d+\.\d*"</code><code>)</code>

re提供了衆多子產品方法用于完成正規表達式的功能。這些方法可以使用Pattern執行個體的相應方法替代，唯一的好處是少寫一行re.compile()代碼，但同時也無法複用編譯後的Pattern對象。這些方法将在Pattern類的執行個體方法部分一起介紹。如上面這個例子可以簡寫為：

<code>m</code><code>=</code> <code>re.match(r</code><code>‘hello‘</code><code>,</code><code>‘hello world!‘</code><code>)</code>

<code>print</code> <code>m.group()</code>

re子產品還提供了一個方法escape(string)，用于将string中的正規表達式元字元如*/+/?等之前加上轉義符再傳回，在需要大量比對元字元時有那麼一點用。

Match對象是一次比對的結果，包含了很多關于此次比對的資訊，可以使用Match提供的可讀屬性或方法來擷取這些資訊。

屬性：

string: 比對時使用的文本。

re: 比對時使用的Pattern對象。

pos:

文本中正規表達式開始搜尋的索引。值與Pattern.match()和Pattern.seach()方法的同名參數相同。

endpos:

文本中正規表達式結束搜尋的索引。值與Pattern.match()和Pattern.seach()方法的同名參數相同。

lastindex: 最後一個被捕獲的分組在文本中的索引。如果沒有被捕獲的分組，将為None。

lastgroup:

最後一個被捕獲的分組的别名。如果這個分組沒有别名或者沒有被捕獲的分組，将為None。

方法：

group([group1,

…]):

獲得一個或多個分組截獲的字元串；指定多個參數時将以元組形式傳回。group1可以使用編号也可以使用别名；編号0代表整個比對的子串；不填寫參數時，傳回group(0)；沒有截獲字元串的組傳回None；截獲了多次的組傳回最後一次截獲的子串。

groups([default]):

以元組形式傳回全部分組截獲的字元串。相當于調用group(1,2,…last)。default表示沒有截獲字元串的組以這個值替代，預設為None。

groupdict([default]):

傳回以有别名的組的别名為鍵、以該組截獲的子串為值的字典，沒有别名的組不包含在内。default含義同上。

start([group]):

傳回指定的組截獲的子串在string中的起始索引（子串第一個字元的索引）。group預設值為0。

end([group]):

傳回指定的組截獲的子串在string中的結束索引（子串最後一個字元的索引+1）。group預設值為0。

span([group]):

傳回(start(group),

end(group))。

expand(template):

将比對到的分組代入template中然後傳回。template中可以使用\id或\g<id>、\g<name>引用分組，但不能使用編号0。\id與\g<id>是等價的；但\10将被認為是第10個分組，如果你想表達\1之後是字元‘0‘，隻能使用\g<1>0。

<code>m</code><code>=</code> <code>re.match(r</code><code>‘(\w+) (\w+)(?P<sign>.*)‘</code><code>,</code><code>‘hello world!‘</code><code>)</code>

<code>print</code> <code>"m.string:"</code><code>, m.string</code>

<code>print</code> <code>"m.re:"</code><code>, m.re</code>

<code>print</code> <code>"m.pos:"</code><code>, m.pos</code>

<code>print</code> <code>"m.endpos:"</code><code>, m.endpos</code>

<code>print</code> <code>"m.lastindex:"</code><code>, m.lastindex</code>

<code>print</code> <code>"m.lastgroup:"</code><code>, m.lastgroup</code>

<code>print</code> <code>"m.group(1,2):"</code><code>, m.group(</code><code>1</code><code>,</code><code>2</code><code>)</code>

<code>print</code> <code>"m.groups():"</code><code>, m.groups()</code>

<code>print</code> <code>"m.groupdict():"</code><code>, m.groupdict()</code>

<code>print</code> <code>"m.start(2):"</code><code>, m.start(</code><code>2</code><code>)</code>

<code>print</code> <code>"m.end(2):"</code><code>, m.end(</code><code>2</code><code>)</code>

<code>print</code> <code>"m.span(2):"</code><code>, m.span(</code><code>2</code><code>)</code>

<code>print</code> <code>r</code><code>"m.expand(r‘\2 \1\3‘):"</code><code>, m.expand(r</code><code>‘\2 \1\3‘</code><code>)</code>

<code>### output ###</code>

<code># m.string: hello world!</code>

<code># m.endpos: 12</code>

<code># m.lastindex: 3</code>

<code># m.lastgroup: sign</code>

<code># m.group(1,2): (‘hello‘, ‘world‘)</code>

<code># m.groups(): (‘hello‘, ‘world‘, ‘!‘)</code>

<code># m.groupdict(): {‘sign‘: ‘!‘}</code>

<code># m.start(2): 6</code>

<code># m.expand(r‘\2 \1\3‘): world hello!</code>

Pattern對象是一個編譯好的正規表達式，通過Pattern提供的一系列方法可以對文本進行比對查找。

Pattern不能直接執行個體化，必須使用re.compile()進行構造。

Pattern提供了幾個可讀屬性用于擷取表達式的相關資訊：

pattern: 編譯時用的表達式字元串。

flags: 編譯時用的比對模式。數字形式。

groups: 表達式中分組的數量。

groupindex: 以表達式中有别名的組的别名為鍵、以該組對應的編号為值的字典，沒有别名的組不包含在内。

<code>p</code><code>=</code> <code>re.</code><code>compile</code><code>(r</code><code>‘(\w+) (\w+)(?P<sign>.*)‘</code><code>, re.DOTALL)</code>

<code>print</code> <code>"p.pattern:"</code><code>, p.pattern</code>

<code>print</code> <code>"p.flags:"</code><code>, p.flags</code>

<code>print</code> <code>"p.groups:"</code><code>, p.groups</code>

<code>print</code> <code>"p.groupindex:"</code><code>, p.groupindex</code>

<code># p.pattern: (\w+) (\w+)(?P<sign>.*)</code>

<code># p.flags: 16</code>

<code># p.groups: 3</code>

<code># p.groupindex: {‘sign‘: 3}</code>

執行個體方法[ | re子產品方法]：

match(string[, pos[, endpos]]) | re.match(pattern, string[,

flags]):

這個方法将從string的pos下标處起嘗試比對pattern；如果pattern結束時仍可比對，則傳回一個Match對象；如果比對過程中pattern無法比對，或者比對未結束就已到達endpos，則傳回None。

pos和endpos的預設值分别為0和len(string)；re.match()無法指定這兩個參數，參數flags用于編譯pattern時指定比對模式。

注意：這個方法并不是完全比對。當pattern結束時若string還有剩餘字元，仍然視為成功。想要完全比對，可以在表達式末尾加上邊界比對符‘$‘。

示例參見2.1小節。

search(string[, pos[, endpos]]) | re.search(pattern, string[,

這個方法用于查找字元串中可以比對成功的子串。從string的pos下标處起嘗試比對pattern，如果pattern結束時仍可比對，則傳回一個Match對象；若無法比對，則将pos加1後重新嘗試比對；直到pos=endpos時仍無法比對則傳回None。

pos和endpos的預設值分别為0和len(string))；re.search()無法指定這兩個參數，參數flags用于編譯pattern時指定比對模式。

<code>pattern</code><code>=</code> <code>re.</code><code>compile</code><code>(r</code><code>‘world‘</code><code>)</code>

<code># 使用search()查找比對的子串，不存在能比對的子串時将傳回None</code>

<code># 這個例子中使用match()無法成功比對</code>

<code>match</code><code>=</code> <code>pattern.search(</code><code>‘hello world!‘</code><code>)</code>

<code># world</code>

split(string[, maxsplit]) | re.split(pattern, string[,

maxsplit]):

按照能夠比對的子串将string分割後傳回清單。maxsplit用于指定最大分割次數，不指定将全部分割。

<code>p</code><code>=</code> <code>re.</code><code>compile</code><code>(r</code><code>‘\d+‘</code><code>)</code>

<code>print</code> <code>p.split(</code><code>‘one1two2three3four4‘</code><code>)</code>

<code># [‘one‘, ‘two‘, ‘three‘, ‘four‘, ‘‘]</code>

findall(string[, pos[, endpos]]) | re.findall(pattern, string[,

搜尋string，以清單形式傳回全部能比對的子串。

<code>print</code> <code>p.findall(</code><code>‘one1two2three3four4‘</code><code>)</code>

finditer(string[, pos[, endpos]]) | re.finditer(pattern, string[,

搜尋string，傳回一個順序通路每一個比對結果（Match對象）的疊代器。

<code>for</code> <code>m</code><code>in</code> <code>p.finditer(</code><code>‘one1two2three3four4‘</code><code>):</code>

<code> </code><code>print</code> <code>m.group(),</code>

sub(repl, string[, count]) | re.sub(pattern, repl, string[,

count]):

使用repl替換string中每一個比對的子串後傳回替換後的字元串。

當repl是一個字元串時，可以使用\id或\g<id>、\g<name>引用分組，但不能使用編号0。

當repl是一個方法時，這個方法應當隻接受一個參數（Match對象），并傳回一個字元串用于替換（傳回的字元串中不能再引用分組）。

count用于指定最多替換次數，不指定時全部替換。

<code>p</code><code>=</code> <code>re.</code><code>compile</code><code>(r</code><code>‘(\w+) (\w+)‘</code><code>)</code>

<code>s</code><code>=</code> <code>‘i say, hello world!‘</code>

<code>print</code> <code>p.sub(r</code><code>‘\2 \1‘</code><code>, s)</code>

<code> </code><code>return</code> <code>m.group(</code><code>1</code><code>).title()</code><code>+</code> <code>‘ ‘</code> <code>+</code> <code>m.group(</code><code>2</code><code>).title()</code>

<code>print</code> <code>p.sub(func, s)</code>

<code># say i, world hello!</code>

<code># I Say, Hello World!</code>

subn(repl, string[, count]) |re.sub(pattern, repl, string[,

傳回 (sub(repl, string[, count]), 替換次數)。

<code>print</code> <code>p.subn(r</code><code>‘\2 \1‘</code><code>, s)</code>

<code>print</code> <code>p.subn(func, s)</code>

<code># (‘say i, world hello!‘, 2)</code>

<code># (‘I Say, Hello World!‘, 2)</code>

以上就是Python對于正規表達式的支援。熟練掌握正規表達式是每一個程式員必須具備的技能，這年頭沒有不與字元串打交道的程式了。筆者也處于初級階段，與君共勉，^_^

另外，圖中的特殊構造部分沒有舉出例子，用到這些的正規表達式是具有一定難度的。有興趣可以思考一下，如何比對不是以abc開頭的單詞，^_^

全文結束

轉自：http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html

Python正規表達式指南

繼續閱讀

dos 指令集2---DOS 常用指令 (sys)

C/C++頭檔案、函數使用說明

SOFTICE 使用說明 (斷點)

DOS指令(2) 磁盤操作類指令

在DOS下運作不了ipconfig指令

c寫檔案

對于0-1分數規劃的Dinkelbach算法的分析

不用iconv函數實作UTF-8編碼轉換GB2312的PHP函數

浮點數計算精度控制

C#多線程——前台線程和背景線程

Android – ListView 中添加按鈕，動态删除添加ItemView的操作

IE8 CSS設定DIV居中，添加“margin:0 auto”

Small tricks

C++ 第十五周報告1--《冒泡法排序》

[轉]九大排序算法——C語言實作及詳解

QR碼編碼原理三（日本漢字和中文編碼）