天天看點

正規表達式學習筆記

正規表達式無論是在爬蟲還是其它的應用中都是有一定作用的。

1、常見的比對模式

模式                              描述
\w                            比對字母數字及下劃線
\W                            比對非字母數字下劃線
\s                            比對任意空白字元,等價于 [\t\n\r\f].
\S                            比對任意非空字元
\d                            比對任意數字,等價于 [0-9]
\D                            比對任意非數字
\A                            比對字元串開始
\Z                            比對字元串結束,如果是存在換行,隻比對到換行前的結束字元串
\z                            比對字元串結束
\G                            比對最後比對完成的位置
\n                            比對一個換行符
\t                            比對一個制表符
^                             比對字元串的開頭
$                             比對字元串的末尾。
.                             比對任意字元,除了換行符,當re.DOTALL标記被指定時,則可以比對包括換行符的任意字元。
[...]                           用來表示一組字元,單獨列出:[amk] 比對 'a','m'或'k'
[^...]                          不在[]中的字元:[^abc] 比對除了a,b,c之外的字元。
*                             比對0個或多個的表達式。
+                             比對1個或多個的表達式。
?                             比對0個或1個由前面的正規表達式定義的片段,非貪婪方式
{n}                            精确比對n個前面表達式。
{n, m}                          比對 n 到 m 次由前面的正規表達式定義的片段,貪婪方式
a|b                             比對a或b
( )                            比對括号内的表達式,也表示一個組      

 2、re.match()

re.match()方法會從字元的第一個位置比對起。如果第一個位置比對失敗的話,就會傳回none。

re.match(pattern, string, flags=0)      

正常比對

import re

content="hfhdh 8484 djfjdj dkfd 8596"

result=re.match("^hfhdh\s\d{4}\s\w.*96$",content)
print(result) #<_sre.SRE_Match object; span=(0, 27), match='hfhdh 8484 djfjdj dkfd 8596'>
print(result.group())#表示比對到的字元hfhdh 8484 djfjdj dkfd 8596
print(result.span()) #表示比對字元的大小(0, 27)      

泛比對

利用.*比對多個字元

import re

content="hfhdh 8484 djfjdj dkfd 8596"
result=re.match("^hfhdh.*96$",content)
print(result)
print(result.group())      

目标比對

比對字元串中的數字,正規表達式加上括号表示一個組,可以取出每一個括号中比對到的值

import re

content="hfhdh 8484 djfjdj dkfd 8596"
result=re.match("^hfhdh\s(\d+).*\s(\d+)$",content)
print(result)#<_sre.SRE_Match object; span=(0, 27), match='hfhdh 8484 djfjdj dkfd 8596'>
print(result.group())#hfhdh 8484 djfjdj dkfd 8596
print(result.group(1))#8484
print(result.group(2))#8596      

貪婪比對

 可以看到還是之前的比對不過在.*後面去掉了\s,結果就不一樣了。它會盡量比對多的字元,不過至少留下一個數字。

import re

content="hfhdh 8484 djfjdj dkfd 8596"
result=re.match("^hfhdh\s(\d+).*(\d+)$",content)
print(result.group(2))#6      

非貪婪比對

為防止多的比對,可以引入?它是比對0個或者1個前面的正規表達式

import re

content="hfhdh 8484 djfjdj dkfd 8596"
result=re.match("^hfhdh\s(\d+).*?(\d+)$",content)
print(result.group(2))#8596      

比對模式

如果出現換行,應該如何處理?此時可以引入比對模式re.S

import re

content="hfhdh 8484 djfjdj " \
        "dkfd 8596"
result=re.match("^hfhdh\s(\d+).*?(\d+)$",content,re.S)
print(result.group(2))#8596      

轉義

如果比對的内容中有正規表達式,需要使用“\”進行轉義

import re

content="This book's price is $10.00"
result=re.match("This book's price is \$10\.00",content,re.S)
print(result)      

3、re.search()

re.search()是對整個字元串進行掃描,不一定非要從第一個開始。

import re

content="hfhdh 8484 djfjdj dkfd 8596"
result=re.search(".*?(\d+).*?(\d+)$",content)
print(result)
print(result.group(1))
print(result.group(2))      

4、re.findall()

上述都是比對的一個字元串,如果需要比對出所有的字元串就需要用到findall(),結果以清單的形式傳回所有的結果。

import re

content="""<div class="main-nav">
<a href=//new.tmall.com/?abbucket=&amp;acm=lb-zebra-12803-227044.1003.8.390455&amp;aldid=74460&amp;spm=3.7396704.20000005.d1.zNATmK&amp;abtest=&amp;scm=1003.8.lb-zebra-12803-227044.ITEM_14392515722151_390455&amp;pos=1>尚天貓</a>
<a href=//miao.tmall.com/?abbucket=&amp;acm=lb-zebra-12803-227044.1003.8.390455&amp;aldid=74460&amp;spm=3.7396704.20000005.d2.zNATmK&amp;abtest=&amp;scm=1003.8.lb-zebra-12803-227044.ITEM_14392511874662_390455&amp;pos=2>喵鮮生</a>
<a href=//vip.tmall.com/vip/index.htm?abbucket=&amp;acm=lb-zebra-12803-227044.1003.8.390455&amp;aldid=74460&amp;spm=3.7396704.20000005.d3.zNATmK&amp;abtest=&amp;scm=1003.8.lb-zebra-12803-227044.ITEM_14392523417123_390455&amp;pos=3>天貓會員</a>
<a href=//3c.tmall.com/?abbucket=&amp;acm=lb-zebra-12803-227044.1003.8.390455&amp;aldid=74460&amp;spm=3.7396704.20000005.d4.zNATmK&amp;abtest=&amp;scm=1003.8.lb-zebra-12803-227044.ITEM_14392519569634_390455&amp;pos=4>電器城</a>
<a href=//chaoshi.tmall.com/?abbucket=&amp;acm=lb-zebra-12803-227044.1003.8.390455&amp;aldid=74460&amp;spm=3.7396704.20000005.d5.zNATmK&amp;abtest=&amp;scm=1003.8.lb-zebra-12803-227044.ITEM_14392500332195_390455&amp;pos=5>天貓超市</a>
<a href=//yao.tmall.com/?abbucket=&amp;acm=lb-zebra-12803-227044.1003.8.390455&amp;aldid=74460&amp;spm=3.7396704.20000005.d6.zNATmK&amp;abtest=&amp;scm=1003.8.lb-zebra-12803-227044.ITEM_14392496484706_390455&amp;pos=6>醫藥館</a>
<a href=//www.tmall.hk/?abbucket=&amp;acm=lb-zebra-12803-227044.1003.8.390455&amp;aldid=74460&amp;spm=3.7396704.20000005.d8.zNATmK&amp;abtest=&amp;scm=1003.8.lb-zebra-12803-227044.ITEM_14392508027177_390455&amp;pos=8>天貓國際</a>
<a class="last" href=//car.tmall.com/?acm=lb-zebra-12803-227044.1003.8.390455&amp;spm=3.7396704.20000007.23.zNATmK&amp;uuid=75987&amp;abtest=&amp;scm=1003.8.lb-zebra-12803-227044.ITEM_14392504179688_390455&amp;pos=1>天貓汽車</a>
</div>"""

result=re.findall('<a.*?>(.*?)</a>',content,re.S)#有幾個分組就比對出幾個分組的内容
print(result)#['尚天貓', '喵鮮生', '天貓會員', '電器城', '天貓超市', '醫藥館', '天貓國際', '天貓汽車']      

5、re.sub()

替換字元串中每一個比對的子串後傳回替換後的字元串。

import re

content="hfhdh 8484 djfjdj dkfd 8596"
result=re.sub("(\d+)",'hellO',content)
print(result)#hfhdh hellO djfjdj dkfd hellO      

如果保留替換的内容

import re

content="hfhdh 8484 djfjdj dkfd 8596"
result=re.sub("(\d+)",r'\1hellO',content) #注意加r \1 原先第一個分組中的内容
print(result)#hfhdh 8484hellO djfjdj dkfd 8596hellO      

6、re.compile()

将正則字元串編譯成正規表達式對象,以便于複用該比對模式。

import re

content="hfhdh 8484 djfjdj dkfd 8596"
pattern=re.compile('.*?(\d+).*?(\d+)')
print(re.match(pattern,content).group(1))#8484
print(re.match(pattern,content).group(2))#8596      

7、實戰演練

爬取豆瓣讀書的url,img,以及author

import requests
import re

content=requests.get('https://book.douban.com').text
results=re.findall('<li.*?class="cover".*?a\shref="(.*?)"\stitle=".*?">.*?src="(.*?)"\sclass.*?class="author">(.*?)</div>.*?/li>',content,re.S)
for result in results:
    url,img,author=result
    author=re.sub('\s','',author)
    print(url,img,author)      
https://book.douban.com/subject/30353889/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32288925.jpg (挪)奧斯娜·塞厄斯塔
https://book.douban.com/subject/30431051/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32288967.jpg 【英】霍吉淑
https://book.douban.com/subject/33428941/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32311017.jpg [意]伊塔洛·斯韋沃
https://book.douban.com/subject/30319982/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s29986124.jpg [美]巴巴拉·塔奇曼
https://book.douban.com/subject/33414749/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32295228.jpg [日]貴志祐介
https://book.douban.com/subject/33396548/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32281256.jpg [法]勒•柯布西耶
https://book.douban.com/subject/30281429/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s30021281.jpg [荷]高羅佩
https://book.douban.com/subject/33379779/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32305312.jpg [法]弗雷德裡克·皮耶魯齊&nbsp;/&nbsp;[法]馬修·阿倫
https://book.douban.com/subject/33404843/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32289202.jpg 遠子
https://book.douban.com/subject/30432494/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32296675.jpg [美]瓊·狄迪恩
https://book.douban.com/subject/30466222/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32281237.jpg [匈]雅歌塔·克裡斯多夫
https://book.douban.com/subject/33435992/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32313677.jpg 葛兆光
https://book.douban.com/subject/30396696/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32301364.jpg [美]奧森·斯科特·卡德
https://book.douban.com/subject/33420594/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32317746.jpg 馮唐
https://book.douban.com/subject/33400116/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32310181.jpg (英)阿加莎•克裡斯蒂著
https://book.douban.com/subject/30362709/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32273120.jpg [美]海蓮·漢芙
https://book.douban.com/subject/32567841/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s31459918.jpg 伊謝爾倫的風
https://book.douban.com/subject/33408138/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32302726.jpg [美]羅威廉(WilliamT.Rowe)
https://book.douban.com/subject/33423702/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32311913.jpg 夏清影
https://book.douban.com/subject/33440284/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32314738.jpg 菲奧娜·斯塔福德
https://book.douban.com/subject/30480992/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32278296.jpg [英]約翰·勒卡雷
https://book.douban.com/subject/33393524/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32302539.jpg 許知遠
https://book.douban.com/subject/30466204/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32318137.jpg [英]戴維·洛奇
https://book.douban.com/subject/30473225/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32304992.jpg [日]池田龜鑒
https://book.douban.com/subject/30200837/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s30017700.jpg [日]青山七惠
https://book.douban.com/subject/30481930/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s30019054.jpg (英)吉姆·克裡斯蒂安&nbsp;/&nbsp;于應機&nbsp;/&nbsp;李陽歡
https://book.douban.com/subject/33399902/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32284301.jpg 池莉
https://book.douban.com/subject/30443973/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32271911.jpg [葡]費爾南多·佩索阿
https://book.douban.com/subject/33370472/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32326689.jpg [法]羅曼·加裡
https://book.douban.com/subject/30415984/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32285594.jpg [美]克裡斯·克利爾菲爾德&nbsp;/&nbsp;[美]安德拉什·蒂爾克斯
https://book.douban.com/subject/33423373/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32323469.jpg 周恺
https://book.douban.com/subject/33381271/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32277009.jpg [美]戴維•戴恩
https://book.douban.com/subject/30406506/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32295155.jpg 練明喬
https://book.douban.com/subject/30436197/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32278312.jpg [英]海倫•拉塞爾
https://book.douban.com/subject/30446953/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32284916.jpg [美]勞倫斯·布洛克
https://book.douban.com/subject/32492398/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32259438.jpg [法]阿爾貝·奧古斯特·拉西内著
https://book.douban.com/subject/30464096/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32322848.jpg [英]格雷厄姆·格林
https://book.douban.com/subject/30475747/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32294085.jpg [法]米歇爾·維諾克
https://book.douban.com/subject/33411336/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32293152.jpg 曾铮
https://book.douban.com/subject/33387411/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32304678.jpg [美]拉塞爾·柯克      

作者:iveBoy

出處:http://www.cnblogs.com/shenjianping/

本文版權歸作者和部落格園共有,歡迎轉載,但未經作者同意必須在文章頁面給出原文連接配接,否則保留追究法律責任的權利。