正規表達式無論是在爬蟲還是其它的應用中都是有一定作用的。
1、常見的比對模式
模式 描述
\w 比對字母數字及下劃線
\W 比對非字母數字下劃線
\s 比對任意空白字元,等價于 [\t\n\r\f].
\S 比對任意非空字元
\d 比對任意數字,等價于 [0-9]
\D 比對任意非數字
\A 比對字元串開始
\Z 比對字元串結束,如果是存在換行,隻比對到換行前的結束字元串
\z 比對字元串結束
\G 比對最後比對完成的位置
\n 比對一個換行符
\t 比對一個制表符
^ 比對字元串的開頭
$ 比對字元串的末尾。
. 比對任意字元,除了換行符,當re.DOTALL标記被指定時,則可以比對包括換行符的任意字元。
[...] 用來表示一組字元,單獨列出:[amk] 比對 'a','m'或'k'
[^...] 不在[]中的字元:[^abc] 比對除了a,b,c之外的字元。
* 比對0個或多個的表達式。
+ 比對1個或多個的表達式。
? 比對0個或1個由前面的正規表達式定義的片段,非貪婪方式
{n} 精确比對n個前面表達式。
{n, m} 比對 n 到 m 次由前面的正規表達式定義的片段,貪婪方式
a|b 比對a或b
( ) 比對括号内的表達式,也表示一個組
2、re.match()
re.match()方法會從字元的第一個位置比對起。如果第一個位置比對失敗的話,就會傳回none。
re.match(pattern, string, flags=0)
正常比對
import re
content="hfhdh 8484 djfjdj dkfd 8596"
result=re.match("^hfhdh\s\d{4}\s\w.*96$",content)
print(result) #<_sre.SRE_Match object; span=(0, 27), match='hfhdh 8484 djfjdj dkfd 8596'>
print(result.group())#表示比對到的字元hfhdh 8484 djfjdj dkfd 8596
print(result.span()) #表示比對字元的大小(0, 27)
泛比對
利用.*比對多個字元
import re
content="hfhdh 8484 djfjdj dkfd 8596"
result=re.match("^hfhdh.*96$",content)
print(result)
print(result.group())
目标比對
比對字元串中的數字,正規表達式加上括号表示一個組,可以取出每一個括号中比對到的值
import re
content="hfhdh 8484 djfjdj dkfd 8596"
result=re.match("^hfhdh\s(\d+).*\s(\d+)$",content)
print(result)#<_sre.SRE_Match object; span=(0, 27), match='hfhdh 8484 djfjdj dkfd 8596'>
print(result.group())#hfhdh 8484 djfjdj dkfd 8596
print(result.group(1))#8484
print(result.group(2))#8596
貪婪比對
可以看到還是之前的比對不過在.*後面去掉了\s,結果就不一樣了。它會盡量比對多的字元,不過至少留下一個數字。
import re
content="hfhdh 8484 djfjdj dkfd 8596"
result=re.match("^hfhdh\s(\d+).*(\d+)$",content)
print(result.group(2))#6
非貪婪比對
為防止多的比對,可以引入?它是比對0個或者1個前面的正規表達式
import re
content="hfhdh 8484 djfjdj dkfd 8596"
result=re.match("^hfhdh\s(\d+).*?(\d+)$",content)
print(result.group(2))#8596
比對模式
如果出現換行,應該如何處理?此時可以引入比對模式re.S
import re
content="hfhdh 8484 djfjdj " \
"dkfd 8596"
result=re.match("^hfhdh\s(\d+).*?(\d+)$",content,re.S)
print(result.group(2))#8596
轉義
如果比對的内容中有正規表達式,需要使用“\”進行轉義
import re
content="This book's price is $10.00"
result=re.match("This book's price is \$10\.00",content,re.S)
print(result)
3、re.search()
re.search()是對整個字元串進行掃描,不一定非要從第一個開始。
import re
content="hfhdh 8484 djfjdj dkfd 8596"
result=re.search(".*?(\d+).*?(\d+)$",content)
print(result)
print(result.group(1))
print(result.group(2))
4、re.findall()
上述都是比對的一個字元串,如果需要比對出所有的字元串就需要用到findall(),結果以清單的形式傳回所有的結果。
import re
content="""<div class="main-nav">
<a href=//new.tmall.com/?abbucket=&acm=lb-zebra-12803-227044.1003.8.390455&aldid=74460&spm=3.7396704.20000005.d1.zNATmK&abtest=&scm=1003.8.lb-zebra-12803-227044.ITEM_14392515722151_390455&pos=1>尚天貓</a>
<a href=//miao.tmall.com/?abbucket=&acm=lb-zebra-12803-227044.1003.8.390455&aldid=74460&spm=3.7396704.20000005.d2.zNATmK&abtest=&scm=1003.8.lb-zebra-12803-227044.ITEM_14392511874662_390455&pos=2>喵鮮生</a>
<a href=//vip.tmall.com/vip/index.htm?abbucket=&acm=lb-zebra-12803-227044.1003.8.390455&aldid=74460&spm=3.7396704.20000005.d3.zNATmK&abtest=&scm=1003.8.lb-zebra-12803-227044.ITEM_14392523417123_390455&pos=3>天貓會員</a>
<a href=//3c.tmall.com/?abbucket=&acm=lb-zebra-12803-227044.1003.8.390455&aldid=74460&spm=3.7396704.20000005.d4.zNATmK&abtest=&scm=1003.8.lb-zebra-12803-227044.ITEM_14392519569634_390455&pos=4>電器城</a>
<a href=//chaoshi.tmall.com/?abbucket=&acm=lb-zebra-12803-227044.1003.8.390455&aldid=74460&spm=3.7396704.20000005.d5.zNATmK&abtest=&scm=1003.8.lb-zebra-12803-227044.ITEM_14392500332195_390455&pos=5>天貓超市</a>
<a href=//yao.tmall.com/?abbucket=&acm=lb-zebra-12803-227044.1003.8.390455&aldid=74460&spm=3.7396704.20000005.d6.zNATmK&abtest=&scm=1003.8.lb-zebra-12803-227044.ITEM_14392496484706_390455&pos=6>醫藥館</a>
<a href=//www.tmall.hk/?abbucket=&acm=lb-zebra-12803-227044.1003.8.390455&aldid=74460&spm=3.7396704.20000005.d8.zNATmK&abtest=&scm=1003.8.lb-zebra-12803-227044.ITEM_14392508027177_390455&pos=8>天貓國際</a>
<a class="last" href=//car.tmall.com/?acm=lb-zebra-12803-227044.1003.8.390455&spm=3.7396704.20000007.23.zNATmK&uuid=75987&abtest=&scm=1003.8.lb-zebra-12803-227044.ITEM_14392504179688_390455&pos=1>天貓汽車</a>
</div>"""
result=re.findall('<a.*?>(.*?)</a>',content,re.S)#有幾個分組就比對出幾個分組的内容
print(result)#['尚天貓', '喵鮮生', '天貓會員', '電器城', '天貓超市', '醫藥館', '天貓國際', '天貓汽車']
5、re.sub()
替換字元串中每一個比對的子串後傳回替換後的字元串。
import re
content="hfhdh 8484 djfjdj dkfd 8596"
result=re.sub("(\d+)",'hellO',content)
print(result)#hfhdh hellO djfjdj dkfd hellO
如果保留替換的内容
import re
content="hfhdh 8484 djfjdj dkfd 8596"
result=re.sub("(\d+)",r'\1hellO',content) #注意加r \1 原先第一個分組中的内容
print(result)#hfhdh 8484hellO djfjdj dkfd 8596hellO
6、re.compile()
将正則字元串編譯成正規表達式對象,以便于複用該比對模式。
import re
content="hfhdh 8484 djfjdj dkfd 8596"
pattern=re.compile('.*?(\d+).*?(\d+)')
print(re.match(pattern,content).group(1))#8484
print(re.match(pattern,content).group(2))#8596
7、實戰演練
爬取豆瓣讀書的url,img,以及author
import requests
import re
content=requests.get('https://book.douban.com').text
results=re.findall('<li.*?class="cover".*?a\shref="(.*?)"\stitle=".*?">.*?src="(.*?)"\sclass.*?class="author">(.*?)</div>.*?/li>',content,re.S)
for result in results:
url,img,author=result
author=re.sub('\s','',author)
print(url,img,author)
https://book.douban.com/subject/30353889/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32288925.jpg (挪)奧斯娜·塞厄斯塔
https://book.douban.com/subject/30431051/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32288967.jpg 【英】霍吉淑
https://book.douban.com/subject/33428941/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32311017.jpg [意]伊塔洛·斯韋沃
https://book.douban.com/subject/30319982/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s29986124.jpg [美]巴巴拉·塔奇曼
https://book.douban.com/subject/33414749/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32295228.jpg [日]貴志祐介
https://book.douban.com/subject/33396548/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32281256.jpg [法]勒•柯布西耶
https://book.douban.com/subject/30281429/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s30021281.jpg [荷]高羅佩
https://book.douban.com/subject/33379779/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32305312.jpg [法]弗雷德裡克·皮耶魯齊 / [法]馬修·阿倫
https://book.douban.com/subject/33404843/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32289202.jpg 遠子
https://book.douban.com/subject/30432494/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32296675.jpg [美]瓊·狄迪恩
https://book.douban.com/subject/30466222/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32281237.jpg [匈]雅歌塔·克裡斯多夫
https://book.douban.com/subject/33435992/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32313677.jpg 葛兆光
https://book.douban.com/subject/30396696/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32301364.jpg [美]奧森·斯科特·卡德
https://book.douban.com/subject/33420594/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32317746.jpg 馮唐
https://book.douban.com/subject/33400116/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32310181.jpg (英)阿加莎•克裡斯蒂著
https://book.douban.com/subject/30362709/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32273120.jpg [美]海蓮·漢芙
https://book.douban.com/subject/32567841/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s31459918.jpg 伊謝爾倫的風
https://book.douban.com/subject/33408138/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32302726.jpg [美]羅威廉(WilliamT.Rowe)
https://book.douban.com/subject/33423702/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32311913.jpg 夏清影
https://book.douban.com/subject/33440284/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32314738.jpg 菲奧娜·斯塔福德
https://book.douban.com/subject/30480992/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32278296.jpg [英]約翰·勒卡雷
https://book.douban.com/subject/33393524/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32302539.jpg 許知遠
https://book.douban.com/subject/30466204/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32318137.jpg [英]戴維·洛奇
https://book.douban.com/subject/30473225/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32304992.jpg [日]池田龜鑒
https://book.douban.com/subject/30200837/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s30017700.jpg [日]青山七惠
https://book.douban.com/subject/30481930/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s30019054.jpg (英)吉姆·克裡斯蒂安 / 于應機 / 李陽歡
https://book.douban.com/subject/33399902/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32284301.jpg 池莉
https://book.douban.com/subject/30443973/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32271911.jpg [葡]費爾南多·佩索阿
https://book.douban.com/subject/33370472/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32326689.jpg [法]羅曼·加裡
https://book.douban.com/subject/30415984/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32285594.jpg [美]克裡斯·克利爾菲爾德 / [美]安德拉什·蒂爾克斯
https://book.douban.com/subject/33423373/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32323469.jpg 周恺
https://book.douban.com/subject/33381271/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32277009.jpg [美]戴維•戴恩
https://book.douban.com/subject/30406506/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32295155.jpg 練明喬
https://book.douban.com/subject/30436197/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32278312.jpg [英]海倫•拉塞爾
https://book.douban.com/subject/30446953/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32284916.jpg [美]勞倫斯·布洛克
https://book.douban.com/subject/32492398/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32259438.jpg [法]阿爾貝·奧古斯特·拉西内著
https://book.douban.com/subject/30464096/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32322848.jpg [英]格雷厄姆·格林
https://book.douban.com/subject/30475747/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32294085.jpg [法]米歇爾·維諾克
https://book.douban.com/subject/33411336/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32293152.jpg 曾铮
https://book.douban.com/subject/33387411/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32304678.jpg [美]拉塞爾·柯克
作者:iveBoy
出處:http://www.cnblogs.com/shenjianping/
本文版權歸作者和部落格園共有,歡迎轉載,但未經作者同意必須在文章頁面給出原文連接配接,否則保留追究法律責任的權利。