使用Python正規表達式RE從CSDN部落格源代碼中比對出部落格資訊

re.compile(strpattern[, flag]):

這個方法是pattern類的工廠方法，用于将字元串形式的正規表達式編譯為pattern對象。第二個參數flag是比對模式，取值可以使用按位或運算符'|'表示同時生效，比如re.i | re.m。另外，你也可以在regex字元串中指定模式，比如re.compile('pattern', re.i | re.m)與re.compile('(?im)pattern')是等價的。

可選值有：

主要非英文語系字元範圍

模式

名稱

說明

re.i

re.ignorecase

忽略大小寫（括号内是完整寫法，下同）

re.m

multiline

多行模式，改變'^'和'$'的行為

re.s

dotall

點任意比對模式，改變'.'的行為, 使".“可以比對任意字元

re.l

locale

使預定字元類 \w \w \b \b \s \s 取決于目前區域設定

re.u

unicode

使預定字元類 \w \w \b \b \s \s \d \d 取決于unicode定義的字元屬性

re.x

verbose

詳細模式。這個模式下正規表達式可以是多行，忽略空白字元，并可以加入注釋。以下兩個正規表達式是等價的：

具體參見如下代碼

#!coding:utf-8

import re

import sys

import urllib2

# 測試比對中文資訊

def testrechinese( ):

reload(sys)

sys.setdefaultencoding( "utf-8" )

# 這段html代碼是從部落格清單頁面中摘取出來的單個部落格的資訊，我們要從中摘取出

page = r"""<div class="list_item article_item">

</span>

<h1>

python正規表達式比對中文

</a>

</span>

</h1>

</div>

在使用python的過程中，由于需求原因，我們經常需要在文本或者網頁元素中用python正規表達式比對中文，但是我們經常所熟知的正規表達式卻隻能比對英文，而對于中文編碼卻望塵莫及，于是我大量google，幾經baidu，花了兩個多個小時測試，終于發現解決的辦法。特記錄如下字元串的角度來說，中文不如英文整齊、規範，這是不可避免的現實。本文結合網上資料以及個人經驗，以 python 語言為例，...

<span class="link_postdate">2015-01-28 19:34

<a href="/gatieme/article/details/43235791" title="閱讀次數">閱讀</a>(64)

<a href="/gatieme/article/details/43235791#comments" title="評論次數" onclick="_gaq.push(['_trackevent','function', 'onclick', 'blog_articles_pinglun'])">評論</a>(0)

</spa </div>

</div>"""

req = urllib2.request("http://blog.csdn.net/gatieme/article/list/1") # 建立頁面請求

req.add_header("user-agent", "mozilla/4.0 (compatible; msie 8.0; windows nt 6.1; trident/4.0)")

try:

cn = urllib2.urlopen(req)

page = cn.read( )

unicodepage = page.decode("utf-8")

cn.close( )

except urllib2.urlerror, e:

print 'urlerror:', e.code

return

except urllib2.httperror, e:

print 'http error:' + e.reason

return

# 從部落格頁面中比對出每個部落格的地址

rehtml = r'<span class="link_title"><a href="(.*?)">\s*(.*?)\s*</a></span>.*?<span class="link_postdate">(.*?)</span>\s*<span class="link_view" title=".*?"><a href="(.*?)" title=".*?">.*?</a>(.*?)</span>\s*<span class="link_comments" title=".*?"><a href="(.*?)#comments" title=".*?" onclick=".*?">.*?</a>(.*?)</span>'

#####-----------------------------------------------------------

# [示例1]----寫法1 比對失敗, 無法比對中文

# pattern = re.compile(rehtml)

# myitems = re.findall(pattern, unicodepage)

# [示例1]----寫法2 比對失敗, 無法比對中文

# myitems = re.findall(rehtml, unicodepage)

#####-----------------------------------------------------------

# [示例2]----寫法1, 比對成功， re.s使用dotall模式可以比對中文

# 寫法說明

# Ⅰ将字元串編譯成re.s[dotall]模式下正規表達式

# Ⅱ在使用正規表達式文本或者html代碼中比對部落格資訊

pattern = re.compile(rehtml, re.s)

myitems = re.findall(pattern, unicodepage)

# [示例2]----寫法2, 比對成功， re.s使用dotall模式可以比對中文

# Ⅱ直接用編譯好的正規表達式在文本或者html代碼中比對部落格資訊

# pattern = re.compile(rehtml, re.s)

# myitems = pattern.findall(unicodepage)

# [示例2]----寫法3, 比對成功， re.s使用dotall模式可以比對中文

# Ⅰ

# Ⅱ不編譯正規表達式，直接在文本或者html代碼中比對部落格資訊

# myitems = re.findall(rehtml, unicodepage, re.s)

print len(myitems)

# print myitems

for item in myitems:

urltitle = item[0].replace("\n", "")

urlview = item[3].replace("\n", "")

urlcomments = item[5].replace("\n", "")

# 由于比對時使用了貪婪模式, 為了比對出現錯誤，

# 将某一篇的标題與另一篇部落格的發表時間, 閱讀次數或者評論條數混淆的比對成一篇部落格資訊

# 是以在比對時，重複的比對了部落格的位址資訊

# 當且僅當，部落格标題附帶的位址資訊與部落格閱讀次數以及評論條數附帶的位址資訊時同一篇部落格的位址時，

# 我們才認為比對成功

if (urltitle == urlview) and (urltitle == urlcomments):

print "#------------------------------------------------------"

print "位址：", item[0].replace("\n", ""), # 部落格位址url1(标題附帶)

使用Python正規表達式RE從CSDN部落格源代碼中比對出部落格資訊

繼續閱讀

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

tab滑鼠經過菜單切換

vue （vue2.0）使用總結(從大體結構總結)

vue搭建過程及出現問題

/\B(?=(?:\d{3})+$)/g 一條令人費解的正規表達式

适用于JavaScript的ECMAScript 2020規範向前發展

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

JS生成uuid的四種方法

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

layui多任務上傳添加進度條

在python中建立excel并寫入