python 正則比對網頁中文内容

2021-09-16 23:50:00

在對讀取到的網頁内容進行中文比對，大體思路是：

1.對讀取到的網頁内容提取http header中的content-type，擷取網頁内容的編碼格式;

2.根據擷取的編碼格式将網頁内容轉換為unicode格式;

3.使用[\u2e80-\u4dfh]進行正則比對;

4.将比對擷取的字元進行編碼為utf-8格式

Demo:

1: #coding=utf-8

2:

3: import urllib2

4:

5: if __name__ == \'__main__\':

6: try:

7: url = \'https://play.google.com/store/apps/category/TRANSPORTATION/collection/topselling_free?start=48&num=24\'

8: req = urllib2.Request(url)

9: res = urllib2.urlopen( req )

10: # get content encode

11: encoding = res.headers[\'content-type\'].split(\'charset=\')[-1]

12: # get http content

13: data = res.read()

14: # encode with unicode

15: data = unicode(data,encoding)

16: res.close()

17: # match with regex

18: str = re.findall(ur\'[\u2e80-\u4dfh]+\',data)

19: for item in str:

20: # encode with utf-8

21: item = item.encode(\'utf-8\')

22: print item

23: catch Excepiton,e:

24: print e