天天看点

python 正则匹配网页中文内容

在对读取到的网页内容进行中文匹配,大体思路是:

1.对读取到的网页内容提取http header中的content-type,获取网页内容的编码格式;

2.根据获取的编码格式将网页内容转换为unicode格式;

3.使用[\u2e80-\u4dfh]进行正则匹配;

4.将匹配获取的字符进行编码为utf-8格式

Demo:

1: #coding=utf-8      
2:       
3: import urllib2      
4:       
5: if __name__ == \'__main__\':      
6: try:      
7: url = \'https://play.google.com/store/apps/category/TRANSPORTATION/collection/topselling_free?start=48&num=24\'      
8: req = urllib2.Request(url)      
9: res = urllib2.urlopen( req )      
10: # get content encode      
11: encoding = res.headers[\'content-type\'].split(\'charset=\')[-1]      
12: # get http content      
13: data = res.read()      
14: # encode with unicode      
15: data = unicode(data,encoding)      
16: res.close()      
17: # match with regex      
18: str = re.findall(ur\'[\u2e80-\u4dfh]+\',data)      
19: for item in str:      
20: # encode with utf-8      
21: item = item.encode(\'utf-8\')      
22: print item      
23: catch Excepiton,e:      
24: print e      
python 正则匹配网页中文内容