在對讀取到的網頁内容進行中文比對,大體思路是:
1.對讀取到的網頁内容提取http header中的content-type,擷取網頁内容的編碼格式;
2.根據擷取的編碼格式将網頁内容轉換為unicode格式;
3.使用[\u2e80-\u4dfh]進行正則比對;
4.将比對擷取的字元進行編碼為utf-8格式
Demo:
1: #coding=utf-8
2:
3: import urllib2
4:
5: if __name__ == \'__main__\':
6: try:
7: url = \'https://play.google.com/store/apps/category/TRANSPORTATION/collection/topselling_free?start=48&num=24\'
8: req = urllib2.Request(url)
9: res = urllib2.urlopen( req )
10: # get content encode
11: encoding = res.headers[\'content-type\'].split(\'charset=\')[-1]
12: # get http content
13: data = res.read()
14: # encode with unicode
15: data = unicode(data,encoding)
16: res.close()
17: # match with regex
18: str = re.findall(ur\'[\u2e80-\u4dfh]+\',data)
19: for item in str:
20: # encode with utf-8
21: item = item.encode(\'utf-8\')
22: print item
23: catch Excepiton,e:
24: print e
