天天看點

python抓取gb2312/gbk編碼網頁亂碼問題

??浜?涓?缃?缁?????????缃?椤碉?浣?濡???缃?椤墊??gbk/gb2312缂???锛???浼??虹?頒貢????棰?锛?濡?涓?锛?

??寰???瀛???锛??存?ユ???幫?杈??虹???str濡?涓?锛?鹿貿??????驢錄?????壟?酶_鹿貿??????驢錄???酶_鹿貿??鹿蘆?帽?鹵驢錄???酶_鹿貿????鹿蘆

杩?涓???棰??版?版??濂介?挎?堕?達?baidu,google浜?涓???涔?娌℃???懼?闆???ㄥ??琛????規?锛?缁х畫?????撅?????灞??舵???烘?ヤ?锛?缂???杞??㈡?ヨ漿?㈠?葷??锛?杩???寰?涓??拌В?熾???規??????棰??葷?涓??ワ???浜?缁?澶у?訛?浜??稿???锛?锛????堕??棰?涓?澶???寰?瀹規??瑙e?籌?濡???娌℃???懼?伴??棰???绐??村?g??璇?锛??d?????瑙e?抽??棰???璺?绋?灏卞??ヨ?浜?锛??諱?锛????伴??棰??訛?澶????╁??绉??規?璇?璇?锛??繪??涓?绉??規???浠ヨВ?崇????

1.?瑰??缃?椤墊?浠g????缂????煎?

# -*- coding:utf8 -*-

import urllib2

req = urllib2.Request("http://www.baidu.com/")
res = urllib2.urlopen(req)
html = res.read()
res.close()

html = unicode(html, "gb2312").encode("utf8")  #gb2312--->utf-8
print html
           

2.python????缃?椤墊?跺??绗???杞??㈤??棰?澶????規?

???跺????浠?????缃?椤碉?澶???瀹?姣???灏?瀛?绗?覆淇?瀛??版??浠舵???????ユ?版??搴?锛?杩??跺????瑕??跺??瀛?绗?覆??缂???锛?濡???????缃?椤電??缂?????gb2312锛?????浠????版??搴???utf-8??锛?杩??蜂???浠諱?澶????存?ユ???ユ?版??搴????戒?涔辯??(娌℃?璇?杩?锛?涓??ラ???版??搴?浼?涓?浼????ㄨ漿??)锛???浠???瑕????ㄥ?gb2312杞??㈡??utf-8??

棣?????浠??ラ??锛?python????瀛?绗??璁ゆ??ascii??锛??辨??褰??舵病??棰????纰闆?頒腑?????跺??绔?椹?缁?璺???

涓??ラ??浣?杩?璁頒?璁闆?锛?python?????頒腑??姹?瀛????跺????瑕??ㄥ??绗?覆???㈠?? u锛?

print u"?ユ???哄??锛?"

杩??峰??涓??????芥?劇ず锛?杩????㈢??u??浣??ㄥ氨??灏????㈢??瀛?绗?覆杞???負unicode??锛?杩??蜂腑?????藉??版?g‘???劇ず??

杩???涓?涔??稿?崇????涓?涓?unicode()?芥?幫??ㄦ?濡?涓?

str="?ユ????

str=unicode(str,"utf-8")

print str

涓?u???哄????锛?杩?????nicode灏?str杞???負unicode缂???锛???瑕?姝g‘??瀹?绗?浜?涓????幫?杩?????utf-8????test.py??????韬?????浠跺??绗???锛?榛?璁ょ?????芥??ansi??

unicode杩???涓?涓??抽??锛?涓??㈢戶缁?

??浠?寮?濮??????懼害棣?椤碉?娉ㄦ??锛?娓稿?㈣?塊???懼害棣?椤碉??ョ??缃?椤墊?浠g??锛?瀹???charset=gb2312??

import urllib2

def main():

? f=urllib2.urlopen("http://www.baidu.com")

? str=f.read()

? str=unicode(str,"gb2312")

? fp=open("baidu.html","w")

? fp.write(str.encode("utf-8"))

? fp.close()

?

if __name__ == '__main__' :

? main()

瑙i??锛?

??浠?棣?????rllib2.urlopen()?規?灏??懼害棣?椤墊?????幫?f???ユ?? 锛???tr=f.read()灏?????婧?浠g??璇誨??tr涓?

??娓?妤?,str???㈠氨????浠???????html婧?浠g??锛??變?缃?椤甸?璁ょ??瀛?绗?????gb2312锛???浠ュ?????浠??存?ヤ?瀛??版??浠朵腑锛???浠剁???灏???ansi??

瀵逛?澶ч?ㄥ??浜烘?ヨ?達??跺??杩?灏辮凍澶?浜?锛?浣??????跺????灏辨?蟲??gb2312杞??㈡??utf-8??璇ユ??涔??????

棣???锛?

? ? str=unicode(str,"gb2312") #杩?????gb2312灏辨??str??瀹???瀛?绗???锛???浠??闆?ㄥ??惰漿?㈡??unicode

?跺??锛?

? ? str=str.encode("utf-8") #灏?unicode??瀛?绗?覆???扮?????utf-8

????锛?

? ? 灏?str???ュ?版??浠朵腑锛???寮???浠剁??涓?涓?缂???灞??э????版??utf-8??浜?锛???<meta charset="gb2312"?規??<meta charset="utf-8" 锛?灏辨??涓?涓?utf-8??缃?椤典?????浜?杩?涔?澶??跺??灏卞????浜?涓?涓?gb2312->utf-8??杞?????

?葷?锛?

? ? ??浠???椤句?涓?锛?濡?????瑕?灏?瀛?绗?覆???ф??瀹???瀛?绗???淇?瀛?锛???浠ヤ???涓?姝ラ?わ?

? ? 1锛???nicode(str,"???ョ??缂???")灏?str瑙g????unicode瀛?绗?覆

? ? 2锛?灏?unicode瀛?绗?覆str 浣跨??str.encode("??瀹???瀛?绗???") 杞??㈡??浣???瀹???瀛?绗???

? ? 3锛?灏?str淇?瀛???浠訛????????ユ?版??搴?绛???浣?锛?褰??訛?缂???浣?宸茬???瀹?浜?锛?涓?????锛?

3.??xml瑙f??html

???xml.etree??缃?缁?????????缃?椤碉?浣?濡???缃?椤墊??gbk/gb2312缂???锛???浼??虹?頒貢????棰?锛?濡?涓?锛?

??寰???瀛???锛??存?ユ???幫?杈??虹???v濡?涓?锛?

鹿貿??????驢錄?????壟?酶_鹿貿??????驢錄???酶_鹿貿??鹿蘆?帽?鹵驢錄???酶_鹿貿????鹿蘆杩?涓?v??绫誨??????<type?'lxml.etree._ElementUnicodeResult'>

璇ュ?浣?瑙e?沖??锛?? ??浠ヤ慨?規?浠g????缂????煎?锛? ?response.encoding?=?'utf-8'

page?=?etree.HTML(response.content)

nodes_title?=?page.xpath("//title//text()")

杩??鋒???闆?烘?ョ?? nodes_title[0]灏變負姝e父??涓????劇ず浜???

?瑰??娉ㄦ??????锛?response.text寰?瀹規???虹?扮?????棰???锛?浠ュ????esponse.content??