??浜?涓?缃?缁?????????缃?椤碉?浣?濡???缃?椤垫??gbk/gb2312缂???锛???浼??虹?颁贡????棰?锛?濡?涓?锛?
??寰???瀛???锛??存?ユ???帮?杈??虹???str濡?涓?锛?鹿贸??????驴录?????垄?酶_鹿贸??????驴录???酶_鹿贸??鹿芦?帽?卤驴录???酶_鹿贸????鹿芦
杩?涓???棰??版?版??濂介?挎?堕?达?baidu,google浜?涓???涔?娌℃???惧?板???ㄥ??琛????规?锛?缁х画?????撅?????灞??舵???烘?ヤ?锛?缂???杞??㈡?ヨ浆?㈠?荤??锛?杩???寰?涓??拌В?炽???规??????棰??荤?涓??ワ???浜?缁?澶у?讹?浜??稿???锛?锛????堕??棰?涓?澶???寰?瀹规??瑙e?筹?濡???娌℃???惧?伴??棰???绐??村?g??璇?锛??d?????瑙e?抽??棰???璺?绋?灏卞??ヨ?浜?锛??讳?锛????伴??棰??讹?澶????╁??绉??规?璇?璇?锛??绘??涓?绉??规???浠ヨВ?崇????
1.?瑰??缃?椤垫?浠g????缂????煎?
# -*- coding:utf8 -*-
import urllib2
req = urllib2.Request("http://www.baidu.com/")
res = urllib2.urlopen(req)
html = res.read()
res.close()
html = unicode(html, "gb2312").encode("utf8") #gb2312--->utf-8
print html
2.python????缃?椤垫?跺??绗???杞??㈤??棰?澶????规?
???跺????浠?????缃?椤碉?澶???瀹?姣???灏?瀛?绗?覆淇?瀛??版??浠舵???????ユ?版??搴?锛?杩??跺????瑕??跺??瀛?绗?覆??缂???锛?濡???????缃?椤电??缂?????gb2312锛?????浠????版??搴???utf-8??锛?杩??蜂???浠讳?澶????存?ユ???ユ?版??搴????戒?涔辩??(娌℃?璇?杩?锛?涓??ラ???版??搴?浼?涓?浼????ㄨ浆??)锛???浠???瑕????ㄥ?gb2312杞??㈡??utf-8??
棣?????浠??ラ??锛?python????瀛?绗??璁ゆ??ascii??锛??辨??褰??舵病??棰????纰板?颁腑?????跺??绔?椹?缁?璺???
涓??ラ??浣?杩?璁颁?璁板?锛?python?????颁腑??姹?瀛????跺????瑕??ㄥ??绗?覆???㈠?? u锛?
print u"?ユ???哄??锛?"
杩??峰??涓??????芥?剧ず锛?杩????㈢??u??浣??ㄥ氨??灏????㈢??瀛?绗?覆杞???负unicode??锛?杩??蜂腑?????藉??版?g‘???剧ず??
杩???涓?涔??稿?崇????涓?涓?unicode()?芥?帮??ㄦ?濡?涓?
str="?ユ????
str=unicode(str,"utf-8")
print str
涓?u???哄????锛?杩?????nicode灏?str杞???负unicode缂???锛???瑕?姝g‘??瀹?绗?浜?涓????帮?杩?????utf-8????test.py??????韬?????浠跺??绗???锛?榛?璁ょ?????芥??ansi??
unicode杩???涓?涓??抽??锛?涓??㈢户缁?
??浠?寮?濮??????惧害棣?椤碉?娉ㄦ??锛?娓稿?㈣?块???惧害棣?椤碉??ョ??缃?椤垫?浠g??锛?瀹???charset=gb2312??
import urllib2
def main():
? f=urllib2.urlopen("http://www.baidu.com")
? str=f.read()
? str=unicode(str,"gb2312")
? fp=open("baidu.html","w")
? fp.write(str.encode("utf-8"))
? fp.close()
?
if __name__ == '__main__' :
? main()
瑙i??锛?
??浠?棣?????rllib2.urlopen()?规?灏??惧害棣?椤垫?????帮?f???ユ?? 锛???tr=f.read()灏?????婧?浠g??璇诲??tr涓?
??娓?妤?,str???㈠氨????浠???????html婧?浠g??锛??变?缃?椤甸?璁ょ??瀛?绗?????gb2312锛???浠ュ?????浠??存?ヤ?瀛??版??浠朵腑锛???浠剁???灏???ansi??
瀵逛?澶ч?ㄥ??浜烘?ヨ?达??跺??杩?灏辫冻澶?浜?锛?浣??????跺????灏辨?虫??gb2312杞??㈡??utf-8??璇ユ??涔??????
棣???锛?
? ? str=unicode(str,"gb2312") #杩?????gb2312灏辨??str??瀹???瀛?绗???锛???浠??板?ㄥ??惰浆?㈡??unicode
?跺??锛?
? ? str=str.encode("utf-8") #灏?unicode??瀛?绗?覆???扮?????utf-8
????锛?
? ? 灏?str???ュ?版??浠朵腑锛???寮???浠剁??涓?涓?缂???灞??э????版??utf-8??浜?锛???<meta charset="gb2312"?规??<meta charset="utf-8" 锛?灏辨??涓?涓?utf-8??缃?椤典?????浜?杩?涔?澶??跺??灏卞????浜?涓?涓?gb2312->utf-8??杞?????
?荤?锛?
? ? ??浠???椤句?涓?锛?濡?????瑕?灏?瀛?绗?覆???ф??瀹???瀛?绗???淇?瀛?锛???浠ヤ???涓?姝ラ?わ?
? ? 1锛???nicode(str,"???ョ??缂???")灏?str瑙g????unicode瀛?绗?覆
? ? 2锛?灏?unicode瀛?绗?覆str 浣跨??str.encode("??瀹???瀛?绗???") 杞??㈡??浣???瀹???瀛?绗???
? ? 3锛?灏?str淇?瀛???浠讹????????ユ?版??搴?绛???浣?锛?褰??讹?缂???浣?宸茬???瀹?浜?锛?涓?????锛?
3.??xml瑙f??html
???xml.etree??缃?缁?????????缃?椤碉?浣?濡???缃?椤垫??gbk/gb2312缂???锛???浼??虹?颁贡????棰?锛?濡?涓?锛?
??寰???瀛???锛??存?ユ???帮?杈??虹???v濡?涓?锛?
鹿贸??????驴录?????垄?酶_鹿贸??????驴录???酶_鹿贸??鹿芦?帽?卤驴录???酶_鹿贸????鹿芦杩?涓?v??绫诲??????<type?'lxml.etree._ElementUnicodeResult'>
璇ュ?浣?瑙e?冲??锛?? ??浠ヤ慨?规?浠g????缂????煎?锛? ?response.encoding?=?'utf-8'
page?=?etree.HTML(response.content)
nodes_title?=?page.xpath("//title//text()")
杩??锋???板?烘?ョ?? nodes_title[0]灏变负姝e父??涓????剧ず浜???
?瑰??娉ㄦ??????锛?response.text寰?瀹规???虹?扮?????棰???锛?浠ュ????esponse.content??