python利用utf-8編碼判斷中文英文字元(轉)

下面這個小工具包含了判斷unicode是否是漢字、數字、英文或者其他字元，全角符号轉半角符号，unicode字元串歸一化等工作。

#!/usr/bin/env python

# -*- coding:GBK -*-

"""漢字處理的工具:

判斷unicode是否是漢字，數字，英文，或者其他字元。

全角符号轉半角符号。"""

__author__="internetsweeper <[email protected]>"

__date__="2007-08-04"

def is_chinese(uchar):

"""判斷一個unicode是否是漢字"""

if uchar >= u'\u4e00' and uchar<=u'\u9fa5':

return True

else:

return False

def is_number(uchar):

"""判斷一個unicode是否是數字"""

if uchar >= u'\u0030' and uchar<=u'\u0039':

def is_alphabet(uchar):

"""判斷一個unicode是否是英文字母"""

if (uchar >= u'\u0041' and uchar<=u'\u005a') or (uchar >= u'\u0061' and uchar<=u'\u007a'):

def is_other(uchar):

"""判斷是否非漢字，數字和英文字元"""

if not (is_chinese(uchar) or is_number(uchar) or is_alphabet(uchar)):

def B2Q(uchar):

"""半角轉全角"""

inside_code=ord(uchar)

if inside_code<0x0020 or inside_code>0x7e: #不是半角字元就傳回原來的字元

return uchar

if inside_code==0x0020: #除了空格其他的全角半角的公式為:半角=全角-0xfee0

inside_code=0x3000

inside_code+=0xfee0

return unichr(inside_code)

def Q2B(uchar):

"""全角轉半角"""

if inside_code==0x3000:

inside_code=0x0020

inside_code-=0xfee0

if inside_code<0x0020 or inside_code>0x7e: #轉完之後不是半角字元傳回原來的字元

def stringQ2B(ustring):

"""把字元串全角轉半角"""

return "".join([Q2B(uchar) for uchar in ustring])

def uniform(ustring):

"""格式化字元串，完成全角轉半角，大寫轉小寫的工作"""

return stringQ2B(ustring).lower()

def string2List(ustring):

"""将ustring按照中文，字母，數字分開"""

retList=[]

utmp=[]

for uchar in ustring:

if is_other(uchar):

if len(utmp)==0:

continue

else:

retList.append("".join(utmp))

utmp=[]

else:

utmp.append(uchar)

if len(utmp)!=0:

retList.append("".join(utmp))

return retList

if __name__=="__main__":

#test Q2B and B2Q

for i in range(0x0020,0x007F):

print Q2B(B2Q(unichr(i))),B2Q(unichr(i))

#test uniform

ustring=u'中國人名ａ高頻Ａ'

ustring=uniform(ustring)

ret=string2List(ustring)

print ret

1. 中英文混合字串的統一編碼表示中英文混合字串處理最省力的辦法就是把它們的編碼都轉成 Unicode，讓一個漢字與一個英文字母的記憶體位寬都是相等的。這個工作用 Python 來做，比較合适，因為 Python 内碼采用的是 Unicode，并且為了支援 Unicode 字串的操作，Python 做了一個 Unicode 内模組化塊，把 string 對象的全部方法重新實作了一遍，另外提供了 Codecs 對象，解決各種編碼類型的字元串解碼與編碼問題。

譬如下面的 Python 代碼，可實作 UTF-8 編碼的中英文混合字串向 Unicode 編碼的轉換：# -*-

coding:utf-8 -*-

a = "我的 English 學的不好"

print type(a),len (a), a

b = unicode (a, "utf-8")

print type(b), len (b), b字元串 a 是 utf-8 編碼，使用 python 的内建對象 unicode 可将其轉換為 Unicode 編碼的字元串 b。上述代碼執行後的輸出結果如下所示，比較字串 a 與字串 b 的長度，顯然 len (b) 的輸出結果是合理的。<type 'str'> 27 我的 English 學的不好

<type 'unicode'> 15 我的 English 學的不好要注意的一個問題是 Unicode 雖然号稱是“統一碼”，不過也是存在着兩種形式，即：

UCS-2：為 16 位碼，具有 2^16 = 65536 個碼位； UCS-4：為 32 位碼，目前的規定是其首位元組的首位為 0，是以具有 2^31 = 2147483648 個碼位，不過現在的隻使用了 0x00000000 － 0x0010FFFF 之間的碼位，共 1114112 個。

使用Python sys 子產品提供的一個變量 maxunicode 的值可以判斷目前 Python 所使用的 Unicode 類型是 UCS-2 的還是 UCS-4 的。import sys

print sys.maxunicode若 sys.maxunicode 的值為 1114111，即為 UCS-4；若為 65535，則為 UCS-2。

2. 中英文混合字串的分離一旦中英文字串的編碼獲得統一，那麼對它們進行分裂就是很簡單的事情了。首先要為中文字串與英文字串分别準備一個收集器，使用兩個空的字串對象即可，譬如 zh_gather 與 en_gather；然後要準備一個清單對象，負責按分離次序存儲 zh_gather 與 en_gather 的值。下面這個 Python 函數接受一個中英文混合的 Unicode 字串，并傳回存儲中英文子字串的清單。def split_zh_en (zh_en_str):

zh_en_group = []

zh_gather = ""

en_gather = ""

zh_status = False

for c in zh_en_str:

if not zh_status and is_zh (c):

zh_status = True

if en_gather != "":

zh_en_group.append ([mark["en"],en_gather])

en_gather = ""

elif not is_zh (c) and zh_status:

zh_status = False

if zh_gather != "":

zh_en_group.append ([mark["zh"], zh_gather])

if zh_status:

zh_gather += c

else:

en_gather += c

zh_gather = ""

if en_gather != "":

zh_en_group.append ([mark["en"],en_gather])

elif zh_gather != "":

zh_en_group.append ([mark["zh"],zh_gather])

return zh_en_group上述代碼所實作的功能細節是：對中英文混合字串 zh_en_str 的周遊過程中進行逐字識别，若目前字元為中文，則将其添加到 zh_gather 中；若目前字元為英文，則将其添加到 en_gather 中。zh_status 表示中英文字元的切換狀态，當 zh_status 的值發生突變時，就将所收集的中文子字串或英文子字串添加到 zh_en_group 中去。

判斷字串 zh_en_str 中是否包含中文字元的條件語句中出現了一個 is_zh () 函數，它的實作如下：def is_zh (c):

x = ord (c)

# Punct & Radicals

if x >= 0x2e80 and x <= 0x33ff:

return True

# Fullwidth Latin Characters

elif x >= 0xff00 and x <= 0xffef:

# CJK Unified Ideographs &

# CJK Unified Ideographs Extension A

elif x >= 0x4e00 and x <= 0x9fbb:

# CJK Compatibility Ideographs

elif x >= 0xf900 and x <= 0xfad9:

# CJK Unified Ideographs Extension B

elif x >= 0x20000 and x <= 0x2a6d6:

# CJK Compatibility Supplement

elif x >= 0x2f800 and x <= 0x2fa1d:

else:

return False這段代碼來自 jjgod 寫的 XeTeX 預處理程式。

對于分離出來的中文子字串與英文子字串，為了使用友善，在将它們存入 zh_en_group 清單時，我對它們分别做了标記，即 mark["zh"] 與 mark["en"]。mark 是一個 dict 對象，其定義如下：mark = {"en":1, "zh":2}如果要對 zh_en_group 中的英文字串或中文字串進行處理時，标記的意義在于快速判定字串是中文的，還是英文的，譬如：for str in zh_en_group:

if str[0] = mark["en"]:

do somthing

python利用utf-8編碼判斷中文英文字元(轉)

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入