unicode有個normalize的過程,按照unicode标準,有C、D、KC、KD四種,KC會将大部分的中文标點符号轉化為對應的英文,還會将全角字元轉化為相應的半角字元,比如:
import unicodedata
t = u'中國,中文,标點符号!你好?12345@#【】+=-()'
t2 = unicodedata.normalize('NFKC', t)
'''
>>> print t2
中國,中文,标點符号!你好[email protected]#【】+=-()
'''
作者:靈劍
連結:https://www.zhihu.com/question/37720196/answer/115870233
來源:知乎
著作權歸作者所有。商業轉載請聯系作者獲得授權,非商業轉載請注明出處。
with open('F:/src.txt', 'r', encoding='utf-8') as f:
res = unicodedata.normalize('NFKC', f.read())
with open('F:/dst.txt', 'w', encoding='utf-8') as ff:
ff.write(res)
輸入字元串或者txt檔案路徑進行處理
def punctuation_mend(string):
import unicodedata
import os
table = {ord(f):ord(t) for f,t in zip(
u',。!?【】()%#@&1234567890“”‘’',
u',.!?[]()%#@&1234567890""\'\'')}
if os.path.isfile(string):
with open(string, 'r', encoding='utf-8') as f:
res = unicodedata.normalize('NFKC', f.read())
res = res.translate(table)
with open(string, 'w', encoding='utf-8') as f:
f.write(res)
else:
res = unicodedata.normalize('NFKC', string)
res = res.translate(table)
return res
print(punctuation_mend('【】()%#@&“”'))
punctuation_mend('F:/z.txt')