unicode有个normalize的过程,按照unicode标准,有C、D、KC、KD四种,KC会将大部分的中文标点符号转化为对应的英文,还会将全角字符转化为相应的半角字符,比如:
import unicodedata
t = u'中国,中文,标点符号!你好?12345@#【】+=-()'
t2 = unicodedata.normalize('NFKC', t)
'''
>>> print t2
中国,中文,标点符号!你好[email protected]#【】+=-()
'''
作者:灵剑
链接:https://www.zhihu.com/question/37720196/answer/115870233
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。
with open('F:/src.txt', 'r', encoding='utf-8') as f:
res = unicodedata.normalize('NFKC', f.read())
with open('F:/dst.txt', 'w', encoding='utf-8') as ff:
ff.write(res)
输入字符串或者txt文件路径进行处理
def punctuation_mend(string):
import unicodedata
import os
table = {ord(f):ord(t) for f,t in zip(
u',。!?【】()%#@&1234567890“”‘’',
u',.!?[]()%#@&1234567890""\'\'')}
if os.path.isfile(string):
with open(string, 'r', encoding='utf-8') as f:
res = unicodedata.normalize('NFKC', f.read())
res = res.translate(table)
with open(string, 'w', encoding='utf-8') as f:
f.write(res)
else:
res = unicodedata.normalize('NFKC', string)
res = res.translate(table)
return res
print(punctuation_mend('【】()%#@&“”'))
punctuation_mend('F:/z.txt')