天天看点

python中文标点转英文标点

unicode有个normalize的过程,按照unicode标准,有C、D、KC、KD四种,KC会将大部分的中文标点符号转化为对应的英文,还会将全角字符转化为相应的半角字符,比如:

import unicodedata
t = u'中国,中文,标点符号!你好?12345@#【】+=-()'
t2 = unicodedata.normalize('NFKC', t)
'''
>>> print t2
中国,中文,标点符号!你好[email protected]#【】+=-()
'''

作者:灵剑
链接:https://www.zhihu.com/question/37720196/answer/115870233
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。
           
with open('F:/src.txt', 'r', encoding='utf-8') as f:
    res = unicodedata.normalize('NFKC', f.read())
    with open('F:/dst.txt', 'w', encoding='utf-8') as ff:
        ff.write(res)
           

输入字符串或者txt文件路径进行处理

def punctuation_mend(string):
    import unicodedata
    import os

    table = {ord(f):ord(t) for f,t in zip(
        u',。!?【】()%#@&1234567890“”‘’',
        u',.!?[]()%#@&1234567890""\'\'')}
    if os.path.isfile(string):
        with open(string, 'r', encoding='utf-8') as f:
            res = unicodedata.normalize('NFKC', f.read())
            res = res.translate(table)
        with open(string, 'w', encoding='utf-8') as f:
            f.write(res)
    else:
        res = unicodedata.normalize('NFKC', string)
        res = res.translate(table)
        return res

print(punctuation_mend('【】()%#@&“”'))
punctuation_mend('F:/z.txt')