天天看點

python中文标點轉英文标點

unicode有個normalize的過程,按照unicode标準,有C、D、KC、KD四種,KC會将大部分的中文标點符号轉化為對應的英文,還會将全角字元轉化為相應的半角字元,比如:

import unicodedata
t = u'中國,中文,标點符号!你好?12345@#【】+=-()'
t2 = unicodedata.normalize('NFKC', t)
'''
>>> print t2
中國,中文,标點符号!你好[email protected]#【】+=-()
'''

作者:靈劍
連結:https://www.zhihu.com/question/37720196/answer/115870233
來源:知乎
著作權歸作者所有。商業轉載請聯系作者獲得授權,非商業轉載請注明出處。
           
with open('F:/src.txt', 'r', encoding='utf-8') as f:
    res = unicodedata.normalize('NFKC', f.read())
    with open('F:/dst.txt', 'w', encoding='utf-8') as ff:
        ff.write(res)
           

輸入字元串或者txt檔案路徑進行處理

def punctuation_mend(string):
    import unicodedata
    import os

    table = {ord(f):ord(t) for f,t in zip(
        u',。!?【】()%#@&1234567890“”‘’',
        u',.!?[]()%#@&1234567890""\'\'')}
    if os.path.isfile(string):
        with open(string, 'r', encoding='utf-8') as f:
            res = unicodedata.normalize('NFKC', f.read())
            res = res.translate(table)
        with open(string, 'w', encoding='utf-8') as f:
            f.write(res)
    else:
        res = unicodedata.normalize('NFKC', string)
        res = res.translate(table)
        return res

print(punctuation_mend('【】()%#@&“”'))
punctuation_mend('F:/z.txt')