These days, I am working on enhancing the localization mechanism for my project.

I did meet some problems during the journey and although I’ve solved them, some of the issues may still be tricky.

So I would like to write a blog here to explain an easy but interesting problem when we writing Python Script with Unicode characters.

HERE’S THE QUESTION:

Please write down the following script and debug it with Visual Studio.

Code Snippet：

1 2 3

if __name__ == '__main__' : test = '俠' print ( test）

The Unicode for ‘俠’ is : 0x4fa0 The GBK code for ‘俠’ is : 0xcfc0 Let’s see the result first and then maybe we could think for a while and talk.

Situation A: Under a system with a locale of En-US, the result is :

WHEN YOU TYPE CHINESE CHARACTER IN PYTHON SCRIPT AND RUN IT, WHAT WILL HAPPEN? HERE’S THE QUESTION:

We get a ‘?’

Situation B: Under a system with a locale of zh-CN, the result is :

WHEN YOU TYPE CHINESE CHARACTER IN PYTHON SCRIPT AND RUN IT, WHAT WILL HAPPEN? HERE’S THE QUESTION:

We get an error message: ‘utf-8′ codec can’t decode byte 0xcf in position 0: invalid continuation byte. Before we dive into the talk, let’s first get familiar with some basic concepts related to Unicode.

Unicode Program and Non-Unicode Program.

As we all know, different countries have different languages as well as their characters. In the early days of programming history, western countries first use their own characters as the developing language. For example, we may only have data structure like ‘char’ to define a character, and the length of each char is 1 byte. As there are only limited kinds of English characters(a, b, c, d, e, f, etc), it is enough to contain all the characters in one byte. The mapping rule is very easy, for example, ‘a’ is defined as 0×61. When we read a byte, we will look up the ASCII table to find the font. All the API at that time only use ‘char’ as their data structure to define a character. Later on, eastern countries also joined the programming world. But their characters are far more than 256 kinds of characters. So they have to find a way to display their language. Every country then begin to define their own rule of how to find their characters in a table. We call those rules encoding. For example, Chinese have several kinds of encoding such as Big5, GBK and so on. Microsoft Windows then define a concept which is now known as code page. A code page is very much like an ASCII table, for example, it could be like this in logic:

Char	Code
俠	0xcfc0

With this table, the program can easily find the related characters. But there were too many countries defining their own mapping code page. This would be a burden for compatible. So the Unicode comes in. Unicode is a big table which defines almost all the characters all over the world. Each character has a Unique code value. As we could see, one byte room is not enough to hold all the kinds of characters. So the early Unicode defines that each character should has 2 bytes memory so that it could contains 2^16 kinds of different value. This requires the program API be updated. Programmers can use this suite of API and Data Structure to write program which support the Unicode system. That’s why most of the recent programs are called Unicode programs. They use Unicode data structure and Unicode API

1 2 3 4 5

#ifdef UNICODE typedef wchar_t TCHAR ; #else typedef char TCHAR ; #endif

1 2 3 4 5

#ifdef UNICODE #define SetWindowText SetWindowTextW #else #define SetWindowText SetWindowTextA #endif

Thus, the language problem is solved. The Unicode table look like this:

Char	Unicode
俠	0x4fa0

Yet Microsoft has to consider about the legacy API and programs which use early code page. For the sake of compatible, Microsoft define their own code page with a relation to the Unicode characters. The code page look like this:

Char	Code	Unicode
俠	0xcfc0	0x4fa0

With the above basic knowledge, we can divide programs into two groups: The Unicode Program and the Non Unicode Program. Unicode Program uses the new data type and API which support the Unicode system and displays the character with the Unicode table. The font displayed on the screen are found in a Unicode table. For example, Microsoft Office use C# as programming language and it supports Unicode. Python 3.4 supports Unicode system so some APIs of encoding function has been changed a lot from Python 2.0 Non-Unicode Program still uses legacy data type and API which directly finds the character with the system default code page. The font displayed on the screen are found in the code page. For example, the ancient cmd.exe of windows system is a Non-Unicode Program. So if you choose English as the default system code page, you will never see a Chinese character on the command screen.

UTF-8

There’s another encoding called UTF-8(or UTF-16, UTF-32). It is a method to organize the Unicode. If you would like to know more about how it organize the Unicode, simply search it with Google, and you will get to know. I would not like to talk the detail here, but only would like to stress that it is more like a rule. The advantage of using UTF-8 is that, it only use one character to describe the non-wide characters such as English, thus can save a lot of spaces when dealing with English.

UTF-8	Unicode
0xe40xbe0xa0	0x4fa0

Brief Summary:

We can map a code from Unicode to UTF-8 or any other code pages. When we can’t match the code, there will be some messy code problems.

If the Unicode system was invented before the code page, we should never need code pages.

Let’s Solve the Problem:

Now go back to our topic.

Situation A: Why we get a ‘?’

If you use Visual Studio to create a file, the default selection of encoding is ANSI. This means that the Visual Studio will save your file with the default code page encoding. It converts the Unicode characters into code page code and then save the file on the disk. As our code page is set to an English code page (cp437), English characters can easily be found by the mapping code page. But the Chinese Characters is another story. In this case, the undefined unicode in code page will be replaced with a symbol, for example, ‘?’ and then save into the file. So are the Chinese Unicode. When you run the script, python requires you first save the file and then load it into parser. The original Chinese characters now have already been saved as the replaced code ‘?’ in the file and when it is loaded into memory, those characters are matched with the Unicode for ‘?’. That’s why we get a ‘?’ Conclusion: The ‘?’ comes from an unmatched code in the file when converting Unicode to current code page.

Here comes the puzzle. How to display the Chinese?

Solution 1:

What if we change the code to:

1 2 3

if __name__ == '__main__' : test = '\u4fa0' print ( test）

In this case, there are no Chinese characters in the file. What will happen then? Give a break point before ‘print (test)’, run the script and watch the value of the ‘test’. The debugger gives you a ‘俠’. Wow, we are almost there! Step to the ‘print (test)’, again, to our surprise, python breaks, and leave us an error message:

WHEN YOU TYPE CHINESE CHARACTER IN PYTHON SCRIPT AND RUN IT, WHAT WILL HAPPEN? HERE’S THE QUESTION:

Error Message: ‘charmap’ codec can’t encode character ‘\u4fa0′ in position 0: character maps to <undefined> This message tells us that the Unicode can not be found in the current code page. The message is strange, isn’t it? We actually get the Unicode in memory, why couldn’t we print it on the screen? Please recall the Non-Unicode Program we talked before. Yes, it’s not caused by your script. It’s the python parser who throw the exception. When you need the cmd.exe to display the string by print() function, you actually call the cmd.exe process. Cmd.exe is a Non-Unicode Program. So the parameter you send to it has to be mapped with the code page first for the Non-Unicode Program only use legacy APIs. In fact, in situation A, you will never succeed in making the command line display Chinese characters. But if you switch the displaying tool to the python shell command, then you will get the correct result. Why? Simply because python shell command line is an Unicode Program. It won’t require you to convert your wide characters to multi-byte characters. See, the Unicode Program is really cool! Although it’s puzzling, we’ve got another question in the end. But you may wonder whether you can display the Chinese without changing your code? Solution 2: The answer is : Yes! How? As we have already known where the ‘?’ comes from, it would be easy for us to fix the problem. The current code page can’t find any Chinese characters, what can we do for it? Remember another encoding whose name is ‘ UTF-8‘? It contains almost all the characters in Unicode. I think it could be a good helper! Let’s save our original code with a UTF-8 encoding by notepad. Now run the python shell command, we get the correct result.

WHEN YOU TYPE CHINESE CHARACTER IN PYTHON SCRIPT AND RUN IT, WHAT WILL HAPPEN? HERE’S THE QUESTION:

Actually, the default loading encoding of python script is UTF-8. We save the correct UTF-8 file and when we decode the file bytes with utf-8 to Unicode, we get the correct result.

Situation B. Why We Get the Error Message?

I have to say that it’s really a tricky problem. But finally we will solve all the puzzles

WHEN YOU TYPE CHINESE CHARACTER IN PYTHON SCRIPT AND RUN IT, WHAT WILL HAPPEN? HERE’S THE QUESTION:

Take a cup of coffee and let’s continue

WHEN YOU TYPE CHINESE CHARACTER IN PYTHON SCRIPT AND RUN IT, WHAT WILL HAPPEN? HERE’S THE QUESTION:

We have already set our locale to Chinese, why we get such an error message? Error Message: ‘utf-8′ codec can’t decode byte 0xcf in position 0: invalid continuation byte. Calm down and read the message carefully. According to the message, utf-8 codec can’t decode byte 0xcf. Why utf-8? What 0xcf? If we can answer the two questions, we will be closer. The first problem is easy to answer, the python parser always uses UTF-8 as a default encoding. And that’s how we solve the problem in Situation A. The second problem is not so easy to answer, because we don’t know what 0xcf is. Open UltrEdit and type the Chinese character ‘俠’ . Use a binary view to watch the value.

WHEN YOU TYPE CHINESE CHARACTER IN PYTHON SCRIPT AND RUN IT, WHAT WILL HAPPEN? HERE’S THE QUESTION:

That’s great! We find that 0xcfc0 is the code of ‘俠‘ in current code page! We get the clue: The file has been saved into code with the current code page. The difference between Situation A and Situation B is that this time, we can find the code matched with the Unicode! Despite the corrected matched code and saving of the file, the python parser use UTF-8 to load the file and there’s no code 0xcfc0 in UTF-8. That’s why the error message is thrown. You could try writing the following code and you will get the same error message:

1 2 3

import sys bytes = b '\xcf\xc0' string = bytes . decode ( 'utf-8' )

And if you do some changes, you will get the correct result:

1 2 3 4

import sys encode = sys . getfilesystemencoding ( ) bytes = b '\xcf\xc0' string = bytes . decode ( encode )

As we already know where the problem is, so it’s now easy to fix.

One way is to save the current file with UTF-8 encoding. And again we run the script. Done!

The other way is a little bit reverse thinking. As the file is saved with the current code page encoding, we could tell the python to load file with the code page, not the UTF-8

We could write the code at the first line in the file:

1	# coding=gbk

Now run the script again, you find that the problem has been solved.

Here comes the puzzle:

1. Suppose the script has already been saved with UTF-8 encoding, we use notepad to save the same file with an ANSI encoding. Now go back to the IDE and what will happen?

Actually the IDE will give you a warning to tell you that some bytes have been replaced with the Unicode substitution character while loading your file. You will then get a messy code in your environment.

1	string = '��'

2. After you save the file and run the script, you get another error message:‘gbk’ codec can’t encode character ‘xxxx’ in position 0: illegal multi-byte sequence.

How can this happen?

Because the IDE now use UTF-8 to load the file, when it read the 0xcfc0, it fails to recognize it as a UTF-8 character. So it automatically replaced the undefined characters with some certain UTF-8 code. This code can be matched with Unicode.

When printing the replaced Unicode characters, wide-characters should be converted to multi-byte first for the Non-Unicode program cmd.exe. During the converting, python find that the Unicode can’t be matched to the current code page. That’s why you get the error message.

Happy Ending:

We are good

WHEN YOU TYPE CHINESE CHARACTER IN PYTHON SCRIPT AND RUN IT, WHAT WILL HAPPEN? HERE’S THE QUESTION:

The conclusion is that when a program will cooperate with another Non-Unicode program, we should set the Language for Non-Unicode programs for the system. If you are really interested in encoding topics, I suggest you to read some more professional books. Here I just illustrate my understanding and how to solve the problems I met. It may not be that accurate, but I try to figure out how to solve it. Hope the article makes sense.

WHEN YOU TYPE CHINESE CHARACTER IN PYTHON SCRIPT AND RUN IT, WHAT WILL HAPPEN? HERE’S THE QUESTION:

HERE’S THE QUESTION:

Situation A: Under a system with a locale of En-US, the result is :

Situation B: Under a system with a locale of zh-CN, the result is :

Unicode Program and Non-Unicode Program.

UTF-8

Brief Summary:

Let’s Solve the Problem:

Situation A: Why we get a ‘?’

Here comes the puzzle. How to display the Chinese?

Situation B. Why We Get the Error Message?

Here comes the puzzle:

Happy Ending:

繼續閱讀

python3 把\u開頭的unicode轉中文，把str形态的unicode轉中文

JavaScript中Unicode編碼和中文互相轉換Unicode轉換

中文和Unicode編碼互轉

萬字長文：關于sourcemap，這篇文章就夠了

字元編碼：ASCII，Unicode和UTF-8 1. ASCII碼 2、非ASCII編碼 3.Unicode 4. Unicode的問題 5.UTF-8 6. Unicode與UTF-8之間的轉換 7. Little endian和Big endian 8. 執行個體

字元編碼ANSI、ASCII、GB2312、GBK、GB18030、UNICODE、UTF-8小結1.英語字元編碼ASCII2.中文編碼GB2312、GBK、GB180303.統一碼UNICODE4.UNICODE傳輸标準之UTF-85.Little endian和Big endian的差別與識别6.中文window作業系統編碼總結7.編碼的互相轉換

字元編碼之ASCII、GB2312、GBK、GB18030、UNICODE、UTF-8、UTF-16、UTF-32、ANSI初步了解

面試官：String長度有限制嗎？是多少？String

unicode字元集和utf-8編碼

wxWidgets的編譯選項

關于字元的一點問題

Unicode字元編碼标準

字元編碼一、介紹二、簡略曆程三、Unicode四、其他參考

字元編碼學習之二，UTF-16，USCII，GB2312編碼

Windows Programming 第二章 Unicode簡介

Python編碼、解碼的了解（GBK，UTF-8，Unicode）