These days, I am working on enhancing the localization mechanism for my project.
I did meet some problems during the journey and although I’ve solved them, some of the issues may still be tricky.
So I would like to write a blog here to explain an easy but interesting problem when we writing Python Script with Unicode characters.
HERE’S THE QUESTION:
Please write down the following script and debug it with Visual Studio.
Code Snippet:
1 2 3 | if __name__ == '__main__' : test = '俠' print ( test) |
The Unicode for ‘俠’ is : 0x4fa0 The GBK code for ‘俠’ is : 0xcfc0 Let’s see the result first and then maybe we could think for a while and talk.
Situation A: Under a system with a locale of En-US, the result is :

We get a ‘?’
Situation B: Under a system with a locale of zh-CN, the result is :
We get an error message: ‘utf-8′ codec can’t decode byte 0xcf in position 0: invalid continuation byte. Before we dive into the talk, let’s first get familiar with some basic concepts related to Unicode.
Unicode Program and Non-Unicode Program.
As we all know, different countries have different languages as well as their characters. In the early days of programming history, western countries first use their own characters as the developing language. For example, we may only have data structure like ‘char’ to define a character, and the length of each char is 1 byte. As there are only limited kinds of English characters(a, b, c, d, e, f, etc), it is enough to contain all the characters in one byte. The mapping rule is very easy, for example, ‘a’ is defined as 0×61. When we read a byte, we will look up the ASCII table to find the font. All the API at that time only use ‘char’ as their data structure to define a character. Later on, eastern countries also joined the programming world. But their characters are far more than 256 kinds of characters. So they have to find a way to display their language. Every country then begin to define their own rule of how to find their characters in a table. We call those rules encoding. For example, Chinese have several kinds of encoding such as Big5, GBK and so on. Microsoft Windows then define a concept which is now known as code page. A code page is very much like an ASCII table, for example, it could be like this in logic:
Char | Code |
---|---|
俠 | 0xcfc0 |
With this table, the program can easily find the related characters. But there were too many countries defining their own mapping code page. This would be a burden for compatible. So the Unicode comes in. Unicode is a big table which defines almost all the characters all over the world. Each character has a Unique code value. As we could see, one byte room is not enough to hold all the kinds of characters. So the early Unicode defines that each character should has 2 bytes memory so that it could contains 2^16 kinds of different value. This requires the program API be updated. Programmers can use this suite of API and Data Structure to write program which support the Unicode system. That’s why most of the recent programs are called Unicode programs. They use Unicode data structure and Unicode API
1 2 3 4 5 | #ifdef UNICODE typedef wchar_t TCHAR ; #else typedef char TCHAR ; #endif |
1 2 3 4 5 | #ifdef UNICODE #define SetWindowText SetWindowTextW #else #define SetWindowText SetWindowTextA #endif |
Thus, the language problem is solved. The Unicode table look like this:
Char | Unicode |
---|---|
俠 | 0x4fa0 |
Yet Microsoft has to consider about the legacy API and programs which use early code page. For the sake of compatible, Microsoft define their own code page with a relation to the Unicode characters. The code page look like this:
Char | Code | Unicode |
---|---|---|
俠 | 0xcfc0 | 0x4fa0 |
With the above basic knowledge, we can divide programs into two groups: The Unicode Program and the Non Unicode Program. Unicode Program uses the new data type and API which support the Unicode system and displays the character with the Unicode table. The font displayed on the screen are found in a Unicode table. For example, Microsoft Office use C# as programming language and it supports Unicode. Python 3.4 supports Unicode system so some APIs of encoding function has been changed a lot from Python 2.0 Non-Unicode Program still uses legacy data type and API which directly finds the character with the system default code page. The font displayed on the screen are found in the code page. For example, the ancient cmd.exe of windows system is a Non-Unicode Program. So if you choose English as the default system code page, you will never see a Chinese character on the command screen.
UTF-8
There’s another encoding called UTF-8(or UTF-16, UTF-32). It is a method to organize the Unicode. If you would like to know more about how it organize the Unicode, simply search it with Google, and you will get to know. I would not like to talk the detail here, but only would like to stress that it is more like a rule. The advantage of using UTF-8 is that, it only use one character to describe the non-wide characters such as English, thus can save a lot of spaces when dealing with English.
UTF-8 | Unicode |
---|---|
0xe40xbe0xa0 | 0x4fa0 |
Brief Summary:
We can map a code from Unicode to UTF-8 or any other code pages. When we can’t match the code, there will be some messy code problems.
If the Unicode system was invented before the code page, we should never need code pages.
Let’s Solve the Problem:
Now go back to our topic.
Situation A: Why we get a ‘?’
If you use Visual Studio to create a file, the default selection of encoding is ANSI. This means that the Visual Studio will save your file with the default code page encoding. It converts the Unicode characters into code page code and then save the file on the disk. As our code page is set to an English code page (cp437), English characters can easily be found by the mapping code page. But the Chinese Characters is another story. In this case, the undefined unicode in code page will be replaced with a symbol, for example, ‘?’ and then save into the file. So are the Chinese Unicode. When you run the script, python requires you first save the file and then load it into parser. The original Chinese characters now have already been saved as the replaced code ‘?’ in the file and when it is loaded into memory, those characters are matched with the Unicode for ‘?’. That’s why we get a ‘?’ Conclusion: The ‘?’ comes from an unmatched code in the file when converting Unicode to current code page.
Here comes the puzzle. How to display the Chinese?
Solution 1:
What if we change the code to:
1 2 3 | if __name__ == '__main__' : test = '\u4fa0' print ( test) |
In this case, there are no Chinese characters in the file. What will happen then? Give a break point before ‘print (test)’, run the script and watch the value of the ‘test’. The debugger gives you a ‘俠’. Wow, we are almost there! Step to the ‘print (test)’, again, to our surprise, python breaks, and leave us an error message:
Error Message: ‘charmap’ codec can’t encode character ‘\u4fa0′ in position 0: character maps to <undefined> This message tells us that the Unicode can not be found in the current code page. The message is strange, isn’t it? We actually get the Unicode in memory, why couldn’t we print it on the screen? Please recall the Non-Unicode Program we talked before. Yes, it’s not caused by your script. It’s the python parser who throw the exception. When you need the cmd.exe to display the string by print() function, you actually call the cmd.exe process. Cmd.exe is a Non-Unicode Program. So the parameter you send to it has to be mapped with the code page first for the Non-Unicode Program only use legacy APIs. In fact, in situation A, you will never succeed in making the command line display Chinese characters. But if you switch the displaying tool to the python shell command, then you will get the correct result. Why? Simply because python shell command line is an Unicode Program. It won’t require you to convert your wide characters to multi-byte characters. See, the Unicode Program is really cool! Although it’s puzzling, we’ve got another question in the end. But you may wonder whether you can display the Chinese without changing your code? Solution 2: The answer is : Yes! How? As we have already known where the ‘?’ comes from, it would be easy for us to fix the problem. The current code page can’t find any Chinese characters, what can we do for it? Remember another encoding whose name is ‘ UTF-8‘? It contains almost all the characters in Unicode. I think it could be a good helper! Let’s save our original code with a UTF-8 encoding by notepad. Now run the python shell command, we get the correct result.
Actually, the default loading encoding of python script is UTF-8. We save the correct UTF-8 file and when we decode the file bytes with utf-8 to Unicode, we get the correct result.
Situation B. Why We Get the Error Message?
I have to say that it’s really a tricky problem. But finally we will solve all the puzzles
Take a cup of coffee and let’s continue
We have already set our locale to Chinese, why we get such an error message? Error Message: ‘utf-8′ codec can’t decode byte 0xcf in position 0: invalid continuation byte. Calm down and read the message carefully. According to the message, utf-8 codec can’t decode byte 0xcf. Why utf-8? What 0xcf? If we can answer the two questions, we will be closer. The first problem is easy to answer, the python parser always uses UTF-8 as a default encoding. And that’s how we solve the problem in Situation A. The second problem is not so easy to answer, because we don’t know what 0xcf is. Open UltrEdit and type the Chinese character ‘俠’ . Use a binary view to watch the value.
That’s great! We find that 0xcfc0 is the code of ‘俠‘ in current code page! We get the clue: The file has been saved into code with the current code page. The difference between Situation A and Situation B is that this time, we can find the code matched with the Unicode! Despite the corrected matched code and saving of the file, the python parser use UTF-8 to load the file and there’s no code 0xcfc0 in UTF-8. That’s why the error message is thrown. You could try writing the following code and you will get the same error message:
1 2 3 | import sys bytes = b '\xcf\xc0' string = bytes . decode ( 'utf-8' ) |
And if you do some changes, you will get the correct result:
1 2 3 4 | import sys encode = sys . getfilesystemencoding ( ) bytes = b '\xcf\xc0' string = bytes . decode ( encode ) |
As we already know where the problem is, so it’s now easy to fix.
One way is to save the current file with UTF-8 encoding. And again we run the script. Done!
The other way is a little bit reverse thinking. As the file is saved with the current code page encoding, we could tell the python to load file with the code page, not the UTF-8
We could write the code at the first line in the file:
1 | # coding=gbk |
Now run the script again, you find that the problem has been solved.
Here comes the puzzle:
1. Suppose the script has already been saved with UTF-8 encoding, we use notepad to save the same file with an ANSI encoding. Now go back to the IDE and what will happen?
Actually the IDE will give you a warning to tell you that some bytes have been replaced with the Unicode substitution character while loading your file. You will then get a messy code in your environment.
1 | string = '��' |
2. After you save the file and run the script, you get another error message:‘gbk’ codec can’t encode character ‘xxxx’ in position 0: illegal multi-byte sequence.
How can this happen?
Because the IDE now use UTF-8 to load the file, when it read the 0xcfc0, it fails to recognize it as a UTF-8 character. So it automatically replaced the undefined characters with some certain UTF-8 code. This code can be matched with Unicode.
When printing the replaced Unicode characters, wide-characters should be converted to multi-byte first for the Non-Unicode program cmd.exe. During the converting, python find that the Unicode can’t be matched to the current code page. That’s why you get the error message.
Happy Ending:
We are good
The conclusion is that when a program will cooperate with another Non-Unicode program, we should set the Language for Non-Unicode programs for the system. If you are really interested in encoding topics, I suggest you to read some more professional books. Here I just illustrate my understanding and how to solve the problems I met. It may not be that accurate, but I try to figure out how to solve it. Hope the article makes sense.