What's the difference between UTF-8 and Unicode?

2023-06-24 20:40:17

If asked the question, "What is the difference between UTF-8 and Unicode?", would you confidently reply with a short and precise answer? In these days of internationalization all developers should be able to do that. I suspect many of us do not understand these concepts as well as we should. If you feel you belong to this group, you should read this ultra short introduction to character sets and encodings.

Actually, comparing UTF-8 and Unicode is like comparing apples and oranges:

UTF-8 is an encoding - Unicode is a character set

A character set is a list of characters with unique numbers (these numbers are sometimes referred to as "code points"). For example, in the Unicode character set, the number for A is 41.

An encoding on the other hand, is an algorithm that translates a list of numbers to binary so it can be stored on disk. For example UTF-8 would translate the number sequence 1, 2, 3, 4 like this:

00000001 00000010 00000011 00000100

Our data is now translated into binary and can now be saved to disk.

All together now

Say an application reads the following from the disk:

1101000 1100101 1101100 1101100 1101111

The app knows this data represent a Unicode string encoded with UTF-8 and must show this as text to the user. First step, is to convert the binary data to numbers. The app uses the UTF-8 algorithm to decode the data. In this case, the decoder returns this:

104 101 108 108 111

Since the app knows this is a Unicode string, it can assume each number represents a character. We use the Unicode character set to translate each number to a corresponding character. The resulting string is "hello".

Conclusion

So when somebody asks you "What is the difference between UTF-8 and Unicode?", you can now confidently answer short and precise:

UTF-8 and Unicode cannot be compared. UTF-8 is an encoding used to translate numbers into binary data. Unicode is a character set used to translate characters into numbers.

文章轉自：http://stackoverflow.com/questions/3951722/whats-the-difference-between-unicode-and-utf8

Java中采用的是unicode标準字元集

Java語言使用unicode标準字元集，最多可以識别65535個字元，unicode字元表的前128個字元剛好是ASCII表。每個國家的“字母表”的字母都是unicode表中的一個字元，比如漢字中的“你”字就是unicode表中的第20320字元。

Java所謂的字母包括了世界上任何語言中的“字母表”，是以，Java所使用的字母不僅包括通常的拉丁字母，a,b,c等，也包括漢語中的漢字，日文裡的片假名，平假名，北韓文以及其他許多語言中的文字。

維基百科：

目前實際應用的統一碼版本是UCS-2，使用16位的編碼空間。也就是每個字元(character，即char)占用2個位元組(byte)。這樣理論上一共最多可以表示2^16（即65536）個字元。基本滿足各種語言的使用。實際上目前版本的統一碼并未完全使用這16位編碼，而是保留了大量空間以作為特殊使用或将來擴充。

Java的位元組碼環境采用UTF-16作為内部表示，UTF-16繼承自UCS-2，使用16位的編碼空間。是以Java中基本類型char的大小是16-bit，範圍是：Unicode 0 ~ Unicode 2^16-1。

What's the difference between UTF-8 and Unicode?

Conclusion

繼續閱讀

Bank相關9_金融系列文章

DOS源碼相關資料

iOS開發的幾種加密方式

java調用第三方webservice接口

讀後感：敢問路在何方---走出軟體作坊：三五個人十來條槍如何成為開發正規軍（十三）敢問路在何方---走出軟體作坊：三五個人十來條槍如何成為開發正規軍（十三）

面向對象的思維方式

測試的基本理論與方法（1）

記一次因MySQL編碼問題導緻的慢查詢排查

MySQL errno 150的解決方案

java操作access資料庫亂碼問題

最新2007年OWASP十大Web資安漏洞 (2007 OWASP Top 10)

轉關于測試人員的職業發展

《程式員的職業素養》四——編碼

使用kettle報Invalid byte 1 of 1-byte UTF-8 sequence異常使用kettle時報Invalid byte 1 of 1-byte UTF-8 sequence異常

V4L2視訊采集與H264編碼1—V4L2采集JPEG資料

Netty——自定義協定解決TCP粘包拆包問題什麼是TCP粘包拆包自定義協定解決拆包粘包問題