定義
A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set.
- 字元集,就好比一本漢語字典,裡面記錄了很多文字(Symbol),以及順序号(頁碼);
- 比較方式,是一套規則,告訴我們如何比較字典裡的文字。這套規則字典裡并沒有記錄,是根據我們具體需求來确定的,例如:可以根據頁碼比較,可以根據發音比較,等等。
CHARSET
以latin類字元集為例,我們有latin1、latin2、latin5、latin7四種字元集。
這是什麼意思呢?簡單了解,latin類比東方語系,latin1、latin2就可以類比為漢語字典、日語字典。
mysql> show character set;
+----------+-----------------------------+---------------------+--------+
| Charset | Description | Default collation | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 |
| dec8 | DEC West European | dec8_swedish_ci | 1 |
| cp850 | DOS West European | cp850_general_ci | 1 |
| hp8 | HP West European | hp8_english_ci | 1 |
| koi8r | KOI8-R Relcom Russian | koi8r_general_ci | 1 |
| latin1 | cp1252 West European | latin1_swedish_ci | 1 |
| latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 |
| swe7 | 7bit Swedish | swe7_swedish_ci | 1 |
| ascii | US ASCII | ascii_general_ci | 1 |
| ujis | EUC-JP Japanese | ujis_japanese_ci | 3 |
| sjis | Shift-JIS Japanese | sjis_japanese_ci | 2 |
| hebrew | ISO 8859-8 Hebrew | hebrew_general_ci | 1 |
| tis620 | TIS620 Thai | tis620_thai_ci | 1 |
| euckr | EUC-KR Korean | euckr_korean_ci | 2 |
| koi8u | KOI8-U Ukrainian | koi8u_general_ci | 1 |
| gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci | 2 |
| greek | ISO 8859-7 Greek | greek_general_ci | 1 |
| cp1250 | Windows Central European | cp1250_general_ci | 1 |
| gbk | GBK Simplified Chinese | gbk_chinese_ci | 2 |
| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci | 1 |
| armscii8 | ARMSCII-8 Armenian | armscii8_general_ci | 1 |
| utf8 | UTF-8 Unicode | utf8_general_ci | 3 |
| ucs2 | UCS-2 Unicode | ucs2_general_ci | 2 |
| cp866 | DOS Russian | cp866_general_ci | 1 |
| keybcs2 | DOS Kamenicky Czech-Slovak | keybcs2_general_ci | 1 |
| macce | Mac Central European | macce_general_ci | 1 |
| macroman | Mac West European | macroman_general_ci | 1 |
| cp852 | DOS Central European | cp852_general_ci | 1 |
| latin7 | ISO 8859-13 Baltic | latin7_general_ci | 1 |
| utf8mb4 | UTF-8 Unicode | utf8mb4_general_ci | 4 |
| cp1251 | Windows Cyrillic | cp1251_general_ci | 1 |
| utf16 | UTF-16 Unicode | utf16_general_ci | 4 |
| utf16le | UTF-16LE Unicode | utf16le_general_ci | 4 |
| cp1256 | Windows Arabic | cp1256_general_ci | 1 |
| cp1257 | Windows Baltic | cp1257_general_ci | 1 |
| utf32 | UTF-32 Unicode | utf32_general_ci | 4 |
| binary | Binary pseudo charset | binary | 1 |
| geostd8 | GEOSTD8 Georgian | geostd8_general_ci | 1 |
| cp932 | SJIS for Windows Japanese | cp932_japanese_ci | 2 |
| eucjpms | UJIS for Windows Japanese | eucjpms_japanese_ci | 3 |
+----------+-----------------------------+---------------------+--------+
rows in set ( sec)
而latin1又有好幾種比較方式,latin1_german1_ci、latin1_swedish_ci、latin1_danish_ci、latin1_german2_ci、latin1_bin等等。他們規定了latin1這本字典的不同排序方式。
mysql> SHOW COLLATION LIKE 'latin%';
+---------------------+---------+----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+---------------------+---------+----+---------+----------+---------+
| latin1_german1_ci | latin1 | | | Yes | |
| latin1_swedish_ci | latin1 | | Yes | Yes | |
| latin1_danish_ci | latin1 | | | Yes | |
| latin1_german2_ci | latin1 | | | Yes | |
| latin1_bin | latin1 | | | Yes | |
| latin1_general_ci | latin1 | | | Yes | |
| latin1_general_cs | latin1 | | | Yes | |
| latin1_spanish_ci | latin1 | | | Yes | |
| latin2_czech_cs | latin2 | | | Yes | |
| latin2_general_ci | latin2 | | Yes | Yes | |
| latin2_hungarian_ci | latin2 | | | Yes | |
| latin2_croatian_ci | latin2 | | | Yes | |
| latin2_bin | latin2 | | | Yes | |
| latin5_turkish_ci | latin5 | | Yes | Yes | |
| latin5_bin | latin5 | | | Yes | |
| latin7_estonian_cs | latin7 | | | Yes | |
| latin7_general_ci | latin7 | | Yes | Yes | |
| latin7_general_cs | latin7 | | | Yes | |
| latin7_bin | latin7 | | | Yes | |
+---------------------+---------+----+---------+----------+---------+
rows in set ( sec)
由上面兩張表可見,Charset和Collation是一對多的關系;Charset和Default Collation是一對一的關系。
為什麼要有Default Collation呢?看下面的例子就能了解:
CREATE DATABASE db_name
[[DEFAULT] CHARACTER SET charset_name]
[[DEFAULT] COLLATE collation_name]
通常,建立Table的時候最多隻指定了CHARSET,很少指定COLLATE。這時候系統必須給一個預設的值,否則行為就是未定義了。預設值就是Default Collation。
僅僅有Charset,沒有Collation,系統會無法正确工作。就好比給你一本字典,問你“張”和“趙”,誰大?沒有答案,但如果增加一個比較規則:“按照百家姓的順序排序”,那麼就可以得出“趙” > “張”了。
按照MySQL命名規則,如果知道了一個Collation,那麼對應的Charset就能唯一确定。是以Create table的時候可以隻指定COLLATE,CHARSET部分能自動推導出來。
Collation
每個字元串都有collation,可以顯式指定:
如果沒有指定,則使用character_set_connection,collation_connection指定的值。
For the simple statement SELECT ‘string’, the string has the character set and collation defined by the character_set_connection and collation_connection system variables.
Coercible
兩個collation不同的資料比較,應該如何操作?這裡面也有一套規則指導我們報錯,或者将其中一個collation轉換成另外一個。具體的,參考MySQL文檔, 它規定了一系列的級别,然後這套規則就根據級别比較來決定如何轉換Collation。
- An explicit COLLATE clause has a coercibility of 0. (Not coercible at all.)
- The concatenation of two strings with different collations has a coercibility of 1.
- The collation of a column or a stored routine parameter or local variable has a coercibility of 2.
- A “system constant” (the string returned by functions such as USER() or VERSION()) has a coercibility of 3.
- The collation of a literal has a coercibility of 4.
- NULL or an expression that is derived from NULL has a coercibility of 5.
這些規則級别被稱為COERICIBILITY。MySQL提供了一個函數來顯示一個值的規則級别:
mysql> SELECT COERCIBILITY('A' COLLATE latin1_swedish_ci);
->
mysql> SELECT COERCIBILITY(VERSION());
->
mysql> SELECT COERCIBILITY('A');
->
字元集轉換
字元集之間一般不能互相轉換,但也有例外:一個字元集是例外一個字元集的子集時。例如Unicode和所有其他字元集之間,ASCII和GBK之間,等等。
參考文獻
http://dev.mysql.com/doc/refman/5.7/en/charset.html