天天看點

字元集及其比較方式(Character Sets and Collations)定義CHARSETCollationCoercible字元集轉換參考文獻

定義

A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set.
  • 字元集,就好比一本漢語字典,裡面記錄了很多文字(Symbol),以及順序号(頁碼);
  • 比較方式,是一套規則,告訴我們如何比較字典裡的文字。這套規則字典裡并沒有記錄,是根據我們具體需求來确定的,例如:可以根據頁碼比較,可以根據發音比較,等等。

CHARSET

以latin類字元集為例,我們有latin1、latin2、latin5、latin7四種字元集。

這是什麼意思呢?簡單了解,latin類比東方語系,latin1、latin2就可以類比為漢語字典、日語字典。

mysql> show character set;
+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |      2 |
| dec8     | DEC West European           | dec8_swedish_ci     |      1 |
| cp850    | DOS West European           | cp850_general_ci    |      1 |
| hp8      | HP West European            | hp8_english_ci      |      1 |
| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |      1 |
| latin1   | cp1252 West European        | latin1_swedish_ci   |      1 |
| latin2   | ISO 8859-2 Central European | latin2_general_ci   |      1 |
| swe7     | 7bit Swedish                | swe7_swedish_ci     |      1 |
| ascii    | US ASCII                    | ascii_general_ci    |      1 |
| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |      3 |
| sjis     | Shift-JIS Japanese          | sjis_japanese_ci    |      2 |
| hebrew   | ISO 8859-8 Hebrew           | hebrew_general_ci   |      1 |
| tis620   | TIS620 Thai                 | tis620_thai_ci      |      1 |
| euckr    | EUC-KR Korean               | euckr_korean_ci     |      2 |
| koi8u    | KOI8-U Ukrainian            | koi8u_general_ci    |      1 |
| gb2312   | GB2312 Simplified Chinese   | gb2312_chinese_ci   |      2 |
| greek    | ISO 8859-7 Greek            | greek_general_ci    |      1 |
| cp1250   | Windows Central European    | cp1250_general_ci   |      1 |
| gbk      | GBK Simplified Chinese      | gbk_chinese_ci      |      2 |
| latin5   | ISO 8859-9 Turkish          | latin5_turkish_ci   |      1 |
| armscii8 | ARMSCII-8 Armenian          | armscii8_general_ci |      1 |
| utf8     | UTF-8 Unicode               | utf8_general_ci     |      3 |
| ucs2     | UCS-2 Unicode               | ucs2_general_ci     |      2 |
| cp866    | DOS Russian                 | cp866_general_ci    |      1 |
| keybcs2  | DOS Kamenicky Czech-Slovak  | keybcs2_general_ci  |      1 |
| macce    | Mac Central European        | macce_general_ci    |      1 |
| macroman | Mac West European           | macroman_general_ci |      1 |
| cp852    | DOS Central European        | cp852_general_ci    |      1 |
| latin7   | ISO 8859-13 Baltic          | latin7_general_ci   |      1 |
| utf8mb4  | UTF-8 Unicode               | utf8mb4_general_ci  |      4 |
| cp1251   | Windows Cyrillic            | cp1251_general_ci   |      1 |
| utf16    | UTF-16 Unicode              | utf16_general_ci    |      4 |
| utf16le  | UTF-16LE Unicode            | utf16le_general_ci  |      4 |
| cp1256   | Windows Arabic              | cp1256_general_ci   |      1 |
| cp1257   | Windows Baltic              | cp1257_general_ci   |      1 |
| utf32    | UTF-32 Unicode              | utf32_general_ci    |      4 |
| binary   | Binary pseudo charset       | binary              |      1 |
| geostd8  | GEOSTD8 Georgian            | geostd8_general_ci  |      1 |
| cp932    | SJIS for Windows Japanese   | cp932_japanese_ci   |      2 |
| eucjpms  | UJIS for Windows Japanese   | eucjpms_japanese_ci |      3 |
+----------+-----------------------------+---------------------+--------+
 rows in set ( sec)
           

而latin1又有好幾種比較方式,latin1_german1_ci、latin1_swedish_ci、latin1_danish_ci、latin1_german2_ci、latin1_bin等等。他們規定了latin1這本字典的不同排序方式。

mysql> SHOW COLLATION LIKE 'latin%';
+---------------------+---------+----+---------+----------+---------+
| Collation           | Charset | Id | Default | Compiled | Sortlen |
+---------------------+---------+----+---------+----------+---------+
| latin1_german1_ci   | latin1  |   |         | Yes      |        |
| latin1_swedish_ci   | latin1  |   | Yes     | Yes      |        |
| latin1_danish_ci    | latin1  |  |         | Yes      |        |
| latin1_german2_ci   | latin1  |  |         | Yes      |        |
| latin1_bin          | latin1  |  |         | Yes      |        |
| latin1_general_ci   | latin1  |  |         | Yes      |        |
| latin1_general_cs   | latin1  |  |         | Yes      |        |
| latin1_spanish_ci   | latin1  |  |         | Yes      |        |
| latin2_czech_cs     | latin2  |   |         | Yes      |        |
| latin2_general_ci   | latin2  |   | Yes     | Yes      |        |
| latin2_hungarian_ci | latin2  |  |         | Yes      |        |
| latin2_croatian_ci  | latin2  |  |         | Yes      |        |
| latin2_bin          | latin2  |  |         | Yes      |        |
| latin5_turkish_ci   | latin5  |  | Yes     | Yes      |        |
| latin5_bin          | latin5  |  |         | Yes      |        |
| latin7_estonian_cs  | latin7  |  |         | Yes      |        |
| latin7_general_ci   | latin7  |  | Yes     | Yes      |        |
| latin7_general_cs   | latin7  |  |         | Yes      |        |
| latin7_bin          | latin7  |  |         | Yes      |        |
+---------------------+---------+----+---------+----------+---------+
 rows in set ( sec)
           

由上面兩張表可見,Charset和Collation是一對多的關系;Charset和Default Collation是一對一的關系。

為什麼要有Default Collation呢?看下面的例子就能了解:

CREATE DATABASE db_name
    [[DEFAULT] CHARACTER SET charset_name]
    [[DEFAULT] COLLATE collation_name]
           

通常,建立Table的時候最多隻指定了CHARSET,很少指定COLLATE。這時候系統必須給一個預設的值,否則行為就是未定義了。預設值就是Default Collation。

僅僅有Charset,沒有Collation,系統會無法正确工作。就好比給你一本字典,問你“張”和“趙”,誰大?沒有答案,但如果增加一個比較規則:“按照百家姓的順序排序”,那麼就可以得出“趙” > “張”了。

按照MySQL命名規則,如果知道了一個Collation,那麼對應的Charset就能唯一确定。是以Create table的時候可以隻指定COLLATE,CHARSET部分能自動推導出來。

Collation

每個字元串都有collation,可以顯式指定:

如果沒有指定,則使用character_set_connection,collation_connection指定的值。

For the simple statement SELECT ‘string’, the string has the character set and collation defined by the character_set_connection and collation_connection system variables.

Coercible

兩個collation不同的資料比較,應該如何操作?這裡面也有一套規則指導我們報錯,或者将其中一個collation轉換成另外一個。具體的,參考MySQL文檔, 它規定了一系列的級别,然後這套規則就根據級别比較來決定如何轉換Collation。

  • An explicit COLLATE clause has a coercibility of 0. (Not coercible at all.)
  • The concatenation of two strings with different collations has a coercibility of 1.
  • The collation of a column or a stored routine parameter or local variable has a coercibility of 2.
  • A “system constant” (the string returned by functions such as USER() or VERSION()) has a coercibility of 3.
  • The collation of a literal has a coercibility of 4.
  • NULL or an expression that is derived from NULL has a coercibility of 5.

這些規則級别被稱為COERICIBILITY。MySQL提供了一個函數來顯示一個值的規則級别:

mysql> SELECT COERCIBILITY('A' COLLATE latin1_swedish_ci);
        -> 
mysql> SELECT COERCIBILITY(VERSION());
        -> 
mysql> SELECT COERCIBILITY('A');
        -> 
           

字元集轉換

字元集之間一般不能互相轉換,但也有例外:一個字元集是例外一個字元集的子集時。例如Unicode和所有其他字元集之間,ASCII和GBK之間,等等。

參考文獻

http://dev.mysql.com/doc/refman/5.7/en/charset.html