天天看點

使用T-SQL管理資料中的Unicode字元

In this article, I’ll provide some useful information to help you understand how to use Unicode in SQL Server and address various compilation problems that arise from the Unicode characters’ text with the help of T-SQL.

在本文中,我将提供一些有用的資訊,以幫助您了解如何在SQL Server中使用Unicode,并借助T-SQL解決由于Unicode字元的文本而引起的各種編譯問題。

什麼是Unicode? (What is Unicode?)

The American Standard Code for Information Interchange (ASCII) was the first extensive character encoding format. Originally developed in the US, and intended for English, ASCII could only accommodate encoding for 128 characters. Character encoding simply means assigning a unique number to every character being used. As an example, we show the letters ‘A’,’a’,’1′ and the symbol ‘+’ become numbers, as shown in the table:

美國資訊交換标準碼(ASCII)是第一種擴充的字元編碼格式。 ASCII最初在美國開發,并且旨在英語,隻能容納128個字元的編碼。 字元編碼隻是意味着為正在使用的每個字元配置設定一個唯一的數字。 例如,我們顯示字母“ A”,“ a”,“ 1”和符号“ +”成為數字,如下表所示:

ASCII(‘A’) ASCII(‘a’) ASCII(‘1’) ASCII(‘+’)
65 97 49 43
ASCII('A') ASCII('a') ASCII('1') ASCII('+')
65 97 49 43

The T-SQL statement below can help us find the character from the ASCII value and vice-versa:

下面的T-SQL語句可以幫助我們從ASCII值中查找字元,反之亦然:

SELECT CHAR(193) as Character
           

Here is the result set of ASCII value to char:

這是char的ASCII值的結果集:

使用T-SQL管理資料中的Unicode字元
SELECT ASCII('Á') as ASCII_
           

Here is the result set of char to ASCII value:

這是char到ASCII值的結果集:

使用T-SQL管理資料中的Unicode字元

While ASCII encoding was acceptable for most common English language characters, numbers and punctuation, it was constraining for the rest of the world’s dialects. As a result, other languages required different encoding schemes and character definitions changed according to the language. Having encoding schemes of different lengths required programs to figure out which one to apply depending on the language being used.

盡管ASCII編碼對于大多數常見的英語語言字元,數字和标點符号是可以接受的,但它限制了世界其他地方的方言。 結果,其他語言需要不同的編碼方案,并且字元定義根據語言而改變。 具有不同長度的編碼方案需要程式根據所使用的語言找出要應用的編碼方案。

Here is where international standards become critical. When the entire world practices the same character encoding scheme, every computer can display the same characters. This is where the Unicode Standard comes in.

這是國際标準變得至關重要的地方。 當整個世界都采用相同的字元編碼方案時,每台計算機都可以顯示相同的字元。 這是 Unicode标準進來。

Encoding is always related to a charset, so the encoding process encodes characters to bytes and decodes bytes to characters. There are several Unicode formats: UTF-8, UTF-16 and UTF-32.

編碼始終與字元集相關,是以編碼過程會将字元編碼為位元組,然後将位元組解碼為字元。 有幾種Unicode格式: UTF-8 , UTF-16和UTF-32 。

  • UTF-8 uses 1 byte to encode an English character. It uses between 1 and 4 bytes per character and it has no concept of byte-order. All European languages are encoded in two bytes or less per character UTF-8使用1個位元組對英語字元進行編碼。 每個字元使用1到4個位元組,并且沒有位元組順序的概念。 所有歐洲語言均以每個字元兩個或更少的位元組編碼
  • UTF-16 uses 2 bytes to encode an English character and it is widely used with either 2 or 4 bytes per character UTF-16使用2個位元組來編碼英文字元,并且廣泛使用每個字元2或4個位元組
  • UTF-32 uses 4 bytes to encode an English character. It is best for random access by character offset into a byte-array UTF-32使用4個位元組來編碼英文字元。 最好通過字元偏移到位元組數組中進行随機通路

Special characters are often problematic. When working with different source frameworks, it would be preferable if every framework agreed as to which characters were acceptable. A lot of times, it happens that developers perform missteps to identify or troubleshoot the issue, and however, those issues are identified with the odd characters in the data, which caused the error.

特殊字元通常是有問題的。 當使用不同的源架構時,最好是每個架構都就可接受的字元達成一緻。 很多時候,開發人員會執行錯誤的步驟來識别或排除問題,但是,這些問題是用資料中的奇數字元識别的,進而導緻了錯誤。

SQL Server中的Unicode資料類型 (Unicode data types in SQL Server)

Microsoft SQL Server supports the below Unicode data types:

Microsoft SQL Server支援以下Unicode資料類型:

  • nchar nchar
  • nvarchar nvarchar
  • ntext 文字

The Unicode terms are expressed with a prefix “N”, originating from the SQL-92 standard. The utilization of nchar, nvarchar and ntext data types are equivalent to char, varchar and text. The Unicode supports a broad scope of characters and more space is expected to store Unicode characters. The most extreme size of nchar and nvarchar columns is 4,000 characters, not 8,000 characters like char and varchar. For example:

Unicode術語以字首“ N”表示,該字首源自SQL-92标準。 nchar,nvarchar和ntext資料類型的使用等效于char,varchar和text。 Unicode支援廣泛的字元範圍,預計将存儲更多空間來存儲Unicode字元。 nchar和nvarchar列的最大大小為4,000個字元,而不是像char和varchar這樣的8,000個字元。 例如:

N’Mãrk sÿmónds’

N'Mãrksÿmónds'

All Unicode data practices the identical Unicode code page. Collations do not regulate the code page, which is being used for Unicode columns. Collations control only attributes such as comparison rules and case sensitivity.

所有Unicode資料都使用相同的Unicode代碼頁。 排序規則不規範用于Unicode列的代碼頁。 歸類僅控制比較規則和區分大小寫之類的屬性。

This T-SQL statement prints the ASCII values and characters for the ASCII 193-200 range:

此T-SQL語句列印ASCII 193-200範圍内的ASCII值和字元:

SELECT CHAR(193), CHAR(194), CHAR(195), CHAR(196), CHAR(197), CHAR(198), CHAR (199), CHAR (200)
           
CHAR(193) CHAR(194) CHAR(195) CHAR(196) CHAR(197) CHAR(198) CHAR(199) CHAR(200)
Á Â Ã Ä Å Æ Ç È
CHAR(193) 炭黑(194) CHAR(195) CHAR(196) CHAR(197) CHAR(198) 炭黑(199) 炭(200)
一個 一個 一個 一個 一個 Æ C È

擷取SQL Server中的特殊字元清單 (Get a list of special characters in SQL Server)

Here are some of the Unicode character sets that can be represented in a single-byte coding scheme; however, the character sets require multi-byte encoding. For more information on character sets, check out the below function that returns the ASCII value and character with positions for each special character in the string with the help of T-SQL statements:

這是一些可以用單位元組編碼方案表示的Unicode字元集。 但是,字元集需要多位元組編碼。 有關字元集的更多資訊,請檢視以下函數,該函數借助T-SQL語句傳回字元串中每個特殊字元的ASCII值和位置。

功能: (Function:)

CREATE FUNCTION [dbo].[Find_Unicode]
(
    @in_string nvarchar(max)
)
RETURNS @unicode_char TABLE(id INT IDENTITY(1,1), Char_ NVARCHAR(4), position BIGINT)
AS
BEGIN
    DECLARE @character nvarchar(1)
    DECLARE @index int
 
    SET @index = 1
    WHILE @index <= LEN(@in_string)
    BEGIN
        SET @character = SUBSTRING(@in_string, @index, 1)
        IF((UNICODE(@character) NOT BETWEEN 32 AND 127) AND UNICODE(@character) NOT IN (10,11))
        BEGIN
      INSERT INTO @unicode_char(Char_, position)
      VALUES(@character, @index)
    END
    SET @index = @index + 1
    END
    RETURN
END
GO
           

執行: (Execution:)

SELECT * 
FROM [Find_Unicode](N'Mãrk sÿmónds')
           

Here is the result set:

這是結果集:

使用T-SQL管理資料中的Unicode字元

從SQL Server中的字元串中删除特殊字元 (Remove special characters from string in SQL Server)

In the code below, we are defining logic to remove special characters from a string. We know that the basic ASCII values are 32 – 127. This includes capital letters in order from 65 to 90 and lower case letters in order from 97 to 122. Each character corresponds to its ASCII value using T-SQL. The “RemoveNonASCII” function excludes all the special characters from the string and sets up a blank of them:

在下面的代碼中,我們定義了從字元串中删除特殊字元的邏輯。 我們知道基本的ASCII值是32 –127。這包括從65到90的大寫字母和從97到122的小寫字母。每個字元對應于使用T-SQL的ASCII值。 “ RemoveNonASCII”函數從字元串中排除所有特殊字元并設定為空白:

CREATE FUNCTION [dbo].[RemoveNonASCII] 
(
    @in_string nvarchar(max)
)
RETURNS nvarchar(MAX)
AS
BEGIN
 
    DECLARE @Result nvarchar(MAX)
    SET @Result = ''
 
    DECLARE @character nvarchar(1)
    DECLARE @index int
 
    SET @index = 1
    WHILE @index <= LEN(@in_string)
    BEGIN
        SET @character = SUBSTRING(@in_string, @index, 1)
   
        IF (UNICODE(@character) between 32 and 127) or UNICODE(@character) in (10,11)
            SET @Result = @Result + @character
        SET @index = @index + 1
    END
 
    RETURN @Result
END
           

執行: (Execution:)

SELECT dbo.[RemoveNonASCII](N'Mãrk sÿmónds')
           
使用T-SQL管理資料中的Unicode字元

These SQL functions can be very useful if you’re working with large international character sets.

如果您使用大型國際字元集,那麼這些SQL函數可能會非常有用。

翻譯自: https://www.sqlshack.com/manage-unicode-characters-in-data-using-t-sql/