字元集研究之不同字元集的轉換方式

作者：朱金燦

在上篇文章中介紹了多位元組字元集和unicode字元集，今天介紹下兩大字元集之間的轉換方式。

首先談談的是微軟對于unicode字元集的态度。在windows的開發體系下，unicode字元字元集被稱為寬位元組字元集，多位元組字元集被稱為窄字元集。微軟對unicode字元集是大力支援的。從以下幾點可以看出：從windows2000開始使用unicode進行開發；Windows CE 本身就是使用Unicode的一種作業系統，完全不支援ANSI版Windows API函數；建立的VC工程預設使用的是unicode字元集(utf16)。那麼問題來了，作為一個C++程式員，是否該使用unicode字元集。

為什麼使用Unicode字元集?提升運作效率，比如Windows核心本身是基于unicode字元的，非unicode字元傳進入要先轉成unicode字元（《windows核心程式設計有詳細解釋》）；在不同語言中可以友善交換資料，比如在英文版作業系統中輸入中文路徑，如果是非unicode字元同時又沒有安裝中文字元集，那麼就會出現亂碼。

為什麼不使用Unicode字元集？因為傳統的勢力很強大，很多跨平台的第三方庫都是基于多位元組位元組集進行開發，還有就是程式設計習慣，比如在Windows下開發，大家耳熟能詳的是計算字元串長度的函數是strlen，誰會去用寬位元組版的wcslen呢。詳見我以前寫的文章：

《unicode字元集，用還是不用？》

最後談談多位元組字元集和unicode字元集。兩種方式，一種是使用跨平台的iconv庫，示例代碼如下：

include <stdio.h>
#include <stdlib.h>
#include <string>
using namespace std;

#include <iconv.h> //編碼轉換庫

#define OUTLEN 255 //檔案路徑長度

//代碼轉換:從一種編碼轉為另一種編碼
int code_convert(char *from_charset, char *to_charset, char *inbuf, size_t inlen, char *outbuf, size_t  outlen)
{
iconv_t cd;
char **pin = &inbuf;
char **pout = &outbuf;

cd = iconv_open(to_charset,from_charset);
if (cd==0) 
return -1;
memset(outbuf,0,outlen);

if (iconv(cd,pin,&inlen,pout,&outlen)==-1)
return -1;
iconv_close(cd);
return 0;
}
//UNICODE碼轉為GB2312碼
int u2g(char *inbuf, size_t  inlen, char *outbuf, size_t  outlen)
{
return code_convert("utf-8","gb2312",inbuf,inlen,outbuf,outlen);
}
//GB2312碼轉為UNICODE碼
int g2u(char *inbuf, size_t inlen, char *outbuf, size_t outlen)
{
return code_convert("gb2312","utf-8",inbuf,inlen,outbuf,outlen);
}

//執行SQL語句回調函數
static int _sql_callback(void* pUsed, int argc, char** argv, char** ppszColName)
{
for(int i=0; i<argc; i++)
{
printf("%s = %s/n", ppszColName[i], argv[i]==0 ? "NULL" : argv[i]);
}
return 0;
}

void main()
{
char *in_gb2312 = "D://控制點庫//GCPDB.3sdb";

char out[OUTLEN];
  
//gb2312碼轉為unicode碼
g2u(in_gb2312,strlen(in_gb2312),out,OUTLEN);
printf("gb2312-->unicode out=%s /n",out);
}

另一種方式是使用使用WindiwsAPI，示例代碼如下：

std::string MbcsToUtf8( const char* pszMbcs )  
    {  
        std::string str;  
        WCHAR   *pwchar=0;  
        CHAR    *pchar=0;  
        int len=0;  
        int codepage = AreFileApisANSI() ? CP_ACP : CP_OEMCP;  
        len=MultiByteToWideChar(codepage, 0, pszMbcs, -1, NULL,0);  
        pwchar=new WCHAR[len];  
        if(pwchar!=0)  
        {  
            len = MultiByteToWideChar(codepage, 0, pszMbcs, -1, pwchar, len);  
            if( len!=0 )  
            {  
                len = WideCharToMultiByte(CP_UTF8, 0, pwchar, -1, 0, 0, 0, 0);  
                pchar=new CHAR[len];  
                if(pchar!=0)  
                {  
                    len = WideCharToMultiByte(CP_UTF8, 0, pwchar, -1, pchar, len,0, 0);  
                    if(len!=0)                  
                    {  
                        str = pchar;                     
                    }  
                    delete pchar;  
                }  
                delete pwchar;  
            }  
        }  
        return str;  
}

字元集研究之不同字元集的轉換方式

繼續閱讀

FZU 1978 Repair the brackets

UVA 10344- 23 out of 5

ZOJ 3935 2016

POJ 2115 C Looooops

HDU 5381 The sum of gcd

ZOJ 1104 Leaps Tall Buildings

ZOJ 3700 Ever Dream

HDU 2821 Pusher

ZOJ 1199 Point of Intersection

UVA 1401 Remember the Word

UVA 620 Cellular Structure

ZOJ 2748 Free Kick

CSU 1567 Reverse Rot

JAVA 系列——>開發工具IntelliJ IDEA的安裝以及配置、快捷鍵IDEA 簡介

UVA 519 Puzzle (II)

磁盤結構及在Linux中的命名