Java中String和byte[]間的轉換淺析

Java語言中字元串類型和位元組數組類型互相之間的轉換經常發生，網上的分析及代碼也比較多，本文将分析總結正常的byte[]和String間的轉換以及十六進制String和byte[]間互相轉換的原理及實作。

1. String轉byte[]     首先我們來分析一下正常的String轉byte[]的方法，代碼如下：     public static byte[] strToByteArray(String str) {         if (str == null) {             return null;         }         byte[] byteArray = str.getBytes();         return byteArray;     }     很簡單，就是調用String類的getBytes()方法。看JDK源碼可以發現該方法最終調用了String類如下的方法。     /**      * JDK source code      */     public byte[] getBytes(Charset charset) {         String canonicalCharsetName = charset.name();         if (canonicalCharsetName.equals("UTF-8")) {             return Charsets.toUtf8Bytes(value, offset, count);         } else if (canonicalCharsetName.equals("ISO-8859-1")) {             return Charsets.toIsoLatin1Bytes(value, offset, count);         } else if (canonicalCharsetName.equals("US-ASCII")) {             return Charsets.toAsciiBytes(value, offset, count);         } else if (canonicalCharsetName.equals("UTF-16BE")) {             return Charsets.toBigEndianUtf16Bytes(value, offset, count);         } else {             CharBuffer chars = CharBuffer.wrap(this.value, this.offset, this.count);             ByteBuffer buffer = charset.encode(chars.asReadOnlyBuffer());             byte[] bytes = new byte[buffer.limit()];             buffer.get(bytes);             return bytes;         }     }     上述代碼其實就是根據給定的編碼方式進行編碼。如果調用的是不帶參數的getBytes()方法，則使用預設的編碼方式，如下代碼所示：     /**      * JDK source code      */     private static Charset getDefaultCharset() {         String encoding = System.getProperty("file.encoding", "UTF-8");         try {             return Charset.forName(encoding);         } catch (UnsupportedCharsetException e) {             return Charset.forName("UTF-8");         }     }     關于預設的編碼方式，Java API是這樣說的：         The default charset is determined during virtual-machine startup and typically depends upon the locale and charset of the underlying operating system.     同樣，由上述代碼可以看出，預設編碼方式是由System類的"file.encoding"屬性決定的，經過測試，在簡體中文Windows作業系統下，預設編碼方式為"GBK"，在Android平台上，預設編碼方式為"UTF-8"。     2. byte[]轉String     接下來分析一下正常的byte[]轉為String的方法，代碼如下：     public static String byteArrayToStr(byte[] byteArray) {         if (byteArray == null) {             return null;         }         String str = new String(byteArray);         return str;     }     很簡單，就是String的構造方法之一。那我們分析Java中String的源碼，可以看出所有以byte[]為參數的構造方法最終都調用了如下代碼所示的構造方法。需要注意的是Java中String類的資料是Unicode類型的，是以上述的getBytes()方法是把Unicode類型轉化為指定編碼方式的byte數組；而這裡的Charset為讀取該byte數組時所使用的編碼方式。     /**      * JDK source code      */     public String(byte[] data, int offset, int byteCount, Charset charset) {         if ((offset | byteCount) < 0 || byteCount > data.length - offset) {              throw failedBoundsCheck(data.length, offset, byteCount);         }         // We inline UTF-8, ISO-8859-1, and US-ASCII decoders for speed and because         // 'count' and 'value' are final.         String canonicalCharsetName = charset.name();         if (canonicalCharsetName.equals("UTF-8")) {             byte[] d = data;             char[] v = new char[byteCount];             int idx = offset;             int last = offset + byteCount;             int s = 0;             outer:             while (idx < last) {                 byte b0 = d[idx++];                 if ((b0 & 0x80) == 0) {                     // 0xxxxxxx                     // Range:  U-00000000 - U-0000007F                     int val = b0 & 0xff;                     v[s++] = (char) val;                 } else if (((b0 & 0xe0) == 0xc0) || ((b0 & 0xf0) == 0xe0) ||                     ((b0 & 0xf8) == 0xf0) || ((b0 & 0xfc) == 0xf8) || ((b0 & 0xfe)                     == 0xfc)) {                     int utfCount = 1;                     if ((b0 & 0xf0) == 0xe0) utfCount = 2;                     else if ((b0 & 0xf8) == 0xf0) utfCount = 3;                     else if ((b0 & 0xfc) == 0xf8) utfCount = 4;                     else if ((b0 & 0xfe) == 0xfc) utfCount = 5;                     // 110xxxxx (10xxxxxx)+                     // Range:  U-00000080 - U-000007FF (count == 1)                     // Range:  U-00000800 - U-0000FFFF (count == 2)                     // Range:  U-00010000 - U-001FFFFF (count == 3)                     // Range:  U-00200000 - U-03FFFFFF (count == 4)                     // Range:  U-04000000 - U-7FFFFFFF (count == 5)                     if (idx + utfCount > last) {                         v[s++] = REPLACEMENT_CHAR;                         continue;                     }                     // Extract usable bits from b0                     int val = b0 & (0x1f >> (utfCount - 1));                     for (int i = 0; i < utfCount; ++i) {                         byte b = d[idx++];                         if ((b & 0xc0) != 0x80) {                             v[s++] = REPLACEMENT_CHAR;                             idx--; // Put the input char back                             continue outer;                         }                         // Push new bits in from the right side                         val <<= 6;                         val |= b & 0x3f;                     }                     // Note: Java allows overlong char                     // specifications To disallow, check that val                     // is greater than or equal to the minimum                     // value for each count:                     //                     // count    min value                     // -----   ----------                     //   1           0x80                     //   2          0x800                     //   3        0x10000                     //   4       0x200000                     //   5      0x4000000                     // Allow surrogate values (0xD800 - 0xDFFF) to                     // be specified using 3-byte UTF values only                     if ((utfCount != 2) && (val >= 0xD800) && (val <= 0xDFFF)) {                         v[s++] = REPLACEMENT_CHAR;                         continue;                     }                     // Reject chars greater than the Unicode maximum of U+10FFFF.                     if (val > 0x10FFFF) {                         v[s++] = REPLACEMENT_CHAR;                         continue;                     }                     // Encode chars from U+10000 up as surrogate pairs                     if (val < 0x10000) {                         v[s++] = (char) val;                     } else {                         int x = val & 0xffff;                         int u = (val >> 16) & 0x1f;                         int w = (u - 1) & 0xffff;                         int hi = 0xd800 | (w << 6) | (x >> 10);                         int lo = 0xdc00 | (x & 0x3ff);                         v[s++] = (char) hi;                         v[s++] = (char) lo;                     }                 } else {                     // Illegal values 0x8*, 0x9*, 0xa*, 0xb*, 0xfd-0xff                     v[s++] = REPLACEMENT_CHAR;                 }             }             if (s == byteCount) {                 // We guessed right, so we can use our temporary array as-is.                 this.offset = 0;                 this.value = v;                 this.count = s;             } else {                 // Our temporary array was too big, so reallocate and copy.                 this.offset = 0;                 this.value = new char[s];                 this.count = s;                 System.arraycopy(v, 0, value, 0, s);             }         } else if (canonicalCharsetName.equals("ISO-8859-1")) {             this.offset = 0;             this.value = new char[byteCount];             this.count = byteCount;             Charsets.isoLatin1BytesToChars(data, offset, byteCount, value);         } else if (canonicalCharsetName.equals("US-ASCII")) {             this.offset = 0;             this.value = new char[byteCount];             this.count = byteCount;             Charsets.asciiBytesToChars(data, offset, byteCount, value);         } else {             CharBuffer cb = charset.decode(ByteBuffer.wrap(data, offset, byteCount));             this.offset = 0;             this.count = cb.length();             if (count > 0) {                 // We could use cb.array() directly, but that would mean we'd have to trust                 // the CharsetDecoder doesn't hang on to the CharBuffer and mutate it later,                 // which would break String's immutability guarantee. It would also tend to                 // mean that we'd be wasting memory because CharsetDecoder doesn't trim the                 // array. So we copy.                 this.value = new char[count];                 System.arraycopy(cb.array(), 0, value, 0, count);             } else {                 this.value = EmptyArray.CHAR;             }         }     }     具體的轉換過程較為複雜，其實就是将byte數組的一個或多個元素按指定的Charset類型讀取并轉換為char類型（char本身就是以Unicode編碼方式存儲的），因為String類的核心是其内部維護的char數組。是以有興趣的同學可以研究下各種編碼方式的編碼規則，然後才能看懂具體的轉換過程。     3. byte[]轉十六進制String     所謂十六進制String，就是字元串裡面的字元都是十六進制形式，因為一個byte是八位，可以用兩個十六進制位來表示，是以，byte數組中的每個元素可以轉換為兩個十六進制形式的char，是以最終的HexString的長度是byte數組長度的兩倍。閑話少說上代碼：     public static String byteArrayToHexStr(byte[] byteArray) {         if (byteArray == null){             return null;         }         char[] hexArray = "0123456789ABCDEF".toCharArray();         char[] hexChars = new char[byteArray.length * 2];         for (int j = 0; j < byteArray.length; j++) {             int v = byteArray[j] & 0xFF;             hexChars[j * 2] = hexArray[v >>> 4];             hexChars[j * 2 + 1] = hexArray[v & 0x0F];         }         return new String(hexChars);     }     上述代碼中，之是以要将byte數值和0xFF按位與，是因為我們為了友善後面的無符号移位操作（無符号右移運算符>>>隻對32位和64位的值有意義），要将byte資料轉換為int類型，而如果直接轉換就會出現問題。因為java裡面二進制是以補碼形式存在的，如果直接轉換，位擴充會産生問題，如值為-1的byte存儲的二進制形式為其補碼11111111，而轉換為int後為11111111111111111111111111111111，直接使用該值結果就不對了。而0xFF預設是int類型，即0x000000FF，一個byte值跟0xFF相與會先将那個byte值轉化成int類型運算，這樣，相與的結果中高的24個比特就總會被清0，後面的運算才會正确。     4. 十六進制String轉byte[]     沒什麼好說的了，就是byte[]轉十六進制String的逆過程，放代碼：     public static byte[] hexStrToByteArray(String str)     {         if (str == null) {             return null;         }         if (str.length() == 0) {             return new byte[0];         }         byte[] byteArray = new byte[str.length() / 2];         for (int i = 0; i < byteArray.length; i++){             String subStr = str.substring(2 * i, 2 * i + 2);             byteArray[i] = ((byte)Integer.parseInt(subStr, 16));         }         return byteArray;     }     文中所有代碼可以在個人github首頁檢視和下載下傳。

另，個人技術部落格，同步更新，歡迎關注！轉載請注明出處！文中若有什麼錯誤希望大家探讨指正！

Java中String和byte[]間的轉換淺析

繼續閱讀

關于Gradle配置的小結

Java小案例——随機數猜測随機數猜測

nginx location中斜線的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

neo4j之cypher使用文檔

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method