Java如何擷取文本檔案的字元編碼【UTF-8格式判斷改進版】

一、認識字元編碼：

1、Java中String的預設編碼為UTF-8，可以使用以下語句擷取：Charset.defaultCharset();

2、Windows作業系統下，文本檔案的預設編碼為ANSI，對中文Windows來說即為GBK。例如我們使用記事本程式建立一個文本文檔，其預設字元編碼即為ANSI。

3、Text文本文檔有四種編碼選項：ANSI、Unicode(含Unicode Big Endian和Unicode Little Endian)、UTF-8、UTF-16

4、是以我們讀取txt檔案可能有時候并不知道其編碼格式，是以需要用程式動态判斷擷取txt檔案編碼。

ANSI ：無格式定義，對中文作業系統為GBK或GB2312
UTF-8 ：前三個位元組為：0xE59B9E(UTF-8)、0xEFBBBF(UTF-8含BOM)
UTF-16 ：前兩位元組為：0xFEFF
Unicode：前兩個位元組為：0xFFFE

例如，Unicode文檔以0xFFFE開頭，用程式取出前幾個位元組并進行判斷即可。

5、Java編碼與Text文本編碼對應關系：

Java中的編碼字元串	Text編碼	位元組标志
GBK	ANSI	無格式定義
UTF-8	UTF-8包含兩種規格： UTF-8 UTF-8-BOM	需判斷前三個位元組：前三個位元組為：0xE59B9E 前三個位元組為：0xEFBBBF
UTF-16	UTF-16	前兩個位元組為：0xFEFF
UNICODE	Unicode包含兩種規格： 1、UCS2 Little Endian 2、UCS2 Big Endian	前兩個位元組為：0xFFFE

Java讀取Text檔案，如果編碼格式不比對，就會出現亂碼現象。是以讀取文本檔案的時候需要設定正确字元編碼。Text文檔編碼格式都是寫在檔案頭的，在程式中需要先解析檔案的編碼格式，獲得編碼格式後，再以此格式讀取檔案就不會産生亂碼了。

二、舉個例子：

有一個文本檔案：test.txt

測試代碼：

/**
 * 檔案名：CharsetCodeTest.java
 * 功能描述：檔案字元編碼測試
 */

import java.io.*;

public class CharsetCodeTest {
    public static void main(String[] args) throws Exception {
        String filePath = "test.txt";
        String content = readTxt(filePath);
        System.out.println(content);
    }


public static String readTxt(String path) {
        StringBuilder content = new StringBuilder("");
        try {
            String fileCharsetName = getFileCharsetName(path);
            System.out.println("檔案的編碼格式為："+fileCharsetName);

            InputStream is = new FileInputStream(path);
            InputStreamReader isr = new InputStreamReader(is, fileCharsetName);
            BufferedReader br = new BufferedReader(isr);

            String str = "";
            boolean isFirst = true;
            while (null != (str = br.readLine())) {
                if (!isFirst)
                    content.append(System.lineSeparator());
                    //System.getProperty("line.separator");
                else
                    isFirst = false;
                content.append(str);
            }
            br.close();
        } catch (Exception e) {
            e.printStackTrace();
            System.err.println("讀取檔案:" + path + "失敗!");
        }
        return content.toString();
    }


    public static String getFileCharsetName(String fileName) throws IOException {
        InputStream inputStream = new FileInputStream(fileName);
        byte[] head = new byte[3];
        inputStream.read(head);

        String charsetName = "GBK";//或GB2312，即ANSI
        if (head[0] == -1 && head[1] == -2 ) //0xFFFE
            charsetName = "UTF-16";
        else if (head[0] == -2 && head[1] == -1 ) //0xFEFF
            charsetName = "Unicode";//包含兩種編碼格式：UCS2-Big-Endian和UCS2-Little-Endian
        else if(head[0]==-27 && head[1]==-101 && head[2] ==-98)
            charsetName = "UTF-8"; //UTF-8(不含BOM)
        else if(head[0]==-17 && head[1]==-69 && head[2] ==-65)
            charsetName = "UTF-8"; //UTF-8-BOM

        inputStream.close();

        //System.out.println(code);
        return charsetName;
    }
}

運作結果：

Java如何擷取文本檔案的字元編碼【UTF-8格式判斷改進版】

繼續閱讀

關于Gradle配置的小結

Java小案例——随機數猜測随機數猜測

nginx location中斜線的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

neo4j之cypher使用文檔

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method