Java处理UTF-8文件的BOM头部

BOM——Byte Order Mark，就是字节序标记。

基本概念

在 UCS 编码 中有一个叫做” ZERO WIDTH NO-BREAK SPACE “的字符，它的编码是 FEFF 。而 FFFE 在 UCS 中是不存在的字符，所以不应该出现在实际传输中。
UCS规范建议我们在传输字节流前，先传输字符” ZERO WIDTH NO-BREAK SPACE “。
如果接收者收到 FEFF ，就表明这个字节流是大字节序的；如果收到 FFFE ，就表明这个字节流是小字节序的。因此字符” ZERO WIDTH NO-BREAK SPACE “又被称作 BOM 。

UTF-8不需要BOM来表明字节顺序，但可以用BOM来表明编码方式。字符”

ZERO WIDTH NO-BREAK SPACE

“的

UTF-8编码

是

EF BB BF

。所以如果接收者收到以

EF BB BF

开头的字节流，就知道这是

UTF-8编码

了。

这个BOM头部对于UTF-8来说不是必须的，并且我建议最好不用有这个头部，以避免可能的兼容性问题。

下面就来看看怎么用java来处理UTF-8的BOM头部

增加BOM到UTF-8文件

import java.io.BufferedWriter;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

public class AddBomToUtf8File {

  public static void main(String[] args) throws IOException {

      Path path = Paths.get("/home/file.txt");
      writeBomFile(path, "billy");

  }

  private static void writeBomFile(Path path, String content) {
        // Java 8 default UTF-8
        try (BufferedWriter bw = Files.newBufferedWriter(path)) {
            bw.write("\ufeff");
            bw.write(content);
            bw.newLine();
            bw.write(content);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

在Java 8 之前可以用下面的方法：

new OutputStreamWriter(
                      new FileOutputStream(path.toFile())
                      , StandardCharsets.UTF_8))) {
          bw.write("\ufeff");
          bw.write(content);
          bw.newLine();
          bw.write(content);
      } catch (IOException e) {
          e.printStackTrace();
      }
  }

或者可以用 PrintWriter 和OutputStreamWriter

try (PrintWriter pw = new PrintWriter(
              new OutputStreamWriter(
                      new FileOutputStream(path.toFile()), StandardCharsets.UTF_8))) {
          //pw.write("\ufeff");
          pw.write(0xfeff); // alternative, codepoint
          pw.write(content);
          pw.write(System.lineSeparator());
          pw.write(content);

      } catch (IOException e) {
          e.printStackTrace();
      }
  }

又或者，这样:

private static void writeBomFile4(Path path, String content) {
      try (FileOutputStream fos = new FileOutputStream(path.toFile())) {

          byte[] BOM = {(byte) 0xEF, (byte) 0xBB, (byte) 0xBF};

          fos.write(BOM);
          fos.write(content.getBytes(StandardCharsets.UTF_8));
          fos.write(System.lineSeparator().getBytes(StandardCharsets.UTF_8));
          fos.write(content.getBytes(StandardCharsets.UTF_8));

      } catch (IOException e) {
          e.printStackTrace();
      }
  }

检查文件是否包含BOM头部

import java.io.IOException;
import java.io.InputStream;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

public class CheckBom {

  public static void main(String[] args) throws IOException {

      Path path = Paths.get("/home/file.txt");

      if(isContainBOM(path)){
          System.out.println("Found BOM!");
      }else{
          System.out.println("No BOM.");
      }

  }

  private static boolean isContainBOM(Path path) throws IOException {

      if(Files.notExists(path)){
          throw new IllegalArgumentException("Path: " + path + " does not exists!");
      }

      boolean result = false;

      byte[] bom = new byte[3];
      try(InputStream is = new FileInputStream(path.toFile())){

          // read first 3 bytes of a file.
          is.read(bom);

          // BOM encoded as ef bb bf
          String content = new String(Hex.encodeHex(bom));
          if ("efbbbf".equalsIgnoreCase(content)) {
              result = true;
          }

      }

      return result;
  }

}

上面的代码需要一个依赖：

<dependency>
      <groupId>commons-codec</groupId>
      <artifactId>commons-codec</artifactId>
      <version>1.14</version>
  </dependency>

移除UTF-8文件的BOM头部

通常，我建议不要用这个BOM，不然处理不好产生什么乱码就麻烦了。

import org.apache.commons.codec.binary.Hex;

import java.io.BufferedWriter;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.ByteBuffer;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

public class RemoveBomFromUtf8File {

  public static void main(String[] args) throws IOException {

      Path path = Paths.get("/home/file.txt");
      writeBomFile(path, "billy");
      removeBom(path);

  }

  private static void writeBomFile(Path path, String content) {
      // Java 8 default UTF-8
      try (BufferedWriter bw = Files.newBufferedWriter(path)) {
          bw.write("\ufeff");
          bw.write(content);
          bw.newLine();
          bw.write(content);
      } catch (IOException e) {
          e.printStackTrace();
      }
  }

  private static boolean isContainBOM(Path path) throws IOException {

      if (Files.notExists(path)) {
          throw new IllegalArgumentException("Path: " + path + " does not exists!");
      }

      boolean result = false;

      byte[] bom = new byte[3];
      try (InputStream is = new FileInputStream(path.toFile())) {

          // read 3 bytes of a file.
          is.read(bom);

          // BOM encoded as ef bb bf
          String content = new String(Hex.encodeHex(bom));
          if ("efbbbf".equalsIgnoreCase(content)) {
              result = true;
          }

      }

      return result;
  }

  private static void removeBom(Path path) throws IOException {

      if (isContainBOM(path)) {

          byte[] bytes = Files.readAllBytes(path);

          ByteBuffer bb = ByteBuffer.wrap(bytes);

          System.out.println("Found BOM!");

          byte[] bom = new byte[3];
          // get the first 3 bytes
          bb.get(bom, 0, bom.length);

          // remaining
          byte[] contentAfterFirst3Bytes = new byte[bytes.length - 3];
          bb.get(contentAfterFirst3Bytes, 0, contentAfterFirst3Bytes.length);

          System.out.println("Remove the first 3 bytes, and overwrite the file!");

          // override the same path
          Files.write(path, contentAfterFirst3Bytes);

      } else {
          System.out.println("This file doesn't contains UTF-8 BOM!");
      }

  }

}

复制UTF-8文件并追加BOM

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

public class CopyAndAddBomToXmlFile {

    public static void main(String[] args) {
        Path src = Paths.get("src/main/resources/staff.xml");
        Path dest = Paths.get("src/main/resources/staff-bom.xml");
        writeBomFile(src, dest);
    }

    private static void writeBomFile(Path src, Path dest) {

        try (FileOutputStream fos = new FileOutputStream(dest.toFile())) {

            byte[] BOM = {(byte) 0xEF, (byte) 0xBB, (byte) 0xBF};
            // add BOM
            fos.write(BOM);

            // BOM + src to fos
            Files.copy(src, fos);

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

{

byte[] BOM = {(byte) 0xEF, (byte) 0xBB, (byte) 0xBF};
        // add BOM
        fos.write(BOM);

        // BOM + src to fos
        Files.copy(src, fos);

    } catch (IOException e) {
        e.printStackTrace();
    }
}

}

####

Java处理UTF-8文件的BOM头部

Java处理UTF-8文件的BOM头部

增加BOM到UTF-8文件

检查文件是否包含BOM头部

移除UTF-8文件的BOM头部

复制UTF-8文件并追加BOM

继续阅读

关于Gradle配置的小结

Java小案例——随机数猜测随机数猜测

nginx location中斜线的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的简单使用

neo4j之cypher使用文档

GitHub连夜封杀！这份阿里 10W 字内部 Java 字面试手册到底有多强？

spark/scala关于【资源文件】加载方法概述外部文件加载方案测试资源文件打包入jar包中小结

mybatis_入门程序Mybatis入门

AOP编程_Android优雅权限框架(1)概念基础，2021金三银四前言正文大纲正文

Effective Java 8:通用程序设计

OOM三种类型

工厂模式-三种类型

【递归】高效率求2的n次幂

win10本地scala和spark安装安装scala安装spark

scala (3) Function 和 Method