文章目錄
- 無損壓縮算法理論基礎
-
- 資訊熵
- 熵編碼
- 字典編碼
- 綜合通用無損壓縮算法
- 相關常見名詞說明
- java對幾種常見算法實作
-
- Snappy
- deflate算法
- Gzip算法
- huffman算法
- Lz4算法
- Lzo算法
- 使用方式
無損壓縮算法理論基礎
資訊熵
資訊熵是一個數學上頗為抽象的概念,在這裡不妨把資訊熵了解成某種特定資訊的出現機率(離散随機事件的出現機率)。一個系統越是有序,資訊熵就越低;反之,一個系統越是混亂,資訊熵就越高。資訊熵也可以說是系統有序化程度的一個度量。
- 熵編碼:根據消息中每個符号出現的機率,然後通過某種映射用更短的符号替代原來的符号,核心在于提高符号的
- 字典編碼:提取資訊中的重複部分作為字典,然後通過字典和某種映射替代這些重複的部分,核心在于替代重複
熵編碼
- 赫夫曼編碼:根據消息中符号出現的頻率建構出霍夫曼樹,實作頻率高的符号編碼短,然後根據霍夫曼樹得到新的符号替代原來的符号
- 算術編碼:根據消息中符号出現的機率算出整個消息(符号串)的機率,一個滿足(0.0 ≤ n < 1.0)的小數 n ,這個小數n就代表了這個消息
- 區間編碼:根據消息中符号出現的機率把符号串映射到大區間數值中的一段小區間(多個符号多次細分區),用小區間邊緣的數值的唯一字首就可以代表了這個區間對應的消息(效果其實和算術編碼相同)
字典編碼
- RLE(Run-length Encoding)遊程編碼: 個人把他看作一種比較直覺樸素的字典編碼,具體算法就是把字元串中重複出現的多個字元替換為重複次數外加這個字元
- MTF(Move-to-front transform): 通過護“recently used symbols”最近通路過的字元棧表,作為一個動态字典,在編碼消息時,用字元在棧表中的索引序号替代,同時調整棧表中該字元到棧頂,根據“空間局部性”原理可以實作資料壓縮
- LZ77與LZ78: 典型的字典編碼,較早出現并流行的兩種通用壓縮算法。LZ77:通過滑動視窗”slidingwindow”實作動态字典,用前面出現過的字元串作為字典通過映射(與前一個字元串的距離和字元串長度)替代後面重複出現的字元串;LZ78:提前解析輸入資料,生成一個靜态字典
- LZSS: 衍生于LZ77,能檢測到一個替換是否真的減小了檔案大小,以及一些别的優化
- LZW: 衍生于LZ78,優化了字典編碼存儲,但由于專利限制了發展,在GIF中被使用
綜合通用無損壓縮算法
- deflate:先用LZ77(或 LZSS)算法預處理,然後用霍夫曼編碼對壓縮後的 literal、length、distance 編碼優化,如今最流行的通用壓縮算法之一
- bzip2:涉及多種算法,主要流程包括先使用 Run-length Encoding 遊程編碼對原始資料進行處理,然後通過 Burrows-Wheeler Transform 轉換(可逆的處理一段輸入資料使得相同字元連續出現的次數最大化),再用 Move-to-front transform 轉換,然後再次使用Run-length Encoding遊程編碼處理,接下來還會進行霍夫曼編碼以及一系列相關處理,較為複雜,速率劣于DEFLATE但壓縮率更高
- LZMA:實作了LZ77修改版以位(bit)而非位元組(byte)為單元級别的操作,并通過馬可夫鍊實作字典索引,速率和壓縮率優于bzip2,另有多線程優化的版本LZMA2
- Brotli: 基于LZ77算法的一個現代變體,使用了霍夫曼編碼和二階上下文模組化,使用了預定義的120千位元組字典包含超過13000個常用單詞、短語和其他子字元串,預定義的算法可以提升較小檔案的壓縮密度。總體速率接近于DEFLATE且壓縮率接近于LZMA
相關常見名詞說明
- RAR: 商業軟體WinRAR提供的壓縮檔案格式,壓縮算法實作帶專利(可能衍生自LZSS)
- Zip: 一種規範開放的壓縮檔案容器,被多種壓縮軟體實作,相容多種壓縮算法主要為DEFLATE
- GZip: gnu/Linux下的檔案壓縮軟體,提供gz壓縮格式,壓縮算法基于DEFLATE
- 7-Zip: 開源跨平台壓縮軟體,提供7z壓縮格式,壓縮算法主要為Bzip2以及LZMA
java對幾種常見算法實作
Snappy
Google開發的一個非常流行的壓縮算法,基于LZ77的思路編寫的快速資料壓縮與解壓縮
nappy是在谷歌内部生産環境中被許多項目使用的壓縮庫,包括BigTable,MapReduce和RPC等。谷歌表示算法庫針對性能做了調整,而不是針對壓縮比或與其他類似工具的相容性。在Intel酷睿i7處理器上,其單核處理資料流的能力達到250M/s-500M/s。Snappy同時針對64位x86處理器進行了優化,在英特爾酷睿i7處理器單一核心實作了至少250MB/s的壓縮性能和500MB/ s的解壓縮性能。Snappy對于純文字的壓縮率為1.5-1.7,對于HTML是2-4,當然了對于JPEG、PNG和其他已經壓縮過的資料壓縮率為1.0。谷歌強勁吹捧Snappy的魯棒性,稱其是“即使面對損壞或惡意輸入也不會崩潰的設計”,并且在谷歌的生産環境中經過了PB級資料壓縮的考驗而穩定的。
依賴:
<dependency>
<groupId>org.xerial.snappy</groupId>
<artifactId>snappy-java</artifactId>
<version>1.1.7.5</version>
</dependency>
Snappy java實作源碼:
package com.demo.rpc.compress;
import java.io.IOException;
import org.xerial.snappy.Snappy;
/**
* @author: weijie
* @Date: 2020/9/24 14:31
* @Description:Google開發的一個非常流行的壓縮算法,基于LZ77的思路編寫的快速資料壓縮與解壓縮
*
* LZ77算法:如果檔案中有兩塊内容相同的話,那麼隻要知道前一塊的位置和大小,我們就可以确定後一塊的内容
* 是以我們可以用(兩者之間的距離,相同内容的長度)這樣一對資訊,來替換後一對内容。由于(兩者之間的距離,相同
* 内容的長度)這一對資訊的大小,小于被替換内容的大小,是以檔案得到壓縮。
*
* @url: https://blog.csdn.net/zj57356498318/article/details/108248602
*
*
*
*/
public class SnappyCompressor implements Compressor {
public byte[] compress(byte[] array) throws IOException {
if (array == null) {
return null;
}
return Snappy.compress(array);
}
public byte[] unCompress(byte[] array) throws IOException {
if (array == null) {
return null;
}
return Snappy.uncompress(array);
}
}
deflate算法
package com.demo.rpc.compress;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.util.Base64;
import java.util.zip.DeflaterOutputStream;
import java.util.zip.InflaterInputStream;
public class DeflateCompress {
//deflate解壓縮
public static String unCompress(String inputString){
byte[] bytes = Base64.getDecoder().decode(inputString);
if(bytes == null || bytes.length == 0){
return null;
}
ByteArrayOutputStream out = new ByteArrayOutputStream();
ByteArrayInputStream in = new ByteArrayInputStream(bytes);
try{
InflaterInputStream inflater = new InflaterInputStream(in);
byte[] buffer = new byte[256];
int n;
while((n = inflater.read(buffer)) >= 0){
out.write(buffer, 0, n);
}
return out.toString("utf-8");
}catch (Exception e){
throw new RuntimeException("DeflaterUnCompressError", e);
}
}
public static byte[] compress(byte[] bytes){
ByteArrayOutputStream out = new ByteArrayOutputStream();
DeflaterOutputStream deflaterOutputStream = new DeflaterOutputStream(out);
try {
deflaterOutputStream.write(bytes);
deflaterOutputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
return out.toByteArray();
}
public static byte[] unCompress(byte[] bytes){
ByteArrayOutputStream out = new ByteArrayOutputStream();
ByteArrayInputStream in = new ByteArrayInputStream(bytes);
try {
InflaterInputStream inflater = new InflaterInputStream(in);
byte[] buffer = new byte[256];
int n;
while((n = inflater.read(buffer)) >= 0){
out.write(buffer, 0, n);
}
} catch (IOException e) {
e.printStackTrace();
}
return out.toByteArray();
}
//deflate壓縮
public static String compress(String original){
if(original == null || original.length() == 0){
return null;
}
ByteArrayOutputStream out = new ByteArrayOutputStream();
DeflaterOutputStream deflater ;
try{
deflater = new DeflaterOutputStream(out);
deflater.write(original.getBytes(StandardCharsets.UTF_8));
deflater.close();
return Base64.getEncoder().encodeToString(out.toByteArray());
}catch (Exception e){
throw new RuntimeException("DeflaterCompressError", e);
}
}
}
Gzip算法
package com.demo.rpc.compress;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.util.Base64;
import java.util.zip.GZIPInputStream;
import java.util.zip.GZIPOutputStream;
public class GzipCompress {
private static final String GZIP_ENCODE_UTF_8 = "UTF-8";
//GZip解壓縮
public static String gzipUnCompress(String inputString){
byte[] decode = Base64.getDecoder().decode(inputString);
return unCompress(decode, GZIP_ENCODE_UTF_8);
}
public static String unCompress(byte[] bytes, String encoding){
if(bytes == null || bytes.length == 0){
return null;
}
ByteArrayOutputStream out = new ByteArrayOutputStream();
ByteArrayInputStream in = new ByteArrayInputStream(bytes);
try{
GZIPInputStream ungzip = new GZIPInputStream(in);
byte[] buffer = new byte[256];
int n;
while((n = ungzip.read(buffer)) >= 0){
out.write(buffer, 0, n);
}
return out.toString(encoding);
}catch (Exception e){
throw new RuntimeException("GzipUnCompressError", e);
}
}
//Gzip壓縮
public static String gzipCompress(String original){
return Base64.getEncoder().encodeToString(compress(original, GZIP_ENCODE_UTF_8));
}
public static byte[] compress(String str, String encoding){
if(str == null || str.length() == 0){
return null;
}
ByteArrayOutputStream out = new ByteArrayOutputStream();
GZIPOutputStream gzip ;
try{
gzip = new GZIPOutputStream(out);
gzip.write(str.getBytes(encoding));
gzip.close();
}catch (Exception e){
throw new RuntimeException("GzipCompressError", e);
}
return out.toByteArray();
}
}
huffman算法
package com.demo.rpc.compress;
import java.io.*;
import java.util.*;
/**
* @Date: 2020/9/24 15:14
* @url:https://blog.csdn.net/qq_41966475/article/details/108550909?utm_medium=distribute.pc_relevant.none-task-blog-title-5&spm=1001.2101.3001.4242
*/
public class HuffmanCompress {
//資料的解壓
public byte[] unCompress(Map<Byte, String> huffmanCodes, byte[] huffmanBytes) {
StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < huffmanBytes.length; i++) {
byte b = huffmanBytes[i];
boolean flag = (i == huffmanBytes.length - 1);
stringBuilder.append(byteToBitString(!flag, b));
}
System.out.print(stringBuilder);
System.out.println();
Map<String, Byte> map = new HashMap<>();
for (Map.Entry<Byte, String> entry : huffmanCodes.entrySet()) {
map.put(entry.getValue(), entry.getKey());
}
List<Byte> list = new ArrayList<>();
for (int i = 0; i < stringBuilder.length(); ) {
int count = 1;
boolean flag = true;
Byte b = null;
while (flag) {
String key = stringBuilder.substring(i, i + count);
b = map.get(key);
if (b == null) {
count++;
} else {
flag = false;
}
}
list.add(b);
i += count;
}
byte[] b = new byte[list.size()];
for (int i = 0; i < b.length; i++) {
b[i] = list.get(i);
}
return b;
}
//把壓縮的byte數組中的十進制數轉化為2進制數
private String byteToBitString(boolean flag, byte b) {
int temp = b;
if (flag) {
temp |= 256;
}
String str = Integer.toBinaryString(temp);
if (flag) {
return str.substring(str.length() - 8);
} else {
return str;
}
}
//封裝壓縮操作
public byte[] compress(Map<Byte, String> huffmanCodes , byte[] bytes) {
List<Node> nodes = getNodes(bytes);
Node root = creatHuffmanTree(nodes);
getCodes(huffmanCodes, root);
byte[] huffmanCodeBytes = zip(bytes, huffmanCodes);
return huffmanCodeBytes;
}
/**
* @param bytes 原始的字元串對應的數組
* @param huffmanCodes 生成的哈夫曼樹編碼map
* @return 傳回哈夫曼編碼處理後的byte[]
*/
private byte[] zip(byte[] bytes, Map<Byte, String> huffmanCodes) {
StringBuilder builder = new StringBuilder();
for (byte b : bytes) {
builder.append(huffmanCodes.get(b));
}
int len;
if (builder.length() % 8 == 0) {
len = builder.length() / 8;
} else {
len = builder.length() / 8 + 1;
}
byte[] huffmanCodeBytes = new byte[len];
int index = 0;
for (int i = 0; i < builder.length(); i = i + 8) {
String strByte;
if (i + 8 > builder.length()) {
strByte = builder.substring(i);
} else {
strByte = builder.substring(i, i + 8);
}
huffmanCodeBytes[index] = (byte) Integer.parseInt(strByte, 2);
index++;
}
return huffmanCodeBytes;
}
// Map<Byte, String> huffmanCodes = new HashMap<>();
//
// StringBuilder stringBuilder = new StringBuilder();
private Map<Byte, String> getCodes(Map<Byte, String> huffmanCodes, Node root) {
if (root == null) {
return null;
}
getCodes(huffmanCodes, root.left, "0", new StringBuilder());
getCodes(huffmanCodes, root.right, "1", new StringBuilder());
return huffmanCodes;
}
/**
* 将傳入的node節點的所有葉子節點哈夫曼編碼得到,并放入到huffmanCode集合中
*
* @param node 傳入節點
* @param code 路徑,左0右1
* @param stringBuilder 用于拼接路徑
*/
private void getCodes(Map<Byte, String> huffmanCodes, Node node, String code, StringBuilder stringBuilder) {
StringBuilder builder = new StringBuilder(stringBuilder);
builder.append(code);
if (node != null) {
if (node.data == null) {
getCodes(huffmanCodes, node.left, "0", builder);
getCodes(huffmanCodes, node.right, "1", builder);
} else {
huffmanCodes.put(node.data, builder.toString());
}
}
}
/**
* @param bytes 接收位元組數組
* @return 傳回的就算List
*/
private List<Node> getNodes(byte[] bytes) {
List<Node> nodes = new ArrayList<>();
Map<Byte, Integer> counts = new HashMap<>();
for (Byte b : bytes) {
Integer count = counts.get(b);
if (count == null) {
counts.put(b, 1);
} else {
counts.put(b, count + 1);
}
}
for (Map.Entry<Byte, Integer> entry : counts.entrySet()) {
nodes.add(new Node(entry.getKey(), entry.getValue()));
}
return nodes;
}
//通過List建立哈夫曼樹
private Node creatHuffmanTree(List<Node> nodes) {
while (nodes.size() > 1) {
Collections.sort(nodes);
Node leftNode = nodes.get(0);
Node rightNode = nodes.get(1);
Node parent = new Node(null, leftNode.weight + rightNode.weight);
parent.left = leftNode;
parent.right = rightNode;
nodes.remove(leftNode);
nodes.remove(rightNode);
nodes.add(parent);
}
return nodes.get(0);
}
}
//建立節點
class Node implements Comparable<Node> {
Byte data;
int weight;
Node left;
Node right;
public Node(Byte data, int weight) {
this.data = data;
this.weight = weight;
}
@Override
public int compareTo(Node o) {
return this.weight - o.weight;
}
@Override
public String toString() {
return "Node{" +
"data=" + data +
", weight=" + weight +
'}';
}
}
Lz4算法
依賴:
<dependency>
<groupId>org.lz4</groupId>
<artifactId>lz4-java</artifactId>
<version>1.7.1</version>
</dependency>
Lz4算法java實作源碼:
package com.demo.rpc.compress;
import net.jpountz.lz4.LZ4BlockInputStream;
import net.jpountz.lz4.LZ4BlockOutputStream;
import net.jpountz.lz4.LZ4Compressor;
import net.jpountz.lz4.LZ4Factory;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.nio.charset.StandardCharsets;
import java.util.Base64;
public class Lz4Compress {
//lz4解壓縮
public static String unCompress(String str){
byte[] decode = Base64.getDecoder().decode(str.getBytes());
ByteArrayOutputStream baos = new ByteArrayOutputStream();
try{
LZ4BlockInputStream lzis = new LZ4BlockInputStream(
new ByteArrayInputStream(decode));
int count;
byte[] buffer = new byte[2048];
while ((count = lzis.read(buffer)) != -1) {
baos.write(buffer, 0, count);
}
lzis.close();
return baos.toString("utf-8");
}catch (Exception e){
throw new RuntimeException("lz4UnCompressError", e);
}
}
public static byte[] unCompress(byte[] bytes){
ByteArrayOutputStream baos = new ByteArrayOutputStream();
try{
LZ4BlockInputStream lzis = new LZ4BlockInputStream(
new ByteArrayInputStream(bytes));
int count;
byte[] buffer = new byte[2048];
while ((count = lzis.read(buffer)) != -1) {
baos.write(buffer, 0, count);
}
lzis.close();
return baos.toByteArray();
}catch (Exception e){
throw new RuntimeException("lz4UnCompressError", e);
}
}
//lz4壓縮
public static String compress(String str){
LZ4Factory factory = LZ4Factory.fastestInstance();
ByteArrayOutputStream byteOutput = new ByteArrayOutputStream();
LZ4Compressor compressor = factory.fastCompressor();
try{
LZ4BlockOutputStream compressedOutput = new LZ4BlockOutputStream(
byteOutput, 2048, compressor);
compressedOutput.write(str.getBytes(StandardCharsets.UTF_8));
compressedOutput.close();
return Base64.getEncoder().encodeToString(byteOutput.toByteArray());
}catch (Exception e){
throw new RuntimeException("lz4CompressError", e);
}
}
public static byte[] compress(byte[] bytes){
LZ4Factory factory = LZ4Factory.fastestInstance();
ByteArrayOutputStream byteOutput = new ByteArrayOutputStream();
LZ4Compressor compressor = factory.fastCompressor();
try{
LZ4BlockOutputStream compressedOutput = new LZ4BlockOutputStream(
byteOutput, 2048, compressor);
compressedOutput.write(bytes);
compressedOutput.close();
return byteOutput.toByteArray();
}catch (Exception e){
throw new RuntimeException("lz4CompressError", e);
}
}
}
Lzo算法
依賴:
<dependency>
<groupId>org.anarres.lzo</groupId>
<artifactId>lzo-core</artifactId>
<version>1.0.6</version>
</dependency>
Lzo算法java實作源碼:
package com.demo.rpc.compress;
import org.anarres.lzo.*;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.nio.charset.StandardCharsets;
import java.util.Base64;
public class LzoCompress {
//lzo解壓縮
public static String unCompress(String str){
LzoDecompressor decompressor = LzoLibrary.getInstance()
.newDecompressor(LzoAlgorithm.LZO1X, null);
try{
ByteArrayOutputStream os = new ByteArrayOutputStream();
ByteArrayInputStream is = new ByteArrayInputStream(Base64.getDecoder().decode(str.getBytes(StandardCharsets.UTF_8)));
LzoInputStream lis = new LzoInputStream(is, decompressor);
int count;
byte[] buffer = new byte[256];
while((count = lis.read(buffer)) != -1){
os.write(buffer, 0, count);
}
return os.toString();
}catch (Exception e){
throw new RuntimeException("lzoUnCompressError", e);
}
}
public static byte[] unCompress(byte[] bytes){
LzoDecompressor decompressor = LzoLibrary.getInstance()
.newDecompressor(LzoAlgorithm.LZO1X, null);
try{
ByteArrayOutputStream os = new ByteArrayOutputStream();
ByteArrayInputStream is = new ByteArrayInputStream(bytes);
LzoInputStream lis = new LzoInputStream(is, decompressor);
int count;
byte[] buffer = new byte[256];
while((count = lis.read(buffer)) != -1){
os.write(buffer, 0, count);
}
return os.toByteArray();
}catch (Exception e){
throw new RuntimeException("lzoUnCompressError", e);
}
}
public static byte[] compress(byte[] bytes){
LzoCompressor compressor = LzoLibrary.getInstance().newCompressor(
LzoAlgorithm.LZO1X, null);
ByteArrayOutputStream os = new ByteArrayOutputStream();
LzoOutputStream louts = new LzoOutputStream(os, compressor);
try{
louts.write(bytes);
louts.close();
return os.toByteArray();
}catch (Exception e){
throw new RuntimeException("LzoCompressError", e);
}
}
public static String compress(String str){
LzoCompressor compressor = LzoLibrary.getInstance().newCompressor(
LzoAlgorithm.LZO1X, null);
ByteArrayOutputStream os = new ByteArrayOutputStream();
LzoOutputStream louts = new LzoOutputStream(os, compressor);
try{
louts.write(str.getBytes(StandardCharsets.UTF_8));
louts.close();
return Base64.getEncoder().encodeToString(os.toByteArray());
}catch (Exception e){
throw new RuntimeException("LzoCompressError", e);
}
}
}
使用方式
package com.demo.rpc.compress;
import org.junit.Test;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
public class CompressorTest {
String str = "http://www.baidu.com https://fanyi.baidu.com/ http://www.baidu.com ";
@Test
public void snappyCompress() throws IOException {
SnappyCompressor snappyCompressor = new SnappyCompressor();
byte[] compressed = snappyCompressor.compress(str.getBytes());
System.out.println("壓縮前數組大小: " + str.getBytes().length);
System.out.println("壓縮後數組大小:" + compressed.length);
byte[] unCompressed = snappyCompressor.unCompress(compressed);
System.out.println("原字元串:" + new String(unCompressed));
}
@Test
public void gzipCompress(){
String encode = "utf-8";
byte[] compressed = GzipCompress.compress(str, encode);
String unCompressed = GzipCompress.unCompress(compressed, encode);
System.out.println("壓縮前數組大小:" + str.getBytes().length);
System.out.println("壓縮後數組大小:" + compressed.length);
System.out.println("原字元串:" + new String(unCompressed));
}
@Test
public void deflateCompress(){
byte[] compressed = DeflateCompress.compress(str.getBytes());
byte[] unCompressed = DeflateCompress.unCompress(compressed);
System.out.println("壓縮前數組大小:" + str.getBytes().length);
System.out.println("壓縮後數組大小:" + compressed.length);
System.out.println("原字元串:" + new String(unCompressed));
}
@Test
public void huffmanCompress(){
HuffmanCompress huffmanCompress = new HuffmanCompress();
Map<Byte, String> huffmanCodec = new HashMap<>();
byte[] compressed = huffmanCompress.compress(huffmanCodec, str.getBytes());
byte[] unCompressed = huffmanCompress.unCompress(huffmanCodec, compressed);
System.out.println("壓縮前數組大小:" + str.getBytes().length);
System.out.println("壓縮後數組大小:" + compressed.length);
System.out.println("原字元串:" + new String(unCompressed));
}
@Test
public void lzoCompress(){
byte[] compressed = LzoCompress.compress(str.getBytes());
byte[] unCompressed = LzoCompress.unCompress(compressed);
System.out.println("壓縮前數組大小:" + str.getBytes().length);
System.out.println("壓縮後數組大小:" + compressed.length);
System.out.println("原字元串:" + new String(unCompressed));
}
@Test
public void lz4Compress(){
byte[] compressed = Lz4Compress.compress(str.getBytes());
byte[] unCompressed = Lz4Compress.unCompress(compressed);
System.out.println("壓縮前數組大小:" + str.getBytes().length);
System.out.println("壓縮後數組大小:" + compressed.length);
System.out.println("原字元串:" + new String(unCompressed));
}
}