文章目录
- 无损压缩算法理论基础
-
- 信息熵
- 熵编码
- 字典编码
- 综合通用无损压缩算法
- 相关常见名词说明
- java对几种常见算法实现
-
- Snappy
- deflate算法
- Gzip算法
- huffman算法
- Lz4算法
- Lzo算法
- 使用方式
无损压缩算法理论基础
信息熵
信息熵是一个数学上颇为抽象的概念,在这里不妨把信息熵理解成某种特定信息的出现概率(离散随机事件的出现概率)。一个系统越是有序,信息熵就越低;反之,一个系统越是混乱,信息熵就越高。信息熵也可以说是系统有序化程度的一个度量。
- 熵编码:根据消息中每个符号出现的概率,然后通过某种映射用更短的符号替代原来的符号,核心在于提高符号的
- 字典编码:提取信息中的重复部分作为字典,然后通过字典和某种映射替代这些重复的部分,核心在于替代重复
熵编码
- 赫夫曼编码:根据消息中符号出现的频率构建出霍夫曼树,实现频率高的符号编码短,然后根据霍夫曼树得到新的符号替代原来的符号
- 算术编码:根据消息中符号出现的概率算出整个消息(符号串)的概率,一个满足(0.0 ≤ n < 1.0)的小数 n ,这个小数n就代表了这个消息
- 区间编码:根据消息中符号出现的概率把符号串映射到大区间数值中的一段小区间(多个符号多次细分区),用小区间边缘的数值的唯一前缀就可以代表了这个区间对应的消息(效果其实和算术编码相同)
字典编码
- RLE(Run-length Encoding)游程编码: 个人把他看作一种比较直觉朴素的字典编码,具体算法就是把字符串中重复出现的多个字符替换为重复次数外加这个字符
- MTF(Move-to-front transform): 通过护“recently used symbols”最近访问过的字符栈表,作为一个动态字典,在编码消息时,用字符在栈表中的索引序号替代,同时调整栈表中该字符到栈顶,根据“空间局部性”原理可以实现数据压缩
- LZ77与LZ78: 典型的字典编码,较早出现并流行的两种通用压缩算法。LZ77:通过滑动窗口”slidingwindow”实现动态字典,用前面出现过的字符串作为字典通过映射(与前一个字符串的距离和字符串长度)替代后面重复出现的字符串;LZ78:提前解析输入数据,生成一个静态字典
- LZSS: 衍生于LZ77,能检测到一个替换是否真的减小了文件大小,以及一些别的优化
- LZW: 衍生于LZ78,优化了字典编码存储,但由于专利限制了发展,在GIF中被使用
综合通用无损压缩算法
- deflate:先用LZ77(或 LZSS)算法预处理,然后用霍夫曼编码对压缩后的 literal、length、distance 编码优化,如今最流行的通用压缩算法之一
- bzip2:涉及多种算法,主要流程包括先使用 Run-length Encoding 游程编码对原始数据进行处理,然后通过 Burrows-Wheeler Transform 转换(可逆的处理一段输入数据使得相同字符连续出现的次数最大化),再用 Move-to-front transform 转换,然后再次使用Run-length Encoding游程编码处理,接下来还会进行霍夫曼编码以及一系列相关处理,较为复杂,速率劣于DEFLATE但压缩率更高
- LZMA:实现了LZ77修改版以位(bit)而非字节(byte)为单元级别的操作,并通过马可夫链实现字典索引,速率和压缩率优于bzip2,另有多线程优化的版本LZMA2
- Brotli: 基于LZ77算法的一个现代变体,使用了霍夫曼编码和二阶上下文建模,使用了预定义的120千字节字典包含超过13000个常用单词、短语和其他子字符串,预定义的算法可以提升较小文件的压缩密度。总体速率接近于DEFLATE且压缩率接近于LZMA
相关常见名词说明
- RAR: 商业软件WinRAR提供的压缩文件格式,压缩算法实现带专利(可能衍生自LZSS)
- Zip: 一种规范开放的压缩文件容器,被多种压缩软件实现,兼容多种压缩算法主要为DEFLATE
- GZip: gnu/Linux下的文件压缩软件,提供gz压缩格式,压缩算法基于DEFLATE
- 7-Zip: 开源跨平台压缩软件,提供7z压缩格式,压缩算法主要为Bzip2以及LZMA
java对几种常见算法实现
Snappy
Google开发的一个非常流行的压缩算法,基于LZ77的思路编写的快速数据压缩与解压缩
nappy是在谷歌内部生产环境中被许多项目使用的压缩库,包括BigTable,MapReduce和RPC等。谷歌表示算法库针对性能做了调整,而不是针对压缩比或与其他类似工具的兼容性。在Intel酷睿i7处理器上,其单核处理数据流的能力达到250M/s-500M/s。Snappy同时针对64位x86处理器进行了优化,在英特尔酷睿i7处理器单一核心实现了至少250MB/s的压缩性能和500MB/ s的解压缩性能。Snappy对于纯文本的压缩率为1.5-1.7,对于HTML是2-4,当然了对于JPEG、PNG和其他已经压缩过的数据压缩率为1.0。谷歌强劲吹捧Snappy的鲁棒性,称其是“即使面对损坏或恶意输入也不会崩溃的设计”,并且在谷歌的生产环境中经过了PB级数据压缩的考验而稳定的。
依赖:
<dependency>
<groupId>org.xerial.snappy</groupId>
<artifactId>snappy-java</artifactId>
<version>1.1.7.5</version>
</dependency>
Snappy java实现源码:
package com.demo.rpc.compress;
import java.io.IOException;
import org.xerial.snappy.Snappy;
/**
* @author: weijie
* @Date: 2020/9/24 14:31
* @Description:Google开发的一个非常流行的压缩算法,基于LZ77的思路编写的快速数据压缩与解压缩
*
* LZ77算法:如果文件中有两块内容相同的话,那么只要知道前一块的位置和大小,我们就可以确定后一块的内容
* 所以我们可以用(两者之间的距离,相同内容的长度)这样一对信息,来替换后一对内容。由于(两者之间的距离,相同
* 内容的长度)这一对信息的大小,小于被替换内容的大小,所以文件得到压缩。
*
* @url: https://blog.csdn.net/zj57356498318/article/details/108248602
*
*
*
*/
public class SnappyCompressor implements Compressor {
public byte[] compress(byte[] array) throws IOException {
if (array == null) {
return null;
}
return Snappy.compress(array);
}
public byte[] unCompress(byte[] array) throws IOException {
if (array == null) {
return null;
}
return Snappy.uncompress(array);
}
}
deflate算法
package com.demo.rpc.compress;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.util.Base64;
import java.util.zip.DeflaterOutputStream;
import java.util.zip.InflaterInputStream;
public class DeflateCompress {
//deflate解压缩
public static String unCompress(String inputString){
byte[] bytes = Base64.getDecoder().decode(inputString);
if(bytes == null || bytes.length == 0){
return null;
}
ByteArrayOutputStream out = new ByteArrayOutputStream();
ByteArrayInputStream in = new ByteArrayInputStream(bytes);
try{
InflaterInputStream inflater = new InflaterInputStream(in);
byte[] buffer = new byte[256];
int n;
while((n = inflater.read(buffer)) >= 0){
out.write(buffer, 0, n);
}
return out.toString("utf-8");
}catch (Exception e){
throw new RuntimeException("DeflaterUnCompressError", e);
}
}
public static byte[] compress(byte[] bytes){
ByteArrayOutputStream out = new ByteArrayOutputStream();
DeflaterOutputStream deflaterOutputStream = new DeflaterOutputStream(out);
try {
deflaterOutputStream.write(bytes);
deflaterOutputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
return out.toByteArray();
}
public static byte[] unCompress(byte[] bytes){
ByteArrayOutputStream out = new ByteArrayOutputStream();
ByteArrayInputStream in = new ByteArrayInputStream(bytes);
try {
InflaterInputStream inflater = new InflaterInputStream(in);
byte[] buffer = new byte[256];
int n;
while((n = inflater.read(buffer)) >= 0){
out.write(buffer, 0, n);
}
} catch (IOException e) {
e.printStackTrace();
}
return out.toByteArray();
}
//deflate压缩
public static String compress(String original){
if(original == null || original.length() == 0){
return null;
}
ByteArrayOutputStream out = new ByteArrayOutputStream();
DeflaterOutputStream deflater ;
try{
deflater = new DeflaterOutputStream(out);
deflater.write(original.getBytes(StandardCharsets.UTF_8));
deflater.close();
return Base64.getEncoder().encodeToString(out.toByteArray());
}catch (Exception e){
throw new RuntimeException("DeflaterCompressError", e);
}
}
}
Gzip算法
package com.demo.rpc.compress;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.util.Base64;
import java.util.zip.GZIPInputStream;
import java.util.zip.GZIPOutputStream;
public class GzipCompress {
private static final String GZIP_ENCODE_UTF_8 = "UTF-8";
//GZip解压缩
public static String gzipUnCompress(String inputString){
byte[] decode = Base64.getDecoder().decode(inputString);
return unCompress(decode, GZIP_ENCODE_UTF_8);
}
public static String unCompress(byte[] bytes, String encoding){
if(bytes == null || bytes.length == 0){
return null;
}
ByteArrayOutputStream out = new ByteArrayOutputStream();
ByteArrayInputStream in = new ByteArrayInputStream(bytes);
try{
GZIPInputStream ungzip = new GZIPInputStream(in);
byte[] buffer = new byte[256];
int n;
while((n = ungzip.read(buffer)) >= 0){
out.write(buffer, 0, n);
}
return out.toString(encoding);
}catch (Exception e){
throw new RuntimeException("GzipUnCompressError", e);
}
}
//Gzip压缩
public static String gzipCompress(String original){
return Base64.getEncoder().encodeToString(compress(original, GZIP_ENCODE_UTF_8));
}
public static byte[] compress(String str, String encoding){
if(str == null || str.length() == 0){
return null;
}
ByteArrayOutputStream out = new ByteArrayOutputStream();
GZIPOutputStream gzip ;
try{
gzip = new GZIPOutputStream(out);
gzip.write(str.getBytes(encoding));
gzip.close();
}catch (Exception e){
throw new RuntimeException("GzipCompressError", e);
}
return out.toByteArray();
}
}
huffman算法
package com.demo.rpc.compress;
import java.io.*;
import java.util.*;
/**
* @Date: 2020/9/24 15:14
* @url:https://blog.csdn.net/qq_41966475/article/details/108550909?utm_medium=distribute.pc_relevant.none-task-blog-title-5&spm=1001.2101.3001.4242
*/
public class HuffmanCompress {
//数据的解压
public byte[] unCompress(Map<Byte, String> huffmanCodes, byte[] huffmanBytes) {
StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < huffmanBytes.length; i++) {
byte b = huffmanBytes[i];
boolean flag = (i == huffmanBytes.length - 1);
stringBuilder.append(byteToBitString(!flag, b));
}
System.out.print(stringBuilder);
System.out.println();
Map<String, Byte> map = new HashMap<>();
for (Map.Entry<Byte, String> entry : huffmanCodes.entrySet()) {
map.put(entry.getValue(), entry.getKey());
}
List<Byte> list = new ArrayList<>();
for (int i = 0; i < stringBuilder.length(); ) {
int count = 1;
boolean flag = true;
Byte b = null;
while (flag) {
String key = stringBuilder.substring(i, i + count);
b = map.get(key);
if (b == null) {
count++;
} else {
flag = false;
}
}
list.add(b);
i += count;
}
byte[] b = new byte[list.size()];
for (int i = 0; i < b.length; i++) {
b[i] = list.get(i);
}
return b;
}
//把压缩的byte数组中的十进制数转化为2进制数
private String byteToBitString(boolean flag, byte b) {
int temp = b;
if (flag) {
temp |= 256;
}
String str = Integer.toBinaryString(temp);
if (flag) {
return str.substring(str.length() - 8);
} else {
return str;
}
}
//封装压缩操作
public byte[] compress(Map<Byte, String> huffmanCodes , byte[] bytes) {
List<Node> nodes = getNodes(bytes);
Node root = creatHuffmanTree(nodes);
getCodes(huffmanCodes, root);
byte[] huffmanCodeBytes = zip(bytes, huffmanCodes);
return huffmanCodeBytes;
}
/**
* @param bytes 原始的字符串对应的数组
* @param huffmanCodes 生成的哈夫曼树编码map
* @return 返回哈夫曼编码处理后的byte[]
*/
private byte[] zip(byte[] bytes, Map<Byte, String> huffmanCodes) {
StringBuilder builder = new StringBuilder();
for (byte b : bytes) {
builder.append(huffmanCodes.get(b));
}
int len;
if (builder.length() % 8 == 0) {
len = builder.length() / 8;
} else {
len = builder.length() / 8 + 1;
}
byte[] huffmanCodeBytes = new byte[len];
int index = 0;
for (int i = 0; i < builder.length(); i = i + 8) {
String strByte;
if (i + 8 > builder.length()) {
strByte = builder.substring(i);
} else {
strByte = builder.substring(i, i + 8);
}
huffmanCodeBytes[index] = (byte) Integer.parseInt(strByte, 2);
index++;
}
return huffmanCodeBytes;
}
// Map<Byte, String> huffmanCodes = new HashMap<>();
//
// StringBuilder stringBuilder = new StringBuilder();
private Map<Byte, String> getCodes(Map<Byte, String> huffmanCodes, Node root) {
if (root == null) {
return null;
}
getCodes(huffmanCodes, root.left, "0", new StringBuilder());
getCodes(huffmanCodes, root.right, "1", new StringBuilder());
return huffmanCodes;
}
/**
* 将传入的node节点的所有叶子节点哈夫曼编码得到,并放入到huffmanCode集合中
*
* @param node 传入节点
* @param code 路径,左0右1
* @param stringBuilder 用于拼接路径
*/
private void getCodes(Map<Byte, String> huffmanCodes, Node node, String code, StringBuilder stringBuilder) {
StringBuilder builder = new StringBuilder(stringBuilder);
builder.append(code);
if (node != null) {
if (node.data == null) {
getCodes(huffmanCodes, node.left, "0", builder);
getCodes(huffmanCodes, node.right, "1", builder);
} else {
huffmanCodes.put(node.data, builder.toString());
}
}
}
/**
* @param bytes 接收字节数组
* @return 返回的就算List
*/
private List<Node> getNodes(byte[] bytes) {
List<Node> nodes = new ArrayList<>();
Map<Byte, Integer> counts = new HashMap<>();
for (Byte b : bytes) {
Integer count = counts.get(b);
if (count == null) {
counts.put(b, 1);
} else {
counts.put(b, count + 1);
}
}
for (Map.Entry<Byte, Integer> entry : counts.entrySet()) {
nodes.add(new Node(entry.getKey(), entry.getValue()));
}
return nodes;
}
//通过List创建哈夫曼树
private Node creatHuffmanTree(List<Node> nodes) {
while (nodes.size() > 1) {
Collections.sort(nodes);
Node leftNode = nodes.get(0);
Node rightNode = nodes.get(1);
Node parent = new Node(null, leftNode.weight + rightNode.weight);
parent.left = leftNode;
parent.right = rightNode;
nodes.remove(leftNode);
nodes.remove(rightNode);
nodes.add(parent);
}
return nodes.get(0);
}
}
//创建节点
class Node implements Comparable<Node> {
Byte data;
int weight;
Node left;
Node right;
public Node(Byte data, int weight) {
this.data = data;
this.weight = weight;
}
@Override
public int compareTo(Node o) {
return this.weight - o.weight;
}
@Override
public String toString() {
return "Node{" +
"data=" + data +
", weight=" + weight +
'}';
}
}
Lz4算法
依赖:
<dependency>
<groupId>org.lz4</groupId>
<artifactId>lz4-java</artifactId>
<version>1.7.1</version>
</dependency>
Lz4算法java实现源码:
package com.demo.rpc.compress;
import net.jpountz.lz4.LZ4BlockInputStream;
import net.jpountz.lz4.LZ4BlockOutputStream;
import net.jpountz.lz4.LZ4Compressor;
import net.jpountz.lz4.LZ4Factory;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.nio.charset.StandardCharsets;
import java.util.Base64;
public class Lz4Compress {
//lz4解压缩
public static String unCompress(String str){
byte[] decode = Base64.getDecoder().decode(str.getBytes());
ByteArrayOutputStream baos = new ByteArrayOutputStream();
try{
LZ4BlockInputStream lzis = new LZ4BlockInputStream(
new ByteArrayInputStream(decode));
int count;
byte[] buffer = new byte[2048];
while ((count = lzis.read(buffer)) != -1) {
baos.write(buffer, 0, count);
}
lzis.close();
return baos.toString("utf-8");
}catch (Exception e){
throw new RuntimeException("lz4UnCompressError", e);
}
}
public static byte[] unCompress(byte[] bytes){
ByteArrayOutputStream baos = new ByteArrayOutputStream();
try{
LZ4BlockInputStream lzis = new LZ4BlockInputStream(
new ByteArrayInputStream(bytes));
int count;
byte[] buffer = new byte[2048];
while ((count = lzis.read(buffer)) != -1) {
baos.write(buffer, 0, count);
}
lzis.close();
return baos.toByteArray();
}catch (Exception e){
throw new RuntimeException("lz4UnCompressError", e);
}
}
//lz4压缩
public static String compress(String str){
LZ4Factory factory = LZ4Factory.fastestInstance();
ByteArrayOutputStream byteOutput = new ByteArrayOutputStream();
LZ4Compressor compressor = factory.fastCompressor();
try{
LZ4BlockOutputStream compressedOutput = new LZ4BlockOutputStream(
byteOutput, 2048, compressor);
compressedOutput.write(str.getBytes(StandardCharsets.UTF_8));
compressedOutput.close();
return Base64.getEncoder().encodeToString(byteOutput.toByteArray());
}catch (Exception e){
throw new RuntimeException("lz4CompressError", e);
}
}
public static byte[] compress(byte[] bytes){
LZ4Factory factory = LZ4Factory.fastestInstance();
ByteArrayOutputStream byteOutput = new ByteArrayOutputStream();
LZ4Compressor compressor = factory.fastCompressor();
try{
LZ4BlockOutputStream compressedOutput = new LZ4BlockOutputStream(
byteOutput, 2048, compressor);
compressedOutput.write(bytes);
compressedOutput.close();
return byteOutput.toByteArray();
}catch (Exception e){
throw new RuntimeException("lz4CompressError", e);
}
}
}
Lzo算法
依赖:
<dependency>
<groupId>org.anarres.lzo</groupId>
<artifactId>lzo-core</artifactId>
<version>1.0.6</version>
</dependency>
Lzo算法java实现源码:
package com.demo.rpc.compress;
import org.anarres.lzo.*;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.nio.charset.StandardCharsets;
import java.util.Base64;
public class LzoCompress {
//lzo解压缩
public static String unCompress(String str){
LzoDecompressor decompressor = LzoLibrary.getInstance()
.newDecompressor(LzoAlgorithm.LZO1X, null);
try{
ByteArrayOutputStream os = new ByteArrayOutputStream();
ByteArrayInputStream is = new ByteArrayInputStream(Base64.getDecoder().decode(str.getBytes(StandardCharsets.UTF_8)));
LzoInputStream lis = new LzoInputStream(is, decompressor);
int count;
byte[] buffer = new byte[256];
while((count = lis.read(buffer)) != -1){
os.write(buffer, 0, count);
}
return os.toString();
}catch (Exception e){
throw new RuntimeException("lzoUnCompressError", e);
}
}
public static byte[] unCompress(byte[] bytes){
LzoDecompressor decompressor = LzoLibrary.getInstance()
.newDecompressor(LzoAlgorithm.LZO1X, null);
try{
ByteArrayOutputStream os = new ByteArrayOutputStream();
ByteArrayInputStream is = new ByteArrayInputStream(bytes);
LzoInputStream lis = new LzoInputStream(is, decompressor);
int count;
byte[] buffer = new byte[256];
while((count = lis.read(buffer)) != -1){
os.write(buffer, 0, count);
}
return os.toByteArray();
}catch (Exception e){
throw new RuntimeException("lzoUnCompressError", e);
}
}
public static byte[] compress(byte[] bytes){
LzoCompressor compressor = LzoLibrary.getInstance().newCompressor(
LzoAlgorithm.LZO1X, null);
ByteArrayOutputStream os = new ByteArrayOutputStream();
LzoOutputStream louts = new LzoOutputStream(os, compressor);
try{
louts.write(bytes);
louts.close();
return os.toByteArray();
}catch (Exception e){
throw new RuntimeException("LzoCompressError", e);
}
}
public static String compress(String str){
LzoCompressor compressor = LzoLibrary.getInstance().newCompressor(
LzoAlgorithm.LZO1X, null);
ByteArrayOutputStream os = new ByteArrayOutputStream();
LzoOutputStream louts = new LzoOutputStream(os, compressor);
try{
louts.write(str.getBytes(StandardCharsets.UTF_8));
louts.close();
return Base64.getEncoder().encodeToString(os.toByteArray());
}catch (Exception e){
throw new RuntimeException("LzoCompressError", e);
}
}
}
使用方式
package com.demo.rpc.compress;
import org.junit.Test;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
public class CompressorTest {
String str = "http://www.baidu.com https://fanyi.baidu.com/ http://www.baidu.com ";
@Test
public void snappyCompress() throws IOException {
SnappyCompressor snappyCompressor = new SnappyCompressor();
byte[] compressed = snappyCompressor.compress(str.getBytes());
System.out.println("压缩前数组大小: " + str.getBytes().length);
System.out.println("压缩后数组大小:" + compressed.length);
byte[] unCompressed = snappyCompressor.unCompress(compressed);
System.out.println("原字符串:" + new String(unCompressed));
}
@Test
public void gzipCompress(){
String encode = "utf-8";
byte[] compressed = GzipCompress.compress(str, encode);
String unCompressed = GzipCompress.unCompress(compressed, encode);
System.out.println("压缩前数组大小:" + str.getBytes().length);
System.out.println("压缩后数组大小:" + compressed.length);
System.out.println("原字符串:" + new String(unCompressed));
}
@Test
public void deflateCompress(){
byte[] compressed = DeflateCompress.compress(str.getBytes());
byte[] unCompressed = DeflateCompress.unCompress(compressed);
System.out.println("压缩前数组大小:" + str.getBytes().length);
System.out.println("压缩后数组大小:" + compressed.length);
System.out.println("原字符串:" + new String(unCompressed));
}
@Test
public void huffmanCompress(){
HuffmanCompress huffmanCompress = new HuffmanCompress();
Map<Byte, String> huffmanCodec = new HashMap<>();
byte[] compressed = huffmanCompress.compress(huffmanCodec, str.getBytes());
byte[] unCompressed = huffmanCompress.unCompress(huffmanCodec, compressed);
System.out.println("压缩前数组大小:" + str.getBytes().length);
System.out.println("压缩后数组大小:" + compressed.length);
System.out.println("原字符串:" + new String(unCompressed));
}
@Test
public void lzoCompress(){
byte[] compressed = LzoCompress.compress(str.getBytes());
byte[] unCompressed = LzoCompress.unCompress(compressed);
System.out.println("压缩前数组大小:" + str.getBytes().length);
System.out.println("压缩后数组大小:" + compressed.length);
System.out.println("原字符串:" + new String(unCompressed));
}
@Test
public void lz4Compress(){
byte[] compressed = Lz4Compress.compress(str.getBytes());
byte[] unCompressed = Lz4Compress.unCompress(compressed);
System.out.println("压缩前数组大小:" + str.getBytes().length);
System.out.println("压缩后数组大小:" + compressed.length);
System.out.println("原字符串:" + new String(unCompressed));
}
}