【小項目】用Huffman樹實作檔案壓縮并解壓

一、前言

如果你學習資料結構，就一定會學到Huffman樹，而Huffman編碼實際上上就是zip壓縮的核心部分，是以，如果已經學習了Huffman樹，為何不嘗試寫一個壓縮程式出來呢？

如果你沒有學習Huffman樹，那咱們就一起先學習一下Huffman樹吧。

二、Huffman樹壓縮檔案

定義：Huffman樹，又稱為最優二叉樹，是權重路徑長度最短的二叉樹。

建立：

【小項目】用Huffman樹實作檔案壓縮并解壓

這樣建立的樹，保證所有資料成員都在葉子節點上，且數越小，離根節點越遠，越大，離根節點越近，那麼這樣的特點應用于壓縮中是很關鍵的，我們可以讓出現次數少的字元編碼長一些，次數多的字元編碼短一些。接下來我們看看壓縮的步驟吧~

1>統計要壓縮的檔案中字元出現的次數。

周遊一遍檔案，将字元出現的次數統計在一個結構體數組裡，數組裡包含字元，字元出現的次數，對該字元的編碼。

2>用得到的數組建構一個Huffman樹。

因為每次要取最小值，是以這裡考慮建立一個小堆。

3>得到Huffman編碼

怎麼得到呢？向右為1，向左為0，就是這麼簡單，我畫圖示意一下：

【小項目】用Huffman樹實作檔案壓縮并解壓

原本用一個char表示的字元，現在隻占了幾個位，這就是為什麼能将檔案壓縮。

4>向壓縮檔案裡寫入Huffman編碼。

寫入的時候，滿8個位寫進去，如果最後不足8個位，先補齊，解壓的時候要注意，解壓到源檔案字元數的時候停止即可。源檔案的總字元數可以在第一次周遊統計出現的字元個數時統計，還有一種方法就是，仔細觀察Huffman樹就知道，它的根節點的大小，其實就是所有葉子節點相加的和。是以，根節點的大小就是源檔案裡所有字元出現的總次數。

至此，壓縮就結束了。

但是，怎麼解壓縮呢？解壓縮至少也得已知這樣的一顆樹才行啊，是以，我們在壓縮完成後要建立一個配置檔案。

5>建立配置檔案

配置檔案裡要存儲源檔案字元及出現的次數。有了這樣的配置檔案，就可以再次建構Huffman樹！

三、解壓縮

1>讀取配置檔案，重新建構Huffman樹

2>讀取壓縮檔案

由壓縮時的原理可知，此時讀到1指針向右移動，0向左移，到葉子節點停下，将字元還原。不停的循環，直到檔案結束或者總字元數變為0.這裡就能展現出，Huffman壓縮是一種無損的壓縮，如果代碼沒有問題，它會原原本本的還原源檔案。

解壓到這裡成功。可以先使用小檔案測試，若沒有問題則找個大點的檔案，還有各類格式的檔案都拿來壓一壓測一下。

四、我遇到的問題

1>編譯時不通過，一大堆的錯誤，我找了半天！最後發現是一個很簡單的問題，我的Huffman樹使用的是C++模闆實作的，模闆不能分離編譯，而我在壓縮時建立Huffman樹是在另一個檔案中進行的，是以編譯不通過。

解決方法：.h字尾改成.hpp，重新包一下頭檔案ok。

2>檔案的打開方式。這裡打開檔案一定要用二進制形式，"wb","rb".因為二進制打開和文本打開其實是有差別的。文本方式打開，會對‘\n’進行特殊處理，那如果這個字元本身就是'\n'.這就會出現問題，是以使用二進制打開，特點：不進行任何處理，是什麼就是什麼。

3>壓縮後解壓縮的圖檔打不開，經過我反複查找，終于發現是配置檔案裡對‘\0’的處理問題，我在寫配置檔案起初是用一個string把字元和它出現的次數連接配接起來放進去。比如：a,3 這樣帶來的問題是 \0,200 寫的時候是以c字元串的形式寫的，是以遇見'\0'就終止了，那麼在解壓縮的時候就會出問題。

解決方法：先把字元放進去，再把剩下的建構成string對象放進去。

五、源碼

1>Huffman樹

#pragma once
#include<iostream>
#include"Heap.h"
#include"Press.h"
using namespace std;

template<class T>
struct HuffmanTreeNode
{
	typedef HuffmanTreeNode<T> Node;
	T _weight;
	Node* _left;
	Node* _right;
	Node* _parent;
	HuffmanTreeNode(const T& w)
		:_weight(w),
		_left(NULL),
		_right(NULL),
		_parent(NULL)
	{}
};

template<class T>
class HuffmanTree
{
public:
	typedef HuffmanTreeNode<T> Node;
	HuffmanTree()
		:_root(NULL)
	{}
	~HuffmanTree()
	{
		_destory(_root);
	}
	Node* GetRoot()
	{
		return _root;
	}
	template<class T>
	struct Less
	{
		bool operator()(const T& left, const T&right)
		{
			return left->_weight < right->_weight;
		}
	};
	HuffmanTree(T* a, int size,T invalid)   //建構Huffman樹
	{
		
		Heap<Node* , Less<Node*>> hp;   //建小堆
		for (int i = 0; i<size; i++)
		{
			if (a[i] != invalid)
			{
				Node* tmp = new Node(a[i]);
				hp.Push(tmp);
			}
				
		}
		while (hp.Size()>1)
		{		
			Node* left = hp.Top();
			hp.Pop();

			Node* right = hp.Top();
			hp.Pop();

			Node* parent = new Node(left->_weight + right->_weight);
			hp.Push(parent);

			parent->_left = left;
			parent->_right = right;
			left->_parent = parent;
			right->_parent = parent;
		}
		_root = hp.Top();
	}
protected:
	void _destory(Node* root)
	{
		if (NULL == root)
			return;
		_destory(root->_left);
		_destory(root->_right);
		
		delete root;
	}
private:
	Node* _root;
};

2>壓縮和解壓縮

#pragma once

#pragma warning(disable:4996)
#include<cassert>
#include<Windows.h>
#include<string>
#include<iostream>
#include"Huffman.hpp"
typedef long long type;
struct weight    //權值裡應該包含字元出現的次數以及對應的字元和Huffman編碼
{
	unsigned char _ch;
	type _count;
	string _code;

	weight(type count = 0)
		: _ch(0)
		,_count(count)
		, _code("")
	{}
	weight operator+(const weight& w)
	{
		type tmp = _count + w._count;
		return weight(tmp);
	}
	bool operator<(const weight& w)
	{
		return _count < w._count;
	}
	bool operator!=(const weight& w)
	{
		return !(_count == w._count);
	}

};

class HuffmanPress
{
public:
	HuffmanPress()
	{
		for (int i = 0; i < 256; i++)
		{
			_infos[i]._ch = (unsigned char)i;
		}
	}

	bool FilePress(const char* filename)
	{
		//統計出每個字元出現的次數。
		FILE* fOut = fopen(filename, "rb");   
		assert(fOut);
		int ch = fgetc(fOut);
		type charcount = 0;  //統計出字元出現的總次數
		while (ch != EOF)
		{
			if (feof(fOut))
				break;
			_infos[ch]._count++;
			ch = fgetc(fOut);
			charcount++;
		}
		weight invalid(0);
		HuffmanTree<weight> hf(_infos, 256,invalid);    //用得到的權重數組建構一個Huffman樹
		HuffmanTreeNode<weight>* root = hf.GetRoot();
		//得到Huffman編碼
		string code;
		_GetCodeR(root, code);
		//開始壓縮,建立壓縮後的檔案
		string CompressFilename = filename;
		CompressFilename += ".huffman";
		FILE* fIn = fopen(CompressFilename.c_str(), "wb");
		assert(fIn);
		//統計完次數使得檔案指針指向了最後，是以需要使指針指向檔案頭
		fseek(fOut, 0, SEEK_SET);
		//向壓縮檔案裡寫入Huffman編碼
		int pos = 0;
		char value = 0;
		int ch1 = fgetc(fOut);
		while (ch1 != EOF)
		{
			if (feof(fOut))
				break;
			string& code = _infos[ch1]._code;
			for (size_t i = 0; i < code.size(); i++)
			{
				value <<= 1;
				if (code[i] == '1')  //得到二進制的1
				{
					value |= 1;
				}
				if (++pos == 8)   //滿8位寫入檔案
				{
					fputc(value, fIn);
					value = 0;
					pos = 0;
				}
			}
			
			ch1 = fgetc(fOut);
		}
		if (pos)    //最後的編碼不滿足一個位元組
		{
			value =value<<(8 - pos);
			fputc(value, fIn);
		}
		//将字元和字元出現的次數寫進配置檔案，檔案解壓時會用到
		string ConfigFilename = filename;
		ConfigFilename += ".config";
		FILE* fConfig = fopen(ConfigFilename.c_str(), "wb");
		assert(fConfig);
		char countStr[20];  //字元出現的次數
		//先把所有字元出現的總次數寫進配置檔案，為防止超過int範圍，charcount使用的是long long 是以要分兩步寫入
		itoa(charcount >> 32, countStr, 10);  //轉換高位
		fputs(countStr, fConfig);  //寫入
		fputc('\n', fConfig);
		itoa(charcount & 0Xffffffff, countStr, 10); //轉換低位
		fputs(countStr, fConfig);  //寫入
		fputc('\n', fConfig);
		for (int i = 0; i < 256; i++)
		{
			string put;
			if (_infos[i]!=invalid)
			{
				fputc(_infos[i]._ch,fConfig);//必須先把ch放進去，如果把ch作為string的字元最後轉換為C的字元，會導緻'\0'沒有處理
				put.push_back(',');
				itoa(_infos[i]._count, countStr, 10);
				put += countStr;
				fputs(put.c_str(), fConfig);
				fputc('\n', fConfig);
			}
		}
		fclose(fOut);
		fclose(fIn);
		fclose(fConfig);
		return true;
	}
	bool FileUncompress(char* filename)  //這裡給的是壓縮檔案名
	{
		//1.讀取配置檔案
		string ConfigFilename = filename;
		int count = ConfigFilename.rfind('.');
		ConfigFilename = ConfigFilename.substr(0, count);
		string UnCompressname = ConfigFilename + ".unpress"; 
		FILE* fUnCompress = fopen(UnCompressname.c_str(), "wb"); //建立解壓縮檔案
		ConfigFilename += ".config";
		FILE* fconfig = fopen(ConfigFilename.c_str(),"rb");
		assert(fconfig);
		assert(fUnCompress);
		FILE* fpress = fopen(filename, "rb");  //打開壓縮好的檔案
		assert(fpress);

		type charcount = 0; //先讀出字元出現的總次數
		string line;
		_ReadLine(fconfig,line);
		charcount = atoi(line.c_str());
		charcount <<= 32;   
		line.clear();
		_ReadLine(fconfig, line);
		charcount += atoi(line.c_str());
		line.clear();
		while (_ReadLine(fconfig,line))  //檔案結束時feof會傳回0
		{
			if (!line.empty())
			{
				char ch = line[0];
				_infos[(unsigned char)ch]._count = atoi(line.substr(2).c_str());
				line.clear();
			}
			else  //若讀到一個空行，對應的字元為換行符
			{
				line += '\n';
			}
		}
		//2.再次建構Huffman樹
		weight invalid(0);
		HuffmanTree<weight> hf(_infos, 256, invalid);    //用得到的權重數組建構一個Huffman樹
		HuffmanTreeNode<weight>* root = hf.GetRoot();
		HuffmanTreeNode<weight>* cur = root;
		char ch = fgetc(fpress);
		int pos = 8;
		while (1)
		{
			--pos;
			if ((ch >> pos) & 1)
			{
				cur = cur->_right;
			}
			else
			{
				cur = cur->_left;
			}

			if (cur->_left == NULL&&cur->_right == NULL)
			{
				fputc(cur->_weight._ch, fUnCompress);
				cur = root;   //再次從根節點周遊
				charcount--;
			}
			if (pos == 0)
			{
				ch = fgetc(fpress);
				pos = 8;
			}
			if (charcount == 0)  //不讀取壓縮時為了湊夠一個位元組而加進去的比特位
				break;   
		}

		fclose(fconfig);
		fclose(fpress);
		fclose(fUnCompress);
		return true;
	}

protected:
	bool _ReadLine(FILE* filename,string& line)
	{
		assert(filename);
		if (feof(filename))
			return false;
		unsigned char ch = fgetc(filename);

		while (ch != '\n')
		{
			line += ch;
			ch = fgetc(filename);

			if (feof(filename))
				//break;
				return false;
		}
		return true;
	}

	void _GetCodeR(HuffmanTreeNode<weight>* root, string code)
	{
		if (NULL == root)
			return;
		if (root->_left == NULL&& root->_right == NULL)
		{
			_infos[root->_weight._ch]._code = code;
		
		}
		_GetCodeR(root->_left, code + '0');
		_GetCodeR(root->_right, code + '1');

	}
private:
	weight _infos[256];
};

void TestCompress()
{
	HuffmanPress hft;
	int begin = GetTickCount();
//	hft.FilePress("test1.txt");
	//hft.FilePress("git.txt");
//	hft.FilePress("1.jpg");
//	hft.FilePress("8.pdf");
	//hft.FilePress("Input.BIG");
	hft.FilePress("listen.mp3");
	int end = GetTickCount();
	cout << end-begin << endl;
}

void TestUnCompress()
{
	HuffmanPress hf;
	int begin = GetTickCount();
//	hf.FileUncompress("test1.txt.huffman");
//	hf.FileUncompress("1.jpg.huffman");
//	hf.FileUncompress("git.txt.huffman");
//	hf.FileUncompress("8.pdf.huffman");
	//hf.FileUncompress("Input.BIG.huffman");
	hf.FileUncompress("listen.mp3.huffman");
	int end = GetTickCount();
	cout << end - begin << endl;
}

最後，檔案大了之後怎麼對比兩個檔案是否一緻呢？我用的是beyond Compare這個軟體，很友善，能對比各種類型的檔案。

【小項目】用Huffman樹實作檔案壓縮并解壓

一、前言

二、Huffman樹壓縮檔案

三、解壓縮

四、我遇到的問題

五、源碼

繼續閱讀

2021-08-13c++——類之操作符重載

[轉]ISUP信令的IAM消息詳細内容

swmm與lisflood-fp源碼如何一起編譯 CMake指令

Windows下VS開發環境環境安裝工程項目設定關于Debug和Release的提示

一文看懂字元串的加減乘除

C++ 第十五周報告1--《冒泡法排序》

C++實作簡單順序表

筆試面試題目：滑動視窗(二)

交通/城市相關的公開的資料集上學時整理的Xie et al., 2020收集的

資料結構與算法（27）——排序（二）

C經典書籍筆記——C陷阱與缺陷②(文法陷阱之優先級)一、錯誤案列二、優先級規律

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

線性表之順序表的實作

C++判斷素數、求最大公約數代碼判斷一個數是否為素數求兩個數的最大公約數

SequoiaDB巨杉資料庫C++驅動概述

Linux裝置模型（中）之上層容器