什麼是霍夫曼編碼 (Huffman Coding)
是一種用于無損資料壓縮的權編碼算法。由美國計算機科學家David Albert Huffman在1952年發明。
霍夫曼編碼使用變長編碼表澤源符号(如一個字母)進行編碼,其中變長編碼表是通過一種評估來源符号出現幾率的方法得到的,出現幾率高的字母使用較短的編碼,反之出現幾率低的則使用較長的編碼,這便使編碼之後的字元串的平均長度、期望值降低,進而達到無損壓縮資料的目的。
Huffman Coding的作用是什麼?
用于資料壓縮與解壓。
我們知道英文和數字各占1個位元組,中文占1個字元,也是就是2個位元組;
- utf-8編碼中,中文字元占了3個位元組,英文占1個位元組;
- utf-16編碼中,中文字元占了3個位元組,英文占2個位元組;
- utf-32編碼中,所有字元均占4個位元組;
我們再重溫下位元組:
位元組是一種資料量的機關,一個位元組等于8位(8 bit) bit,一個二進制資料0或1,是一bit。
所有的資料所占空間都可以用位元組資料來衡量;例如Java中:
- 一個字元(char)占2個位元組,
- 一個short占2個位元組,
- 一個int占4個位元組,
- 一個float占4個位元組,
- 一個long或double占8個位元組。
代碼的實作
Code Tree, Left Traversal has a value of 0, Right Traversal has a value of 1.
Coal: reduce the code tree,
- step1: take the 2 chars with the lowest frequency
- step2: make a 2 leaf node tree from them, the root node value is a sum of 2 leaves node's frequency
- step3: take the next lowest frequency char, and add it to the tree
Let us understand the algorithm with an example.
package _Algorithm.HuffmanCode
import java.util.*
class HuffmanCoding {
//recursive function to paint the huffman-code through the tree traversal
private fun printCode(root: HuffmanNode?, s: String) {
if (root?.left == null && root?.right == null && Character.isLetter(root?.c!!)) {
println("${root?.c}:$s")
return
}
//if we go left than add "0" to the node
//if we go right than add "1" to the node
printCode(root.left, s + "0")
printCode(root.right, s + "1")
}
fun test() {
val n = 6
val charArray = charArrayOf('a', 'b', 'c', 'd', 'e', 'f')
val charfreq = intArrayOf(5, 9, 12, 13, 16, 45)
val priorityQueue = PriorityQueue<HuffmanNode>(n, MyComparator())
for (i in 0 until n) {
val node = HuffmanNode()
node.c = charArray[i]
node.data = charfreq[i]
node.left = null
node.right = null
priorityQueue.add(node)
}
//create root node
var root: HuffmanNode? = null
while (priorityQueue.size > 1) {
//first min extract
val x = priorityQueue.poll()
//second min extract
val y = priorityQueue.poll()
// to the sum of the frequency of the two nodes
// assigning values to the f node.
val f = HuffmanNode()
f.data = x.data + y.data
f.c = '-'
f.left = x
f.right = y
//make the f node as the root
root = f
priorityQueue.add(f)
}
printCode(root, "")
}
}
class MyComparator : Comparator<HuffmanNode> {
override fun compare(o1: HuffmanNode?, o2: HuffmanNode?): Int {
return o1?.data!! - o2?.data!!
}
}
列印結果
f:0
c:100
d:101
a:1100
b:1101
e:111
壓縮結果
從資料:
val charArray = charArrayOf('a', 'b', 'c', 'd', 'e', 'f')
val charfreq = intArrayOf(5, 9, 12, 13, 16, 45)
我們得出結果:
Finding number of bits without using Huffman:
Total number of characters = sum of frequencies = 100;
1byte = 8bits, so total number of bit = 100*8 = 800;
Using Huffman Encoding result is :
f:0 //code length is 1
c:100 //code length is 3
d:101
a:1100
b:1101
e:111
so total number of bits =
freq(f) * code_length(f) + freq(c) * code_length(c) + freq(d) * code_length(d) + freq(a) * code length(a) +
freq(b) * code_length(b) + freq(e) * code_length(e) =
45*1 + 12*3 + 13*3 + 5*4 + 9*4 + 16*3 = 224
Bits saved: 800-224 = 576.