Week-9 (Huffman Coding)

CE100 Algorithms and Programming II¶

Week-9 (Huffman Coding)¶

Spring Semester, 2021-2022¶

Download DOC-PDF, DOC-DOCX, SLIDE, PPTX

Huffman Coding¶

Outline¶

Heap Data Structure (Review Week-4)
Heap Sort (Review Week-4)
Huffman Coding

Huffman Codes¶

Huffman Codes for Compression¶

Widely used and very effective for data compression
Savings of 20% - 90% typical
(depending on the characteristics of the data)
In summary: Huffman’s greedy algorithm uses a table of frequencies of character occurrences to build up an optimal way of representing each character as a binary string.

Binary String Representation - Example¶

Consider a data file with:
100K characters
Each character is one of $\{a, b, c, d, e, f\}$
Frequency of each character in the file:
frequency: $\overbrace{a}^{45K},\overbrace{b}^{13K},\overbrace{c}^{12K},\overbrace{d}^{16K},\overbrace{e}^{9K},\overbrace{f}^{5K}$
Binary character code: Each character is represented by a unique binary string.
Intuition:
Frequent characters $\Leftrightarrow$ shorter codewords
Infrequent characters $\Leftrightarrow$ longer codewords

Binary String Representation - Example¶

\[ \begin{array}{ccc} \text{characters} & a & b & c & d & e & f \\ \text{frequency} & 45K & 13K & 12K & 16K & 9K & 5K \\ \text{fixed-length} & 000 & 001 & 010 & 011 & 100 & 101 \\ \text{variable-length(1)} & 0 & 101 & 100 & 111 & 1101 & 1100 \\ \text{variable-length(2)} & 0 & 10 & 110 & 1110 & 11110 & 11111 \end{array} \]

How many total bits needed for fixed-length codewords? $100K \times 3 = 300K \ bits$
How many total bits needed for variable-length(1) codewords? $45K \times 1 + 13K \times 3 + 12K \times 3 + 16K \times 3 + 9K \times 4 + 5K \times 4 = 224K$
How many total bits needed for variable-length(2) codewords? $45K \times 1 + 13K \times 2 + 12K \times 3 + 16K \times 4 + 9K \times 5 + 5K \times 5 = 241K$

Prefix Codes¶

Prefix codes: No codeword is also a prefix of some other codeword
Example:

\[ \begin{array}{ccc} \text{characters} & a & b & c & d & e & f \\ \text{codeword} & 0 & 101 & 100 & 111 & 1101 & 1100 \end{array} \]

It can be shown that:
Optimal data compression is achievable with a prefix code
In other words, optimality is not lost due to prefix-code restriction.

Prefix Codes: Encoding¶

\[ \begin{array}{ccc} \text{characters} & a & b & c & d & e & f \\ \text{codeword} & 0 & 101 & 100 & 111 & 1101 & 1100 \end{array} \]

Encoding: Concatenate the codewords representing each character of the file
Example: Encode file "abc" using the codewords above
$abc \Rightarrow 0.101.100 \Rightarrow 0101100$
Note: "." denotes the concatenation operation. It is just for illustration purposes, and does not exist in the encoded string.

Prefix Codes: Decoding¶

Decoding is quite simple with a prefix code
The first codeword in an encoded file is unambiguous
because no codeword is a prefix of any other
Decoding algorithm:
Identify the initial codeword
Translate it back to the original character
Remove it from the encoded file
Repeat the decoding process on the remainder of the encoded file.

Prefix Codes: Decoding - Example¶

\[ \begin{array}{ccc} \text{characters} & a & b & c & d & e & f \\ \text{codeword} & 0 & 101 & 100 & 111 & 1101 & 1100 \end{array} \]

Example: Decode encoded file $001011101$
$001011101$
$0.01011101$
$0.0.1011101$
$0.0.101.1101$
$0.0.101.1101$
$aabe$

Prefix Codes¶

Convenient representation for the prefix code:
a binary tree whose leaves are the given characters
Binary codeword for a character is the path from the root to that character in the binary tree
"$0$" means "go to the left child"
"$1$" means "go to the right child"

Binary Tree Representation of Prefix Codes¶

Weight of an internal node: sum of weights of the leaves in its subtree
The binary tree corresponding to the fixed-length code

center h:450px

Binary Tree Representation of Prefix Codes¶

Weight of an internal node: sum of weights of the leaves in its subtree
The binary tree corresponding to the optimal variable-length code
An optimal code for a file is always represented by a full binary tree

bg right:40% h:600px

Full Binary Tree Representation of Prefix Codes¶

Consider an FBT corresponding to an optimal prefix code
It has $|C|$ leaves (external nodes)
One for each letter of the alphabet where $C$ is the alphabet from which the characters are drawn
Lemma: An FBT with $|C|$ external nodes has exactly $|C|-1$ internal nodes

Full Binary Tree Representation of Prefix Codes¶

Consider an $FBT$ $T$, corresponding to a prefix code.
Notation:
$f( c )$: frequency of character c in the file
$d_T( c )$: depth of $c$'s leaf in the $FBT$ $T$
$B(T)$: the number of bits required to encode the file
What is the length of the codeword for $c$?
$d_T( c )$, same as the depth of $c$ in $T$
How to compute $B(T)$, cost of tree $T$?
$B(T) = \sum \limits_{c \in C}^{} f( c )d_T( c )$

Cost Computation - Example¶

$$ B(T) = \sum \limits_{c \in C}^{} f( c )d_T( c ) $$

$$ \begin{align*} B(T) =& (45 \times 1) + (12 \times 3) + \ & (13 \times 3) + (16 \times 3) + \ & (5 \times 4) + (9 \times 4) \ =& 224 \end{align*} $$

bg right:50% h:500px

Prefix Codes¶

Lemma: Let each internal node i is labeled with the sum of the weight $w(i)$ of the leaves in its subtree
Then

$$ B(T) = \sum \limits_{c \in C}^{} f( c )d_T( c ) = \sum \limits_{i \in I_T}^{} w(i) $$

where $I_T$ is the set of internal nodes of $T$
Proof: Consider a leaf node $c$ with $f( c )$ & $d_T( c )$
Then, $f( c )$ appears in the weights of $d_T( c )$ internal node
along the path from $c$ to the root
Hence, $f( c )$ appears $d_T( c )$ times in the above summation

Cost Computation - Example¶

$$ B(T) = \sum \limits_{i \in I_T}^{} w(i) $$

$$ \begin{align*} B(T) =& 100 + 55 + \ & 25 + 30 + 14 \ =& 224 \end{align*} $$

bg right:50% h:500px

Constructing a Huffman Code¶

Problem Formulation: For a given character set C, construct an optimal prefix code with the minimum total cost
Huffman invented a greedy algorithm that constructs an optimal prefix code called a Huffman code
The greedy algorithm
builds the FBT corresponding to the optimal code in a bottom-up manner
begins with a set of $|C|$ leaves
performs a sequence of $|C|-1$ "merges" to create the final tree

Constructing a Huffman Code¶

A priority queue $Q$, keyed on $f$, is used to identify the two least-frequent objects to merge
The result of the merger of two objects is a new object
inserted into the priority queue according to its frequency
which is the sum of the frequencies of the two objects merged

Constructing a Huffman Code¶

Priority queue is implemented as a binary heap
Initiation of $Q$ ($\text{BUILD-HEAP}$): $O(n)$ time
$\text{EXTRACT-MIN}$ & $\text{INSERT}$ take $O(lgn)$ time on $Q$ with $n$ objects

Constructing a Huffman Code¶

\[ \begin{align*} & \text{HUFFMAN}( c ) \\ & \quad n \leftarrow |C| \\ & \quad Q \leftarrow \text{BUILD-HEAP}( c ) \\ & \quad for \ i \leftarrow 1 \ to \ n-1 \ do \\ & \qquad z \leftarrow \text{ALLOCATE-NODE}() \\ & \qquad x \leftarrow left[z] \leftarrow \text{EXTRACT-MIN}(Q) \\ & \qquad y \leftarrow right[z] \leftarrow \text{EXTRACT-MIN}(Q) \\ & \qquad f [z] \leftarrow f [x] \leftarrow f [y] \\ & \qquad \text{INSERT}(Q, z) \\ & \quad return \ \text{EXTRACT-MIN}(Q) \lhd \text{one object left in} \ Q \end{align*} \]

Constructing a Huffman Code - Example¶

Start with one leaf node for each character
The $2$ nodes with the least frequencies: $f \&e$
Merge $f \& e$ and create an internal node
Set the internal node frequency to $5 + 9 = 14$

center h:300px

Constructing a Huffman Code - Example¶

The 2 nodes with least frequencies: $b \& c$

center h:400px

Constructing a Huffman Code - Example¶

center h:350px

Constructing a Huffman Code - Example¶

bg right:50% h:350px

Constructing a Huffman Code - Example¶

bg right:50% h:600px

Constructing a Huffman Code - Example¶

bg right:50% h:600px

Correctness Proof of Huffman’s Algorithm¶

We need to prove:
The greedy choice property
The optimal substructure property
What is the greedy step in Huffman’s algorithm?
Merging the two characters with the lowest frequencies
We will first prove the greedy choice property

Greedy Choice Property¶

Lemma 1: Let $x \& y$ be two characters in $C$ having the lowest frequencies.
Then, $\exists$ an optimal prefix code for $C$ in which the codewords for $x \& y$ have the same length and differ only in the last bit
Note: If $x \& y$ are merged in Huffman’s algorithm, their codewords are guaranteed to have the same length and they will differ only in the last bit.
Lemma 1 states that there exists an optimal solution where this is the case.

Greedy Choice Property - Proof¶

Outline of the proof:
Start with an arbitrary optimal solution
Convert it to an optimal solution that satisfies the greedy choice property.
Proof: Let $T$ be an arbitrary optimal solution where:
- $b \& c$ are the sibling leaves with the max depth
- $x \& y$ are the characters with the lowest frequencies

Greedy Choice Property - Proof¶

bg right:50% h:600px

Reminder:
$b \& c$ are the nodes with max depth
$x \& y$ are the nodes with min freq.
Without loss of generality, assume:
$f( x ) \leq f( y )$
$f( b ) \leq f( c )$
Then, it must be the case that:
$f( x ) \leq f( b )$
$f( y ) \leq f( c )$

Greedy Choice Property - Proof¶

$T \Rightarrow T'$: exchange the positions of the leaves $b \& x$
$T' \Rightarrow T''$: exchange the positions of the leaves $c \& y$

center h:400px

Greedy Choice Property - Proof¶

center h:500px

Greedy Choice Property - Proof¶

Reminder: Cost of tree $T'$ $$ B(T) = \sum \limits_{c \in C}^{} f( c )d_{T'}( c ) $$
How does $B(T')$ compare to $B(T)$?
Reminder: $f(x) \leq f(b)$
$d_{T'}(x)=d_T(b)$ and $d_{T'}(b)=d_T(x)$

bg right:50% h:500px

Greedy Choice Property - Proof¶

Reminder: $f( x ) \leq f( b )$
$d_{T'}( x )=d_T( b )$ and $d_{T'}( b )=d_T(x)$
The difference in cost between $T$ and $T'$:

\[ \begin{align*} B(T) - B(T') =& \sum \limits_{c \in C}^{} f( c )d_{T}( c ) - \sum \limits_{c \in C}^{} f( c )d_{T'}( c ) \\ &= f[ x ]d_{T}( x ) + f[ b ]d_{T}( b ) - f[ x ]d_{T'}( x ) - f[ b ]d_{T'}( b ) \\ &= f[ x ]d_{T}( x ) + f[ b ]d_{T}( b ) - f[ x ]d_{T}( x ) - f[ b ]d_{T}( b ) \\ &= f[ b ](d_{ T }( b ) + d_{ T }( x )) - f[ x ](d_{ T }( b ) - d_{ T }( x )) \\ &= (f [ b ] - f[x])(d_{T}( b ) + d_{T}( x )) \end{align*} \]

Greedy Choice Property - Proof¶

$$ \begin{align*} B(T) - B(T') = (f[ b ] - f[ x ])(d_{T}( b ) + d_{T}( x )) \end{align*} $$

Since $f [ b ] - f [ x ] \geq 0$ and $d_T( b ) \geq d_T( x )$
therefore $B(T') \leq B(T)$
In other words, $T'$ is also optimal

Greedy Choice Property - Proof¶

center h:450px

Greedy Choice Property - Proof¶

We can similarly show that
$B(T')-B(T'') \geq 0 \Rightarrow B(T'') \leq B(T')$
which implies $B(T'') \leq B(T)$
Since $T$ is optimal $\Rightarrow B(T'') = B(T) \Rightarrow T''$ is also optimal
Note: $T''$ contains our greedy choice:
Characters $x \& y$ appear as sibling leaves of max-depth in $T''$
Hence, the proof for the greedy choice property is complete

Greedy-Choice Property of Determining an Optimal Code¶

Lemma 1 implies that
process of building an optimal tree
by mergers can begin with the greedy choice of merging
those two characters with the lowest frequency
We have already proved that $B(T)=\sum \limits_{i \in I_T}w(i)$ , that is,
the total cost of the tree constructed
is the sum of the costs of its mergers (internal nodes) of all possible mergers
At each step Huffman chooses the merger that incurs the least cost

Optimal Substructure Property¶

Consider an optimal solution $T$ for alphabet $C$. Let $x$ and $y$ be any two sibling leaf nodes in $T$. Let $z$ be the parent node of $x$ and $y$ in $T$.
Consider the subtree $T'$ where $T' = T – \{x, y\}$.
Here, consider z as a new character, where
- $f[z] = f[x] + f[y]$
Optimal substructure property: $T'$ must be optimal for the alphabet $C'$, where $C' = C – \{x, y\} \cup \{z\}$

bg right:30% h:500px

Optimal Substructure Property - Proof¶

Reminder: $$ B(T) = \sum \limits_{c \in C}^{} f[c]d_{T}( c ) $$

Try to express $B(T)$ in terms of $B(T')$.

Note: All characters in $C'$ have the same depth in $T$ and $T'$.

\[ B(T) = B(T') – cost(z) + cost(x) + cost(y) \]

bg right:30% h:500px

Optimal Substructure Property - Proof¶

Reminder: $$ B(T) = \sum \limits_{c \in C}^{} f[ c ]d_{T}( c ) $$

\[ \begin{align*} B(T) &= B(T') – cost( z ) + cost( x ) + cost( y ) \\ &= B(T') - f[ z ].d_T( z ) + f[ x ].d_T( x ) + f[ y ].d_T( y ) \\ &= B(T') - f[ z ].d_T( z ) + (f[ x ]+f[ y ])(d_T( z )+1) \\ &= B(T') - f[ z ].d_T( z ) + f[ z ](d_T( z )+1) \\ &= B(T') - f[ z ] \end{align*} \]

\[ \begin{align*} d_T( x )=d_T( z )+1 \\ d_T( y )=d_T( z )+1 \\ \end{align*} \]

\[ \begin{align*} B(T)=B(T') + f[ x ] + f[ y ] \end{align*} \]

bg right:30% h:500px

Optimal Substructure Property - Proof¶

We want to prove that $T'$ is optimal for
$C' = C – \{x, y\} \cup \{z\}$
Assume by contradiction that that there exists another solution for $C'$ with smaller cost than $T'$. Call this solution $R'$:
$B(R') < B(T')$
Let us construct another prefix tree $R$ by adding $x \& y$ as children of $z$ in $R'$

\[ \begin{align*} B(T)=B(T') + f[ x ] + f[ y ] \end{align*} \]

bg right:30% h:500px

Optimal Substructure Property - Proof¶

Let us construct another prefix tree $R$ by adding $x \& y$ as children of $z$ in $R'$.
We have:
$B(R) = B(R') + f[ x ] + f[ y ]$
In the beginning, we assumed that:
$B(Rʹ) < B(T')$
So, we have:
$B(R) < B(T') + f[ x ] + f[ y ] = B(T)$

Contradiction! Proof complete

bg right:30% h:500px

Greedy Algorithm for Huffman Coding - Summary¶

For the greedy algorithm, we have proven that:
The greedy choice property holds.
The optimal substructure property holds.
So, the greedy algorithm is optimal.

References¶

$-End-Of-Week-9-Course-Module-$

Last update: May 13, 2022