CE100 Algorithms and Programming II

CE100 Algorithms and Programming II

Week-9 (Huffman Coding)

Spring Semester, 2021-2022

Download DOC, SLIDE, PPTX

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Huffman Coding

Outline

  • Heap Data Structure (Review Week-4)
  • Heap Sort (Review Week-4)
  • Huffman Coding
RTEU CE100 Week-9
CE100 Algorithms and Programming II

Huffman Codes

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Huffman Codes for Compression

  • Widely used and very effective for data compression

  • Savings of 20% - 90% typical

    • (depending on the characteristics of the data)
  • In summary: Huffman’s greedy algorithm uses a table of frequencies of character occurrences to build up an optimal way of representing each character as a binary string.

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Binary String Representation - Example

  • Consider a data file with:

    • 100K characters
    • Each character is one of {a,b,c,d,e,f}\{a, b, c, d, e, f\}
  • Frequency of each character in the file:

    • frequency: a45K,b13K,c12K,d16K,e9K,f5K\overbrace{a}^{45K},\overbrace{b}^{13K},\overbrace{c}^{12K},\overbrace{d}^{16K},\overbrace{e}^{9K},\overbrace{f}^{5K}
  • Binary character code: Each character is represented by a unique binary string.

  • Intuition:

    • Frequent characters \Leftrightarrow shorter codewords
    • Infrequent characters \Leftrightarrow longer codewords
RTEU CE100 Week-9
CE100 Algorithms and Programming II

Binary String Representation - Example

charactersabcdeffrequency45K13K12K16K9K5Kfixed-length000001010011100101variable-length(1)010110011111011100variable-length(2)01011011101111011111\begin{array}{ccc} \text{characters} & a & b & c & d & e & f \\ \text{frequency} & 45K & 13K & 12K & 16K & 9K & 5K \\ \text{fixed-length} & 000 & 001 & 010 & 011 & 100 & 101 \\ \text{variable-length(1)} & 0 & 101 & 100 & 111 & 1101 & 1100 \\ \text{variable-length(2)} & 0 & 10 & 110 & 1110 & 11110 & 11111 \end{array}

  • How many total bits needed for fixed-length codewords?
    100K×3=300K bits100K \times 3 = 300K \ bits
  • How many total bits needed for variable-length(1) codewords?
    45K×1+13K×3+12K×3+16K×3+9K×4+5K×4=224K45K \times 1 + 13K \times 3 + 12K \times 3 + 16K \times 3 + 9K \times 4 + 5K \times 4 = 224K
  • How many total bits needed for variable-length(2) codewords?
    45K×1+13K×2+12K×3+16K×4+9K×5+5K×5=241K45K \times 1 + 13K \times 2 + 12K \times 3 + 16K \times 4 + 9K \times 5 + 5K \times 5 = 241K
RTEU CE100 Week-9
CE100 Algorithms and Programming II

Prefix Codes

  • Prefix codes: No codeword is also a prefix of some other codeword
  • Example:

charactersabcdefcodeword010110011111011100\begin{array}{ccc} \text{characters} & a & b & c & d & e & f \\ \text{codeword} & 0 & 101 & 100 & 111 & 1101 & 1100 \end{array}

  • It can be shown that:
    • Optimal data compression is achievable with a prefix code
  • In other words, optimality is not lost due to prefix-code restriction.
RTEU CE100 Week-9
CE100 Algorithms and Programming II

Prefix Codes: Encoding

charactersabcdefcodeword010110011111011100\begin{array}{ccc} \text{characters} & a & b & c & d & e & f \\ \text{codeword} & 0 & 101 & 100 & 111 & 1101 & 1100 \end{array}

  • Encoding: Concatenate the codewords representing each character of the file

  • Example: Encode file "abc" using the codewords above

    • abc0.101.1000101100abc \Rightarrow 0.101.100 \Rightarrow 0101100
  • Note: "." denotes the concatenation operation. It is just for illustration purposes, and does not exist in the encoded string.

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Prefix Codes: Decoding

  • Decoding is quite simple with a prefix code
  • The first codeword in an encoded file is unambiguous
    • because no codeword is a prefix of any other
  • Decoding algorithm:
    • Identify the initial codeword
    • Translate it back to the original character
    • Remove it from the encoded file
    • Repeat the decoding process on the remainder of the encoded file.
RTEU CE100 Week-9
CE100 Algorithms and Programming II

Prefix Codes: Decoding - Example

charactersabcdefcodeword010110011111011100\begin{array}{ccc} \text{characters} & a & b & c & d & e & f \\ \text{codeword} & 0 & 101 & 100 & 111 & 1101 & 1100 \end{array}

  • Example: Decode encoded file 001011101001011101
    • 001011101001011101
    • 0.010111010.01011101
    • 0.0.10111010.0.1011101
    • 0.0.101.11010.0.101.1101
    • 0.0.101.11010.0.101.1101
    • aabeaabe
RTEU CE100 Week-9
CE100 Algorithms and Programming II

Prefix Codes

  • Convenient representation for the prefix code:

    • a binary tree whose leaves are the given characters
  • Binary codeword for a character is the path from the
    root to that character in the binary tree

  • "00" means "go to the left child"

  • "11" means "go to the right child"

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Binary Tree Representation of Prefix Codes

  • Weight of an internal node: sum of weights of the leaves in its subtree
  • The binary tree corresponding to the fixed-length code

center

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Binary Tree Representation of Prefix Codes

  • Weight of an internal node: sum of weights of the leaves in its subtree

  • The binary tree corresponding to the optimal variable-length code

  • An optimal code for a file is always represented by a full binary tree

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Full Binary Tree Representation of Prefix Codes

  • Consider an FBT corresponding to an optimal prefix code

  • It has C|C| leaves (external nodes)

  • One for each letter of the alphabet where CC is the alphabet from which the characters are drawn

  • Lemma: An FBT with C|C| external nodes has exactly C1|C|-1 internal nodes

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Full Binary Tree Representation of Prefix Codes

  • Consider an FBTFBT TT, corresponding to a prefix code.
  • Notation:
    • f(c)f( c ): frequency of character c in the file
    • dT(c)d_T( c ): depth of cc's leaf in the FBTFBT TT
    • B(T)B(T): the number of bits required to encode the file
  • What is the length of the codeword for cc?
    • dT(c)d_T( c ), same as the depth of cc in TT
  • How to compute B(T)B(T), cost of tree TT?
    • B(T)=cCf(c)dT(c)B(T) = \sum \limits_{c \in C}^{} f( c )d_T( c )
RTEU CE100 Week-9
CE100 Algorithms and Programming II

Cost Computation - Example

B(T)=cCf(c)dT(c)B(T) = \sum \limits_{c \in C}^{} f( c )d_T( c )

B(T)=(45×1)+(12×3)+(13×3)+(16×3)+(5×4)+(9×4)=224\begin{align*} B(T) =& (45 \times 1) + (12 \times 3) + \\ & (13 \times 3) + (16 \times 3) + \\ & (5 \times 4) + (9 \times 4) \\ =& 224 \end{align*}

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Prefix Codes

  • Lemma: Let each internal node i is labeled with
    the sum of the weight w(i)w(i) of the leaves in its subtree

  • Then

B(T)=cCf(c)dT(c)=iITw(i)B(T) = \sum \limits_{c \in C}^{} f( c )d_T( c ) = \sum \limits_{i \in I_T}^{} w(i)

  • where ITI_T is the set of internal nodes of TT

  • Proof: Consider a leaf node cc with f(c)f( c ) & dT(c)d_T( c )

    • Then, f(c)f( c ) appears in the weights of dT(c)d_T( c ) internal node
    • along the path from cc to the root
    • Hence, f(c)f( c ) appears dT(c)d_T( c ) times in the above summation
RTEU CE100 Week-9
CE100 Algorithms and Programming II

Cost Computation - Example

B(T)=iITw(i)B(T) = \sum \limits_{i \in I_T}^{} w(i)

B(T)=100+55+25+30+14=224\begin{align*} B(T) =& 100 + 55 + \\ & 25 + 30 + 14 \\ =& 224 \end{align*}

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Constructing a Huffman Code

  • Problem Formulation: For a given character set C, construct an optimal prefix code with the minimum total cost

  • Huffman invented a greedy algorithm that constructs an optimal prefix code called a Huffman code

  • The greedy algorithm

    • builds the FBT corresponding to the optimal code in a bottom-up manner
    • begins with a set of C|C| leaves
    • performs a sequence of C1|C|-1 "merges" to create the final tree
RTEU CE100 Week-9
CE100 Algorithms and Programming II

Constructing a Huffman Code

  • A priority queue QQ, keyed on ff, is used
    to identify the two least-frequent objects to merge

  • The result of the merger of two objects is a new object

    • inserted into the priority queue according to its frequency
    • which is the sum of the frequencies of the two objects merged
RTEU CE100 Week-9
CE100 Algorithms and Programming II

Constructing a Huffman Code

  • Priority queue is implemented as a binary heap
  • Initiation of QQ (BUILD-HEAP\text{BUILD-HEAP}): O(n)O(n) time
  • EXTRACT-MIN\text{EXTRACT-MIN} & INSERT\text{INSERT} take O(lgn)O(lgn) time on QQ with nn objects
RTEU CE100 Week-9
CE100 Algorithms and Programming II

Constructing a Huffman Code

HUFFMAN(c)nCQBUILD-HEAP(c)for i1 to n1 dozALLOCATE-NODE()xleft[z]EXTRACT-MIN(Q)yright[z]EXTRACT-MIN(Q)f[z]f[x]f[y]INSERT(Q,z)return EXTRACT-MIN(Q)one object left in Q\begin{align*} & \text{HUFFMAN}( c ) \\ & \quad n \leftarrow |C| \\ & \quad Q \leftarrow \text{BUILD-HEAP}( c ) \\ & \quad for \ i \leftarrow 1 \ to \ n-1 \ do \\ & \qquad z \leftarrow \text{ALLOCATE-NODE}() \\ & \qquad x \leftarrow left[z] \leftarrow \text{EXTRACT-MIN}(Q) \\ & \qquad y \leftarrow right[z] \leftarrow \text{EXTRACT-MIN}(Q) \\ & \qquad f [z] \leftarrow f [x] \leftarrow f [y] \\ & \qquad \text{INSERT}(Q, z) \\ & \quad return \ \text{EXTRACT-MIN}(Q) \lhd \text{one object left in} \ Q \end{align*}

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Constructing a Huffman Code - Example

  • Start with one leaf node for each character
  • The 22 nodes with the least frequencies: f&ef \&e
  • Merge f&ef \& e and create an internal node
  • Set the internal node frequency to 5+9=145 + 9 = 14

center

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Constructing a Huffman Code - Example

  • The 2 nodes with least frequencies: b&cb \& c

center

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Constructing a Huffman Code - Example

center

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Constructing a Huffman Code - Example

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Constructing a Huffman Code - Example

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Constructing a Huffman Code - Example

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Correctness Proof of Huffman’s Algorithm

  • We need to prove:

    • The greedy choice property
    • The optimal substructure property
  • What is the greedy step in Huffman’s algorithm?

    • Merging the two characters with the lowest frequencies
  • We will first prove the greedy choice property

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Greedy Choice Property

  • Lemma 1: Let x&yx \& y be two characters in CC having the lowest frequencies.
  • Then, \exists an optimal prefix code for CC in which the codewords for x&yx \& y have the same length and differ only in the last bit
  • Note: If x&yx \& y are merged in Huffman’s algorithm, their codewords are guaranteed to have the same length and they will differ only in the last bit.
    • Lemma 1 states that there exists an optimal solution where this is the case.
RTEU CE100 Week-9
CE100 Algorithms and Programming II

Greedy Choice Property - Proof

  • Outline of the proof:

    • Start with an arbitrary optimal solution
    • Convert it to an optimal solution that satisfies the greedy choice property.
  • Proof: Let TT be an arbitrary optimal solution where:

    • b&cb \& c are the sibling leaves with the max depth
    • x&yx \& y are the characters with the lowest frequencies
RTEU CE100 Week-9
CE100 Algorithms and Programming II

Greedy Choice Property - Proof

  • Reminder:
    • b&cb \& c are the nodes with max depth
    • x&yx \& y are the nodes with min freq.
  • Without loss of generality, assume:
    • f(x)f(y)f( x ) \leq f( y )
    • f(b)f(c)f( b ) \leq f( c )
  • Then, it must be the case that:
    • f(x)f(b)f( x ) \leq f( b )
    • f(y)f(c)f( y ) \leq f( c )
RTEU CE100 Week-9
CE100 Algorithms and Programming II

Greedy Choice Property - Proof

  • TTT \Rightarrow T': exchange the positions of the leaves b&xb \& x
  • TTT' \Rightarrow T'': exchange the positions of the leaves c&yc \& y

center

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Greedy Choice Property - Proof

center

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Greedy Choice Property - Proof

  • Reminder: Cost of tree TT'

B(T)=cCf(c)dT(c)B(T) = \sum \limits_{c \in C}^{} f( c )d_{T'}( c )

  • How does B(T)B(T') compare to B(T)B(T)?
  • Reminder: f(x)f(b)f(x) \leq f(b)
    • dT(x)=dT(b)d_{T'}(x)=d_T(b) and dT(b)=dT(x)d_{T'}(b)=d_T(x)
RTEU CE100 Week-9
CE100 Algorithms and Programming II

Greedy Choice Property - Proof

  • Reminder: f(x)f(b)f( x ) \leq f( b )

    • dT(x)=dT(b)d_{T'}( x )=d_T( b ) and dT(b)=dT(x)d_{T'}( b )=d_T(x)
  • The difference in cost between TT and TT':

B(T)B(T)=cCf(c)dT(c)cCf(c)dT(c)=f[x]dT(x)+f[b]dT(b)f[x]dT(x)f[b]dT(b)=f[x]dT(x)+f[b]dT(b)f[x]dT(x)f[b]dT(b)=f[b](dT(b)+dT(x))f[x](dT(b)dT(x))=(f[b]f[x])(dT(b)+dT(x))\begin{align*} B(T) - B(T') =& \sum \limits_{c \in C}^{} f( c )d_{T}( c ) - \sum \limits_{c \in C}^{} f( c )d_{T'}( c ) \\ &= f[ x ]d_{T}( x ) + f[ b ]d_{T}( b ) - f[ x ]d_{T'}( x ) - f[ b ]d_{T'}( b ) \\ &= f[ x ]d_{T}( x ) + f[ b ]d_{T}( b ) - f[ x ]d_{T}( x ) - f[ b ]d_{T}( b ) \\ &= f[ b ](d_{ T }( b ) + d_{ T }( x )) - f[ x ](d_{ T }( b ) - d_{ T }( x )) \\ &= (f [ b ] - f[x])(d_{T}( b ) + d_{T}( x )) \end{align*}

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Greedy Choice Property - Proof

B(T)B(T)=(f[b]f[x])(dT(b)+dT(x))\begin{align*} B(T) - B(T') = (f[ b ] - f[ x ])(d_{T}( b ) + d_{T}( x )) \end{align*}

  • Since f[b]f[x]0f [ b ] - f [ x ] \geq 0 and dT(b)dT(x)d_T( b ) \geq d_T( x )
    • therefore B(T)B(T)B(T') \leq B(T)
  • In other words, TT' is also optimal
RTEU CE100 Week-9
CE100 Algorithms and Programming II

Greedy Choice Property - Proof

center

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Greedy Choice Property - Proof

  • We can similarly show that

  • B(T)B(T)0B(T)B(T)B(T')-B(T'') \geq 0 \Rightarrow B(T'') \leq B(T')

    • which implies B(T)B(T)B(T'') \leq B(T)
  • Since TT is optimal B(T)=B(T)T\Rightarrow B(T'') = B(T) \Rightarrow T'' is also optimal

  • Note: TT'' contains our greedy choice:

    • Characters x&yx \& y appear as sibling leaves of max-depth in TT''
  • Hence, the proof for the greedy choice property is complete

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Greedy-Choice Property of Determining an Optimal Code

  • Lemma 1 implies that

    • process of building an optimal tree
    • by mergers can begin with the greedy choice of merging
    • those two characters with the lowest frequency
  • We have already proved that B(T)=iITw(i)B(T)=\sum \limits_{i \in I_T}w(i) , that is,

    • the total cost of the tree constructed
    • is the sum of the costs of its mergers (internal nodes) of all possible mergers
  • At each step Huffman chooses the merger that incurs the least cost

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Optimal Substructure Property

  • Consider an optimal solution TT for alphabet CC. Let xx and yy be any two sibling leaf nodes in TT. Let zz be the parent node of xx and yy in TT.
  • Consider the subtree TT' where T=T{x,y}T' = T – \{x, y\}.
    • Here, consider z as a new character, where
      • f[z]=f[x]+f[y]f[z] = f[x] + f[y]
  • Optimal substructure property: TT' must be optimal for the alphabet CC',
    where C=C{x,y}{z}C' = C – \{x, y\} \cup \{z\}
RTEU CE100 Week-9
CE100 Algorithms and Programming II

Optimal Substructure Property - Proof

Reminder:

B(T)=cCf[c]dT(c) B(T) = \sum \limits_{c \in C}^{} f[c]d_{T}( c )

Try to express B(T)B(T) in terms of B(T)B(T').

Note: All characters in CC' have the same depth in TT and TT'.

B(T)=B(T)cost(z)+cost(x)+cost(y)B(T) = B(T') – cost(z) + cost(x) + cost(y)

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Optimal Substructure Property - Proof

Reminder:

B(T)=cCf[c]dT(c) B(T) = \sum \limits_{c \in C}^{} f[ c ]d_{T}( c )

B(T)=B(T)cost(z)+cost(x)+cost(y)=B(T)f[z].dT(z)+f[x].dT(x)+f[y].dT(y)=B(T)f[z].dT(z)+(f[x]+f[y])(dT(z)+1)=B(T)f[z].dT(z)+f[z](dT(z)+1)=B(T)f[z]\begin{align*} B(T) &= B(T') – cost( z ) + cost( x ) + cost( y ) \\ &= B(T') - f[ z ].d_T( z ) + f[ x ].d_T( x ) + f[ y ].d_T( y ) \\ &= B(T') - f[ z ].d_T( z ) + (f[ x ]+f[ y ])(d_T( z )+1) \\ &= B(T') - f[ z ].d_T( z ) + f[ z ](d_T( z )+1) \\ &= B(T') - f[ z ] \end{align*}

dT(x)=dT(z)+1dT(y)=dT(z)+1\begin{align*} d_T( x )=d_T( z )+1 \\ d_T( y )=d_T( z )+1 \\ \end{align*}

B(T)=B(T)+f[x]+f[y]\begin{align*} B(T)=B(T') + f[ x ] + f[ y ] \end{align*}

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Optimal Substructure Property - Proof

  • We want to prove that TT' is optimal for
    • C=C{x,y}{z}C' = C – \{x, y\} \cup \{z\}
  • Assume by contradiction that that there exists another solution for CC' with smaller cost than TT'. Call this solution RR':
  • B(R)<B(T)B(R') < B(T')
  • Let us construct another prefix tree RR by adding x&yx \& y as children of zz in RR'

B(T)=B(T)+f[x]+f[y]\begin{align*} B(T)=B(T') + f[ x ] + f[ y ] \end{align*}

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Optimal Substructure Property - Proof

  • Let us construct another prefix tree RR by adding x&yx \& y as children of zz in RR'.
  • We have:
    • B(R)=B(R)+f[x]+f[y]B(R) = B(R') + f[ x ] + f[ y ]
  • In the beginning, we assumed that:
    • B(Rʹ)<B(T)B(Rʹ) < B(T')
  • So, we have:
    • B(R)<B(T)+f[x]+f[y]=B(T)B(R) < B(T') + f[ x ] + f[ y ] = B(T)

Contradiction! Proof complete

RTEU CE100 Week-9
CE100 Algorithms and Programming II

Greedy Algorithm for Huffman Coding - Summary

  • For the greedy algorithm, we have proven that:
    • The greedy choice property holds.
    • The optimal substructure property holds.
  • So, the greedy algorithm is optimal.
RTEU CE100 Week-9
CE100 Algorithms and Programming II

References

RTEU CE100 Week-9
CE100 Algorithms and Programming II

EndOfWeek9CourseModule-End-Of-Week-9-Course-Module-

RTEU CE100 Week-9