Ana içeriğe geç

Week-9 (Huffman Coding)

CE100 Algorithms and Programming II

Week-9 (Huffman Coding)

Spring Semester, 2021-2022

Download DOC-PDF, DOC-DOCX, SLIDE, PPTX


Huffman Coding

Outline

  • Heap Data Structure (Review Week-4)
  • Heap Sort (Review Week-4)
  • Huffman Coding

Huffman Codes


Huffman Codes for Compression

  • Widely used and very effective for data compression
  • Savings of 20% - 90% typical
  • (depending on the characteristics of the data)

  • In summary: Huffman’s greedy algorithm uses a table of frequencies of character occurrences to build up an optimal way of representing each character as a binary string.


Binary String Representation - Example

  • Consider a data file with:
  • 100K characters
  • Each character is one of {a,b,c,d,e,f}
  • Frequency of each character in the file:
  • frequency: a45K,b13K,c12K,d16K,e9K,f5K

  • Binary character code: Each character is represented by a unique binary string.

  • Intuition:
  • Frequent characters shorter codewords
  • Infrequent characters longer codewords

Binary String Representation - Example

charactersabcdeffrequency45K13K12K16K9K5Kfixed-length000001010011100101variable-length(1)010110011111011100variable-length(2)01011011101111011111
  • How many total bits needed for fixed-length codewords? 100K×3=300K bits
  • How many total bits needed for variable-length(1) codewords? 45K×1+13K×3+12K×3+16K×3+9K×4+5K×4=224K
  • How many total bits needed for variable-length(2) codewords? 45K×1+13K×2+12K×3+16K×4+9K×5+5K×5=241K

Prefix Codes

  • Prefix codes: No codeword is also a prefix of some other codeword
  • Example:
charactersabcdefcodeword010110011111011100
  • It can be shown that:
  • Optimal data compression is achievable with a prefix code
  • In other words, optimality is not lost due to prefix-code restriction.

Prefix Codes: Encoding

charactersabcdefcodeword010110011111011100
  • Encoding: Concatenate the codewords representing each character of the file

  • Example: Encode file "abc" using the codewords above

  • abc0.101.1000101100
  • Note: "." denotes the concatenation operation. It is just for illustration purposes, and does not exist in the encoded string.

Prefix Codes: Decoding

  • Decoding is quite simple with a prefix code
  • The first codeword in an encoded file is unambiguous
  • because no codeword is a prefix of any other
  • Decoding algorithm:
  • Identify the initial codeword
  • Translate it back to the original character
  • Remove it from the encoded file
  • Repeat the decoding process on the remainder of the encoded file.

Prefix Codes: Decoding - Example

charactersabcdefcodeword010110011111011100
  • Example: Decode encoded file 001011101
  • 001011101
  • 0.01011101
  • 0.0.1011101
  • 0.0.101.1101
  • 0.0.101.1101
  • aabe

Prefix Codes

  • Convenient representation for the prefix code:
  • a binary tree whose leaves are the given characters

  • Binary codeword for a character is the path from the root to that character in the binary tree

  • "0" means "go to the left child"

  • "1" means "go to the right child"

Binary Tree Representation of Prefix Codes

  • Weight of an internal node: sum of weights of the leaves in its subtree
  • The binary tree corresponding to the fixed-length code

center h:450px


Binary Tree Representation of Prefix Codes

  • Weight of an internal node: sum of weights of the leaves in its subtree
  • The binary tree corresponding to the optimal variable-length code

  • An optimal code for a file is always represented by a full binary tree

bg right:40% h:600px


Full Binary Tree Representation of Prefix Codes

  • Consider an FBT corresponding to an optimal prefix code

  • It has |C| leaves (external nodes)

  • One for each letter of the alphabet where C is the alphabet from which the characters are drawn

  • Lemma: An FBT with |C| external nodes has exactly |C|1 internal nodes


Full Binary Tree Representation of Prefix Codes

  • Consider an FBT T, corresponding to a prefix code.
  • Notation:
  • f(c): frequency of character c in the file
  • dT(c): depth of c's leaf in the FBT T
  • B(T): the number of bits required to encode the file
  • What is the length of the codeword for c?
  • dT(c), same as the depth of c in T
  • How to compute B(T), cost of tree T?
  • B(T)=cCf(c)dT(c)

Cost Computation - Example

B(T)=cCf(c)dT(c)

B(T)=(45×1)+(12×3)+ (13×3)+(16×3)+ (5×4)+(9×4) =224

bg right:50% h:500px


Prefix Codes

  • Lemma: Let each internal node i is labeled with the sum of the weight w(i) of the leaves in its subtree

  • Then

B(T)=cCf(c)dT(c)=iITw(i)

  • where IT is the set of internal nodes of T

  • Proof: Consider a leaf node c with f(c) & dT(c)

  • Then, f(c) appears in the weights of dT(c) internal node
  • along the path from c to the root
  • Hence, f(c) appears dT(c) times in the above summation

Cost Computation - Example

B(T)=iITw(i)

B(T)=100+55+ 25+30+14 =224

bg right:50% h:500px


Constructing a Huffman Code

  • Problem Formulation: For a given character set C, construct an optimal prefix code with the minimum total cost

  • Huffman invented a greedy algorithm that constructs an optimal prefix code called a Huffman code

  • The greedy algorithm

  • builds the FBT corresponding to the optimal code in a bottom-up manner
  • begins with a set of |C| leaves
  • performs a sequence of |C|1 "merges" to create the final tree

Constructing a Huffman Code

  • A priority queue Q, keyed on f, is used to identify the two least-frequent objects to merge

  • The result of the merger of two objects is a new object

  • inserted into the priority queue according to its frequency
  • which is the sum of the frequencies of the two objects merged

Constructing a Huffman Code

  • Priority queue is implemented as a binary heap
  • Initiation of Q (BUILD-HEAP): O(n) time
  • EXTRACT-MIN & INSERT take O(lgn) time on Q with n objects

Constructing a Huffman Code

HUFFMAN(c)n|C|QBUILD-HEAP(c)for i1 to n1 dozALLOCATE-NODE()xleft[z]EXTRACT-MIN(Q)yright[z]EXTRACT-MIN(Q)f[z]f[x]f[y]INSERT(Q,z)return EXTRACT-MIN(Q)one object left in Q

Constructing a Huffman Code - Example

  • Start with one leaf node for each character
  • The 2 nodes with the least frequencies: f&e
  • Merge f&e and create an internal node
  • Set the internal node frequency to 5+9=14

center h:300px


Constructing a Huffman Code - Example

  • The 2 nodes with least frequencies: b&c

center h:400px


Constructing a Huffman Code - Example

center h:350px


Constructing a Huffman Code - Example

bg right:50% h:350px


Constructing a Huffman Code - Example

bg right:50% h:600px


Constructing a Huffman Code - Example

bg right:50% h:600px


Correctness Proof of Huffman’s Algorithm

  • We need to prove:
  • The greedy choice property
  • The optimal substructure property

  • What is the greedy step in Huffman’s algorithm?

  • Merging the two characters with the lowest frequencies

  • We will first prove the greedy choice property


Greedy Choice Property

  • Lemma 1: Let x&y be two characters in C having the lowest frequencies.
  • Then, an optimal prefix code for C in which the codewords for x&y have the same length and differ only in the last bit
  • Note: If x&y are merged in Huffman’s algorithm, their codewords are guaranteed to have the same length and they will differ only in the last bit.
  • Lemma 1 states that there exists an optimal solution where this is the case.

Greedy Choice Property - Proof

  • Outline of the proof:
  • Start with an arbitrary optimal solution
  • Convert it to an optimal solution that satisfies the greedy choice property.

  • Proof: Let T be an arbitrary optimal solution where:

    • b&c are the sibling leaves with the max depth
    • x&y are the characters with the lowest frequencies

Greedy Choice Property - Proof

bg right:50% h:600px

  • Reminder:
  • b&c are the nodes with max depth
  • x&y are the nodes with min freq.
  • Without loss of generality, assume:
  • f(x)f(y)
  • f(b)f(c)
  • Then, it must be the case that:
  • f(x)f(b)
  • f(y)f(c)

Greedy Choice Property - Proof

  • TT: exchange the positions of the leaves b&x
  • TT: exchange the positions of the leaves c&y

center h:400px


Greedy Choice Property - Proof

center h:500px


Greedy Choice Property - Proof

  • Reminder: Cost of tree T B(T)=cCf(c)dT(c)

  • How does B(T) compare to B(T)?

  • Reminder: f(x)f(b)
  • dT(x)=dT(b) and dT(b)=dT(x)

bg right:50% h:500px


Greedy Choice Property - Proof

  • Reminder: f(x)f(b)
  • dT(x)=dT(b) and dT(b)=dT(x)

  • The difference in cost between T and T:

B(T)B(T)=cCf(c)dT(c)cCf(c)dT(c)=f[x]dT(x)+f[b]dT(b)f[x]dT(x)f[b]dT(b)=f[x]dT(x)+f[b]dT(b)f[x]dT(x)f[b]dT(b)=f[b](dT(b)+dT(x))f[x](dT(b)dT(x))=(f[b]f[x])(dT(b)+dT(x))

Greedy Choice Property - Proof

B(T)B(T)=(f[b]f[x])(dT(b)+dT(x))

  • Since f[b]f[x]0 and dT(b)dT(x)
  • therefore B(T)B(T)
  • In other words, T is also optimal

Greedy Choice Property - Proof

center h:450px


Greedy Choice Property - Proof

  • We can similarly show that
  • B(T)B(T)0B(T)B(T)
  • which implies B(T)B(T)
  • Since T is optimal B(T)=B(T)T is also optimal

  • Note: T contains our greedy choice:

  • Characters x&y appear as sibling leaves of max-depth in T
  • Hence, the proof for the greedy choice property is complete

Greedy-Choice Property of Determining an Optimal Code

  • Lemma 1 implies that
  • process of building an optimal tree
  • by mergers can begin with the greedy choice of merging
  • those two characters with the lowest frequency

  • We have already proved that B(T)=iITw(i) , that is,

  • the total cost of the tree constructed
  • is the sum of the costs of its mergers (internal nodes) of all possible mergers
  • At each step Huffman chooses the merger that incurs the least cost

Optimal Substructure Property

  • Consider an optimal solution T for alphabet C. Let x and y be any two sibling leaf nodes in T. Let z be the parent node of x and y in T.
  • Consider the subtree T where T=T{x,y}.
  • Here, consider z as a new character, where
    • f[z]=f[x]+f[y]
  • Optimal substructure property: T must be optimal for the alphabet C, where C=C{x,y}{z}

bg right:30% h:500px


Optimal Substructure Property - Proof

Reminder: B(T)=cCf[c]dT(c)

Try to express B(T) in terms of B(T).

Note: All characters in C have the same depth in T and T.

B(T)=B(T)cost(z)+cost(x)+cost(y)

bg right:30% h:500px


Optimal Substructure Property - Proof

Reminder: B(T)=cCf[c]dT(c)

B(T)=B(T)cost(z)+cost(x)+cost(y)=B(T)f[z].dT(z)+f[x].dT(x)+f[y].dT(y)=B(T)f[z].dT(z)+(f[x]+f[y])(dT(z)+1)=B(T)f[z].dT(z)+f[z](dT(z)+1)=B(T)f[z]
dT(x)=dT(z)+1dT(y)=dT(z)+1
B(T)=B(T)+f[x]+f[y]

bg right:30% h:500px


Optimal Substructure Property - Proof

  • We want to prove that T is optimal for
  • C=C{x,y}{z}
  • Assume by contradiction that that there exists another solution for C with smaller cost than T. Call this solution R:
  • B(R)<B(T)
  • Let us construct another prefix tree R by adding x&y as children of z in R
B(T)=B(T)+f[x]+f[y]

bg right:30% h:500px


Optimal Substructure Property - Proof

  • Let us construct another prefix tree R by adding x&y as children of z in R.
  • We have:
  • B(R)=B(R)+f[x]+f[y]
  • In the beginning, we assumed that:
  • B(Rʹ)<B(T)
  • So, we have:
  • B(R)<B(T)+f[x]+f[y]=B(T)

Contradiction! Proof complete

bg right:30% h:500px


Greedy Algorithm for Huffman Coding - Summary

  • For the greedy algorithm, we have proven that:
  • The greedy choice property holds.
  • The optimal substructure property holds.
  • So, the greedy algorithm is optimal.

References


EndOfWeek9CourseModule


Son Güncelleme: 13 Mayıs 2022
Back to top