 Introduction to Computer Science 2 Lecture 7: Extended binary trees

Presentation on theme: "Introduction to Computer Science 2 Lecture 7: Extended binary trees"— Presentation transcript:

Introduction to Computer Science 2 Lecture 7: Extended binary trees
Prof. Neeraj Suri Brahim Ayari

In advance: Search in binary trees
Binary trees can be considered as decision trees. Each node represent a decision, the edges the different possibilities. In such a tree search means to go from the root to a leaf. A < 2 TRUE FALSE B < 5 C > 7 FALSE TRUE TRUE FALSE X X2 X3 3X

Extended binary trees Replace NULL-pointers with special (external) nodes. A binary tree, to which external nodes are added, is called extended binary tree. The data can be stored either in the internal or the external nodes. The length of the path to the node illustrates the cost of the search.

External and internal path length
The cost of the search in extended binary trees depend on the following parameters: External path length = The sum over all path lengths from the root to the external nodes Si (1  i  n+1): Extn = i = n+1 depth( Si ) Internal path length = The sum over all path lengths to the internal nodes Ki ( 1  i  n ): Intn = i = n depth( Ki ) Extn = Intn + 2n (Proof by induction) Extended binary trees with a minimal external path length have a minimal internal path length too.

Example n = 7 External path length Internal path length
Extn = = 25 Internal path length Intn = = 11 25 = Extn = Intn + 2n = = 25 n = 7 1 1 2 2 2 2 3 3 3 3 3 3 4 4

Minimal and maximal length
For a given n, a balanced tree has the minimal internal path length. Example: Within a complete tree with height h, the internal path length is (for n = 2h -1): Intn = i = h i • 2i Internal path length becomes maximum if the tree degenerates to a linear list: Intn = i = n-1 i = n(n-1)/2 Example: h = 4, n = 15, Int = 34, Ext = 16•4 = 64 For comparison: List with n = 15 nodes has Int = 105, Ext = = 135

Weighted binary trees Often weights qi are assigned to the external nodes ( 1  i  n+1 ). The weighted external path length is defined as Extw = i = n+1 depth( Si )  qi Within weighted binary trees the properties of minimal and maximal path lengths do not apply any more. The determination of the minimal external path length is an important practical problem... 8 3 15 25 3 8 15 25 Extw = 88 (less than 102 although linear list) Extw = 102

Application example: optimal codes
To convert a text file efficiently to bit strings, there are two alternatives: Fixed length coding: each character has the same number of bits (e.g., ASCII) Variable length coding: some characters are represented using less bits than the others Example for coding with fixed length: 3-bit code for alphabet A, B, C, D: A = 001, B = 010, C = 011, D = 100 Message: ABBAABCDADA is converted to (length 33 bits) Using a 2-bit code the same message can be coded only with 22 bits. For decoding the message, group each 3-bits (respectively 2bits) and use a table with the code and its matching character.

Application example: optimal codes (2)
Idea: More frequently used characters are coded using less bits. Message: ABBAABCDADA Coding: Length: 20 Bit! Variable length coding can reduce the memory space needed for storing the file. How can this special coding be found and why is the decoding unique? Character A B C D Frequency 5 3 1 2 Coding 10 111 110

Application example: optimal codes (3)
Representation of the frequencies and coding as a weighted binary tree. First of all decoding: Given a bit string: Use the successive bits, in order to traverse the tree starting from the root. If you arrive to an external node, use the character stored there. 1 Example: A 5 1. Bit = 0: external node, A 2. Bit = 1, from the root to the right 3. Bit 0, links, external node, B 4. Bit = 1, from the root to the right 5. Bit 1, right ... 1 B 3 1 D 2 1 C

Correctness condition
Observation: Within variable length coding, the code of one character should not be a prefix of the code of any other character. If a character is represented in form of an extended binary tree, then the uniqueness is guaranteed (only one character per external node). If the frequency of the characters in the original text is taken as the weight of the external nodes, then a tree with minimal external path length will offer an optimal code. How is a tree with minimal external path length generated?

Huffman Code Idea: Characters are weighted and sorted according to the frequency This works as well independently from the text, e.g., in English (characters with relative weights): A binary tree with minimal external path length is constructed as follows: Each character is represented with an appropriate tree with its corresponding weight (only one external node). The two trees having respectively the smallest weight are merged to a new tree. The root of the new tree is marked with the sum of the weights of the original roots. Continue until only one tree remains. E 1231 T 959 A 805 O 794 N 719 I 718 S 659 R 603 H 514 L 403 D 365 C 320 U 310 P 229 F 228 M 225 W 203 Y 188 B 162 G 161 V 93 K 52 Q 20 X J 10 Z 9

Example 1: Huffman Step 1: (4, 5, 9, 10, 29) Step 2: (9, 9, 10, 29)
Alphabet and frequency: E T N I S 29 10 9 5 4 Step 1: (4, 5, 9, 10, 29) new weight: 9 4+5 1 4 5 9+9 1 Step 2: (9, 9, 10, 29) new weight: 18 9 9 1 4 5

Example 1: Huffman (2) Step 4: (28, 29) finished!
new weight: 28 10+18 1 10 18 57 1 1 9 9 28 29 1 1 4 5 10 18 1 9 9 Step 4: (28, 29) finished! 1 4 5

Resulting tree Coding: Extw = 112
Using this coding, the code e.g., for: TENNIS = SET = NET = Decoding as described before. 57 1 Character Code Weight E 1 29 T 00 10 N 011 9 I 0101 5 S 0100 4 28 E 1 T 18 1 9 N 1 S I

Some remarks The resulting tree is not regular.
Regular trees are not always optimal. Example: the best nearly complete tree has Extw = 123 For the message ABBAABCDADA 20 bits is optimal (see previous slides) 9 10 29 4 5

Example 2: Huffman Average number of bits without Huffman:
3 (because 23 = 8) Average number of bits using Huffman code: There are other “valid” solutions! But the average number of bits remains the same for all these solutions (equal to Huffman) Z p (%) Code A 25 00 B 4 1110 C 13 100 D 7 110 E 35 01 F 11 101 G 2 11110 H 3 11111

Analysis /* Algorithm Huffmann */ for (int i = 1; i  n-1; i++) { p1 = smallest element in list L remove p1 from L p2 = smallest element in L remove p2 from L create node p add p1 und p2 as left and right subtrees to p weight p = weight p1 + weight p2 insert p into L } Run time behavior depends in particular on the implementation of the list Time required to find the node with the smallest weight Time required to insert a new node “Naive” implementations give O(n2), “smarter” result in O(n log2n)

Optimality Observation: The weight of a node K in the Huffman tree is equal to the external path length of the subtree having K as root. Theorem: A Huffman tree is an extended binary tree with minimal external path length Extw. Proof outline (per induction over n, the number of the characters in the alphabet): The statement to prove is A(n) = “A Huffman tree with n nodes has minimal external path length Extw”. Consider first n=2: Prove A(2) = “A Huffman tree with 2 nodes has minimal external path length”.

Optimality (2) V T1 T2 Proof:
n = 2: Only two characters with weights q1 and q2 result in a tree with Extw = q1 + q2. This is minimal, because there are no other trees. Induction hypothesis: For all i  k, A(i) is true. To prove: A(k+1) is true. V T1 T2

Optimality (3) Proof: Consider a Huffman tree T with k+1 nodes. This tree has a root V and two subtrees T1 und T2, which have respectively the weights q1 and q2. Considering the construction method we can deduce, that For the weights qi of all internal nodes ni of T1 and T2: qi  min(q1, q2). That’s why: for these weights qi: q1 + q2 > qi. So if V is replaced by any node in T1 or T2, the resulting tree will have a greater weight. Replacing nodes within T1 and T2 will not make sense, because T1 and T2 are already optimal (both are trees with k nodes or less and the induction hypothesis hold for them). So T is an optimal tree with k+1 nodes. q1 + q2 V T1 q1 T2 q2

Huffman Code: Applications
Fax machine

Huffman: Other applications
ZIP-Coding (at least similar technique) In principle: most of coding techniques with data reduction (lossless compression) NOT Huffman: lossy compression techniques like JPEG, MP3, MPEG, …

Download ppt "Introduction to Computer Science 2 Lecture 7: Extended binary trees"

Similar presentations