Presentation is loading. Please wait.

# Greedy Algorithms Amihood Amir Bar-Ilan University.

## Presentation on theme: "Greedy Algorithms Amihood Amir Bar-Ilan University."— Presentation transcript:

Greedy Algorithms Amihood Amir Bar-Ilan University

Idea Simplest type of strategy: 1. Take a step that makes the problem smaller. 2. iterate. Difficulty: Prove that this leads to an optimal solution. This is not always the case!

Example: Centerstring Problem Input: k strings s 1,…,s k of length ℓ over alphabet Σ, distance d. Find: string s* such that max(Ham(s*,s i )), i=1,…k is ≤ d. 3

Our Problem: 4 k strings length ℓ Maximum distance is smallest

Example: s 1 : 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 s 2 : 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 s 3 : 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 -------------------------------------------------- s * : 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 The Hamming distance of the consensus from any string: 4 5

Suggestion: greedy strategy column majority? 0 1 1 1 0 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 1 1 0 1 0 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 0 0 1 0 1 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0 1 1 0 --------------------------------------- 0 1 1 1 0 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 1 0 1 0 Problem: Works if we want to minimize average Not if we want to minimize maximum! 6

Why? 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 ----------------------------------------- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (majority) Hamming distance from last string: 16 7

But: 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 ----------------------------------------- 1 1 0 0 1 1 0 0 Hamming distance from any string: 8 8

Example (that works) – Huffman code Computer Data Encoding: How do we represent data in binary? Historical Solution: Fixed length codes. Encode every symbol by a unique binary string of a fixed length. Examples: ASCII (7 bit code), EBCDIC (8 bit code), …

American Standard Code for Information Interchange

ASCII Example: AABCAA 1000001 1000001 1000010 1000011 1000001 1000001

Total space usage in bits: Assume an ℓ bit fixed length code. For a file of n characters Need nℓ bits.

Variable Length codes Idea: In order to save space, use less bits for frequent characters and more bits for rare characters. Example: suppose alphabet of 3 symbols: { A, B, C }. suppose in file: 1,000,000 characters. Need 2 bits for a fixed length code for a total of 2,000,000 bits.

Variable Length codes - example CBA 500 999,000 Suppose the frequency distribution of the characters is: CBA 11100 Note that the code of A is of length 1, and the codes for B and C are of length 2 Encode :

Fixed code: 1,000,000 x 2 = 2,000,000 Varable code: 999,000 x 1 + 500 x 2 500 x 2 1,001,000 Total space usage in bits: A savings of almost 50%

How do we decode? In the fixed length, we know where every character starts, since they all have the same number of bits. Example: A = 00 B = 01 C = 10 000000010110101001100100001010 A A A B B C C C B C B A A C C

How do we decode? In the variable length code, we use an idea called Prefix code, where no code is a prefix of another. Example: A = 0 B = 10 C = 11 None of the above codes is a prefix of another.

How do we decode? Example: A = 0 B = 10 C = 11 So, for the string: A A A B B C C C B C B A A C C the encoding: 0 0 01010111111101110 0 01111

Prefix Code Example: A = 0 B = 10 C = 11 Decode the string 0 0 01010111111101110 0 01111 AAABBCCCBCBAACC

Desiderata: Construct a variable length code for a given file with the following properties: 1. Prefix code. 2. Using shortest possible codes. 3. Efficient. 4. As close to entropy as possible.

Idea Consider a binary tree, with: 0 meaning a left turn 1 meaning a right turn. 0 0 0 1 1 1 A B C D

Idea Consider the paths from the root to each of the leaves A, B, C, D: A : 0 B : 10 C : 110 D : 111 0 0 0 1 1 1 A B C D

Observe: 1. This is a prefix code, since each of the leaves has a path ending in it, without continuation. 2. If the tree is full then we are not “wasting” bits. 3. If we make sure that the more frequent symbols are closer to the root then they will have a smaller code. 0 0 0 1 1 1 A B C D

Greedy Algorithm: 1. Consider all pairs:. 2. Choose the two lowest frequencies, and make them brothers, with the root having the combined frequency. 3. Iterate.

Greedy Algorithm Example: Alphabet: A, B, C, D, E, F Frequency table: FEDCBA 605040302010 Total File Length: 210

Algorithm Run: A 10 B 20C 30D 40E 50F 60

Algorithm Run: A 10 B 20 C 30D 40E 50F 60X 30

Algorithm Run: A 10 B 20 C 30 D 40E 50F 60 X 30 Y 60

Algorithm Run: A 10 B 20 C 30 D 40E 50F 60 X 30 Y 60

Algorithm Run: A 10 B 20 C 30 F 60 X 30 Y 60 D 40 E 50 Z 90

Algorithm Run: A 10 B 20 C 30 F 60 X 30 Y 60 D 40 E 50 Z 90

Algorithm Run: A 10 B 20 C 30 F 60 X 30 Y 60 D 40 E 50 Z 90W 120

Algorithm Run: A 10 B 20 C 30 F 60 X 30 Y 60 D 40 E 50 Z 90 W 120

Algorithm Run: A 10 B 20 C 30 F 60 X 30 Y 60 D 40 E 50 Z 90 W 120 V 210 0 0 0 0 0 1 1 1 1 1

The Huffman encoding: A 10 B 20 C 30 F 60 X 30 Y 60 D 40 E 50 Z 90 W 120 V 210 0 0 0 0 0 1 1 1 1 1 A: 1000 B: 1001 C: 101 D: 00 E: 01 F: 11 File Size: 10x4 + 20x4 + 30x3 + 40x2 + 50x2 + 60x2 = 40 + 80 + 90 + 80 + 100 + 120 = 510 bits

Note the savings: The Huffman code: Required 510 bits for the file. Fixed length code: Need 3 bits for 6 characters. File has 210 characters. Total: 630 bits for the file.

Note also: For uniform character distribution: The Huffman encoding will be equal to the fixed length encoding. Why? Assignment.

Formally, the algorithm: Initialize trees of a single node each. Keep the roots of all subtrees in a priority queue. Iterate until only one tree left: Merge the two smallest frequency subtrees into a single subtree with two children, and insert into priority queue.

Algorithm time: Each priority queue operation (e.g. heap): O(log n) In each iteration: one less subtree. Initially: n subtrees. Total: O(n log n) time.

Algorithm correctness: Need to prove two things for greedy algorithms: Greedy Choice Property: The choice of local optimum is indeed part of a global optimum. Optimal Substructure Property: When we recurse on the remaining and combine it with the local optimum of the greedy choice, we get a global optimum.

Centerstring Agorithm correctness: Greedy Choice Property: The choice of majority at a column turns out not be necessarily a global optimum. Optimal Substructure Property: A global optimum means that the overall max distance including the first greedy choice is smallest.

Example: 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ----------------------------------------- 1 For the optimum the second index needs to be 0, but if we ignore the first index, a global optimum may be 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 42

Huffman Algorithm correctness: Need to prove two things: Greedy Choice Property: There exists a minimum cost prefix tree where the two smallest frequency characters are indeed siblings with the longest path from root. This means that the greedy choice does not hurt finding the optimum.

Algorithm correctness: Optimal Substructure Property: An optimal solution to the problem once we choose the two least frequent elements and combine them to produce a smaller problem, is indeed a solution to the problem when the two elements are added.

Algorithm correctness: There exists a minimum cost tree where the minimum frequency elements are longest path siblings: Assume that is not the situation. Then there are two elements in the longest path. Say a,b are the elements with smallest frequency and x,y the elements in the longest path.

Algorithm correctness: xy a dydy dada We know about depth and frequency: d a ≤ d y f a ≤ f y CT

Algorithm correctness: xy a dydy dada We also know about code tree CT: ∑f σ d σ σ is smallest possible. CT Now exchange a and y.

Algorithm correctness: xa y dydy dada CT’ (d a ≤ d y, f a ≤ f y Therefore f a d a ≥f y d a and f y d y ≥f a d y ) Cost(CT) = ∑f σ d σ = σ ∑f σ d σ +f a d a +f y d y ≥ σ≠a,y ∑f σ d σ +f y d a +f a d y = σ≠a,y cost(CT’)

Algorithm correctness: xa b dxdx dbdb CT Now do the same thing for b and x

Algorithm correctness: ba x dxdx dbdb CT” And get an optimal code tree where a and b are sibling with the longest paths

Algorithm correctness: Optimal substructure property: Let a,b be the symbols with the smallest frequency. Let x be a new symbol whose frequency is f x =f a +f b. Delete characters a and b, and find the optimal code tree CT for the reduced alphabet. Then CT’ = CT U {a,b} is an optimal tree for the original alphabet.

Algorithm correctness: CT x ab CT’ x f x = f a + f b

Algorithm correctness: cost(CT’)=∑f σ d’ σ = ∑f σ d’ σ + f a d’ a + f b d’ b = σ σ≠a,b ∑f σ d’ σ + f a (d x +1) + f b (d x +1) = σ≠a,b ∑f σ d’ σ +( f a + f b )(d x +1)= σ≠a,b ∑f σ d σ + f x (d x +1)+f x = cost(CT) + f x σ≠a,b

Algorithm correctness: CT x ab CT’ x f x = f a + f b cost(CT)+f x = cost(CT’)

Algorithm correctness: Assume CT’ is not optimal. By the previous lemma there is a tree CT” that is optimal, and where a and b are siblings. So cost(CT”) < cost(CT’)

Algorithm correctness: CT’’’ x ab CT” x f x = f a + f b By a similar argument: cost(CT’’’)+f x = cost(CT”) Consider

Algorithm correctness: We get: cost(CT’’’) = cost(CT”) – f x < cost(CT’) – f x = cost(CT) and this contradicts the minimality of cost(CT).

Algorithm correctness: Entropy: We leave for a compression course.

Download ppt "Greedy Algorithms Amihood Amir Bar-Ilan University."

Similar presentations

Ads by Google