Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Ananda Gunawardena 22 February 2005.

Similar presentations


Presentation on theme: "Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Ananda Gunawardena 22 February 2005."— Presentation transcript:

1 Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Ananda Gunawardena 22 February 2005

2 Data Compression

3 Data compression zIs one of the fundamental technologies of the Internet. zIs necessary for faster data transmission. zUseful even locally to keep smaller files or backup data.

4 Data compression zTypes of compression yLossless – encodes the original information exactly. yLossy – approximates the original information. zUses of compression yImages over the web: JPEG yMusic: MP3 yGeneral-purpose: ZIP, GZIP, JAR, …

5 Lossy vs. Lossless zIf lossless compression represents exactly the same data, in a compressed form, why use lossy at all? zMaybe you can get excellent compression without too much loss of information? zLet’s look at an example…

6 Compare two images One image is 400K the other is 1100K. Which is which?

7 So where is the difference?

8 Another Example - SVD zSingular Value Decomposition yA hybrid algorithm, which compresses data to a certain “rank” yLower ranks are lossy, highest rank is lossless yBut making this algorithm lossless actually doubles the size of the file! zIn what kinds of situations might SVD be useful?

9 SVD uses zSuppose we send a robot to explore the moon. It doesn’t know what information is useful to us. zWe can ask it to first send us a small rank and then, if we’re interested, we can ask for larger ranks. zUltimately we get the all the information, but only if we really want/need it.

10 Another Example - SVD zOkay, so the robot on the moon seems a bit contrived. zBut what about surfing the web on your handheld? zThere is so much nonsense on the web, we clearly don’t want to download everything since bandwidth is at a premium.

11 Another Example - SVD Rank 1Rank 8Rank 16Original 2231 bytes4549 bytes

12 What can we conclude? zThere is definitely a trade-off. zLossless may not perform so well, but it retains 100% of the information. zLossy can perform extremely well, but is the compression worth the loss of information? zSo how do we decide which one to use?

13 Some Considerations zWhat types of files would you use a lossless algorithm on? zWhat types of files would you use a lossy algorithm on?

14 Some Considerations zWhat types of files would you use a lossless algorithm on? yDiscrete data (text files) zWhat types of files would you use a lossy algorithm on? yAnalogue data (images, music) yFiles where you can get away with an approximation to the data.

15 Question #1 zIs there a lossless compression algorithm that can compress any file?

16 Answer zAbsolutely not! zWhy not? yHint: How many binary strings are there of length N?

17 Question #2 zIs there a best possible way to compress files? zIs there an algorithm that always produces the smallest compressed file possible? No!

18 No optimal compression zSuppose you wish to compress the first 10,000 digits of Pi. zIn case they slipped your mind…

19 Pi

20 How about a program? long a[35014],b,c=35014,d,e,f=1e4,g,h; main(){ for(;b=c-=14;h=printf("%04ld",e+d/f)) for(e=d%=f;g=--b*2;d/=g) d=d*b+f*(h?a[b]:f/5), a[b]=d%--g; }

21 pitiny.c zThis C program is just 143 characters long! zAnd it “decompresses” into the first 10,000 digits of Pi. long a[35014],b,c=35014,d,e,f=1e4,g,h; main(){for(;b=c-=14;h=printf("%04ld",e+d/f)) for(e=d%=f;g=--b*2;d/=g) d=d*b+f*(h?a[b]:f/5), a[b]=d%--g;}

22 Program size complexity zThere is an interesting idea here: yFind the shortest program that computes a certain output. yA very important idea in theoretical computer science. Can be used to define incompressible data (no shorter program will produce these data). zBut useless for data compression. yThere is no algorithm that can find such programs

23 Challenge zCan you come up with the shortest Java program that computes the first 10,000 digits of Pi and writes them to the screen. z why does pitiny.c works?

24 Getting close zIn practice, the best we can hope for is a program that does good compression in interesting cases. yText files yNumerical data yVoice yMusic yImages yVideo y…

25 How does compression work? zLossy algorithms are generally mathematically based. They work by applying transforms. yEg. JPEG – discrete cosine transform zLossy algorithms attempt to approximate the original data. zLossless algorithms cannot do that since they need to maintain the original data. zSo what can they do?

26 How does compression work? zThey need to analyze the file and take advantage of certain properties it might have. zOr its structure. zWe’ll look at two important lossless compression methods yHuffman compression yLZW compression

27 Interlude: Bit-level Representation of Data

28 Bits and Bytes zAll data is stored on a computer as a sequence of 0’s and 1’s, called bits. zThis is a very natural way to represent data, for the following reason: yA computer cannot, in general, infer 10 different values from the intensity of a signal. yIt can however infer 2 different values very easily. I.e. whether the signal is high or low.

29 Bits and Bytes zThe problem: If we use sequences of just 0’s and 1’s instead of 0…9 to represent data, regardless of the convenience, aren’t we using a lot more space? zTo address this issue, let’s consider a specific question…

30 Bits and Bytes zSuppose you had a text file (say, the complete works of Shakespeare) and you know that it has 32 different symbols and a total of 100000 characters. zHow much space would be need to represent this in base 10? zHow about base 2?

31 Bits and Bytes zBut… Log 2 32 = Log 2 10 * Log 10 32 zSo we are only a constant factor off. zIn big-Oh terms, how much more space is needed by the base 2 representation?

32 Bits and Bytes zOkay, so we’ve established that’s it’s easiest to store data as a sequence of 0’s and 1’s, but how does that help us? zIn particular, how do I take a text file and store it on the computer? zTo do this we need to invent a code.

33 Codes

34 zWe can think of data as a large sequence of bits which can be partitioned into smaller meaningful sequences. zA code then is simply a mapping from sequences of bits to characters (or something meaningful) zFor example, the ASCII system is a code. It maps single bytes (8 bits) to unique characters.

35 Encoding strings zWe can then encode “badcae” as: 001 000 011 010 000 100 zReally an 18 bit string: 001000011010000100 Symbolsabcde codewords000001010011100 zSuppose we have 5 characters, {a,…,e}. zWe can use the following 3-bit code:

36 Codes zYou can think of a code as a function mapping characters to bit strings. zWe would like it to be a bijection. zWhat if it is many-to-one? zWhat if it is one-to-many? zIn the one-to-many case nothing spectacularly bad happens, but it is a pain to use the code.

37 Codes zA codeword is simply a binary string and a code is a set of codewords and their meanings. zMust each codeword in a code necessarily have the same length? I.e. is every code a fixed length code? (E.g., Morse code - not binary)

38 Variable Length Codewords zWhat does 001001 code for? y 001 001 -> bb y 00 10 01 -> cad Symbolsabcde codewords10001000111 zNeed to be careful! zSuppose we have 5 characters, {a,…,e}.

39 Prefix Free Codes zA prefix free code is one where no codeword is a prefix of another codeword. zSimply parse left to right. zWe can now construct codes whose codewords are of varying lengths. Known as variable length codes. zLet’s see how they help us…

40 Encoding strings zWe can encode “badcae” as: 001 000 011 010 000 100 zReally an 18 bit string: 001000011010000100 Symbolsabcde Fixed-length code000001010011100 Variable-length code 100010000111

41 Symbolsabcde Fixed-length code000001010011100 Variable-length code 100010000111 Encoding strings zWe can encode “badcae” as: 001 10 01 000 10 11 zReally a 14 bit string: 00110010001011

42 A basis for data compression zUsually, we know only the size of the alphabet. zBut what if we also know how often each character appears in a file? zIn other words, what if we know the distribution of the character frequencies?

43 Encoding strings zVariable-length codes yExploit statistics of symbols. yMore frequently occurring symbols encoded using fewer bits. zWhat makes a good variable-length code? zIt should be prefix free! SymbolsabcdeTotal Frequency5025154075205 chars Fixed-length code000001010011100615 bits Variable-length code (optimal) 100010000111450 bits

44 A basis for data compression zIf we know the character frequencies, we can take advantage to encode more frequently occurring characters with fewer bits.

45 Huffman’s Algorithm

46 Tree representation zRepresent prefix free codes as full binary trees zFull: every node yIs a leaf, or yHas exactly 2 children. zThe encoding is then a (unique) path from the root to a leaf. c a b d 0 0 0 1 1 1 a=1, b=001, c=000, d=01

47 Why a full binary tree? zA node with no sibling can be moved up 1 level, improving the code. zAn optimal code for a string can always be represented by a full binary tree. c a b d 0 1 c a b d 1

48 Encoding cost zAlphabet: C Symbol: c Symbol Frequency: f(c) Depth in tree T: d(c) (d(c) is also number of bits to encode c ) zEncoding cost: zQ: How to construct a full binary tree that minimizes K ?

49 Huffman’s Algorithm zHuffman’s algorithms will give you an optimal prefix free code by constructing an appropriate tree. zData structure used: A Priority Queue. zinsert(element, priority) inserts an element with a given priority into the queue. zdeleteMin() returns the element with least priority.

50 Huffman’s Algorithm 1.Compute f(c) for every symbol c  C 2.insert(c, f(c)) into priority queue Q 3.for i = 1 to |C| - 1 (while Q is not empty) 4. z = new TreeNode() 5. x = z.left = Q.deleteMin() 6. y = z.right = Q.deleteMin() 7. f(z) = f(x) + f(y) 8. Q.insert(z, f(z)) 9.return Q.deleteMin()

51 Example

52

53

54

55

56

57 Huffman’s Algorithm zHuffman’s algorithm is a greedy algorithm that constructs an optimal prefix free code for a given piece of data. zDoes it really generate an optimal prefix free code? zYes, but the proof is beyond the scope of today’s lecture.

58 Greedy Algorithms zAt every step a greedy algorithm makes a locally optimal decision hoping that it will add up to a global optimum. zThis strategy works surprisingly well for a lot of algorithms. zSome examples: Huffman’s for data compression. Kruskal’s for calculating minimum spanning tree’s in graphs.

59 Huffman’s Algorithm zWhy is it greedy? zBecause at each iteration in the loop, it picked the two “optimal” trees in the priority queue with which to create a new node without considering their implications from a global standpoint.

60 Back to Bits and Bytes zNotice that Huffman’s algorithm, in the setting we studied it, can only compress files of characters since it needs to know what the alphabet is in order to count the frequencies. zDo we need to modify the algorithm in order to compress arbitrary files? zTake a minute to think about this.

61 Bits and Bytes zNo, we don’t! zSuppose we have a file F to compress. We can treat F as a stream of bits. zSo we read the first byte and consider it in the context of our predefined alphabet. ASCII in this case. zImplicitly, we then end up treating every file as a text file. zIs that a good idea? What about images?

62 Bits and Bytes zIt doesn’t matter! zSo long as we reproduce the original bit sequence after decompression. zWe can treat the file as containing just the characters {a,b,c,d} if we want, it won’t affect the correctness of our algorithm. zIt will, however, affect the performance. zWhy?

63 Huffman compression zHuffman trees provide a straightforward method for file compression. y1. Read the file and compute frequencies y2. Use frequencies to build Huffman codes y3. Encode file using the codes y4. Write the codes (or tree) and encoded file into the output file. Sometimes students find this to be tricky…

64 Variations zReading the file twice is a pain. yOnce to compute frequencies, and again to do the compression. zIt is possible to build an adaptive Huffman tree that adjusts itself as more data becomes available.

65 Beating Huffman zHow about doing better than Huffman! zImpossible! yHuffman’s algorithm gives the optimal prefix code! zRight. yBut who says we have to use a prefix code?

66 Example zSuppose we have a file containing yabcdabcdabcdabcdabcdabcd… abcdabcd zThis could be expressed very compactly as yabcd^1000 z We will discuss this concept next


Download ppt "Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Ananda Gunawardena 22 February 2005."

Similar presentations


Ads by Google