Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Ananda Gunawardena 22 February 2005.

Similar presentations


Presentation on theme: "Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Ananda Gunawardena 22 February 2005."— Presentation transcript:

1 Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Ananda Gunawardena 22 February 2005

2 Data Compression

3 Data compression zIs one of the fundamental technologies of the Internet. zIs necessary for faster data transmission. zUseful even locally to keep smaller files or backup data.

4 Data compression zTypes of compression yLossless – encodes the original information exactly. yLossy – approximates the original information. zUses of compression yImages over the web: JPEG yMusic: MP3 yGeneral-purpose: ZIP, GZIP, JAR, …

5 Lossy vs. Lossless zIf lossless compression represents exactly the same data, in a compressed form, why use lossy at all? zMaybe you can get excellent compression without too much loss of information? zLet’s look at an example…

6 Compare two images One image is 400K the other is 1100K. Which is which?

7 So where is the difference?

8 Another Example - SVD zSingular Value Decomposition yA hybrid algorithm, which compresses data to a certain “rank” yLower ranks are lossy, highest rank is lossless yBut making this algorithm lossless actually doubles the size of the file! zIn what kinds of situations might SVD be useful?

9 SVD uses zSuppose we send a robot to explore the moon. It doesn’t know what information is useful to us. zWe can ask it to first send us a small rank and then, if we’re interested, we can ask for larger ranks. zUltimately we get the all the information, but only if we really want/need it.

10 Another Example - SVD zOkay, so the robot on the moon seems a bit contrived. zBut what about surfing the web on your handheld? zThere is so much nonsense on the web, we clearly don’t want to download everything since bandwidth is at a premium.

11 Another Example - SVD Rank 1Rank 8Rank 16Original 2231 bytes4549 bytes

12 What can we conclude? zThere is definitely a trade-off. zLossless may not perform so well, but it retains 100% of the information. zLossy can perform extremely well, but is the compression worth the loss of information? zSo how do we decide which one to use?

13 Some Considerations zWhat types of files would you use a lossless algorithm on? zWhat types of files would you use a lossy algorithm on?

14 Some Considerations zWhat types of files would you use a lossless algorithm on? yDiscrete data (text files) zWhat types of files would you use a lossy algorithm on? yAnalogue data (images, music) yFiles where you can get away with an approximation to the data.

15 Question #1 zIs there a lossless compression algorithm that can compress any file?

16 Answer zAbsolutely not! zWhy not? yHint: How many binary strings are there of length N?

17 Question #2 zIs there a best possible way to compress files? zIs there an algorithm that always produces the smallest compressed file possible? No!

18 No optimal compression zSuppose you wish to compress the first 10,000 digits of Pi. zIn case they slipped your mind…

19 Pi 10000 31415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823 06647093844609550582231725359408128481117450284102701938521105559644622948954930381964428810975665933446128475648233 78678316527120190914564856692346034861045432664821339360726024914127372458700660631558817488152092096282925409171536 43678925903600113305305488204665213841469519415116094330572703657595919530921861173819326117931051185480744623799627 49567351885752724891227938183011949129833673362440656643086021394946395224737190702179860943702770539217176293176752 38467481846766940513200056812714526356082778577134275778960917363717872146844090122495343014654958537105079227968925 89235420199561121290219608640344181598136297747713099605187072113499999983729780499510597317328160963185950244594553 46908302642522308253344685035261931188171010003137838752886587533208381420617177669147303598253490428755468731159562 86388235378759375195778185778053217122680661300192787661119590921642019893809525720106548586327886593615338182796823 03019520353018529689957736225994138912497217752834791315155748572424541506959508295331168617278558890750983817546374 64939319255060400927701671139009848824012858361603563707660104710181942955596198946767837449448255379774726847104047 53464620804668425906949129331367702898915210475216205696602405803815019351125338243003558764024749647326391419927260 42699227967823547816360093417216412199245863150302861829745557067498385054945885869269956909272107975093029553211653 44987202755960236480665499119881834797753566369807426542527862551818417574672890977772793800081647060016145249192173 21721477235014144197356854816136115735255213347574184946843852332390739414333454776241686251898356948556209921922218 42725502542568876717904946016534668049886272327917860857843838279679766814541009538837863609506800642251252051173929 84896084128488626945604241965285022210661186306744278622039194945047123713786960956364371917287467764657573962413890 86583264599581339047802759009946576407895126946839835259570982582262052248940772671947826848260147699090264013639443 74553050682034962524517493996514314298091906592509372216964615157098583874105978859597729754989301617539284681382686 83868942774155991855925245953959431049972524680845987273644695848653836736222626099124608051243884390451244136549762 78079771569143599770012961608944169486855584840635342207222582848864815845602850601684273945226746767889525213852254 99546667278239864565961163548862305774564980355936345681743241125150760694794510965960940252288797108931456691368672 28748940560101503308617928680920874760917824938589009714909675985261365549781893129784821682998948722658804857564014 27047755513237964145152374623436454285844479526586782105114135473573952311342716610213596953623144295248493718711014 57654035902799344037420073105785390621983874478084784896833214457138687519435064302184531910484810053706146806749192 78191197939952061419663428754440643745123718192179998391015919561814675142691239748940907186494231961567945208095146 55022523160388193014209376213785595663893778708303906979207734672218256259966150142150306803844773454920260541466592 52014974428507325186660021324340881907104863317346496514539057962685610055081066587969981635747363840525714591028970 64140110971206280439039759515677157700420337869936007230558763176359421873125147120532928191826186125867321579198414 84882916447060957527069572209175671167229109816909152801735067127485832228718352093539657251210835791513698820914442 10067510334671103141267111369908658516398315019701651511685171437657618351556508849099898599823873455283316355076479 18535893226185489632132933089857064204675259070915481416549859461637180270981994309924488957571282890592323326097299 71208443357326548938239119325974636673058360414281388303203824903758985243744170291327656180937734440307074692112019 13020330380197621101100449293215160842444859637669838952286847831235526582131449576857262433441893039686426243410773 22697802807318915441101044682325271620105265227211166039666557309254711055785376346682065310989652691862056476931257 05863566201855810072936065987648611791045334885034611365768675324944166803962657978771855608455296541266540853061434 44318586769751456614068007002378776591344017127494704205622305389945613140711270004078547332699390814546646458807972 70826683063432858785698305235808933065757406795457163775254202114955761581400250126228594130216471550979259230990796 54737612551765675135751782966645477917450112996148903046399471329621073404375189573596145890193897131117904297828564 75032031986915140287080859904801094121472213179476477726224142548545403321571853061422881375850430633217518297986622 37172159160771669254748738986654949450114654062843366393790039769265672146385306736096571209180763832716641627488880 07869256029022847210403172118608204190004229661711963779213375751149595015660496318629472654736425230817703675159067 35023507283540567040386743513622224771589150495309844489333096340878076932599397805419341447377441842631298608099888

20 How about a program? long a[35014],b,c=35014,d,e,f=1e4,g,h; main(){ for(;b=c-=14;h=printf("%04ld",e+d/f)) for(e=d%=f;g=--b*2;d/=g) d=d*b+f*(h?a[b]:f/5), a[b]=d%--g; }

21 pitiny.c zThis C program is just 143 characters long! zAnd it “decompresses” into the first 10,000 digits of Pi. long a[35014],b,c=35014,d,e,f=1e4,g,h; main(){for(;b=c-=14;h=printf("%04ld",e+d/f)) for(e=d%=f;g=--b*2;d/=g) d=d*b+f*(h?a[b]:f/5), a[b]=d%--g;}

22 Program size complexity zThere is an interesting idea here: yFind the shortest program that computes a certain output. yA very important idea in theoretical computer science. Can be used to define incompressible data (no shorter program will produce these data). zBut useless for data compression. yThere is no algorithm that can find such programs

23 Challenge zCan you come up with the shortest Java program that computes the first 10,000 digits of Pi and writes them to the screen. z why does pitiny.c works?

24 Getting close zIn practice, the best we can hope for is a program that does good compression in interesting cases. yText files yNumerical data yVoice yMusic yImages yVideo y…

25 How does compression work? zLossy algorithms are generally mathematically based. They work by applying transforms. yEg. JPEG – discrete cosine transform zLossy algorithms attempt to approximate the original data. zLossless algorithms cannot do that since they need to maintain the original data. zSo what can they do?

26 How does compression work? zThey need to analyze the file and take advantage of certain properties it might have. zOr its structure. zWe’ll look at two important lossless compression methods yHuffman compression yLZW compression

27 Interlude: Bit-level Representation of Data

28 Bits and Bytes zAll data is stored on a computer as a sequence of 0’s and 1’s, called bits. zThis is a very natural way to represent data, for the following reason: yA computer cannot, in general, infer 10 different values from the intensity of a signal. yIt can however infer 2 different values very easily. I.e. whether the signal is high or low.

29 Bits and Bytes zThe problem: If we use sequences of just 0’s and 1’s instead of 0…9 to represent data, regardless of the convenience, aren’t we using a lot more space? zTo address this issue, let’s consider a specific question…

30 Bits and Bytes zSuppose you had a text file (say, the complete works of Shakespeare) and you know that it has 32 different symbols and a total of 100000 characters. zHow much space would be need to represent this in base 10? zHow about base 2?

31 Bits and Bytes zBut… Log 2 32 = Log 2 10 * Log 10 32 zSo we are only a constant factor off. zIn big-Oh terms, how much more space is needed by the base 2 representation?

32 Bits and Bytes zOkay, so we’ve established that’s it’s easiest to store data as a sequence of 0’s and 1’s, but how does that help us? zIn particular, how do I take a text file and store it on the computer? zTo do this we need to invent a code.

33 Codes

34 zWe can think of data as a large sequence of bits which can be partitioned into smaller meaningful sequences. zA code then is simply a mapping from sequences of bits to characters (or something meaningful) zFor example, the ASCII system is a code. It maps single bytes (8 bits) to unique characters.

35 Encoding strings zWe can then encode “badcae” as: 001 000 011 010 000 100 zReally an 18 bit string: 001000011010000100 Symbolsabcde codewords000001010011100 zSuppose we have 5 characters, {a,…,e}. zWe can use the following 3-bit code:

36 Codes zYou can think of a code as a function mapping characters to bit strings. zWe would like it to be a bijection. zWhat if it is many-to-one? zWhat if it is one-to-many? zIn the one-to-many case nothing spectacularly bad happens, but it is a pain to use the code.

37 Codes zA codeword is simply a binary string and a code is a set of codewords and their meanings. zMust each codeword in a code necessarily have the same length? I.e. is every code a fixed length code? (E.g., Morse code - not binary)

38 Variable Length Codewords zWhat does 001001 code for? y 001 001 -> bb y 00 10 01 -> cad Symbolsabcde codewords10001000111 zNeed to be careful! zSuppose we have 5 characters, {a,…,e}.

39 Prefix Free Codes zA prefix free code is one where no codeword is a prefix of another codeword. zSimply parse left to right. zWe can now construct codes whose codewords are of varying lengths. Known as variable length codes. zLet’s see how they help us…

40 Encoding strings zWe can encode “badcae” as: 001 000 011 010 000 100 zReally an 18 bit string: 001000011010000100 Symbolsabcde Fixed-length code000001010011100 Variable-length code 100010000111

41 Symbolsabcde Fixed-length code000001010011100 Variable-length code 100010000111 Encoding strings zWe can encode “badcae” as: 001 10 01 000 10 11 zReally a 14 bit string: 00110010001011

42 A basis for data compression zUsually, we know only the size of the alphabet. zBut what if we also know how often each character appears in a file? zIn other words, what if we know the distribution of the character frequencies?

43 Encoding strings zVariable-length codes yExploit statistics of symbols. yMore frequently occurring symbols encoded using fewer bits. zWhat makes a good variable-length code? zIt should be prefix free! SymbolsabcdeTotal Frequency5025154075205 chars Fixed-length code000001010011100615 bits Variable-length code (optimal) 100010000111450 bits

44 A basis for data compression zIf we know the character frequencies, we can take advantage to encode more frequently occurring characters with fewer bits.

45 Huffman’s Algorithm

46 Tree representation zRepresent prefix free codes as full binary trees zFull: every node yIs a leaf, or yHas exactly 2 children. zThe encoding is then a (unique) path from the root to a leaf. c a b d 0 0 0 1 1 1 a=1, b=001, c=000, d=01

47 Why a full binary tree? zA node with no sibling can be moved up 1 level, improving the code. zAn optimal code for a string can always be represented by a full binary tree. c a b d 0 1 c a b d 1

48 Encoding cost zAlphabet: C Symbol: c Symbol Frequency: f(c) Depth in tree T: d(c) (d(c) is also number of bits to encode c ) zEncoding cost: zQ: How to construct a full binary tree that minimizes K ?

49 Huffman’s Algorithm zHuffman’s algorithms will give you an optimal prefix free code by constructing an appropriate tree. zData structure used: A Priority Queue. zinsert(element, priority) inserts an element with a given priority into the queue. zdeleteMin() returns the element with least priority.

50 Huffman’s Algorithm 1.Compute f(c) for every symbol c  C 2.insert(c, f(c)) into priority queue Q 3.for i = 1 to |C| - 1 (while Q is not empty) 4. z = new TreeNode() 5. x = z.left = Q.deleteMin() 6. y = z.right = Q.deleteMin() 7. f(z) = f(x) + f(y) 8. Q.insert(z, f(z)) 9.return Q.deleteMin()

51 Example

52

53

54

55

56

57 Huffman’s Algorithm zHuffman’s algorithm is a greedy algorithm that constructs an optimal prefix free code for a given piece of data. zDoes it really generate an optimal prefix free code? zYes, but the proof is beyond the scope of today’s lecture.

58 Greedy Algorithms zAt every step a greedy algorithm makes a locally optimal decision hoping that it will add up to a global optimum. zThis strategy works surprisingly well for a lot of algorithms. zSome examples: Huffman’s for data compression. Kruskal’s for calculating minimum spanning tree’s in graphs.

59 Huffman’s Algorithm zWhy is it greedy? zBecause at each iteration in the loop, it picked the two “optimal” trees in the priority queue with which to create a new node without considering their implications from a global standpoint.

60 Back to Bits and Bytes zNotice that Huffman’s algorithm, in the setting we studied it, can only compress files of characters since it needs to know what the alphabet is in order to count the frequencies. zDo we need to modify the algorithm in order to compress arbitrary files? zTake a minute to think about this.

61 Bits and Bytes zNo, we don’t! zSuppose we have a file F to compress. We can treat F as a stream of bits. zSo we read the first byte and consider it in the context of our predefined alphabet. ASCII in this case. zImplicitly, we then end up treating every file as a text file. zIs that a good idea? What about images?

62 Bits and Bytes zIt doesn’t matter! zSo long as we reproduce the original bit sequence after decompression. zWe can treat the file as containing just the characters {a,b,c,d} if we want, it won’t affect the correctness of our algorithm. zIt will, however, affect the performance. zWhy?

63 Huffman compression zHuffman trees provide a straightforward method for file compression. y1. Read the file and compute frequencies y2. Use frequencies to build Huffman codes y3. Encode file using the codes y4. Write the codes (or tree) and encoded file into the output file. Sometimes students find this to be tricky…

64 Variations zReading the file twice is a pain. yOnce to compute frequencies, and again to do the compression. zIt is possible to build an adaptive Huffman tree that adjusts itself as more data becomes available.

65 Beating Huffman zHow about doing better than Huffman! zImpossible! yHuffman’s algorithm gives the optimal prefix code! zRight. yBut who says we have to use a prefix code?

66 Example zSuppose we have a file containing yabcdabcdabcdabcdabcdabcd… abcdabcd zThis could be expressed very compactly as yabcd^1000 z We will discuss this concept next


Download ppt "Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Ananda Gunawardena 22 February 2005."

Similar presentations


Ads by Google