Presentation is loading. Please wait.

Presentation is loading. Please wait.

UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding.

Similar presentations


Presentation on theme: "UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding."— Presentation transcript:

1 UNIT II TEXT COMPRESSION

2 a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding Dictionary techniques LZW family algorithms. Compression techniques – Huffman coding – Adaptive Huffman Coding – Arithmetic coding – Shannon-Fano coding – Dictionary techniques – LZW family algorithms.

3 Introduction Introduction Compression: the process of coding that will effectively reduce the total number of bits needed to represent certain information. 3 Fig.1: A General Data Compression Scheme.

4 15.4 Data compression implies sending or storing a smaller number of bits. Although many methods are used for this purpose, in general these methods can be divided into two broad categories: lossless and lossy methods. Data compression methods

5 Introduction Introduction If the compression and decompression processes induce no information loss, then the compression scheme is lossless; otherwise, it is lossy. Compression ratio: B 0 – number of bits before compression B 1 – number of bits after compression In general, we would desire any codec (encoder/decoder scheme) to have a compression ratio much larger than 1.0. The higher the compression ratio, the better the lossless compression scheme, as long as it is computationally feasible. 5

6 Basics of Information Theory Basics of Information Theory 6 What is entropy? is a measure of the number of specific ways in which a system may be arranged, commonly understood as a measure of the disorder of a system.systemdisorder As an example, if the information source S is a gray-level digital image, each s i is a gray-level intensity ranging from 0 to (2 k − 1), where k is the number of bits used to represent each pixel in an uncompressed image. We need to find the entropy of this image; which the number of bits to represent the image after compression.

7 Run-Length Coding Run-Length Coding RLC is one of the simplest forms of data compression. The basic idea is that if the information source has the property that symbols tend to form continuous groups, then such symbol and the length of the group can be coded. Consider a screen containing plain black text on a solid white background. There will be many long runs of white pixels in the blank space, and many short runs of black pixels within the text. Let us take a hypothetical single scan line, with B representing a black pixel and W representing white: WWWWWBWWWWBBBWWWWWWBWWW If we apply the run-length encoding (RLE) data compression algorithm to the above hypothetical scan line, we get the following: 5W1B4W3B6W1B3W The run-length code represents the original 21 characters in only 14. 7

8 Variable-Length Coding 8 Variable-length coding (VLC) is one of the best-known entropy coding methods Here, we will study the Shannon–Fano algorithm, Huffman coding, and adaptive Huffman coding.

9 Shannon–Fano Algorithm Shannon–Fano Algorithm 9 To illustrate the algorithm, let us suppose the symbols to be coded are the characters in the word HELLO. The frequency count of the symbols is Symbol H E L O Count 1 1 2 1 The encoding steps of the Shannon–Fano algorithm can be presented in the following top-down manner: 1. Sort the symbols according to the frequency count of their occurrences. 2. Recursively divide the symbols into two parts, each with approximately the same number of counts, until all parts contain only one symbol.

10 Shannon–Fano Algorithm Shannon–Fano Algorithm 10 A natural way of implementing the above procedure is to build a binary tree. As a convention, let us assign bit 0 to its left branches and 1 to the right branches. Initially, the symbols are sorted as LHEO. As Fig. 7.3 shows, the first division yields two parts: L with a count of 2, denoted as L:(2); and H, E and O with a total count of 3, denoted as H, E, O:(3). The second division yields H:(1) and E, O:(2). The last division is E:(1) and O:(1).

11 Shannon–Fano Algorithm Shannon–Fano Algorithm 11 Fig. 7.3: Coding Tree for HELLO by Shannon-Fano.

12 Table 7.1: Result of Performing Shannon-Fano on HELLO Li & Drew12 SymbolCountLog 2 Code# of bits used L21.3202 H12.32102 E12.321103 O12.321113 TOTAL # of bits:10

13 Another coding tree for HELLO by Shannon-Fano. Li & Drew13

14 Another Result of Performing Shannon-Fano on HELLO (see Fig. 7.4) Li & Drew14 SymbolCountLog 2 Code# of bits used L21.3200 4 H12.32012 E12.32102 O12.32112 TOTAL # of bits:10

15 Shannon–Fano Algorithm-Analysis 15 The Shannon–Fano algorithm delivers satisfactory coding results for data compression, but it was soon outperformed and overtaken by the Huffman coding method. The Huffman algorithm requires prior statistical knowledge about the information source, and such information is often not available. This is particularly true in multimedia applications, where future data is unknown before its arrival, as for example in live (or streaming) audio and video. Even when the statistics are available, the transmission of the symbol table could represent heavy overhead The solution is to use adaptive Huffman coding compression algorithms, in which statistics are gathered and updated dynamically as the data stream arrives.

16 15.16 LOSSLESS COMPRESSION LOSSLESS COMPRESSION In lossless data compression, the integrity of the data is preserved. The original data and the data after compression and decompression are exactly the same because, in these methods, the compression and decompression algorithms are exact inverses of each other: no part of the data is lost in the process. Redundant data is removed in compression and added during decompression. Lossless compression methods are normally used when we cannot afford to lose any data.

17 15.17 Run-length encoding Run-length encoding is probably the simplest method of compression. It can be used to compress data made of any combination of symbols. It does not need to know the frequency of occurrence of symbols and can be very efficient if data is represented as 0s and 1s. The general idea behind this method is to replace consecutive repeating occurrences of a symbol by one occurrence of the symbol followed by the number of occurrences. The method can be even more efficient if the data uses only two symbols (for example 0 and 1) in its bit pattern and one symbol is more frequent than the other.

18 15.18 Run-length encoding example

19 15.19 Run-length encoding for two symbols

20 15.20 Huffman coding Huffman coding assigns shorter codes to symbols that occur more frequently and longer codes to those that occur less frequently. For example, imagine we have a text file that uses only five characters (A, B, C, D, E). Before we can assign bit patterns to each character, we assign each character a weight based on its frequency of use. In this example, assume that the frequency of the characters is as shown in Table 15.1.

21 15.21 Huffman coding

22 15.22 A character’s code is found by starting at the root and following the branches that lead to that character. The code itself is the bit value of each branch on the path, taken in sequence. Final tree and code

23 15.23 Encoding Let us see how to encode text using the code for our five characters. Figure 15.6 shows the original and the encoded text. Huffman encoding

24 15.24 Decoding The recipient has a very easy job in decoding the data it receives. shows how decoding takes place. Huffman decoding

25 page 2505/06/15 CSE 40373/60373: Multimedia Systems Adaptive Huffman Coding  Extended Huffman is in book: group symbols together  Adaptive Huffman: statistics are gathered and updated dynamically as the data stream arrives ENCODER ------- Initial_code(); while not EOF { get(c); encode(c); update_tree(c); } DECODER ------- Initial_code(); while not EOF { decode(c); output(c); update_tree(c); }

26 26 Adaptive Coding Motivations:  The previous algorithms (both Shannon-Fano and Huffman) require the statistical knowledge which is often not available (e.g., live audio, video).  Even when it is available, it could be a heavy overhead.  Higher-order models incur more overhead. For example, a 255 entry probability table would be required for a 0-order model. An order-1 model would require 255 such probability tables. (A order-1 model will consider probabilities of occurrences of 2 symbols) The solution is to use adaptive algorithms. Adaptive Huffman Coding is one such mechanism that we will study. The idea of “adaptiveness” is however applicable to other adaptive compression algorithms.

27 27 Adaptive Coding ENCODER Initialize_model(); do { c = getc( input ); encode( c, output ); update_model( c ); } while ( c != eof) DECODER Initialize_model(); while ( c = decode (input)) != eof) { putc( c,output) update_model( c ); }  The key is that, both encoder and decoder use exactly the same initialize_model and update_model routines.

28 28 The Sibling Property The node numbers will be assigned in such a way that: 1. A node with a higher weight will have a higher node number 2. A parent node will always have a higher node number than its children. In a nutshell, the sibling property requires that the nodes (internal and leaf) are arranged in order of increasing weights. The update procedure swaps nodes in violation of the sibling property.  The identification of nodes in violation of the sibling property is achieved by using the notion of a block.  All nodes that have the same weight are said to belong to one block

29 29 Flowchart of the update procedure START First appearance of symbol Go to symbol external node Node number max in block? Increment node weight Switch node with highest numbered node in block Is this the root node? Go to parent node STOP NYT gives birth To new NYT and external node Increment weight of external node and old NYT node; Adjust node numbers Go to old NYT node Yes No Yes No  The Huffman tree is initialized with a single node, known as the Not-Yet-Transmitted (NYT) or escape code. This code will be sent every time that a new character, which is not in the tree, is encountered, followed by the ASCII encoding of the character. This allows for the de-compressor to distinguish between a code and a new character. Also, the procedure creates a new node for the character and a new NYT from the old NYT node.  The root node will have the highest node number because it has the highest weight.

30 30 Example B W=2 #1 C W=2 #2 D W=2 #3 W=2 #4 W=4 #5 W=6 #6 E W=10 #7 Root W=16 #8 Counts: (number of occurrences) B:2 C:2 D:2 E:10 Example Huffman tree after some symbols have been processed in accordance with the sibling property NYT #0 Initial Huffman Tree #0

31 31 Example W=1 #2 B W=2 #3 C W=2 #4 D W=2 #5 W=2+1 #6 W=4 #7 W=6+1 #8 E W=10 #9 Root W=16+1 #10 Counts: (number of occurrences) A:1 B:2 C:2 D:2 E:10 A Huffman tree after first appearance of symbol A A W=1 #1 NYT #0

32 32 Increment B W=2 #3 C W=2 #4 D W=2 #5 W=3+1 #6 W=4 #7 W=7+1 #8 E W=10 #9 Root W=17+1 #10 Counts: A:1+1 B:2 C:2 D:2 E:10 An increment in the count for A propagates up to the root W=1+1 #2 A W=1+1 #1 NYT #0

33 33 Swapping B W=2 #3 C W=2 #4 D W=2 #5 W=4 #6 W=4 #7 W=8 #8 E W=10 #9 Root W=18 #10 Counts: A:2+1 B:2 C:2 D:2 E:10 B W=2 #3 C W=2 #4 A W=2+1 #5 W=4 #6 W=4+1 #7 W=8+1 #8 E W=10 #9 Root W=18+1 #10 Counts: A:3 B:2 C:2 D:2 E:10 Swap nodes 1 and 5 Another increment in the count for A results in swap W=2 #2 A W=2 #1 NYT W=2 #2 D W=2 #1 NYT #0

34 34 Swapping … contd. B W=2 #3 C W=2 #4 A W=3+1 #5 W=4 #6 W=5+1 #7 W=9+1 #8 E W=10 #9 Root W=19+1 #10 Counts: A:3+1 B:2 C:2 D:2 E:10 Another increment in the count for A propagates up W=2 #2 D W=2 #1 NYT #0

35 35 Swapping … contd. B W=2 #3 C W=2 #4 A W=4 #5 W=4 #6 W=6 #7 W=10 #8 E W=10 #9 Root W=20 #10 Counts: A:4+1 B:2 C:2 D:2 E:10 Swap nodes 5 and 6 Another increment in the count for A causes swap of sub-tree W=2 #2 D W=2 #1 NYT #0

36 36 Swapping … contd. C W=2 #4 W=6 #7 W=10 #8 E W=10 #9 Root W=20 #10 Counts: A:4+1 B:2 C:2 D:2 E:10 B W=2 #3 W=4 #5 A W=4+1 #6 Swap nodes 8 and 9 Further swapping needed to fix the tree W=2 #2 D W=2 #1 NYT #0

37 37 Swapping … contd. C W=2 #4 W=6 #7 W=10+1 #9 Root W=20+1 #10 Counts: A:5 B:2 C:2 D:2 E:10 B W=2 #3 W=4 #5 A W=5 #6 E W=10 #8 W=2 #2 D W=2 #1 NYT #0

38 38 Arithmetic Coding Arithmetic coding is based on the concept of interval subdividing. In arithmetic coding a source ensemble is represented by an interval between 0 and 1 on the real number line. Each symbol of the ensemble narrows this interval. As the interval becomes smaller, the number of bits needed to specify it grows. Arithmetic coding assumes an explicit probabilistic model of the source. It uses the probabilities of the source messages to successively narrow the interval used to represent the ensemble. A high probability message narrows the interval less than a low probability message, so that high probability messages contribute fewer bits to the coded ensemble

39 39 Arithmetic Coding: Description In the following discussions, we will use M as the size of the alphabet of the data source, N[x] as symbol x's probability, Q[x] as symbol x's cumulative probability (i.e., Q[i]=N[0]+N[1]+...+N[i]) Assume we know the probabilities of each symbol of the data source, we can allocate to each symbol an interval with width proportional to its probability, and each of the intervals does not overlap with others. This can be done if we use the cumulative probabilities as the two ends of each interval. Therefore, the two ends of each symbol x amount to Q[x-1] and Q[x]. Symbol x is said to own the range [Q[x-1], Q[x]).

40 40 Arithmetic Coding: Encoder We begin with the interval [0,1) and subdivide the interval iteratively. For each symbol entered, the current interval is divided according to the probabilities of the alphabet. The interval corresponding to the symbol is picked as the interval to be further proceeded with. The procedure continues until all symbols in the message have been processed. Since each symbol's interval does not overlap with others, for each possible message there is a unique interval assigned. We can represent the message with the interval's two ends [L,H). In fact, taking any single value in the interval as the encoded code is enough, and usually the left end L is selected.

41 41 Arithmetic Coding Algorithm L = 0.0; H = 1.0; While ( (x = getc(input)) != EOF ) { R = (H-L); H = L + R * Q[x]; L = L + R * Q[x-1]; } Output(L); R is the interval range, and H and L are two ends of the current code interval. x is the new symbol to be encoded. H and L are initialized to 0 and 1 respectively

42 42 Arithmetic Coding: Encoder example Symbol, xProbability, N[x][Q[x-1], Q[x]) A0.40.0, 0.4 B0.30.4, 0.7 C0.20.7, 0.9 D0.10.9, 1.0 1 0 B 0.4 0.70.67 0.61 C 0.634 0.61 A 0.6286 0.6196 B String: BCAB Code sent: 0.6196

43 43 Decoding Algorithm  When decoding the code v is placed on the current code interval to find the symbol x so that Q[x-1] <= code < Q[x]. The procedure iterates until all symbols are decoded. v = input_code(); for (;;) { x = find_symbol_straddling_this_range(v); putc(x); R = Q[x] – Q[x-1]; v = (v – Q[x-1])/R; } v Output Char x Q[x-1]Q[x]R 0.6196B0.40.70.3 0.732C0.70.90.2 0.16A0.00.4 B 0.70.3 0.0

44 44 Arithmetic Coding: Issues The zero-frequency problem: Each symbol's predicted probability must not be zero or the interval will become zero and interval renormalization would fail. This is called the zero-frequency problem. Models that adapt online may encounter such problem when decaying. The EOF problem: Assume we pick the lower end of the interval as the encoded code. Two messages may yield the same code if one message is identical to the other, except for a sequence of finite number of the first symbol(first in table, not in the sequence) as a suffix. For e.g., Both BCAB, BCABA, BCABAA, BCABAAA will have the same lower interval but different upper intervals. (try it) The simplest solution is to let the decoder know the length of the encoded message. The decoder will know if the message size is fixed or can be transmitted at first. However this is not plausible if the data size is not known beforehand, such as live broadcasting data; or it's too costly to do so, such as tapes whose size is unknown at the beginning. There is another solution if we introduce a special EOF symbol to the alphabet. The symbol takes a small interval and is used only at the end of the message. When the decoder detects the EOF symbol it knows the end of the message is reached.

45 45 Dictionary-Based Compression The compression algorithms we studied so far use a statistical model to encode single symbols Compression: Encode symbols into bit strings that use fewer bits. Dictionary-based algorithms do not encode single symbols as variable- length bit strings; they encode variable-length strings of symbols as single tokens The tokens form an index into a phrase dictionary If the tokens are smaller than the phrases they replace, compression occurs. Dictionary-based compression is easier to understand because it uses a strategy that programmers are familiar with-> using indexes into databases to retrieve information from large amounts of storage. Telephone numbers Postal codes

46 46 Dictionary-Based Compression: Example Consider the Random House Dictionary of the English Language, Second edition, Unabridged. Using this dictionary, the string: A good example of how dictionary based compression works can be coded as: 1/1 822/3 674/4 1343/60 928/75 550/32 173/46 421/2 Coding: Uses the dictionary as a simple lookup table Each word is coded as x/y, where, x gives the page in the dictionary and y gives the number of the word on that page. The dictionary has 2,200 pages with less than 256 entries per page: Therefore x requires 12 bits and y requires 8 bits, i.e., 20 bits per word (2.5 bytes per word). Using ASCII coding the above string requires 48 bytes, whereas our encoding requires only 20 (<-2.5 * 8) bytes: 50% compression.

47 47 Adaptive Dictionary-based Compression Build the dictionary adaptively Necessary when the source data is not plain text, say audio or video data. Is better tailored to the specific source. Original methods due to Ziv and Lempel in 1977 (LZ77) and 1978 (LZ78). Terry Welch improved the scheme in 1984 (called LZW compression). It is used in, UNIX compress, and, GIF. LZ77: A sliding window technique in which the dictionary consists of a set of fixed length phrases found in a window into the previously processed text LZ78: Instead of using fixed-length phrases from a window into the text, it builds phrases up one symbol at a time, adding a new symbol to an existing phrase when a match occurs.

48 48 LZW Algorithm Preliminaries:  A dictionary that is indexed by “codes” is used.  The dictionary is assumed to be initialized with 256 entries (indexed with ASCII codes 0 through 255) representing the ASCII table.  The compression algorithm assumes that the output is either a file or a communication channel. The input being a file or buffer.  Conversely, the decompression algorithm assumes that the input is a file or a communication channel and the output is a file or a buffer. Decompression Compression file/bufferCompressed file/ Communication channel file/buffer

49 49 LZW Algorithm LZW Compression: set w = NIL loop read a character k if wk exists in the dictionary w = wk else output the code for w add wk to the dictionary w = k endloop The program reads one character at a time. If the code is in the dictionary, then it adds the character to the current work string, and waits for the next one. This occurs on the first character as well. If the work string is not in the dictionary, (such as when the second character comes across), it adds the work string to the dictionary and sends over the wire (or writes to a file) the code assigned to the work string without the new character. It then sets the work string to the new character.

50 50 Input String: ^WED^WE^WEE^WEB^WET wkOutp ut IndexSymb ol NIL^ ^W^256^W WEW257WE EDE258ED D^D259D^ ^W ^WE256260^WE E^E261E^ ^W ^WE ^WEE260262^WEE E^ E^W261263E^W WE WEB257264WEB B^B265B^ ^W ^WE ^WET260266^WET TEOFT set w = NIL loop read a character k if wk exists in the dictionary w = wk else output the code for w add wk to the dictionary w = k endloop

51 51 LZW Algorithm LZW Decompression: read fixed length token k (code or char) output k w = k loop read a fixed length token k entry = dictionary entry for k output entry add w + first char of entry to the dictionary w = entry endloop The nice thing is that the decompressor builds its own dictionary on its side, that matches exactly the compressor's, so that only the codes need to be sent.

52 52 Example of LZW Input String (to decode): ^WED E B T wkOutputIndexSymbol ^^ ^WW256^W WEE257WE EDD258ED D ^W259D^ ^WEE260^WE E ^WE261E^ ^WE E^262^WEE E^ WE263E^W WEBB264WEB B ^WE265B^ ^WETT266^WET read a fixed length token k (code or char) output k w = k loop read a fixed length token k (code or char) entry = dictionary entry for k output entry add w + first char of entry to the dictionary w = entry endloop

53 53 LZW Algorithm - Discussion 9 bits 0 1 < - ASCII characters (0 to 255) < - Codes (256 to 512)  Where is the compression?  Original String to decode : ^WED^WE^WEE^WEB^WET  Decoded String : ^WED E B T  Plain ASCII coding of the string : 19 * 8 bits = 152 bits  LZW coding of the string: 12*9 bits = 108 bits (7 symbols and 5 codes, each of 9 bits)  Why 9 bits?  An ASCII character has a value ranging from 0 to 255  All tokens have fixed length  There has to be a distinction in representation between an ASCII character and a Code (assigned to strings of length 2 or more)  Codes can only have values 256 and above

54 54 LZW Algorithm – Discussion (continued) With 9 bits we can only have a maximum of 256 codes for strings of length 2 or above (with the first 256 entries for ASCII characters)  Original LZW uses dictionary with 4K entries, with the length of each symbol/code being 12 bits 12 bits 0 < - ASCII characters (0 to 255 entries) < - Codes (256 to 4096 entries) 0 0 0 1 000 1 111  With 12 bits, we can have a maximum of 2 12 – 256 codes.

55 55 Practical implementations of LZW algorithm follow the two approaches: Flush the dictionary periodically – no wasted codes Grow the length of the codes as the algorithm proceeds - First start with a length of 9 bits for the codes. - Once we run out of codes, increase the length to 10 bits. When we run out of codes with 10 bits then we increase the code length to 11 bits and so on. - more efficient. 0ASCII 1Codes 256-512 00ASCII 01Codes 256-511 10Codes 512-767 11Codes 768-1023 000ASCII 001Codes 256-511 010Codes 512-767 011Codes 768-1023 100Codes 1024-1279 101Codes 1280-1535 110Codes 1536-1791 111Codes 1792-2047 Out of codes


Download ppt "UNIT II TEXT COMPRESSION. a. Outline Compression techniques Run length coding Huffman coding Adaptive Huffman Coding Arithmetic coding Shannon-Fano coding."

Similar presentations


Ads by Google