Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE Lectures 22 – Huffman codes

Similar presentations


Presentation on theme: "CSE Lectures 22 – Huffman codes"— Presentation transcript:

1 CSE 30331 Lectures 22 – Huffman codes
Binary files Bit operations Using trees, bit ops and binary files Huffman compression

2 File Structure A text file contains ASCII characters with a newline sequence separating lines. A binary file consists of data objects that vary from a single character (byte) to more complex structures that include integers, floating point values, programmer-generated class objects, and arrays. each data object in a file is a record

3 Direct File Access The functions seekg() and seekp() reposition the read and write position, respectively. They take an offset argument indicating the number of bytes from the beginning (beg), ending (end), or current position (cur) in the file. The functions tellg() and tellp() return the current read and write position.

4 Reading & writing To read from a binary file
Use read(char *p, int num); This reads num bytes of data from the file beginning at the current read position in the file Example: //read 5th accountType record out of file accountType acct; int n = 5; ifstream infile; infile.open(“accounts.dat”, ios::in | ios::binary); infile.seekg(n*sizeof(accountType), ios::beg); infile.read((char *)&acct, sizeof(accountType));

5 Reading & writing To write to a binary file
Use write(char *p, int num); This writes num bytes of data from the file beginning at the current write position in the file Example: //write 5th accountType record out of file accountType acct; int n = 5; ofstream outfile; outfile.open(“accounts.dat”, ios::out | ios::binary); outfile.seekp(n*sizeof(accountType), ios::beg); outfile.write((char *)&acct, sizeof(accountType));

6 Bit operations (a reminder)
Bitwise ops And ( & ) 0101 & > 0100 Or ( | ) | > 0111 Xor ( ^ ) ^ > 0011 Not ( ~ ) ~0101 -> 1010

7 Implementing a bitVector Class
bitMask() returns an unsigned character value containing a 1 in the bit position representing i.

8 Lossless Compression Data compression loses no information
Original data can be recovered exactly from the compressed data Normally applied to "discrete" data, such as text, word processing files, computer applications, and so forth

9 Lossy Compression Loses some information during compression and the data cannot be recovered exactly Shrinks the data further than lossless compression techniques Sound files often use this type of compression

10 Huffman Compression A lossless compression technique
Counts occurrences of eight bit characters in data Uses counts to construct variable length codes shorter for more frequently occurring characters Each code has a unique prefix The encoding (compression) process creates an “optimal” binary tree representing these prefix codes Uses a “greedy approach” Makes use of data on hand to choose best option Example: Dijkstra’s algorithm is a greedy approach Achieves compression ratios of at least 1.8 (45% reduction) on text not as good on binary data

11 Example Huffman Tree Internal nodes contain sum of its children’s frequencies Edge to left child is a 0 bit and to a right child is a 1 57 21 36 c:8 13 e:20 a:16 Codes a 11 b 0111 c 00 d 010 e 10 f 0110 Leaves contain original letters and their frequencies d:6 7 f:3 b:4

12 Building Huffman Code Trees
Read file and determine frequencies of each letter Store nodes (letters and frequencies) in a minimum priority queue Probably implemented as a heap, with ordering based on frequencies Loop until only one node left in queue Remove two smallest valued nodes from queue Make them the two children of new root node with value equaling their sum Add new node to queue Result is tree rooted at last node remaining in queue Codes all have unique prefixes Derived for each letter (leaf node) based on traversing links in the tree from root to leaf Left is 0 bit in code – Right is 1 bit in code The length of each code is depth of the leaf in the tree So … Shortest codes for most frequently occurring data value Longest codes for least frequently occurring data values

13 Huffman tree The Huffman code tree is optimal in this sense
All internal nodes have two children and so there are no unused unique prefixes So, the number of shorter codes is the maximum number possible given the frequencies in the data The degree of compression (size of compressed data) is … Where f(ch) is the frequency of ch and d(ch) is the number of bits in its code

14 Building a Huffman Tree

15 Building a Huffman Tree (after first pass)
(f:3) and (b:4) were lowest frequency nodes, so they were joined to a parent (7), which was then added back to the queue

16 Building a Huffman Tree (after second pass)
:3 b:4 7 13 (d:6) and (7) were lowest frequency nodes, so they were joined to a parent (13), which was then added back to the queue a:16 e :20 c :8 Priority Queue

17 Building a Huffman Tree (after third pass)
c :8 d:6 f :3 b:4 7 13 21 (c:8) and (13) were lowest frequency nodes, so they were joined to a parent (21), which was then added back to the queue a:16 e :20 Priority Queue

18 Building a Huffman Tree (after fourth pass)
c :8 d:6 f :3 b:4 7 13 21 (e:20) and (a:16) were lowest frequency nodes, so they were joined to a parent (21), which was then added back to the queue a:16 e :20 36 Priority Queue

19 Building a Huffman Tree (after last pass)
57 (21) and (36) were lowest frequency nodes, so they were joined to a parent (57), which was then added back to the queue 36 c :8 d:6 f :3 b:4 7 13 21 e :20 a:16 Priority Queue

20 The Huffman tree in memory
ID ch freq pID left right code A 16 9 -1 11 1 B 4 6 0111 2 C 8 00 3 D 7 010 E 20 10 5 F 0110 Int 13 21 36 57 c:8 d:6 f:3 b:4 7 13 21 a:16 e:20 36 57 Sample compression “face” = # of bits 4*8=32 vs =10

21 The Huffman tree in file
ID ch freq pID left right code A 16 9 -1 11 1 B 4 6 0111 2 C 8 00 3 D 7 010 E 20 10 5 F 0110 Int 13 21 36 57 Only the gray fields are written to store the tree in the compressed file Tree can then be rebuilt from ch and left and right child indices read from file. Last node is root and codes can be rediscovered as bits are read from file and tree is followed from root to leaf

22 Format of compressed file
There are four parts Size of tree The Tree – vector of (ch,leftID,rightID) data Size of compressed data The compressed data

23 Uncompressing tree in file
ID ch left right code A -1 11 1 B 0111 2 C 00 3 D 010 4 E 10 5 F 0110 6 Int 7 8 9 Read size of tree Read tree from file into vector or array Read size of compressed data Start at root (node[0]) For each bit (b) read If (b==0) move to left child If (b==1) move to right child If now at a leaf append leaf’s letter to uncompressed data, and return to root

24 Uncompressing “face” ID ch left right code A -1 11 1 B 0111 2 C 00 3 D
A -1 11 1 B 0111 2 C 00 3 D 010 4 E 10 5 F 0110 6 Int 7 8 9 Bit data: Bit node letter 10 ‘f’ ‘a’ ‘c’ ‘e’

25 Summary Binary File A sequence of 8-bit characters without the requirement that a character be printable and with no concern for a newline sequence that terminates lines Often organized as a sequence of records: record 0, record 1, record 2, ..., record n-1. Used for both input and output, and the C++ file <fstream> contains the operations to support these types of files. The open() function must use the attribute ios::binary

26 Summary Binary File (Cont…)
For direct access to a file record, use the function seekg(), which moves the file pointer to a file record Accepts an argument that specifies motion from the beginning of the file (ios::beg), from the current position of the file pointer (ios::cur), and from the end of the file (ios::end) Use read() function to inputs a sequence of bytes from the file into block of memory and write() function to output from a block of memory to a binary file

27 Summary Bit Manipulation Operators
| (OR), & (AND), ^ (XOR), ~ (NOT), << (shift left), and >> (shift right) Use to perform operations on specific bits within a character or integer value. The class, bitVector, use operator overloading treat a sequence of bits as an array, with bit 0 the left-most bit of the sequence bit(), set(), and clear() allow access to specific bits The class has I/O operations for binary files and the stream operator << that outputs a bit vector as an ASCII sequence of 0 and 1 values.

28 Summary File Compression Algorithm
Encodes a file as sequence of characters that consume less disk space than the original file. Two types of compression algorithms: 1) lossless compression Restores the original file. Approach: count the frequency of occurrence of each character in the file and assign a prefix bit code to each character File size: the sum of the products of each bit-code length and the frequency of occurrence of the corresponding character.

29 Summary File Compression Algorithm (Cont…)
2) lossy compression Loses some information during compression and the data cannot be recovered exactly Normally used with sound and video files The Huffman compression algorithm is a lossless process that builds optimal prefix codes by constructing a tree with the … most frequently occurring characters and shorter bit codes as leaves close to the root less frequently occurring characters and longer bit codes as farther from the root.

30 Summary File Compression Algorithm (Cont…)
If the file contains n distinct characters, the loop concludes after n-1 iterations, having built the Huffman Tree containing n-1 internal nodes. Implementation requires the use of a minimum priority queue (heap), bit operations, and binary files The use of the bitVector class simplifies the construction of the classes hCompress and hDecompress, which perform Huffman compression and decompression. Works better with textfiles; they tend to have fewer unique characters than binary files.


Download ppt "CSE Lectures 22 – Huffman codes"

Similar presentations


Ads by Google