Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mastering File Compression Part #1

Similar presentations


Presentation on theme: "Mastering File Compression Part #1"— Presentation transcript:

1 Mastering File Compression Part #1
Theory and Context Mastering File Compression Part #1 What is file compression Lossy and Lossless compression Algorithms for data compression *Please note: If you are delivering the OCR GCSE, this guide is for teachers only. Please do not share Pseudocode or solutions with your students.

2 5,2,4,3,1,6 First, see if you can solve this puzzle! Index of words
Refer to the list on the left: What does this say?! Such Computer Is Science Teaching Fun 5,2,4,3,1,6 5. Teaching 2. Computer 4. Science 3. Is 1. Such 6. Fun Teaching Computer Science is such fun

3 Teaching Computer Science is such fun
Compare the space taken up by both Such Computer Is Science Teaching Fun At first glance you may not see the benefit of this sort of ‘compression’ (storing words as a sequence of numbers) but imagine an entire book with the same sentence repeated, and you’ll appreciate where we’re going with this! 5,2,4,3,1,6 VS Teaching Computer Science is such fun

4 Space is precious! Data compression is all about finding a way to save space. There are 3 main principles. Find repeating patterns in a file Replace these patterns with a reference to a dictionary entity Create a dictionary of repeating patterns 5,2,4,3,1,6 Reference Dictionary (list that we refer to for compression and extraction) VS Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun. Teaching Computer Science is such fun.

5 So what is File Compression?
It’s highly likely that you’ve come across file compression in some form or the other. Have you ever tried to ‘compress’ a JPEG image? (Compression, in this instance, merely refers to reducing the file size, which has obvious advantages) Note the picture on the right. The resolution decreases but the file size is also reduced.

6 LOSSY Lossy vs Lossless
These are two very peculiar words, I know. They have very simple meanings however, and it would be useful for you to know before moving on with compression. In Lossy compression the exact sequence is not retained after compression. The reason this standard is called "lossy" is because a picture can be saved into smaller and smaller files but each time the image is degraded with the structure still visible but the details being lost. This means that when the file is recreated it is not identical to the original. LOSSY

7 LOSSLESS Lossy vs Lossless
In Lossless data compression the algorithms allow the original data to be perfectly reconstructed from the compressed data. With lossless compression, every bit of data that was originally in the file remains after the file is uncompressed. All information is restored. This is generally the technique of choice for text or spreadsheet files, where losing words or financial data could pose a problem. The Graphics Interchange File (GIF) is an image format used on the Web that provides lossless compression.

8 Quick recap – fill in the blanks!
Lossless and lossy compression are terms that describe whether or not, in the compression of a file, all original data can be recovered when the file is uncompressed With lossless compression, every single bit of data that was originally in the file remains after the file is uncompressed. All of the information is completely restored. On the other hand, lossy compression reduces a file by permanently eliminating certain information, especially redundant information. When the file is uncompressed, only a part of the original information is still there (although the user may not notice it). LOSSY vs LOSSLESS

9 Animation to illustrate the difference
LOSSY vs LOSSLESS Animation to illustrate the difference ORIGINAL ORIGINAL Compressed Compressed RESTORED RESTORED

10 Lossy compression: Uses
Lossy compression is generally used for video and sound, where a certain amount of information loss will not be detected by most users. The JPEG image file, commonly used for photographs and other complex still images on the Web, is an image that has lossy compression. Using JPEG compression, the creator can decide how much loss to introduce and make a trade-off between file size and image quality.

11 Lossless compression: Uses
Lossless data compression is used in many applications. For example, it is used in the ZIP file format and in the GNU tool gzip. Lossless compression is used in cases where it is important that the original and the decompressed data be identical. Typical examples are executable programs, text documents, and source code. Some image file formats, like PNG or GIF, use only lossless compression, while others like TIFF and MNG may use either lossless or lossy methods. Lossless audio formats are most often used for archiving or production purposes, while smaller lossy audio files are typically used on portable players and in other cases where storage space is limited or exact replication of the audio is unnecessary. Source of image:

12 If you’re interested: The Hutter Prize
If you get really into this you may want to have a look at the HUTTER PRIZE The goal of the Hutter Prize is to encourage research in artificial intelligence (AI). The organizers further believe that compressing natural language text is a hard AI problem, equivalent to passing the Turing test. Thus, progress toward one goal represents progress toward the other.[4] They argue that predicting which characters are most likely to occur next in a text sequence requires vast real-world knowledge. A text compressor must solve the same problem in order to assign the shortest codes to the most likely text sequences. The Hutter Prize is a cash prize funded by Marcus Hutter which rewards data compression improvements on a specific 100 MB English text file. Specifically, the prize awards 500 euros for each one percent improvement (with 50,000 euros total funding)[1] in the compressed size of the file enwik8, which is the smaller of two files used in the Large Text Compression Benchmark; enwik8 is the first 100,000,000 characters of a specific version of English Wikipedia.[2] The ongoing competition is organized by Hutter, Matt Mahoney, and Jim Bowery.

13 So, we’ve established that
Compression is a pretty important thing to be able to do in Computer Science There are prizes out there for effective algorithms and being able to code compression programs BUT HOW DO WE GO ABOUT CODING A COMPRESSION PROGRAM?!

14 Coding a compression program
The JPEG standards are mathematically rather complex but all compression uses certain underlying basic principles. You can understand these basic principles by looking at how a text file would appear if logically similar techniques (to Lossy file compression) are applied to it. Image source: Let’s explore what this would look like and how it works …

15 A compression program Analyse the text on the right carefully.
It is from a very famous war time speech by Winston Churchill. You’ll notice he likes to repeat things! Can you spot some words that are repeated? We Shall the fight If we needed to create a DICTIONARY, it would be important to use the words that are repeated enough to make it worth while.

16 Creating a Dictionary To make a long story short, upon analysis, we would be able to add just three phrases to the dictionary: We shall Fight On the

17 Note: we are looking for repetitions in the text
Applying these rules would result in considerable savings as they are many repetitions of certain phrases. Code Repeated Phrase Count Size Notes

18 Note: we are looking for repetitions in the text
Applying these rules would result in considerable savings as they are many repetitions of certain phrases. Code Repeated Phrase Count Size Notes

19 Compression and restoration
The file that is output after compression consists of the reduced file with the tags inserted plus the dictionary of phrases and the codes that now represent them. The figures shown in brackets are the equivalents for the Complex Lossless compression. In the example we just looked at, we could have compressed it as follows: The original file consisted of 391 characters of which 259 have been taken out leaving 132 characters. Then we created the dictionary which had 12 entries and hence code tags and 114 characters for the phrases giving a total for the dictionary of 126 characters. Therefore the whole file is 285 characters long. Compared with the original file of 391 characters this is compressed to 73% of its original size.


Download ppt "Mastering File Compression Part #1"

Similar presentations


Ads by Google