Presentation on theme: "Introduction to Digital Libraries Digital Data. Do you still have a copy of your first email? Can you still compile and run the first program you ever."— Presentation transcript:
Do you still have a copy of your first email? Can you still compile and run the first program you ever wrote? If Hurricane Isabel had destroyed your computer, how much information would you have lost? Digital information http://en.wikipedia.org/wiki/Rosetta_Stone http://www.rosettaproject.org/about-us/disk/concept
Storage of text: image vs. ascii Document image – Digital image of page; words represented as patterns of pixels – Not searchable as text – Optical character recognition to convert to ascii (may be error prone) ASCII – Searchable as text; words represented as ascii codes
"Benign Neglect" Hardcopy items: – benefit from "benign neglect" – have well-understood methods; e.g.: book->open book->turnPage 000100 IDENTIFICATION DIVISION. 000200 PROGRAM-ID. HELLOWORLD. 000300 000400* 000500 ENVIRONMENT DIVISION. 000600 CONFIGURATION SECTION. 000700 SOURCE-COMPUTER. RM-COBOL. 000800 OBJECT-COMPUTER. RM-COBOL. 000900001000 DATA DIVISION. 001100 FILE SECTION. 001200 100000 PROCEDURE DIVISION. 100100 100200 MAIN-LOGIC SECTION. 100300 BEGIN. 100400 DISPLAY " " LINE 1 POSITION 1 ERASE EOS. 100500 DISPLAY "Hello world!" LINE 15 POSITION 10. 100600 STOP RUN. 100700 MAIN-LOGIC-EXIT. 100800 EXIT. Softcopy items: –frequent use leads to migration & replication –are only understood in specialized, fragile contexts
CCS – Offices Traditional OCR - Output THE AMERICAN MISSIONARY. Vo.. XXXII JANUARY, 1878 No. 1 American Missionary Association 1877 - 1888 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
CCS – Offices information available Title page Title of series Volume number Issue number Motto Date
Oxford University Library Services > 660 staff 40 libraries Budget > £25m (€37m) Total bookstock:11 million items 156 miles (250km) of shelving, including repository space
Text Compression Data compression is important to storage systems because it allows more bytes to be packed into a given storage medium than when the data is uncompressed. Compromises: –Encode-decode time –Random access to text?
Why Compress? To reduce the volume of data to be transmitted (text, images, …) To reduce the bandwidth required for transmission and to reduce storage requirements (speech, audio, video)
Text Compression Common methods – Symbol-wise methods Estimate probabilities of symbols, code one at a time, shorter codes for high probabilities (Morse) E.g. Huffman coding – Dictionary methods Replace words and fragments with dictionary entries (Braille) E.g. Ziv-Lempel compression May be static or dynamic
Huffman coding Developed in 1950s, widely used Static code, variable length Based on frequency of occurrence of letters (from English or from body of text) Method: – Sort by falling probabilities; link 2 symbols with least probabilities, label with sum; repeat till you reach a single symbol with probability of 1 – Code down tree to generate symbols
17 Huffman coding builds a binary tree from the letter frequencies in the message. – The binary symbols for each character are read directly from the tree. Symbols with the highest frequencies end up at the top of the tree, and result in the shortest codes. 7A.2 Statistical Coding Huffman coding
Huffman code tree b c ef gd a 0 1 0 0 0 1 1 1 0 1 01
19 The process of building the tree begins by counting the occurrences of each symbol in the text to be encoded. 7A.2 Statistical Coding HIGGLETY PIGGLTY POP THE DOG HAS EATEN THE MOP THE PIGS IN A HURRY THE CATS IN A FLURRY HIGGLETY PIGGLTY POP Huffman coding
20 Next, place the letters and their frequencies into a forest of trees that each have two nodes: one for the letter, and one for its frequency. 7A.2 Statistical Coding Huffman coding
21 We start building the tree by joining the nodes having the two lowest frequencies. 7A.2 Statistical Coding Huffman coding
22 And then we again join the nodes with two lowest frequencies. 7A.2 Statistical Coding Huffman coding
Example 261 106 155 53 t e 90 s 31 n a l o 0 0 0 0 0 01 1 1 11 1
Example 261 106 155 53 t e 90 s 31 n a l o 0 0 0 0 0 01 1 1 11 1
Ziv-Lempel Compression Adaptive coding For repeat occurrences of text segments, pointer back to first occurrence Higher compression than Huffman coding Also used for image compression
Ziv-Lempel compression Based on triples, where – a = how far back to segment – b = no of characters in segment – c = new character to end segment E.g. – first occurrence of z – go back 17 characters, repeat 5 characters, end in r
36 Ziv-Lempel - Example abbababbbaabaa abbbba babaa baa
Example Encode (i.e., compress) the string ABBCBCABABCAABCAAB The compressed message is: (0,A)(0,B)(2,C)(3,A)(2,A)(4,A)(6,B) Note: The above is just a representation, the commas and parentheses are not transmitted; we will discuss the actual form of the compressed message later on in slide 12.
Example 1. A is not in the Dictionary; insert it 2. B is not in the Dictionary; insert it 3. B is in the Dictionary. BC is not in the Dictionary; insert it. 4. B is in the Dictionary. BC is in the Dictionary. BCA is not in the Dictionary; insert it. 5. B is in the Dictionary. BA is not in the Dictionary; insert it. 6. B is in the Dictionary. BC is in the Dictionary. BCA is in the Dictionary. BCAA is not in the Dictionary; insert it. 7. B is in the Dictionary. BC is in the Dictionary. BCA is in the Dictionary. BCAA is in the Dictionary. BCAAB is not in the Dictionary; insert it.
Example Encode (i.e., compress) the string BABAABRRRA. The compressed message is: (0,B)(0,A)(1,A)(2,B)(0,R)(5,R)(2, )
Example 1. B is not in the Dictionary; insert it 2. A is not in the Dictionary; insert it 3. B is in the Dictionary. BA is not in the Dictionary; insert it. 4. A is in the Dictionary. AB is not in the Dictionary; insert it. 5. R is not in the Dictionary; insert it. 6. R is in the Dictionary. RR is not in the Dictionary; insert it. 7. A is in the Dictionary and it is the last input character; output a pair containing its index: (2, )
Pros and Cons of Different Algorithms ArithmeticCharacter Huffman Word Huffman Ziv-Lempel Compression ratio very goodpoorvery goodgood Compression speed slowfast very fast Decompression speed slowfastvery fast Memory spacelow highmoderate Pattern matchingnoyes Random Accessnoyes no
vector graphics A vector graphic is a set of instruction on how to draw shapes that make up an image. Contrary to raster images, vector graphics are resolution-independent. On a device with small pixels, they look better than on a device with large pixels.
Vector Images Vector – composed of paths Coordinates With color SVG coding – use mathematical relationships between points and the paths connecting them to describe an image Used for – Fonts – Drawings – Charts – Maps Vector Image
Raster Images Raster Images also known as Bitmap Image – A grid of individual pixels – Each pixel can be a different color or shade – Our focus today Used for – Continuous tone images
Raster images Raster images are rectangular sets of pixels. Each pixel is a small rectangle that has a certain color. Since the points are small the illusion of a non-pixilated image is created. The smaller the pixels, the smaller the image.
Original materials Photographs Reflective – Prints Film – Negative – Positive Requirements – Color fidelity – Contrast – Detail rendering
Original Materials Text Can be black and white or have color Usually bound volumes rather than loose pages Requirements – Usually needs to be readable – Often has additional processing like OCR
Original Materials Artifacts 3-D objects can’t be scanned Digital Photography creates the image – “Studio” space – Lighting – Moving equipment and personnel Requirements – Depth – Color – Detail
The Five Big Factors Resolution Bit Depth Color Compression Format
Resolution Often referred to as “dpi” or “ppi” – Dots per inch – Pixels per inch RATIO of number of pixels captured per inch of original photo size – 8x10 print scanned at 300ppi = 2400 x 3000 pixels “Spatial resolution” refers to pixel dimensions of image, e.g., 3000 x 2400 pixels
bit depth The bit depth is the amount of information that is retained on every pixel about the colors of the pixel. The higher the bit depth, the more color can be simulated.
Bit Depth Refers to number of bits (binary digits, places for zeroes and ones) devoted to storing color information about each pixel – 1 bit (1) = 2 1 = 2 shades (black & white) – 2 bit (01) = 2 2 = 4 shades – 4 bit (0010) = 2 4 = 16 shades – 8 bit (11010001) = 2 8 = 256 shades
Bit Depth 1 bit (black & white)2 bit (4 colors) 4 bit (16 colors)8 bit (256 colors)
Color RGB – Scanners and cameras generally have sensors for Red, Green, and Blue – Each of these “channels” is stored separately in the digital file – 8 bits for each channel = 24 bit color CMYK (Cyan, Magenta, Yellow and Black) is used for high-end “pre-press” printing purposes
Compression Reduces size by eliminating data. Can not be reversed. Data is lost. Irrelevancy reduction – Removes data that will not affect perception Redundancy reduction – Removes duplicate data JPEG compression – Discrete Cosine Transform (DCT) simplifies color values – Quantization rounds color values (losing data) – The quality slider governs how much simplification occurs
Compression Full sized image, enlarged 8x and 16x Without Compression With maximum JPEG Compression
Wavelet Compression Treats the image as a signal or wave not a series of numbers or a picture The data is transformed into a continuous wave centered on zero Calculates the peaks and dips distance from zero and takes the average between adjacent points Repeats the averaging
Wavelet vs. JPEG compression Wavelet compression file size: 1861 bytes compression ratio - 105.6 Source: “ About Wavelet Compression ”. http://www.barrt.ru/parshukov/about.htm. JPEG compression file size: 1895 bytes compression ratio - 103.8
TIFF TIFF stands for Tagged Image File format. It is a standard file format used for archival purposes. In fact it is the de facto standard in the archival community. It is a 24bit depth, i.e. “full-color” format.
origin It was originally created Microsoft and a software company called Aldus. The latter held the copyright. It released the first complete specification in 1986. Its aim was to create a standard format for the desktop scanners of the 80s.
status This company was acquired by Adobe, Inc. They now hold the copyright. Thus this is a proprietary format. Use of the format requires no license fees. The last major update was in 1992.
tagged… The TIFF file stores its information in fields called tags. These store things like – image dimensions – copyright information The format allows for proprietary tags you can create yourself.
requirements of baseline TIFF Multiple images may be in the same file. Support for two compression schemes – CCITT Group 3 1-Dimensional Modified Huffman RLE – PackBits compression - a form of run-length encoding Support for – bilevel – grayscale – palette-color – RGB full-color
problems Adobe also owns PSD, the format for its Photoshop application. They have neglected TIFF. – no tags to specify relationship between pages. – no standards for vector graphics and text drawings. There is a size limitation to 4GB.
GIF Developed in 1987 for CompuServ screens Uses an indexed color scheme insufficient for current color technology GIF does not store scaling resolution – Good for screen display – Good for graphics – Bad for printing Uses LZW compression Patent issues and not used currently
birth of PNG In 1993, UniSys has financial problems. It negotiates with CompuServ that they collectively would collect royalties for use of LZW in GIF manipulating software. This was announced on 28 December 1994. An informal group, around Thomas Boutell works on producing a free GIF.
features |1| non-patented and completely lossless compression that is better than the compression in GIF, but only by 5%-20% Multiple circular redundancy checks so that file integrity can be checked without viewing It has a magic signature that can detect the most common types of file corruption.
features |2| two-dimensional interlacing scheme |+ 1-, 2-, 4- and 8-bit palette support (like GIF) 1-, 2-, 4-, 8- and 16-bit grayscale support 24- and 48-bit truecolor support full alpha transparency in 8- and 16-bit modes, not just simple on-off transparency like GIF |+
JPEG Compression: Basics Human vision is insensitive to high spatial frequencies JPEG Takes advantage of this by compressing high frequencies more coarsely and storing image as frequency data JPEG is a “ lossy ” compression scheme. Losslessly compressed image, ~150KBJPEG compressed, ~14KB
JPEG There are two standards, the original JPEG and JPEG2000. We need to worry about this because JPEG 2000 has an option for lossless manipulation of the image. JPEG does not have this. We will assume JPEG is always a lossy format.
Image Technical Metadata MIX – Metadata for Still Images in XML Developed by – The Library of Congress' Network Development and MARC Standards Office – NISO Technical Metadata for Digital Still Images Standards Committee http://www.loc.gov/standards/mix/instances/mix_test.xml http://www.niso.org/standards/resources/Z39_87_trial_use.p df http://www.niso.org/standards/resources/Z39_87_trial_use.p df