Informatics I101 February 25, 2003 John C. Paolillo, Instructor
Electronic Text ASCII — American Standard Code for Information Interchange EBCDIC (IBM Mainframes, not standard) Extended ASCII (8-bit, not standard) –DOS Extended ASCII –Windows Extended ASCII –Macintosh Extended ASCII UNICODE (16-bit, standard-in-progress)
ASCII Alphabet letter "A" means Screen Representation A A A is displayed as
The ASCII Code NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI ABCDEF ABCDEF blank ! " # $ % & ' ( ) * + ` -. / `abcdefghijklmno`abcdefghijklmno DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US :;<=>? :;<=>? PQRSTUVWXYZ[\]^~PQRSTUVWXYZ[\]^~ p q r s t u v w x y z { | } ~ DEL
An Example Text T h i s i s a n e x a m p l e Note that each ASCII character corresponds to a number, including spaces, carriage returns, etc. Everything must be represented somehow, otherwise the computer couldn’t do anything with it.
Representation in Memory _elpmaxe_na_elpmaxe_na
Features of ASCII 7 bit fixed-length code –all codes have same number of bits Sorting: A precedes B, B precedes C, etc. Caps + 32 = Lower case (A + space = a) Word divisions, etc. must be parsed ASCII is very widespread and almost universally supported.
Variable-Length Codes Some symbols (e.g. letters) have shorter codes than others –E.g. Morse code: e = dot, j = dot-dash-dash-dash –Use frequency of symbols to assign code lentgths Why? Space efficiency –compression tools such as gzip and zip use variable- length codes (based on words)
Requirements Starting and ending points of symbols must be clear (simplistic) example: four symbols must be encoded: 0 1110 All symbols end with a zero Any zero ends a symbol Any one continues a symbol Average number of bits per symbol = 2
Example 12 symbols –digits 0-9 –decimal point and space (end of number) _ _
Efficient Coding Huffman coding (gzip) 1.count the number of times each symbol occurs 2.start with the two least frequent symbol a)combine them using a tree b)put 0 on one branch, 1 on the other c)combine counts and treat as a single symbol 3.continue combining in the same way until every symbol is assigned a place in the tree 4.read the codes from the top of the tree down to each symbol
Information Theory Mathematical theory of communication –How many bits in an efficient variable-length encoding? –How much information is in a chunk of data? –How can the capacity of an information medium be measured? Probabilistic model of information –“Noisy channel” model –less frequent ≈ more surprising ≈ more informative Measures information using the notion entropy
Noisy Channel Source Destination We measure the probability of each possible path (correct reception and errors)
Entropy Entropy of a symbol is calculated from its probability of occurrence Number of bits required h s = log 2 p s Average entropy: H(p) = – sum( p i log p i ) Related to variance Measured in bits (log 2 )
Base 2 Logarithms 2 log 2 x = x ; e.g. log 2 2 = 1, log 2 4 = 2, log 2 8 = 3, etc. Often we round up to the nearest power of two (= min number of bits)
Unicode Administered by the Unicode ConsortiumUnicode Consortium Assigns unique code to every written symbol (21 bits: 2,097,152 codes) –UTF-32: four-byte fixed-length code –UTF-16: two to four-byte variable-length code –UTF-8: one to 4-byte variable length code ASCII Block (one byte) + basic multilingual plane (2-3 bytes) + supplementary (4 bytes)