Presentation is loading. Please wait.

Presentation is loading. Please wait.

Informatics I101 February 25, 2003 John C. Paolillo, Instructor.

Similar presentations


Presentation on theme: "Informatics I101 February 25, 2003 John C. Paolillo, Instructor."— Presentation transcript:

1 Informatics I101 February 25, 2003 John C. Paolillo, Instructor

2 Electronic Text ASCII — American Standard Code for Information Interchange EBCDIC (IBM Mainframes, not standard) Extended ASCII (8-bit, not standard) –DOS Extended ASCII –Windows Extended ASCII –Macintosh Extended ASCII UNICODE (16-bit, standard-in-progress)

3 ASCII 01000001 Alphabet letter "A" means Screen Representation A A A  is displayed as

4 The ASCII Code NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 0123456789ABCDEF0123456789ABCDEF blank ! " # $ % & ' ( ) * + ` -. / @ABCDEFGHIJKLMNO@ABCDEFGHIJKLMNO `abcdefghijklmno`abcdefghijklmno DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 0123456789:;<=>?0123456789:;<=>? PQRSTUVWXYZ[\]^~PQRSTUVWXYZ[\]^~ p q r s t u v w x y z { | } ~ DEL 0 1 2 3 4 5 6 7

5 An Example Text T h i s i s a n e x a m p l e 84 104 105 115 32 105 115 32 97 110 32 101 120 97 109 112 108 101 Note that each ASCII character corresponds to a number, including spaces, carriage returns, etc. Everything must be represented somehow, otherwise the computer couldn’t do anything with it.

6 Representation in Memory 32 101 108 112 109 97 120 101 32 110 97 _elpmaxe_na_elpmaxe_na 01101010 01101001 01101000 01100111 01100110 01100101 01100100 01100011 01100010 01100001 01100000

7 Features of ASCII 7 bit fixed-length code –all codes have same number of bits Sorting: A precedes B, B precedes C, etc. Caps + 32 = Lower case (A + space = a) Word divisions, etc. must be parsed ASCII is very widespread and almost universally supported.

8 Variable-Length Codes Some symbols (e.g. letters) have shorter codes than others –E.g. Morse code: e = dot, j = dot-dash-dash-dash –Use frequency of symbols to assign code lentgths Why? Space efficiency –compression tools such as gzip and zip use variable- length codes (based on words)

9 Requirements Starting and ending points of symbols must be clear (simplistic) example: four symbols must be encoded: 0  10 110  1110  All symbols end with a zero Any zero ends a symbol Any one continues a symbol Average number of bits per symbol = 2

10 Example 12 symbols –digits 0-9 –decimal point and space (end of number) 0 1 2 3 4 5 6 7 8 9 _. 0 0 0 0 01 1 1 1 1 0 1 0 0 0 0 0 1 1 1 1 1 000 1010 20110 301110 4011110 5011111 610 7110 81110 911110 _111110.111111

11 Efficient Coding Huffman coding (gzip) 1.count the number of times each symbol occurs 2.start with the two least frequent symbol a)combine them using a tree b)put 0 on one branch, 1 on the other c)combine counts and treat as a single symbol 3.continue combining in the same way until every symbol is assigned a place in the tree 4.read the codes from the top of the tree down to each symbol

12 Information Theory Mathematical theory of communication –How many bits in an efficient variable-length encoding? –How much information is in a chunk of data? –How can the capacity of an information medium be measured? Probabilistic model of information –“Noisy channel” model –less frequent ≈ more surprising ≈ more informative Measures information using the notion entropy

13 Noisy Channel 1010 1010 Source Destination We measure the probability of each possible path (correct reception and errors)

14 Entropy Entropy of a symbol is calculated from its probability of occurrence Number of bits required h s = log 2 p s Average entropy: H(p) = – sum( p i log p i ) Related to variance Measured in bits (log 2 )

15 Base 2 Logarithms 2 log 2 x = x ; e.g. log 2 2 = 1, log 2 4 = 2, log 2 8 = 3, etc. Often we round up to the nearest power of two (= min number of bits)

16 Unicode Administered by the Unicode ConsortiumUnicode Consortium Assigns unique code to every written symbol (21 bits: 2,097,152 codes) –UTF-32: four-byte fixed-length code –UTF-16: two to four-byte variable-length code –UTF-8: one to 4-byte variable length code ASCII Block (one byte) + basic multilingual plane (2-3 bytes) + supplementary (4 bytes)

17


Download ppt "Informatics I101 February 25, 2003 John C. Paolillo, Instructor."

Similar presentations


Ads by Google