Presentation is loading. Please wait.

Presentation is loading. Please wait.

Character representation in the computers Home Assignment 1 Assigned. Deadline 2016 January 24th, Sunday.

Similar presentations


Presentation on theme: "Character representation in the computers Home Assignment 1 Assigned. Deadline 2016 January 24th, Sunday."— Presentation transcript:

1 Character representation in the computers Home Assignment 1 Assigned. Deadline 2016 January 24th, Sunday

2 Machine level representation of data Character representation Characters? 26 letters  5 bits (2 5 = 32) upper/lower case + punctuation  7 bits (in 8) (“ASCII”) standard code to cover all the world’s languages  8,16,32 bits (“Unicode”) Logical values? 0  False, 1  True colors ? Ex: locations / addresses? commands? MEMORIZE: N bits  at most 2 N things Bits can represent anything!! Red (100)Green (010)Blue (001)

3 Characters representation in computers and devices ASCII  ANSI  MultiByte  Unicode American Standard Code for Information Interchange This was the de facto world-wide standard for the code numbers used by computers to represent all the upper and lower-case Latin letters, numbers, punctuation, etc. How many bits we need to represent? 4 letters - A,B,C,D– 2 bits - 2 2 = 4 patterns How many bits for 26 English uppercase letters ?- 5 bits – 2 5 = 32 patterns 26 English lowercase letters - 5 bits – 2 5 = 32 patterns Decimal digits and special signs - 5 bits – 2 5 = 32 patterns Special Control Characters- 5 bits – 2 5 = 32 patterns How many bits we need for 128 patterns ?

4 The Content of ASCII table Contains 128 characters ASCII needs only 7 bits for character representation (“A” - 100 0001 2 ) The first printable character is SP (space) and corresponds to the bit pattern 010 0000 – 0x20. The characters A and B correspond to A - 100 0001 – 0x41B - 100 0010 2 – 0x42 1's place 16's place 0123456789ABCDEF 0 NULSOHSTXETXEOTENQACKBELBSHTLFVTFFCRSOSI 1 DLEDC1DC2DC3DC4NAKSYNETBCANEMSUBESCFSGSRSUS 2SP!"#$%&'()*+,-./ 30123456789:;< => ? 4@ABCDEFGHIJKLMNO 5PQRSTUVWXYZ[\]^_ 6`abcdefghijklmno 7pqrstuvwxy z {|}~ DEL Find “z” ’s ASCII codez – 111 1010 2 – 0x7A

5 1's place 16's place 0123456789ABCDEF 0 NULSOHSTXETXEOTENQACKBELBSHTLFVTFFCRSOSI 1 DLEDC1DC2DC3DC4NAKSYNETBCANEMSUBESCFSGSRSUS 2 SP!"#$%&'()*+,-./ 3 0123456789:;< => ? 4 @ A BCDEFGHIJKLMNO 5 PQRSTUVWXYZ[\]^_ 6 `abcdefghijklmno 7 pqrstuvwxyz{|}~DEL ASCII full chart Control characters Control characters are not shown or printed on the different devices. They control the devices. These control characters have different meaning for different devices. 0x0A – Line Feed, 0x0D – Carriage Return

6 Why this is an advantage? ASCII code Advantages, Disadvantages 1's place 16's place 0123456789ABCDEF 0NULSOHSTXETXEOTENQACKBELBSHTLFVTFFCRSOSI 1DLEDC1DC2DC3DC4NAKSYNETBCANEMSUBESCFSGSRSUS 2SP!"#$%&'()*+,-./ 30123456789:;< => ? 4@ A BCDEFGHIJKLMNO 5PQRSTUVWXYZ[\]^_ 6`abcdefghijklmno 7pqrstuvwxyz{|}~DEL The history of ASCII since 1967 is mostly a history of attempts to overcome its limitations and make it more applicable to languages other than American English. How?

7 ASCII code Extension. Character sets. ASCII 128 chars Extra 128 chars ANSI - 256 chars ASCII 128 chars Extra 128 chars IBM - 256 chars ASCII 128 chars Arme nian Other- 256 chars 7 bits 8 bits

8 ASCII Extra 128 chars ANSI - 256 chars ASCII Extra 128 chars IBM - 256 chars ASCII Arme nian 1 Armenian code page #1 1 byte – 256 character Code pages ASCII Arme nian 2 Armenian code page #2 ASCII Cyril lic 1 Cyrillic code page #1 ANSI and similar code pages need 8 bits for character representation (“A” - 0100 0001 2 ) (“ ” – 0xB2 - 1011 0010 2 ) (“ ” – 0xE2 - 1110 0010 2 ) Б Ա

9 A DBCS starts off with 256 codes Like any well-behaved code page, the first 128 of these codes are ASCII. However, some of the codes in the higher 128 are always followed by a second byte. The two bytes together (called a lead byte and a trail byte) define a single character, usually a complex ideograph. Double (or multi) byte character sets ASCII 128 chars Chinese code page (lead byte) Chinese code page (trail byte)

10 Double (or multi) byte character sets Advantage: DBCS allows to create pages also for languages having more than 256 letters or signs. Disadvantage: Different documents created by different code pages even for the same language (Cyrillic or Armenian) are still be not compatible. The problem with a double-byte character set is not that characters are represented by 2 bytes. The problem is that some characters (in particular, the ASCII characters) are represented by 1 byte. This creates odd programming problems. For example, the number of characters in a character string cannot be determined by the byte size of the string.

11 Unicode  The best thing about Unicode is that there's only one character set. There's simply no ambiguity.  The representation in bits is enough large to accommodate all the languages and signs.  Flavors – UTF-8, UTF-16, UTF-32. code point – Symbol’s identifier

12 Unicode UTF-8 (like multibyte) Bits Last code point Byte 1Byte 2Byte 3Byte 4Byte 5Byte 6 7 U+007F 0xxxxxxx 11 U+07FF 110xxxxx10xxxxxx 16 U+FFFF 1110xxxx10xxxxxx 21 U+1FFFFF 11110xxx10xxxxxx 26 U+3FFFFFF 111110xx10xxxxxx 31 U+7FFFFFFF 1111110x10xxxxxx Armenian character set uses the codes 0x0530 through 0x058F code point – Symbol’s identifier

13 Unicode UTF-16 The Unicode code space is divided into seventeen PLANES of 2 16 (65,536) code points each The code points in each plane have the hexadecimal values xx0000 to xxFFFF where xx is a hex value from 00 to 10 1 st plane code points U+0000 to U+D7FF and U+E000 to U+FFFF - Basic Multilingual Plane – most frequently used characters. Code points U+D800 to U+DFFF - Extensions

14 Unicode Advantage:  Supports all languages and different signs by single code page.  UTF-8 is back compatible with the ASCII  UTF-16 Basic multilingual plane (first 16 bits- 2 bytes) and UTF-32 (4 bytes) allow to work with the characters like with the regular text.

15 Unicode Disadvantage:  Fixed 2 or 4 bytes UTF-16 and UTF-32 Unicode character strings occupy twice or four times as much memory as ASCII strings. UTF-8 character set has variable length. This is an advantage to have less size than the fixed Unicode character set. And this is a disadvantage with the programming. For example, the number of characters in a character string cannot be determined by the byte size of the string.


Download ppt "Character representation in the computers Home Assignment 1 Assigned. Deadline 2016 January 24th, Sunday."

Similar presentations


Ads by Google