Machine level representation of data Character representation

Machine level representation of data Character representation
Bits can represent anything!! Characters? 26 letters  5 bits (25 = 32) upper/lower case + punctuation  7 bits (in 8) (“ASCII”) standard code to cover all the world’s languages  8,16,32 bits (“Unicode”) Logical values? 0  False, 1  True colors ? Ex: locations / addresses? commands? MEMORIZE: N bits  at most 2N things Red (100) Green (010) Blue (001)

Characters representation in computers and devices
ASCII  ANSI  MultiByte  Unicode American Standard Code for Information Interchange This was the de facto world-wide standard for the code numbers used by computers to represent all the upper and lower-case Latin letters, numbers, punctuation, etc. How many bits we need for 5 letters representation? How many bits we need to represent ? 4 letters - A,B,C,D – 2 bits - 22 = 4 patterns How many bits for 26 English uppercase letters ? - 5 bits – 25 = 32 patterns 26 English lowercase letters bits – 25 = 32 patterns Decimal digits and special signs - 5 bits – 25 = 32 patterns Special Control Characters - 5 bits – 25 = 32 patterns How many bits we need for 128 patterns representation?

The Content of ASCII table
Contains 128 characters ASCII needs only 7 bits for character representation (“A” ) The first printable character is SP (space) and corresponds to the bit pattern – 0x20. The characters A and B correspond to A – 0x41 B – 0x42 Find “z” ’s ASCII code z – – 0x7A 1's place 16's place 1 2 3 4 5 6 7 8 9 A B C D E F NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US SP ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ DEL

Control characters 1 2 3 4 5 6 7 8 9 A B C D E F
Control characters are not shown or printed on the different devices. They control the devices. These control characters have different meaning for different devices. 0x0A – Line Feed, x0D – Carriage Return 1's place 16's place 1 2 3 4 5 6 7 8 9 A B C D E F NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US SP ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ DEL ASCII full chart

ASCII code Advantages, Disadvantages
Why this is an advantage? How? How? 1's place 16's place 1 2 3 4 5 6 7 8 9 A B C D E F NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US SP ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ DEL The history of ASCII since 1967 is mostly a history of attempts to overcome its limitations and make it more applicable to languages other than American English.

ASCII code Extension. Character sets.
8 bits ANSI chars IBM chars Other- 256 chars ASCII 128 chars Extra 128 chars ASCII 128 chars Extra 128 chars ASCII 128 chars Arme nian 7 bits 7 bits

1 byte – 256 character Code pages
ANSI and similar code pages need 8 bits for character representation (“A” ) (“ ” – 0xB ) Ա (“ ” – 0xE ) Б ANSI chars IBM chars Armenian code page #1 Armenian code page #2 Cyrillic code page #1 ASCII Extra 128 chars ASCII Extra 128 chars ASCII Arme nian 1 ASCII Arme nian 2 ASCII Cyril lic 1

Chinese code page (lead byte) Chinese code page (trail byte)
Double (or multi) byte character sets Chinese code page (lead byte) Chinese code page (trail byte) ASCII 128 chars A DBCS starts off with 256 codes Like any well-behaved code page, the first 128 of these codes are ASCII. However, some of the codes in the higher 128 are always followed by a second byte. The two bytes together (called a lead byte and a trail byte) define a single character, usually a complex ideograph.

Double (or multi) byte character sets
Advantage: DBCS allows to create pages also for languages having more than 256 letters or signs. Disadvantage: Different documents created by different code pages even for the same language (Cyrillic or Armenian) are still be not compatible. The problem with a double-byte character set is not that characters are represented by 2 bytes. The problem is that some characters (in particular, the ASCII characters) are represented by 1 byte. This creates odd programming problems. For example, the number of characters in a character string cannot be determined by the byte size of the string.

Unicode code point – Symbol’s identifier
The best thing about Unicode is that there's only one character set. There's simply no ambiguity. The representation in bits is enough large to accommodate all the languages and signs. Flavors – UTF-8, UTF-16, UTF-32. code point – Symbol’s identifier

Unicode UTF-8 (like multibyte)
Code point – Symbol’s identifier Leading Byte of the multi-byte sequence Continuation Byte of the multi-byte sequence Bits Last code point Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6 7 U+007F 0xxxxxxx 11 U+07FF 110xxxxx 10xxxxxx 16 U+FFFF 1110xxxx 21 U+1FFFFF 11110xxx 26 U+3FFFFFF 111110xx 31 U+7FFFFFFF x Single Byte Header bits Header bits – The # of “1”s = # of bytes Header bit always = 0 Armenian character set uses the codes 0x0530 through 0x058F

Unicode UTF-16 The Unicode code space is divided into seventeen PLANES of 216 (65,536) code points each The code points in each plane have the hexadecimal values xx0000 to xxFFFF where xx is a hex value from 00 to 10 1st plane code points U+0000 to U+D7FF and U+E000 to U+FFFF - Basic Multilingual Plane – most frequently used characters. Code points U+D800 to U+DFFF - Extensions

Unicode Advantage: Supports all languages and different signs by single code page. UTF-8 is back compatible with the ASCII UTF-16 Basic multilingual plane (first 16 bits- 2 bytes) and UTF-32 (4 bytes) allow to work with the characters like with the regular text.

Unicode Disadvantage:
Fixed 2 or 4 bytes UTF-16 and UTF-32 Unicode character strings occupy twice or four times as much memory as ASCII strings. UTF-8 character set has variable length. This is an advantage to have less size than the fixed Unicode character set. And this is a disadvantage with the programming. For example, the number of characters in a character string cannot be determined by the byte size of the string.

Machine level representation of data Character representation

Similar presentations

Presentation on theme: "Machine level representation of data Character representation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine level representation of data Character representation

Similar presentations

Presentation on theme: "Machine level representation of data Character representation"— Presentation transcript:

Similar presentations

About project

Feedback