Presentation on theme: "ITR3 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-10."— Presentation transcript:
ITR3 lecture 1: bits, bytes and characters Thomas Krichel
Structure Bits Bytes Character sets
Literature Norton new inside the PC chapter 4 htmhttp://www.danbbs.dk/~erikoest/bb_terms. htm ations/ictp99/ictp99N2705.htmlhttp://wwwinfo.cern.ch/asdoc/WWW/public ations/ictp99/ictp99N2705.html htmlhttp://www.cl.cam.ac.uk/~mgk25/unicode. html
Information Information is best understood as what it takes to answer a question. The simplest question has a yes or no answer. Therefore a bit is the natural measure of information. Term first used by John Turkey in Concatenation of binary digit.
Usage of bits Computers are sometimes classified by –The number of bits they can process at one time (register size) or by –The number of bits they use to represent addresses (address size). These two values are not always the same. Larger registers make a computer faster, using more bits for addresses enables a machine to support larger programs. Graphics are also often described by the number of bits used to represent each dot.
Many bits The first chips used to process 8 bits at a time. It become customary to refer to them as a byte. Larger units are –Kilo byte is 2 power 10 bytes –Mega bytes is 2 power 20 bytes –Giga bytes is 2 power 30 bytes –Tera byte is 2 power 40 bytes From ancient Greek words for "thousand", "large", "giant", and "monster", respectively. Terms date back to the French revolution.
More than a monster In 1975, the General Conference of Weights and Measures (CGPM), based at Sèvres near Paris, agreed to add peta- (P) and exa- (E) Petabyte is 2 power 50 bytes Exabyte in 2 power 60 Nowadays they are followed by yottabyte (70) and zettabyte (80)
Hex numbers Each hex number can encode 16 values, Written 0 to 9, then A B C D E F. F is 15. Here, prefixed with 0x Each byte can be represented with two hex numbers. Use Microsoft calculator with scientific notation to convert.
Representing characters Computers don't understand text, they only understand numbers. For computers to be able to treat text, there must be a correspondence between numbers and text characters. Such a correspondence is called a coded character set. Important examples are –ASCII –ISO –cp1252
ASCII American Standard Code for Information Interchange 7-bit character set. There is no such thing as 8-bit ASCII 95 printable symbols 33 control characters (0-31, 127) table.html has a list.http://web.cs.mun.ca/~michael/c/ascii- table.html
ASCII control codes ACK (^F) used to acknowledge receipt of message, NAK (^U) used to signal non-receipt CR (^M) is the carriage return LF (^J) is the linefeed FF (^L) is the form feed (new page) BS (^H) is the backspace DEL (ALT-127) is delete ESC (^[) escape Different programs use them in different ways, a big pain in the a…
ISO PCs work with bytes, so manufactures were free to fill the other 128 characters. A standard set was set up by the ISO, it extends ASCII with characters that are used by the western European languages. It is the default character set of html. Positions 128 to 159 are not used. Cp1252 fill these with graphic chars.
Three concepts for characters Abstract Character Repertoire: the set of characters to be encoded, e.g., some alphabet or symbol set Coded Character Set : a mapping from an abstract character repertoire to a set of non- negative integers Character Encoding Scheme: a mapping from a coded character set to a serialized sequence of bytes
ISO Defines the Universal Character Set (UCS) UCS contains the characters required to represent characters used by practically all known languages, even the likes of Gurmukhi, Oriya, Telugu, Bopomofo, Runic. There are proposals for more, like Hieroglyphs and Tengwar. Note that there are about 6800 known languages..
UCS organization ISO defines formally a 31-bit character set. They are represented as 32 bits, i.e. 4 bytes, or 8 hex chars. The canonical form of ISO uses a four-dimensional coding space consisting of 256 groups. Each group consists of 256 planes with each plane containing 256 rows, each having 256 cells.
UCS organization The first plane (Plane 0x00) of Group (0x00) is called the Basic Multilingual Plane (BMP). It has been fixed since first publication. The subsequent 223 planes (0x01 to 0xDF) of Group 0x00, as well as planes 0x00 to 0xFF in Groups 0x01 to 0x5F are reserved for further standardization. The last 32 planes (0xE0 to 0xFF) of Group 0x00, as well as all code positions of 32 groups (0x60 to 0x7F) are reserved for private use.
Relationship with legacy sets Let U+(four hex numbers) denote characters in the BMP. The UCS characters U+0000 to U+007F are identical to those in ASCII The range U+0000 to U+00FF is identical to ISO (Latin-1).
Types of characters in UCS Letters –Base characters –Ideographic characters –Combining characters Digits Extenders
Unicode Unicode are an industry consortium. The Unicode Standard published by the Unicode Consortium corresponds to the BMP of ISO All characters are at the same positions and have the same names in both standards. The Unicode Standard defines in addition much more semantics associated with some of the characters. There is a free online book at Now recall the difference between character code and encoding…
Encodings for UCS The two most obvious encodings store Unicode text as sequences of either 2 or 4 bytes The official terms for these encodings are UCS-2 and UCS-4 respectively. Of course, UCS-2 can only do plane 0. Unless otherwise specified, the most significant byte comes first in these (bigendian convention). A Latin-1 file can be transformed into a UCS-2 file by simply inserting a 0x00 byte in front of every byte. If we want to have a UCS-4 file, we have to insert three 0x00 bytes instead before every ASCII byte.
The byte order mark In order to allow the automatic detection of the byte order, Microsoft want to start every Unicode file with the character U+FEFF (ZERO WIDTH NO-BREAK SPACE), also known as the Byte- Order Mark. Its byte-swapped equivalent U+FFEF is not a valid Unicode character, therefore it helps to unambiguously distinguish the big-endian and little-endian variants. This causes a lot of trouble on other systems.
UTF-8 encoding This is variable-length encoding of UCS. It is the one used by default in XML. It has a number of desirable properties. UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F. This means that files and strings which contain only ASCII characters have the same encoding under both ASCII and UTF-8. All UCS characters higher than U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII sequence (0x00- 0x7F) can appear as part of any other byte.
UTF-8 encoding The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multi-byte sequence are in the range 0x80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes. UTF-8 encoded characters may theoretically be up to six bytes long, however BMP characters are only up to three bytes long. The sorting order of Bigendian UCS-4 byte strings is preserved. The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.
UTF encoding table Bits Hex Min Hex Max Byte Sequence in Binary F0??????? FF110?????10?????? FFFF 1110????10?????? 10?????? FFFFF11110???10?????? 10??????10?????? FFFFFF111110??10?????? 10??????etc FFFFFFF ?10?????? etc
Examples The Unicode character U+00A9 = (copyright sign) is encoded in UTF-8 as = 0xC2 0xA9 The Unicode character U+2260 = (not equal to) is encoded as: = 0xE2 0x89 0xA0 No need to learn how to do these conversions.