Presentation on theme: "Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30."— Presentation transcript:
lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30
Structure Numbers –Bits –Bytes Character sets –Coded character set –Character endcoding
Literature, no need to read… Norton new inside the PC chapter 4 http://www.danbbs.dk/~erikoest/bb_terms. htmhttp://www.danbbs.dk/~erikoest/bb_terms. htm http://wwwinfo.cern.ch/asdoc/WWW/public ations/ictp99/ictp99N2705.htmlhttp://wwwinfo.cern.ch/asdoc/WWW/public ations/ictp99/ictp99N2705.html http://www.cl.cam.ac.uk/~mgk25/unicode. htmlhttp://www.cl.cam.ac.uk/~mgk25/unicode. html
Information Information is best understood as what it takes to answer a question. The simplest question has a yes or no answer. Therefore a bit is the natural measure of information. Term first used by John Turkey in 1946. Concatenation of binary digit.
Usage of bits Computers are sometimes classified by the number of bits they can process at one time. "32 bit processor" Graphics are also often described by the number of bits used to represent each dot.
bits and bytes a bit can take the values 0 or 1, thus it can describe 2 possibilities two bits can take the value 00, 01, 10, 11, thus it can describe four 2×2 possibilities n bits can encode 2 power n possibilities. The first chips used to process 8 bits at a time. It become customary to refer to them as a byte. It can encode 2 power 8 possibilities. We can use binary numbers just as decimal numbers.
application of bytes IP (Internet Protocol) numbers are used as the addresses of computers on the Internet. In IP version 4 (the one that is most commonly used), each IP number has 4 bytes. It is represented as x.x.x.x where x is a number between 0 and 255 (why?) how many computers can there be on the Internet at any one time?
Many bytes Larger units are –Kilo byte is 2 power 10 bytes (=1024 bytes) –Mega bytes is 2 power 20 bytes –Giga bytes is 2 power 30 bytes –Tera byte is 2 power 40 bytes From ancient Greek words for "thousand", "large", "giant", and "monster", respectively. Terms date back to the French revolution.
Hex numbers A byte is often represented by two hex numbers. Each hex number can encode 16 values Written 0 to 9, then A B C D E F. F is 15. Conventionally prefixed with 0x Use Microsoft calculator with scientific notation to convert.
application of hex numbers Media Access Control (mac) addresses of hardware that allows access to computer networks. They are 6-byte numbers, each byte written as 2 hex numbers, e.g. 00:60:08:F5:20:A9 character numbers that you see when you are inserting a special symbol in Microsoft software, e.g. powerpoint.
Characters Much of the information processed by computers is in the form of characters. A character only makes sense for a human user of a minimum cultural level. A character is not a glyph. –ligatures
Information in a computer file A file is a piece of data on a stored on a computer. Any file contains a sequence of 0s and 1s, like 1010100101010011110101010101… For a computer to make sense of a file, it has to know what type of file it is.
executable files Files that are executable are files that make the computer do something. For example the file starts a program, say powerpoint. An executable on one computer may not run on another Non-executable files hold data that is used by an executable file. We will call them data files. Example: powerpoint slides file.
text files Many data files contain textual data. Textual data is a sequence of characters. A character is an elementary symbol that has some meaning –alphabet letter –hieroglyph Example: email file Text files can be read by many computer programs.
non-text files Examples for non-text files are –graphics files –movie files –sound files non-text files are not very important in library settings –there is not way to organize information retrieval for non-text files. They have to be retrieved using a textual surrogate. –traditional library material are textual will talk about this later.
Representing characters Computers don't understand text, they only understand numbers. For computers to be able to treat text, there must be a correspondence between numbers and text characters. Such a correspondence is called a character set. Examples for characters are –a–a –c–c –ë–ë –
Legacy character sets In early days, computers were a lot less powerful than they are today. Could only deal with the characters that are most commonly used. Such sets are –ascii –ISO-8859-1 –cp1252
ASCII American Standard Code for Information Interchange 7-bit character set. There is no such thing as 8-bit ASCII 95 printable symbols 33 control characters (0-31, 127) http://www.ccmr.cornell.edu/helpful_data/a scii2.html has a list up to 127http://www.ccmr.cornell.edu/helpful_data/a scii2.html
some ASCII control characters CR (13, ^M) is the carriage return LF (10, ^J) is the linefeed FF (12, ^L) is the form feed (new page) BS (8, ^H) is the backspace DEL (127, ALT-127) is delete ESC (27, ^[) escape
ISO-8859-1 ISO-8859-1, aka ISO-latin-1 extends ASCII with characters that are commonly used by the western European languages. It is the default character set of html. Positions 128 to 159 are not used. Cp1252 fills these with graphic chars. It is as Microsoft character set.
This is not enough There are around 6800 different languages around. Some of these languages use characters sets that are not finite, i.e. folks can make up now characters out of existing ones! Setting up a character set for all languages is almost impossible.
ISO 10646-1 Defines the Universal Character Set (UCS) UCS contains the characters required to represent characters used by many known languages, even the likes of Oriya, Telugu, Bopomofo, Runic. ISO 10646 defines formally a 31-bit character set. They are represented as 32 bits, i.e. 4 bytes, or 8 hex chars. Not finished..
Unicode ISO is a inter-government agency. Slow and bureaucratic. Industry has come together to work on Unicode, a 2-byte character set. With some minor exceptions, the Unicode characters are the some as the first 65536 characters in UCS. Much better documented standard.
Unicode and legacy sets The first 128 characters are identical to those in ASCII The next 128 characters are identical to ISO 8859-1 (Latin-1). Unicode is well documented and the Unicode book can be downloaded from the Internet. A must-have for the serious digital librarian.
Politics… Does it make sense to use Unicode rather than, say, ISO-latin-1? Many commercial pieces of software have data files that contain character data interspersed with non-character data. Is that good?
http://openlib.org/home/krichel Thank you for your attention!