Presentation is loading. Please wait.

Presentation is loading. Please wait.

מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.

Similar presentations


Presentation on theme: "מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe."— Presentation transcript:

1 מבנה מחשב תרגול 2 ייצוג תווים בחומרה

2 A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe in germs. Joel Spolsky

3 תמר שרוט, נועם חזון3 Introduction Computers are considered "number crunchers“. Humans work with characters. Character data isn't just alphabetic characters, but also numeric characters, punctuation, spaces, etc. Most keys on the central part of the keyboard (except shift, caps lock) are characters. Everything represented by a computer is represented by binary sequences. We use standard encodings (binary sequences) to represent characters.

4 תמר שרוט, נועם חזון4 Introduction (2) The two's complement method is used to represent integer numbers, because it has nice mathematical properties, in particular. However, there aren't such properties for character data, so assigning binary codes for characters is somewhat arbitrary. The most common character representation is ASCII, which stands for American Standard Code for Information Interchange. The ASCII code defines what character is represented by each binary sequence.

5 תמר שרוט, נועם חזון5 The ASCII code

6 תמר שרוט, נועם חזון6 The ASCII code (2) There are two reasons to use ASCII:  A way to represent characters.  An acceptable standard. Different bit patterns are used for each different character that needs to be represented. A nice property – The lowercase (uppercase; digits) letters are contiguous. Applications:  ‘a’ < ‘b’; 'A' < 'B‘; ‘0’<‘1’.  ‘a’ – ‘A’ = ‘b’ – ‘B’ = …. = ‘z’ – ‘Z’ = 32.  ‘1’ – ‘0’ = 1 – 0.

7 תמר שרוט, נועם חזון7 The ASCII code (3) Note:  ‘a’ ≠ ‘A’.  0 ≠ ‘0’ (‘0’ = 48).  The characters between 0 and 31 are generally not printable (control characters that affect how text is processed, etc). 32 is the space character.  There are 128 (= 2^7) ASCII characters.  The eighth bit being used as a parity bit to detect transmission errors.

8 תמר שרוט, נועם חזון8 The ASCII’s disadvantage The greatest disadvantage: biased for the English language character set. Missing:  Mathematical symbols.  European languages (as well as Hebrew). Solution: use the 8 th bit as well (Extended ASCII). Switching up to 256 letters, which is plenty for most alphabet based languages.

9 תמר שרוט, נועם חזון9 Extended ASCII Problems:  Not enough for Asian languages, which are word- based (thousands of characters).  Can’t add more than one languages (é = ג; email from France to Israel and vise verses). Code-Pages – different characters encoding. Identical only in the first 128 codes (the ASCII part). Works reasonably in small networks that use the same coding. Problem: The Internet!

10 תמר שרוט, נועם חזון10 Unicode An effort to create a single character set that include every reasonable writing system. Uses 2 bytes to represent a character.  1 st byte + 2 nd empty byte – used to represent the ASCII characters.  1 st + 2 nd bytes – used to represent other characters. The UCS-2 (2-Bytes Universal Character Set. Also known as UTF-16) disadvantages:  Endians.  Doubles the files size.  Doesn’t support old files.

11 תמר שרוט, נועם חזון11 Endians Now when the characters are stored in more than one byte the bytes order (high / low endian) matter! Causes problems when transferring files between different computers. Solution: “Union Byte Order Mark” – 0xFEFF (in a 16-bit Unicode).  Always place the mark at the beginning of the characters’ stream.  While receiving an input that start with 0xFFFE – the programmer knows she must swap every other byte.

12 תמר שרוט, נועם חזון12 Unicode – cons. Yet:  Not every Unicode string has a byte order mark at the beginning.  Pure English files are doubled for no reason.  Old files must be converted. Unicode was abandoned for several years (until 1992). Solution: UTF-8 (8-bit-Unicode-Transfer- Format).

13 תמר שרוט, נועם חזון13 UTF-8 This is a variable length character encoding. Every code-point from 0-127 (ASCII’s original codes) is stored in a single byte. Code points 128 and above are stored using 2-4 bytes according to the character code-point (it is possible to use 6 bytes). Outcomes:  Pure English files are identical to ASCII files. No unneeded doubled files. No need to convert old files.  Enables representation of richer character set through the extra bytes.  Frequent characters use shorter encodings.

14 תמר שרוט, נועם חזון14 UTF-8 – How does it work? If we have an ASCII character:  It will be placed in one byte and the MSB will be zero. Otherwise: we need more than one byte!  The first byte will tell us how many bytes are used to encode the character.  The first byte will start (MSB) with a sequence of ones followed by a single zero. The sequence length will be the number of bytes used to encode the character.  Each additional byte will have the value 10 in its MSB.  The reset of the bits will be used to encode the character.

15 תמר שרוט, נועם חזון15 Other encodings There are hundreds of different encodings. UTF-7, UTF-8, UTF-16 and UTF-32 are the most reliable when working with languages other than English. When passing a sequence of characters (strings, files etc.) one must mention which encoding methods is used. Or else:  Gibberish.  Question marks.  Wrong representation of several characters.

16 תמר שרוט, נועם חזון16 Standards E-mail:  Content-Type: text / plain; charset = “UTF-8” Web page: tag  <meta http-equiv=“Content-Type” content = “text/html; charset=utf-8”> …

17 תמר שרוט, נועם חזון17 Libraries for managing encodings Th ere are many libraries that support different characters encoding. I.e.:  Iconv (Or a more stable implementation: libiconv). (Mostly Unix).  Codecs module (python).  “The International Component for Unicode” (ICU) (There are libraries for C/C++ & Java).  UTF8-CPP (C++).


Download ppt "מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe."

Similar presentations


Ads by Google