1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel

1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel herbertv@cs.cornell.edu Lecture 8 character encodings - UNICODE

2 herbert van de sompel Problem The richness of text elements: letters, scripts, symbols structure: words, sentences, paragraphs, headings, tables appearance: fonts, layout, design, materials special: mathematics, music Digital libraries must represent ever variant!

3 herbert van de sompel Characters Distinguish between the abstract character as the smallest structural element in written language that has semantic value. It refers to abstract meaning and/or shape rather than specific shape "A" the glyph as a specific representation of a character A A A a font as a collection of glyphs used for the visual depiction of characters; a font is often associated with a set of parameters (size, posture, weight, …) set to certain value

4 herbert van de sompel Characters Distinguish between an abstract character repertoire: an unordered set of characters that are used together the abstract character a is part of the ASCII character repertoire a coded character set: an ordering and mapping of an abstract character repertoire onto a set of non-negative integers. those integers are called code-points; characters that have a code point are encoded characters. in the ASCII 0x61 is the code point for a character encoding form: a mapping from code-points to units stored on computers (bytes) (fixed/variable width) 1110001 is 7-bit ASCII encoding of a

5 herbert van de sompel Evolution of character encodings ASCII – 7 bit => code points 0-127 ISO 646 – language specific variations of ASCII (some code points are allowed to have different character): Code pointISO 646-IRV [ASCII]ISO 646-DK 5B[Æ 5D]Å ISO/IEC 8859 – series of 8 bit coded character sets: basis is ISO 646-IRV different Parts each Part defines different characters for code points 0x80 0xFF Part 1: Latin alphabet 1 / Part 8 : Latin/Hebrew alphabet

6 herbert van de sompel ASCII family 0 127 255 printable ASCII standard (7-bit) ASCII extended (8-bit) ASCII 32

7 herbert van de sompel ISO/IEC 2022:1994 – defines methods to switch among various 7 or 8 bit coded character sets (escape sequences) Vendor specific character sets – Windows code pages == variation on ISO/IEC 8859 - Part 1 Web : Initially European Wide Web: ISO/IEC 8859 - Part 1 Then language-specific encodings selected in browser HTML 4.0: Unicode Evolution of character encodings

8 herbert van de sompel Chaos Different coded character sets => software must be localized Global communication Global commerce / data exchange Solution: UNICODE – universal character encoding specification

9 herbert van de sompel UNICODE Basic Multilingual Plane 16-bit codes that represent distinct characters => 65,000 characters expansion possible to 1,000,000 code elements includes: characters of major written languages punctuation marks, diacritics, math symbols, tech symbols, arrows, dingbats, … modifying diacritics private use code points 8000 unused organized by scripts, not languages

10 herbert van de sompel UNICODE code elements UNICODE code elements (abstract characters): a fundamental element for text processing code elements are assigned: code point: a unique numeric value [U+0000 => U+FFFF] U+0061 name: a unique name LATIN SMALL LETTER A code elements have character encoding form: UTF-8 0x61 …

11 herbert van de sompel Text processing T system software keyboard U+0054 text processor 01010100 in memory T display soft U+0054 T U+0075 u UNICODE

12 herbert van de sompel UNICODE design principles 16-bit codes that represent distinct characters => 65,000 characters representation of characters, not glyphs characters have properties (directionality, numeric, case, combining class, …) characters are stored in logical order (reading order) organized by scripts, not languages dynamic composition of characters out of basic characters @ + ° => @ ° equivalence sequences for characters that have precomposed and dynamic representation ã == a ~

13 herbert van de sompel UNICODE UTF-8 encoding How to encode a 16 bit character in 8 bit words? UTF-8: Also for 32 bit UNICODE Variable length: 1 to 4 bytes Unicode FromTobit pattern byte 1 byte 2 byte 3 0000007F00000000xxxxxxxx 0xxxxxxx 008007FF00000yyyyyxxxxxx 110yyyyy10xxx… 0800FFFFzzzzyyyyyyxxxxxx 1110zzzz10yyy… 10xx…

14 herbert van de sompel UNICODE & Web Integral part of HTML and XML Supported by all major OS RFC 2277 recommendation: all protocols must support UNICODE

15 herbert van de sompel UNICODE & HTML & browsers character set of HTML == Unicode Unicode character in HTML: U+0061 == a in HTML: a == a U+003D == = in HTML: = == = == = Since the same Unicode character can be used in different languages, the glyph that is rendered by the browser can be dependent on the language of the text

16 herbert van de sompel Readings The Unicode Standard: a technical introduction http://www.unicode.org/unicode/standard/principles.html

1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel

Similar presentations

Presentation on theme: "1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel

Similar presentations

Presentation on theme: "1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel"— Presentation transcript:

Similar presentations

About project

Feedback