1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel

Slides:



Advertisements
Similar presentations
Tafseer Ahmed Department of Computer Science University of Karachi Urdu on Linux International Support.
Advertisements

Introduction to Computers and Programming. Some definitions Algorithm: –A procedure for solving a problem –A sequence of discrete steps that defines such.
Representing Information as Bit Patterns Lecture 4 CSCI 1405, CSCI 1301 Introduction to Computer Science Fall 2009.
מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.
Data Representation Kieran Mathieson. Outline Digital constraints Data types Integer Real Character Boolean Memory address.
1/25 Writing Character sets Unicode Input methods.
Guide To UNIX Using Linux Third Edition
Introduction to Computers and Programming. Some definitions Algorithm: Algorithm: A procedure for solving a problem A procedure for solving a problem.
Data Representation (in computer system) Computer Fundamental CIM2460 Bavy LI.
Lecture 3 1 ISO/IEC and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
Basics of HTML Shashanka Rao. Learning Objectives 1. HTML Overview 2. Head, Body, Title and Meta Elements 3.Heading, Paragraph Elements and Special Characters.
Introduction to Computing Using Python Chapter 6  Encoding of String Characters  Randomness and Random Sampling.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
©Brooks/Cole, 2003 Chapter 2 Data Representation.
Chapter 2 Data Representation. Define data types. Visualize how data are stored inside a computer. Understand the differences between text, numbers, images,
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
ASCII and Unicode.
Encoding and fonts Edward Garrett Software Developer, ELAR.
Representing text Each of different symbol on the text (alphabet letter) is assigned a unique bit patterns the text is then representing as.
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 10 This presentation © 2004, MacAvon Media Productions Characters & Fonts.
Week 4 Number Systems.
Dale Roberts Department of Computer and Information Science, School of Science, IUPUI CSCI N305 Information Representation: Characters and Images.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
B.Sc. Multimedia ComputingMedia Technologies Character Representation & Font Technology.
Globalisation & Computer systems Week 4 writing systems and their implications for globalisation character representation ASCII extended ASCII code pages.
Computer Structure & Architecture 7c - Data Representation.
Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee ( ) National Center for Science Information.
Using the Unicode Standard for Linguistic Data: Preliminary Guidelines Deborah Anderson Researcher Dept. of Linguistics, UC Berkeley.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
CS161 Computer Programming Instructor: Maria Sabir Fall 2009 Lecture #1.
Coding of Information Assign a unique string of binary digits to each piece of information There are many standard coding groups –Decimal Codes (BCD –
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Anlab ( ) Kim, Yangjung Characters & Fonts.
SEC (1.4) Representing Information as bit patterns.
Representing Characters in a computer Pressing a key on the computer a code is generated that the computer can convert into a symbol for displaying or.
Lis508 lecture 2: characters to textual documents Thomas Krichel
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Data Representation. What is data? Data is information that has been translated into a form that is more convenient to process As information take different.
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
1 Problem Solving using Computers “Data....Representation, and Storage.
M204 - Data Representation
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
Characters CS240.
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
Character representation in the computers Home Assignment 1 Assigned. Deadline 2016 January 24th, Sunday.
Basics of Unicode (base upon a presentation by NRSI, SIL International)
THE CODING SYSTEM FOR REPRESENTING DATA IN COMPUTER.
1.4 Representation of data in computer systems Character.
1 Non-Numeric Data Representation V1.0 (22/10/2005)
Nat 4/5 Computing Science Data Representation Lesson 3: Storing Text
Essential Skills for Computing Fonts
Binary Representation in Text
Binary Representation in Text
Machine level representation of data Character representation
Characters & Fonts Digital Multimedia, 2nd edition
Representing Information as bit patterns
TOPICS Information Representation Characters and Images
Representing Nonnumeric Data
Lecture 3 ISE101: Computing Fundamentals
Ch2: Data Representation
Characters & Fonts Digital Multimedia, 2nd edition
Presenting information as bit patterns
Chapter 2 Data Representation.
Learning Intention I will learn how computers store text.
Chapter 3 - Binary Numbering System
ASCII and Unicode.
Presentation transcript:

1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel Lecture 8 character encodings - UNICODE

2 herbert van de sompel Problem The richness of text elements: letters, scripts, symbols structure: words, sentences, paragraphs, headings, tables appearance: fonts, layout, design, materials special: mathematics, music Digital libraries must represent ever variant!

3 herbert van de sompel Characters Distinguish between the abstract character as the smallest structural element in written language that has semantic value. It refers to abstract meaning and/or shape rather than specific shape "A" the glyph as a specific representation of a character A A A a font as a collection of glyphs used for the visual depiction of characters; a font is often associated with a set of parameters (size, posture, weight, …) set to certain value

4 herbert van de sompel Characters Distinguish between an abstract character repertoire: an unordered set of characters that are used together the abstract character a is part of the ASCII character repertoire a coded character set: an ordering and mapping of an abstract character repertoire onto a set of non-negative integers. those integers are called code-points; characters that have a code point are encoded characters. in the ASCII 0x61 is the code point for a character encoding form: a mapping from code-points to units stored on computers (bytes) (fixed/variable width) is 7-bit ASCII encoding of a

5 herbert van de sompel Evolution of character encodings ASCII – 7 bit => code points ISO 646 – language specific variations of ASCII (some code points are allowed to have different character): Code pointISO 646-IRV [ASCII]ISO 646-DK 5B[Æ 5D]Å ISO/IEC 8859 – series of 8 bit coded character sets: basis is ISO 646-IRV different Parts each Part defines different characters for code points 0x80 0xFF Part 1: Latin alphabet 1 / Part 8 : Latin/Hebrew alphabet

6 herbert van de sompel ASCII family printable ASCII standard (7-bit) ASCII extended (8-bit) ASCII 32

7 herbert van de sompel ISO/IEC 2022:1994 – defines methods to switch among various 7 or 8 bit coded character sets (escape sequences) Vendor specific character sets – Windows code pages == variation on ISO/IEC Part 1 Web : Initially European Wide Web: ISO/IEC Part 1 Then language-specific encodings selected in browser HTML 4.0: Unicode Evolution of character encodings

8 herbert van de sompel Chaos Different coded character sets => software must be localized Global communication Global commerce / data exchange Solution: UNICODE – universal character encoding specification

9 herbert van de sompel UNICODE Basic Multilingual Plane 16-bit codes that represent distinct characters => 65,000 characters expansion possible to 1,000,000 code elements includes: characters of major written languages punctuation marks, diacritics, math symbols, tech symbols, arrows, dingbats, … modifying diacritics private use code points 8000 unused organized by scripts, not languages

10 herbert van de sompel UNICODE code elements UNICODE code elements (abstract characters): a fundamental element for text processing code elements are assigned: code point: a unique numeric value [U+0000 => U+FFFF] U+0061 name: a unique name LATIN SMALL LETTER A code elements have character encoding form: UTF-8 0x61 …

11 herbert van de sompel Text processing T system software keyboard U+0054 text processor in memory T display soft U+0054 T U+0075 u UNICODE

12 herbert van de sompel UNICODE design principles 16-bit codes that represent distinct characters => 65,000 characters representation of characters, not glyphs characters have properties (directionality, numeric, case, combining class, …) characters are stored in logical order (reading order) organized by scripts, not languages dynamic composition of characters out of basic + ° ° equivalence sequences for characters that have precomposed and dynamic representation ã == a ~

13 herbert van de sompel UNICODE UTF-8 encoding How to encode a 16 bit character in 8 bit words? UTF-8: Also for 32 bit UNICODE Variable length: 1 to 4 bytes Unicode FromTobit pattern byte 1 byte 2 byte F xxxxxxxx 0xxxxxxx FF00000yyyyyxxxxxx 110yyyyy10xxx… 0800FFFFzzzzyyyyyyxxxxxx 1110zzzz10yyy… 10xx…

14 herbert van de sompel UNICODE & Web Integral part of HTML and XML Supported by all major OS RFC 2277 recommendation: all protocols must support UNICODE

15 herbert van de sompel UNICODE & HTML & browsers character set of HTML == Unicode Unicode character in HTML: U+0061 == a in HTML: a == a U+003D == = in HTML: = == = == = Since the same Unicode character can be used in different languages, the glyph that is rendered by the browser can be dependent on the language of the text

16 herbert van de sompel Readings The Unicode Standard: a technical introduction