Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets 11.1.2008.

Slides:



Advertisements
Similar presentations
Bits and the "Why" of Bytes: Representing Information Digitally
Advertisements

Chapter 8_2 Bits and the "Why" of Bytes: Representing Information Digitally.
מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.
Data Representation Kieran Mathieson. Outline Digital constraints Data types Integer Real Character Boolean Memory address.
Characters and Strings. Characters In Java, a char is a primitive type that can hold one single character A character can be: –A letter or digit –A punctuation.
Data Representation in Computers
Data Representation (in computer system) Computer Fundamental CIM2460 Bavy LI.
Lecture 3 1 ISO/IEC and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters.
CIS 234: Character Codes Dr. Ralph D. Westfall April, 2011.
2.1.4 BINARY ASCII CHARACTER SETS A451: COMPUTER SYSTEMS AND PROGRAMMING.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky Veronika.
ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s.
Introduction to Computing Using Python Chapter 6  Encoding of String Characters  Randomness and Random Sampling.
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
Localizing OpenClinica Hiroaki Honshuku: SQA 1. © What is Character Encoding?  Morse Code (1840) → Latin Alphabet  ASCII (1963)  The American Standard.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Representing text Each of different symbol on the text (alphabet letter) is assigned a unique bit patterns the text is then representing as.
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 10 This presentation © 2004, MacAvon Media Productions Characters & Fonts.
Agenda Data Representation – Characters Encoding Schemes ASCII
Lecture 2 Character Codes and Low-Structure Text Document Formats.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
B.Sc. Multimedia ComputingMedia Technologies Character Representation & Font Technology.
Data Representation S2. This unit covers how the computer represents- Numbers Text Graphics Control.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
INFOCODING BASICS & EXAMPLES OF CURRENT USE Introduction to Computer Science Using Ruby (c) 2010 Gideon Frieder.
Text and Graphics September 26, Unit 3.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Anlab ( ) Kim, Yangjung Characters & Fonts.
Representing Characters in a computer Pressing a key on the computer a code is generated that the computer can convert into a symbol for displaying or.
The character data type char. Character type char is used to represent alpha-numerical information (characters) inside the computer uses 2 bytes of memory.
Representation of Characters
Data Representation. What is data? Data is information that has been translated into a form that is more convenient to process As information take different.
1 Problem Solving using Computers “Data....Representation, and Storage.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
Characters CS240.
ASCII AND EBCDIC CODES By : madam aisha.
Representing Characters in a Computer System Representation of Data in Computer Systems.
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
2. Data Formats. Introduction Examples pp Real World Data Computer Data Input device Dear Mom: Keyboard … Digital camera …
Objectives  Explain the basic Unicode concepts in plain language  Install SILConverters 4.0  Install the converters for your branch  Convert several.
Basics of Unicode (base upon a presentation by NRSI, SIL International)
1.4 Representation of data in computer systems Character.
Lecture Coding Schemes. Representing Data English language uses 26 symbols to represent an idea Different sets of bit patterns have been designed to represent.
DATA REPRESENTATION - TEXT
Binary Representation in Text
Binary Representation in Text
Unit 2.6 Data Representation Lesson 2 ‒ Characters
Characters & Fonts Digital Multimedia, 2nd edition
Data Transfer ASCII FILES.
Representing Information as bit patterns
Data Encoding Characters.
TOPICS Information Representation Characters and Images
Data Representation ASCII.
Representing Characters
Characters & Fonts Digital Multimedia, 2nd edition
Presenting information as bit patterns
COMS 161 Introduction to Computing
COMS 161 Introduction to Computing
INFOCODING BASICS & EXAMPLES OF CURRENT USE
Learning Intention I will learn how computers store text.
ASCII LP1.
Chapter 3 - Binary Numbering System
ASCII and Unicode.
Presentation transcript:

Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets

Overview 1. Basic concepts 2. ASCII 3. 8-bit character sets 4. Unicode 5. Python

Computer coding of characters Computers store data as (binary) numbers Computers store data as (binary) numbers There is no a priori relationship between these numbers and characters (of an alphabet) There is no a priori relationship between these numbers and characters (of an alphabet) If there are no conventions for mapping numbers to characters, or there are too many conventions --> chaos If there are no conventions for mapping numbers to characters, or there are too many conventions --> chaos Standards and quasi standards: ASCII, ISO 8859, (Windows, Mac), Unicode Standards and quasi standards: ASCII, ISO 8859, (Windows, Mac), Unicode

Basic concepts I. a character a character  an abstract concept (An „A“ is something like a Platonic entity: it is the idea of an „A“ and not the „A“ itself)  of itself a character does not have a mapping to a number of a concrete visual representation  so, characters are usu. defined descriptively, e.g. „Greek small letter alpha“; the graphical representation is given only as an examplar, „α“

Basic concepts II. character repertoire or coded character set character repertoire or coded character set  a set of characters  each character is associated with a number (a character code)  identical characters can belong to different characters sets if they are logicaly distinct, e.g. capital letter A in the Latin alphabet, in the Cyrillic alphabet, capital alpha in Greek character code (codepoint) character code (codepoint)  a 1-1 relation between the character from a character set and a number e.g. A = 26, B = 27,...

Basic Concepts III. character encoding character encoding  an algorithm, which translates the character code into a concerete digital encoding, in bytes byte / octet byte / octet  the minimal unit that is processed by a computer  typically 8 bits (0/1) : 0-255

Basic concepts IV glyph glyph  the graphical representation of a character  a character can have several glyphs: A, A, A  sometimes one glyph can have several characters, e.g. the glyph “P” corresponds to the Latin letter P, the Cyrillic letter Er or Greek Rho PErRhoPErRho font font  the graphical representations of a set of characters for some character repertoire (coded character set): A, B, C, Č, D, …

ASCII American Standard Code for Information Interchange (1950') American Standard Code for Information Interchange (1950') a 7-bit character set: range from a 7-bit character set: range from control codes and formatting: Escape, Line Feed, Tab, Space, control codes and formatting: Escape, Line Feed, Tab, Space, – punctuation etc., numbers, lc and English letters : – punctuation etc., numbers, lc and English letters : ! " # $ % & ' ( ) * +, -. / : ; A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ Oxymoron: ‘8-bit ASCII’ Oxymoron: ‘8-bit ASCII’

ASCII II. Advantages: Advantages:  no chaos: one character - one codepoint (number)  trivial character encoding algorithm: one codepoint - one byte Weakness: Weakness:  does not support non-English characters

8-bit character sets I. In ASCII one bit in byte was left unused In ASCII one bit in byte was left unused so, ½ numbers ( ) not assigned characters so, ½ numbers ( ) not assigned characters The need for extra characters: The need for extra characters:  in the 80‘ s many new character sets appeared  ASCII always a subset  make use of the 8 th bit in a byte ISO publishes character sets for families of European languages – the ISO 8859 familily of standards ISO publishes character sets for families of European languages – the ISO 8859 familily of standards ISO (ISO Latin 1) – Western European languages ISO (ISO Latin 1) – Western European languages ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

8-bit character sets II. for Slovene and other Central and Eastern European languages - anarchy: for Slovene and other Central and Eastern European languages - anarchy:  ISO (ISO Latin 2)  Windows-1250 (grrr!)  others: Apple, IBM  …

8-bit character sets III. Advantages: Advantages:  can write characters of national language alphabets (e.g. German, Slovene, Bulgarian, Greek)  simplicity: one character still codes to one byte Weakness:: Weakness::  chaos because of the large number of character sets for many languages  multilingal texts cannot be written in the same character set  no provisions for Far-eastern languages or for more sophisticated characters  the file does a priory contain information in which character set it is written in: © Global publishing ~ Ž Global publishing  --> there is no such thing as “plain text”!!

Unicode I. If we want to extend the character set, the only solution is to code one character in several bytes If we want to extend the character set, the only solution is to code one character in several bytes 1991 – Unicode Consortium: – Unicode Consortium: ISO Unicode ISO Unicode  defines the universal character set  defines 30 alphabets covering several hundred languages, cca characterov  …CJK, Arabic, Sanskrt,…  historical alphabets, punctuation, math symbols, diacritics, …  A character definition in Unicode: „LATIN CAPITAL LETTER A WITH ACUTE“

Unicode definitions for IPA

Unicode II. 1 character ≠ 1 bye, what now? 1 character ≠ 1 bye, what now? for Unicode, several character encodings exist: for Unicode, several character encodings exist:  UTF-32  1 character – 4 bytes  UTF-16  if BMP character (Basic Multilingual Plane) 1 character – 2 bytes  otherwise 1 character – 4 bytes  UTF-8  varying length: 1-6 bytes for character  if character in ASCII then one byte (compatibility)  most European characters code in two bytes

Unicode III. diactrics exists as zero width characters (combining diacritical marks) diactrics exists as zero width characters (combining diacritical marks) e.g. a + ̂ + ̤ = â̤ e.g. a + ̂ + ̤ = â̤ but problems with displaying complex combinations, but problems with displaying complex combinations, e.g. a + ̂ + ˚ = å̂ e.g. a + ̂ + ˚ = å̂

Back to ASCII ASCII is sometimes still the only safe encoding:  how to keyboard complex characters  how to transfer text ( , www) Re-coding to ASCII: - MIME standard - MIME standard WWW - Unidoce character entities, e.g. &# 353; ( = &# x160;) = š WWW - Unidoce character entities, e.g. &# 353; ( = &# x160;) = šš

Conversion between character sets Linux: Linux: iconv –f windows-1250 –t utf8 text-win > text-utf8 iconv –f windows-1250 –t utf8 text-win > text-utf8 Windows: Windows:  charmap  MS Word / Save as

Python Python documentation: Unicode Strings Python documentation: Unicode Strings ASCII string: 'Hello World !' ASCII string: 'Hello World !' Unicode string: u'Hello World !' Unicode string: u'Hello World !' Use of Unicode codepoint: u'Hello\u0020World !' Use of Unicode codepoint: u'Hello\u0020World !' >>> print u'Toma\u017E Erjavec‘ Tomaž Erjavec >>> print u'Toma\u017E Erjavec‘ Tomaž Erjavec

Coding and decoding Converting Unicode strings into 8 bit encodings and back is done with CODECs >>> u"Toma\u017e Erjavec".encode('utf-8') 'Toma\xc5\xbe Erjavec‘ >>> 'Toma\xc5\xbe Erjavec'.decode('utf-8') u'Toma\u017e Erjavec‘

Use of other character sets >>> u"Toma\u017E Erjavec".encode('iso ') 'Toma\xbe Erjavec' >>> u"Toma\u017E Erjavec".encode('iso ') Traceback (most recent call last): UnicodeEncodeError: 'latin-1' codec can't encode character u'\u017e' in position 4: ordinal not in range(256)

References Well written intro: de.html Well written intro: de.html de.html de.html Good intro to character sets: Good intro to character sets: Official Unicode site: Official Unicode site: Python Unicode Objects: Python Unicode Objects: