1/25 Writing Character sets Unicode Input methods.

Slides:



Advertisements
Similar presentations
Worldwide typography (and how to apply JIS-X to Unicode) Michel Suignard Microsoft Corporation.
Advertisements

Murray Sargent III Microsoft Corporation Text Services Group, Word Tips & Tricks on Editing and Displaying Unicode Text.
Technology ICT Option: Data Representation. Data Representation In our everyday lives, we communicate with each other using analogue data. This data takes.
Review of HTML Ch. 1.
Some computer fundamentals and jargon Memory: Basic element is a bit – value = 0 or 1 Collection of “n” bits is a “byte” Collection of several bytes is.
The Binary Numbering Systems
Bits and the "Why" of Bytes: Representing Information Digitally
Tafseer Ahmed Department of Computer Science University of Karachi Urdu on Linux International Support.
Input to the Computer * Input * Keyboard * Pointing Devices
1. Discrete / Continuous Representations Of numbers – binary & decimal Bits Hexadecimal - 'Hex' Representing text Bits and Bytes.
8 November Forms and JavaScript. Types of Inputs Radio Buttons (select one of a list) Checkbox (select as many as wanted) Text inputs (user types text)
Chapter 8_2 Bits and the "Why" of Bytes: Representing Information Digitally.
Chapter 8 Bits and the "Why" of Bytes: Representing Information Digitally.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
Word Basics Microsoft Office 2003 Elizabeth Ponder Palestine Public Library Adult Services.
CIS 234: Character Codes Dr. Ralph D. Westfall April, 2011.
Digital Text Primer Prepared for: AIEA Roundtable on Digitization of Armenian Documents Saturday 7 October 2006, University of Geneva, Switzerland Roland.
 A data processing system is a combination of machines and people that for a set of inputs produces a defined set of outputs. The inputs and outputs.
ASCII and Unicode. ASCII Inside a computer, EVERYTHING is a number – that includes music, sound, and text. In the early days of computers, every manufacturer.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
Computer Systems Lesson 4 Input and Output devices.
Lesson 4 — Keyboarding Unit 1 — Computer Basics. Lesson 4 – Keyboarding 2 Objectives Define keyboarding. Identify the parts of the standard keyboard.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
TERMS TO KNOW. Programming Language A vocabulary and set of grammatical rules for instructing a computer to perform specific tasks. Each language has.
Classification with Hyperplanes Defines a boundary between various points of data which represent examples plotted in multidimensional space according.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 10 This presentation © 2004, MacAvon Media Productions Characters & Fonts.
Week 4 Number Systems.
Binary Numbers and ASCII and EDCDIC Mrs. Cueni. Data Representation  Human speech is analog because it uses continuous signals (waves) that vary in strength.
Spring /6.831 User Interface Design and Implementation1 Lecture 22: Internationalization.
Topics Introduction Hardware and Software How Computers Store Data
11.10 Human Computer Interface www. ICT-Teacher.com.
Chapter Three The UNIX Editors. 2 Lesson A The vi Editor.
CS151 Introduction to Digital Design
Text and Graphics September 26, Unit 3.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
ICT – 8th grade ASPAEN – Gimnasio Los Cerezos Angela I. Arango Echeverry Hardware: Input devices.
Term 2, 2011 Week 1. CONTENTS Problem-solving methodology Programming and scripting languages – Programming languages Programming languages – Scripting.
Anlab ( ) Kim, Yangjung Characters & Fonts.
1 Data Representation Characters, Integers and Real Numbers Binary Number System Octal Number System Hexadecimal Number System Powered by DeSiaMore.
E.g.: MS-DOS interface. DIR C: /W /A:D will list all the directories in the root directory of drive C in wide list format. Disadvantage is that commands.
Data Representation Conversion 24/04/2017.
Representing Characters in a computer Pressing a key on the computer a code is generated that the computer can convert into a symbol for displaying or.
Data Representation, Number Systems and Base Conversions
Chapter Three The UNIX Editors.
Programming Fundamentals. Overview of Previous Lecture Phases of C++ Environment Program statement Vs Preprocessor directive Whitespaces Comments.
Seminar on Endangered Languages Writing Systems.  Different Writing Systems  What makes a writing system  Standardization vs Historical artifacts 
Learning Objectives Explain the link between patterns, symbols, and information Determine possible PandA encodings using a physical phenomenon Encode.
M204 - Data Representation
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
Representing Characters in a Computer System Representation of Data in Computer Systems.
Input devices Device that accepts data and instructions from the outside world Keyboard Mouse Trackball Joystick Light pen Touch Screen Scanner Bar code.
DATA REPRESENTATION - TEXT
Binary Representation in Text
Binary Representation in Text
Chapter 8 & 11: Representing Information Digitally
Characters & Fonts Digital Multimedia, 2nd edition
LECTURE Course Name: Computer Application
Bits and the "Why" of Bytes: Representing Information Digitally
Representing Characters
Data Representation Conversion 05/12/2018.
Characters & Fonts Digital Multimedia, 2nd edition
Option: Data Representation
Option: Data Representation
ASCII LP1.
ASCII and Unicode.
Presentation transcript:

1/25 Writing Character sets Unicode Input methods

2/25 Character sets What’s the problem? –Computer should handle your language’s writing system in a natural way –“Handle” means input and output (and some other things, eg sorting) –“Natural” means like you are used to Input method Output (it should look right) English is straightforward (why?), but not other languages Distinguish: storage and handling of text within the computer vs. input/output

3/25 Why the fuss? Typing characters on a computer may appear deceptively simple: you press a key labelled “A”, and the character “A” appears on the screen. Well, you actually get uppercase “A” or lowercase “a” depending on whether you used the shift key or not, but that’s common knowledge. You also expect “A” to be included into a disk file when you save what you are typing, you expect “A” to appear on paper if you print your text, and you expect “A” to be sent if you send your product by or something like that. And you expect the recipient to see an “A”. No big deal, but does the same happen for “Ä”? Or “ ” –Depends on keyboard settings, display settings, and degree of standardization Adapted from:

4/25 Character sets Size of character set has to do with storage as bits and bytes –Early computers had only 32 characters – upper case “English” plus numerals and a few other symbols –ASCII had space for 64 characters –most alphabetic writing systems can be covered by 128 characters Internal storage is independent of i/o Leads to need for standardization of encoding

5/25 Writing systems Alphabetic –Many languages use Roman alphabet –Often with diacritics (accents), many are common to lots of languages but some of are quite unusual and some languages use multiple diacritics –There are other alphabetic writing systems –Conventionally, a range of other symbols (numerals, currency signs, fractions, math symbols) are included Syllabic Ideographic

6/25 Accented characters Input method –Individual key –Key combination –Menu Must be available in all fonts

7/25 Characters and glyphs A single character might have a variety of appearances (glyphs) depending on size, font, etc. –a a a a a a a a a a a –A a à å α are all different characters –Appearance is a matter of rendering In some writing systems, the same character is rendered differently depending on its context

8/25 Output text direction Note mixed LR and RL in Arabic, and orientation of Roman script in Chinese

9/25 Unicode Problem of many (competing) standards, especially for Arabic, CJK and Indian scripts Industry-agreed standard aiming to cover “all” the world’s writing systems “Unicode consists of a repertoire of about 100,000 characters, a set of code charts for visual reference, an encoding methodology and set of standard character encodings, an enumeration of character properties such as upper and lower case, a set of reference data computer files, and a number of related items, such as character properties, rules for text normalization, decomposition, collation, rendering and bidirectional display order” (Wikipedia)

10/25 Unicode – some issues 30+ writing systems encoded, but many more still to do Non-alphabetic symbols should be included (eg music notation, currency symbols) Should invented alphabets (eg Klingon, Tolkien) and/or ancient systems (hieroglyphics, Mayan) be included?

11/25 Unicode – some issues Ready-made vs composite characters, e.g. é = e+´; Hangul and Chinese/Japanese characters made up of identifiable components Ligatures: many writing systems have special forms for character combinations –Is this a matter of representation or rendering? –Some disputed characters: ligature or separate character? (e.g. Dutch ij) Unicode also defines ordering conventions, not always uncontroversial

12/25 Input methods Typing –Keyboard layout –Key combinations –Inputting ideographs Handwriting pad OCR

13/25 Typing We are used to conventional keyboard which has (roughly) one key-stroke per character We quickly learn key-stroke combinations (eg for capitals, accented characters) Fluent typists rely on the key layout being familiar

14/25

15/25 Typing Recent emergence of MSN on telephones has required input using just ten keys –Shows that software can map key-stroke combination to appropriate character sequence For some users, bilingual keyboards are commonplace

16/25 Non-alphabetic writing systems Syllabic system may require multiple key-strokes per character Ideographic system (Chinese, Japanese) typically has input based on pronunciation, plus conversion to character, which may require contextual analysis Alternate method: composition by radical + stroke count

17/25 Graphic input Using stylus, eg on PDA Also using finger on mousepad on laptop Depends on recognizing stroke direction and order –Shorthand method invented –Recent systems recognize conventional letter shapes... –... in all their varieties

18/25 Graphic input Also found for Chinese/Japanese Important to get stroke order correct

19/25 OCR Optical character recognition “Scanning” Essentially a pattern recognition task: how similar is a given image to the expected image –Divide image into regions –Measure blackness of each region –Compare resulting matrix with template

20/25 OCR Originally developed with special OCR font which maximized the differences between characters For Latin scripts, works very well with almost any font Can include orientation detection Errors are predictable and could be eradicated with more sophisticated (linguistic) processing, but is it worth it?

21/25 OCR for handwriting Neat printed handwriting not much harder than some fonts Joined-up cursive handwriting still a research problem Related problem of handwriting recognition – a bit like speech understanding and voice recognition

22/25 OCR for other scripts Correspondingly more difficult, depending on –Complexity of writing system in general –Complexity and similarity of individual characters

23/25 Not always easy Handwriting is even harder

24/25 Need for OCR Input of (all sorts of) texts for various purposes –Rapid input to save (re)typing –For further processing –For study Two typical (hard) cases –Study of ancient manuscripts –Intelligence gathered in Iraq

25/25