Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect 2003-09-24.

Slides:



Advertisements
Similar presentations
Unicode from a distance…
Advertisements

JChem Web Services Server Jonathan Lee Solutions for Cheminformatics Technical Product Presentation.
Globalization Gotchas
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect.
Copyright © 2003 Pearson Education, Inc. Slide 3-1 Created by Cheryl M. Hughes The Web Wizards Guide to XML by Cheryl M. Hughes.
From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16.
© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.
Open-Source Approaches to Unicode Enablement Panel Discussion.
4. Internet Programming ENG224 INFORMATION TECHNOLOGY – Part I
Overview Environment for Internet database connectivity
Welcome to Middleware Joseph Amrithraj
June 2004 Adil Allawi Technical Director
Web Service Architecture
Building International Applications with Visual Studio.NET Achim Ruopp International Program Manager Microsoft Corporation.
Getting Familiar with Web Pages 1 2 The Internet Worldwide collection of interconnected computer networks that enables businesses, organizations, governments,
COM vs. CORBA.
QIF Hilton Head, SC. Larry Maggiano Mitutoyo America Corporation June 13, 2012 Unicode for GD&T Symbols?
1 ColdFusion Sandra Cadena-Torres IS-373 ~ Spring 2010.
Graphics 2D 1 Subject:T0934 / Multimedia Programming Foundation Session:6 Tahun:2009 Versi:1/0.
1 Internationalization Localization & Unicode Karunesh Arora Vijay Gugnani C-DAC Noida.
Middleware Fatemeh Hendijanifard 1 آزمايشگاه سيستم هاي هوشمند (
INTERNET DATABASE Chapter 9. u Basics of Internet, Web, HTTP, HTML, URLs. u Advantages and disadvantages of Web as a database platform. u Approaches for.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
CCE-EDUSAT SESSION FOR COMPUTER FUNDAMENTALS Date: Session III Topic: Number Systems Faculty: Anita Kanavalli Department of CSE M S Ramaiah.
COMPUTER FUNDAMENTALS David Samuel Bhatti
Introduction SOAP History Technical Architecture SOAP in Industry Summary References.
San José, CA – September, 2004 Localizing with XLIFF and ICU Markus Scherer Raghuram (Ram) Viswanadha IBM San.
Interoperability with CMIS and Apache Chemistry
Creating Interfaces: Localization Language & other issues character codes Homework: preparation for future topics.
Week 7 Lecture Web Database Development Samuel Conn, Asst. Professor
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Arabization of Computer Systems نظم تعريب الحاسب Abdelkarim Abdelkader
News On The Go! How NewsHunt reached 1 Crore Downloads ? INDIAN LANGUAGES!!
119th International Unicode ConferenceSan Jose, California, September 2001 An Overview of ICU Helena Shih Chapman Doug Felt
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Ladd Van Tol Senior Software Engineer Security on the Web Part One - Vulnerabilities.
1 An ICU Library Supporting the Display of Complex Text Eric Mader Globalization Center of Competency, Cupertino, CA.
Creating User Interfaces [Catch up presentations]. Language. Localization. Homework: Work on teaching projects. Post comments on source for localization,
1 HTML ( Hypertext MarkUP Language ) HTML is the lingua franca for publishing hypertext on the World Wide Web Define tags ….etc Allow to embed other scripting.
San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee ( ) National Center for Science Information.
WSDL Tutorial Ching-Long Yeh 葉慶隆 Department of Computer Science and Engineering Tatung University
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge.
Implementation Issues Mark Davis Properties.
Ch 1. A Python Q&A Session Spring Why do people use Python? Software quality Developer productivity Program portability Support libraries Component.
Team Members Team Members Tim Geiger Joe Hunsaker Kevin Kocher David May Advisor Dr. Juliet Hurtig November 8, 2001.
UNICODE & Indic Scripts
XML stands for Extensible Mark-up Language XML is a mark-up language much like HTML XML was designed to carry data, not to display data XML tags are not.
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
Building Database-backended Multilingual, Multimedia Data Repositories: The aAQUA Experience.
 Understand the concept and scope of IT Infrastructure  Understand with various components and technologies that make up IT Infrastructure  Learn the.
 Before you continue you should have a basic understanding of the following:  HTML  CSS  JavaScript.
Markus W. Scherer IBM Cupertino August 20 th, 2001Globalizing eBusiness – SDForum Unicode and XML Globalizing eBusiness Tools of the Trade: Unicode and.
DATA REPRESENTATION 4 Y. Colette Lemard February 2009.
Bucharest, 23 February 2005 CHM PTK technologies Adriana Baciu Finsiel Romania.
Assistive Technology for Information Access (Visual Impairments) UNDERSTANDING ACCESSIBLE FORMATS.
PHP Basics and Syntax Lesson 3 ITBS2203 E-Commerce for IT.
Component Object Model(COM)
INTERNATIONALIZATION
The Internet and HTML Code
TOPICS Information Representation Characters and Images
Silverlight Technology
Unicode from a distance…
Unit 6 part 3 Test Javascript Test.
Intro to PHP.
C++/Java/COM Interoperability
Introduction to UNICODE (ஒருங்குறி)
Presentation transcript:

Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect

Universal Character Encoding … Unique number for every character Unique number for every character

Unifies all Languages 96 thousand characters, so far 96 thousand characters, so far All characters accessible at the same time, in the same document: All characters accessible at the same time, in the same document: A, Ž, Ш, Δ, ش,,,,…,,,…,,, …..

Lingua Franca for Computers Developed & supported by industry leaders: Developed & supported by industry leaders: Apple, HP, IBM, JustSystem, Microsoft, Oracle, SAP, Sun, Sybase, Unisys, … Apple, HP, IBM, JustSystem, Microsoft, Oracle, SAP, Sun, Sybase, Unisys, … Required by modern standards: Required by modern standards: XML, HTML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, Perl, etc. XML, HTML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, Perl, etc. Implemented in: Implemented in: All modern operating systems, browsers, and other products All modern operating systems, browsers, and other products

International Domain Names Approved - Unicode-Based Approved - Unicode-Based Examples: Examples:

Standard Resources Online Standard Online Standard Technical Reports Technical Reports FAQs FAQs General Information General Information Discussion Forums, Conferences Discussion Forums, Conferences

Programming Resources System APIs: System APIs: Windows, Java, Unix, Oracle, DB2, Sybase, Mac, Linux, … Windows, Java, Unix, Oracle, DB2, Sybase, Mac, Linux, … Languages Languages Java, JavaScript, C#, Perl 5.6.0, C, C++, SQL, … Java, JavaScript, C#, Perl 5.6.0, C, C++, SQL, … Cross-platform libraries: Cross-platform libraries: ICU, Rosette, … ICU, Rosette, … ICU

Stability Developers / other standards need absolute stability Developers / other standards need absolute stability Characters are never moved or deleted Characters are never moved or deleted Ordering of characters is by collation, not binary order. See UTS #10: Unicode Collation Algorithm Ordering of characters is by collation, not binary order. See UTS #10: Unicode Collation AlgorithmUTS #10: Unicode Collation AlgorithmUTS #10: Unicode Collation Algorithm Characters may be deprecated (discouraged). Characters may be deprecated (discouraged). Characters never change names Characters never change names Annotations are used to clarify usage Annotations are used to clarify usage See Unicode Policies See Unicode PoliciesUnicode PoliciesUnicode Policies

Indic Support in Unicode ISCII the basis for characters and allocation ISCII the basis for characters and allocation Consortium actively engaged with Indian Government, which is a member Consortium actively engaged with Indian Government, which is a member Welcomes addition of missing characters (e.g. Vedic), clarifications or corrections of usage Welcomes addition of missing characters (e.g. Vedic), clarifications or corrections of usage

Structural Similarities with ISCII Within script, layout and contents nearly identical Within script, layout and contents nearly identical Independent + dependent vowels Independent + dependent vowels Halant model for representing conjuncts Halant model for representing conjuncts conjuncts / half-forms not directly encoded conjuncts / half-forms not directly encoded represented by sequences instead represented by sequences instead Phonetic sequence – order in syllables Phonetic sequence – order in syllables

Structural Differences with ISCII Unicode is stateless: Unicode is stateless: No shifting to get different scripts No shifting to get different scripts Each character has a unique number Each character has a unique number Unicode is uniform: Unicode is uniform: No extension bytes necessary No extension bytes necessary All characters coded in the same space All characters coded in the same space

Additional Characters Indian Government is developing proposals for: Indian Government is developing proposals for: Additions of missing characters: Additions of missing characters: Vedic Vedic Individual characters for certain scripts Individual characters for certain scripts Annotations and Descriptions Annotations and Descriptions

Global Applications now support languages of India Companies supporting Indic with Unicode Companies supporting Indic with Unicode OpenType fonts OpenType fonts Font support for Indic Font support for Indic Microsoft Windows Microsoft Windows Java (IBM contributed ICU Indic Layout) Java (IBM contributed ICU Indic Layout) Linux Linux …

Benefits for India All documents, anywhere in the world, can have Indic text All documents, anywhere in the world, can have Indic text Allows seamless multilingual documents in India Allows seamless multilingual documents in India including scriptures and minority languages including scriptures and minority languages Opens up software export market, beyond English Opens up software export market, beyond English Connects India to the world Connects India to the world

How India Can Contribute Effective Communication with the Unicode Consortium Effective Communication with the Unicode Consortium Provide Resources for Development Provide Resources for Development Descriptions of Usage Descriptions of Usage Descriptions of Character Shaping Descriptions of Character Shaping Transliteration Tables from Script to Script Transliteration Tables from Script to Script Collation Information Collation Information OpenType fonts OpenType fonts …

What Developers Can Do Interwork with existing ISCII systems Interwork with existing ISCII systems Move to Unicode for future developments Move to Unicode for future developments Java, Windows, Linux, … Java, Windows, Linux, …

The Future The world is moving rapidly to Unicode The world is moving rapidly to Unicode Unicode makes India open to the world Unicode makes India open to the world The world comes to you, and The world comes to you, and You go to the world You go to the world You can help You can help

Q & A

Backup Slides

Multiple Forms UTF-8: maximal compatibility with 8-bit systems UTF-8: maximal compatibility with 8-bit systems UTF-16: good storage, interoperability with Windows/Java UTF-16: good storage, interoperability with Windows/Java UTF-32: simplest processing UTF-32: simplest processing Fast, lossless conversion Fast, lossless conversion See Forms of Unicode See Forms of UnicodeForms of UnicodeForms of Unicode