Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.

Slides:



Advertisements
Similar presentations
Advanced.Net Framework 2.0 David Ringsell MCPD MCSD MCT MCAD.
Advertisements

Building International Applications with Visual Studio.NET Achim Ruopp International Program Manager Microsoft Corporation.
Bits and the "Why" of Bytes: Representing Information Digitally
Tafseer Ahmed Department of Computer Science University of Karachi Urdu on Linux International Support.
How do we work in a virtual multilingual classroom? A virtual multilingual classroom with Moodle and Apertium Cultural and Linguistic Practices in the.
Globalization in Multimedia Development The development of W W W led to rise of the concept of “global village”, which the whole world links together.
8 November Forms and JavaScript. Types of Inputs Radio Buttons (select one of a list) Checkbox (select as many as wanted) Text inputs (user types text)
Chapter 8_2 Bits and the "Why" of Bytes: Representing Information Digitally.
Internationalization of Java Platform Presenter: Ataru Nakazawa Advisor: Xiaoping Jia Date: January 23, 2004.
PZ01BX Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ01BX - Standardization, Internationalization Programming.
מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.
1/25 Writing Character sets Unicode Input methods.
Chapter 8 Bits and the "Why" of Bytes: Representing Information Digitally.
RSS RSS is a method that uses XML to distribute web content on one web site, to many other web sites. RSS allows fast browsing for news and updates.
ASCII and Unicode. ASCII Inside a computer, EVERYTHING is a number – that includes music, sound, and text. In the early days of computers, every manufacturer.
2.1.4 BINARY ASCII CHARACTER SETS A451: COMPUTER SYSTEMS AND PROGRAMMING.
Decisions in Python Comparing Strings – ASCII History.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
Introduction to Computing Using Python Chapter 6  Encoding of String Characters  Randomness and Random Sampling.
ECA 228 Internet/Intranet Design I Meta Tags & Directories.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Unicode & W3C Jataayu Software C. Kumar January 2007.
San José, CA – September, 2004 Localizing with XLIFF and ICU Markus Scherer Raghuram (Ram) Viswanadha IBM San.
Chapter 2 Data Representation. Define data types. Visualize how data are stored inside a computer. Understand the differences between text, numbers, images,
Localizing OpenClinica Hiroaki Honshuku: SQA 1. © What is Character Encoding?  Morse Code (1840) → Latin Alphabet  ASCII (1963)  The American Standard.
Creating Interfaces: Localization Language & other issues character codes Homework: preparation for future topics.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
ASCII and Unicode.
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Representing text Each of different symbol on the text (alphabet letter) is assigned a unique bit patterns the text is then representing as.
Week 4 Number Systems.
Lecture 2 Character Codes and Low-Structure Text Document Formats.
Spring /6.831 User Interface Design and Implementation1 Lecture 22: Internationalization.
1 Lab Session-III CSIT-120 Fall 2000 Revising Previous session Data input and output While loop Exercise Limits and Bounds Session III-B (starts on slide.
DATA COMMUNICATION DONE BY: ALVIN SAMPATH CARLVIN SAMPATH.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Chapter 2 Computer Hardware
Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee ( ) National Center for Science Information.
Bing Hong OSIsoft Internationalization &
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Internationalization in PHP: PmWiki’s approach Dr. Patrick R. Michaud September 13, 2005.
Sophia Antipolis, September 2006 Behind the portal software - the Naaya technology Miruna Bădescu Finsiel Romania.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
CISC1100: Binary Numbers Fall 2014, Dr. Zhang 1. Numeral System 2  A way for expressing numbers, using symbols in a consistent manner.  " 11 " can be.
Copenhagen, 6 June 2006 EC CHM Multilinguality Anton Cupcea Finsiel Romania.
SEC (1.4) Representing Information as bit patterns.
Representing Characters in a computer Pressing a key on the computer a code is generated that the computer can convert into a symbol for displaying or.
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
ASCII AND EBCDIC CODES By : madam aisha.
Chapter – 8 Software Tools.
Objectives  Explain the basic Unicode concepts in plain language  Install SILConverters 4.0  Install the converters for your branch  Convert several.
Random Logic l Forum.NET l Localization & Globalization Forum.NET ● May 29, 2006.
1.4 Representation of data in computer systems Character.
1 Non-Numeric Data Representation V1.0 (22/10/2005)
Binary Representation in Text
Binary Representation in Text
Unit 2.6 Data Representation Lesson 2 ‒ Characters
Chapter 8 & 11: Representing Information Digitally
ENCODING AND SENDING FORMATTED TEXT
INTERNATIONALIZATION
Representing Information as bit patterns
Data Representation ASCII.
Representing Characters
Data Representation Question: Characters
Text.
Presenting information as bit patterns
ASCII and Unicode.
Presentation transcript:

Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania

Unicode, encodings and character sets

3 How it all started…  Until recently, most computers used font sets with a maximum 256 characters (ANSI):  The first 128 (ASCII):  numbers  letters a-z and A-Z  punctuation marks  The second 128 set varies:  English-speaking world contain:  more punctuation marks  currency symbols (e.g. £)  accented letters (á, é, ñ, ç, ô)  Places like Egypt, Greece, Russia contain characters taken from the corresponding alphabet: Arabic, Greek, Cyrillic

4 Code, encoding  Character code – a sequence of bits that a computer use to represent a character  Encoding – the rule describing how a set of bytes are transformed into characters

5 Problem  These encoding systems also conflict with one another – two encodings  can use the same number for two different characters  can use different numbers for the same character  Data can become incomprehensible when transferred from one place to another

6 Solution  Moving to a system that assigns a unique number to each character in each language of the world  The Unicode standard provides a unique number for every character no matter what the platform, no matter what the program, no matter what the language  Unicode (as defined by the Unicode Consortium) has become a universal standard: ISO/IEC 10646, describing the 'Universal Multiple-Octet Coded Character Set' (UCS)

7 Unicode  Unicode repertoire can be encoded in more than one way: UTF-8, UTF- 16, UTF-32  UTF-8 encodes:  ASCII characters on 1 byte  other characters up to 6 bytes  Incorporating it into client-server or multi-tiered applications and websites offers significant cost savings over the use of legacy character sets  Enables a single software product or a single website to be targeted across multiple platforms, languages and countries without re-engineering  Allows data to be transported through many different systems without corruption.

Internationalization and localization

9 I18n  Internationalization (I18n): modification of an application so that it can handle multiple languages, countries, etc.:  Display content (web pages, files) in end user’s language  Display messages around the site in user’s language (e.g. “Home”, “Search”, error messages)  Input characters in end user’s language  Printing out the correct characters  Handling dates, numbers and sorting words using the rules of that language

10 L10n  Localization (l10n) involves taking a product and making it linguistically and culturally appropriate to the target locale (country/region and language)  Means to change the language on a Web site:  User selection  Detecting the browser settings  Automatically, based on the user’s profile  Translation issue:  Identifying un-translated or old translations of terms and phrases  Different roles for translators and content managers  Offering an interface for the content translation

11 Example of XLIFF translation file coming from the translation service XLIFF: XML Localization Interchange File Format

Sorting in different languages

13 Sorting in the same language  Strings must be sorted according to that language sorting rules  Complex characters, ignorable characters and exceptional words to be considered  Normally done in to steps:  primary sorting  uppercase and lowercase characters are equivalent  diacritical marks are ignored  ignorable characters are not considered  secondary sorting  difference between uppercase and lowercase  characters with diacritical marks are ranked individually  ignorable characters influence the sorting

14 Sorting in different languages  Approaches  1.  All strings in the same language should be sorted according to that language’s rules  Sorting is also governed by order among languages or among groups of languages  e.g English, German, French = Roman group  2.  Sort using the sorting rules that are associated with the language chosen by the end-user or site language

SEMIDE portal and toolkit - multilinguality issues

16 Multilingual portal – EN, FR, AR, …

17 Features  All pages are encoded in UTF-8  all characters of the word are supported  Default language set at startup: English

18 What aspects are multilingual?  Graphical user interface  translation from the administrative area  one-by-one,.po,.XLIFF  Content  individual translation for each item on edit  Glossaries and thesauri  translation from the Zope’s Management Interface  Syndication (RDF channels)  depends on the selected language  Searches  user multiple selection

19 Language negotiation  When an item is not translated in the language selected by the end user, the system searches translations in: 1.the language from the user's browser settings 2.the default language  …and displays the items’ id if none of these work