A Field Linguist’s Guide to Unicode Deborah Anderson Script Encoding Initiative (Universal Scripts Project) Dept. of Lings., UC-Berkeley LSA Panel: A Field.

Slides:



Advertisements
Similar presentations
Outreach Jeff Good UC Berkeley. OLAC's Needs Maximal involvement from the whole community –The more data providers involved the more useful the services.
Advertisements

Murray Sargent III Microsoft Corporation Text Services Group, Word Tips & Tricks on Editing and Displaying Unicode Text.
July 2010 D2.1 Upgrading strategy Javier Soto Catalog Release 3. Communities.
Using technology to enhance the teaching of South Asian Languages Steve Cushion.
Tafseer Ahmed Department of Computer Science University of Karachi Urdu on Linux International Support.
How to… Critically Evaluate Information Resources!
Chapter 5 Creating Interactive Forms. An interactive form created in InDesign is exported as an interactive Adobe PDF file. The benefit of exporting the.
How to use Unicode on your computer Michael Appleby Eastern Michigan University A field linguist’s guide to making long-lasting texts and databases LSA.
Extending your student visa
Computing Concepts Advanced HTML: Tables and Forms.
1/25 Writing Character sets Unicode Input methods.
15 September How Computers Work: Other Forms of Data.
Invoices On – Line Registration Instructions for Vendors.
Digital Text Primer Prepared for: AIEA Roundtable on Digitization of Armenian Documents Saturday 7 October 2006, University of Geneva, Switzerland Roland.
26 April 2001 Unicode and Windows XP, IUC 18 (Hong Kong) Unicode and Windows XP Cathy Wissink Program Manager, Globalization Windows Division Microsoft.
Decisions in Python Comparing Strings – ASCII History.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
College Application Process Preparing for the Future A Step-by Step Guide to Applying to College.
HistoryClass for The American Promise 5 th Ed. James L. Roark.
SOFTWARE A PROGRAM THAT RUNS ON COMPUTER CONTAINS SERIES OF INSTRUCTIONS.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
The purpose of this Software Requirements Specification document is to clearly define the system under development, that is, the International Etruscan.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC Berkeley.
Enlightening minds. Enriching lives. Tamil Digital Industry Badri Seshadri K.S.Nagarajan New Horizon Media.
ICTA Workshop on Unicode Publishing for Sinhala and Tamil
Spring /6.831 User Interface Design and Implementation1 Lecture 22: Internationalization.
DATA COMMUNICATION DONE BY: ALVIN SAMPATH CARLVIN SAMPATH.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
B.Sc. Multimedia ComputingMedia Technologies Character Representation & Font Technology.
Marketing I.  Self-Analysis and Career Research leads to success in college, internships, and ultimately your career  Research should reflect a variety.
TERMS TO KNOW. Desktop This does not mean a computer desktop vs. a laptop. You probably keep a number of commonly used items on your desk at home such.
File Formats Chapter 9 Bit Literacy. File formats are often ignored by users Applications automatically save files in the application’s format All formats.
If a publication is going to be distributed on time, deadlines must be met by each person on the staff. Be on-time.
Constructing Your Own Corpus from Written Language.
Great Leads for the Savvy Sales Whiz A MINT Skills Workshop Professional Development Institute February 3, 2004.
Using the Unicode Standard for Linguistic Data: Preliminary Guidelines Deborah Anderson Researcher Dept. of Linguistics, UC Berkeley.
 2008 Pearson Education, Inc. All rights reserved Introduction to XHTML.
1 INITIAL SETUP OF THE ST ScI ELECTRONIC GRANTS MANAGEMENT SYSTEM BY AO DESIGNEES September, 2000.
Innovative Training Works Digital Literacy Computing Fundamentals Computer Software.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
1.Obtaining software 2.Sample pdf for this presentation 3.Checking accessibility of the pdf 4.Tackling inaccessibility 5.Tips and helpful links How to.
1. What is desktop publishing software used for? 2.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
ELodgement User Guide July 2009 Level 8, 15 Blue Street, North Sydney NSW 2060 Tel:
MARKETING I Developing a. Agenda/What To Complete: 1. Career Research 2. Resume 3. Electronic Resume Posting 4. Cover Letter 5. Job Application 6. Interview.
Student Quick Start Guide Prepared by: Information Services Division Perpustakaan Sultan Abdul Samad Universiti Putra Malaysia
Integrate, check and share documents Module 3.3. Integrate, check and share documents Module 3.3.
CERTIFICATE IV IN BUSINESS JULY 2015 BSBWRT401A - Write Complex Documents.
M204 - Data Representation
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
Writing Security Alerts tbird Last modified 2/25/2016 8:55 PM.
1 Lesson 14 Sharing Documents Computer Literacy BASICS: A Comprehensive Guide to IC 3, 4 th Edition Morrison / Wells.
Learning Aim C.  In this section we will look at some simple client-side scripts, browser compatibility, exporting and compressing and suitable file.
Writing System Implementation On-the-Fly Extensibility for the common man Sharon Correll, SIL International Copyright © 2001.
Objectives  Explain the basic Unicode concepts in plain language  Install SILConverters 4.0  Install the converters for your branch  Convert several.
Understanding Word Vocabulary
Presenter: Suzy Belonga BTOP/EUPISD Instructional TechnologistWelcome!
Complex Text Layout Issues with examples from Myanmar
Essential Skills for Computing Fonts
Data Representation.
Keyboard Decisions Heidi Rosendall.
Representing Information as bit patterns
Undergraduate Research & Creative Inquiry Showcase
Great Plains Veterinary Educational Center
Lesson 14 Sharing Documents
Great Plains Veterinary Educational Center
ASCII and Unicode.
Presentation transcript:

A Field Linguist’s Guide to Unicode Deborah Anderson Script Encoding Initiative (Universal Scripts Project) Dept. of Lings., UC-Berkeley LSA Panel: A Field Linguist’s Guide to Making Long-Lasting Texts and Databases January 4, 2007 January 4, 2007

Working with Text Representation Working with Text Representation “Use Unicode” (ISO/IEC 10646) “Use Unicode” (ISO/IEC 10646)

Working with Text Representation Working with Text Representation “Use Unicode” (ISO/IEC 10646) “Use Unicode” (ISO/IEC 10646) Practical issues to consider: Practical issues to consider: * which Unicode characters? * what about fonts? * how about keyboards? * will the language be supported in off-the-shelf software?

Working with Text Representation Working with Text Representation Goal today is to discuss the whole process of enabling a language to be used on a computer: Goal today is to discuss the whole process of enabling a language to be used on a computer: identifying letters/symbols in Unicode identifying letters/symbols in Unicode fonts fonts keyboards keyboards how to get support for the characters and scripts in software how to get support for the characters and scripts in software

List all letters, symbols, digits, and marks of punctuation used in a language List all letters, symbols, digits, and marks of punctuation used in a language Step 1. Identify the characters used in a language

One proposal for the Kazym Khanty alphabet

List all letters, symbols, digits, marks of punctuation used in a language List all letters, symbols, digits, marks of punctuation used in a language Assign Unicode codepoints Assign Unicode codepoints Step 1. Identify the characters used in a language

List all letters, symbols, digits, marks of punctuation used in a language List all letters, symbols, digits, marks of punctuation used in a language Assign Unicode codepoints Assign Unicode codepoints Post a plain text version on a publicly accessible website Post a plain text version on a publicly accessible website Circulate this list for comment Circulate this list for comment Step 1. Identify the characters used in a language

Questions on which Unicode characters to use? Questions on which Unicode characters to use? Check codecharts on the Unicode website Check codecharts on the Unicode website Step 1. Identify the characters used in a language

Questions on which Unicode characters? Questions on which Unicode characters? Check codecharts on the Unicode website Check codecharts on the Unicode website Check nameslist and annotations Check nameslist and annotations

Questions on which Unicode characters? Questions on which Unicode characters? Check codecharts on the Unicode website Check codecharts on the Unicode website Check nameslist and annotations Check nameslist and annotations Not in Unicode charts? See if it is on the “Pipeline” page on the website for new characters Not in Unicode charts? See if it is on the “Pipeline” page on the website for new characters Step 1. Identify the characters used in a language

Questions on which Unicode characters? Questions on which Unicode characters? Check codecharts on the Unicode website Check codecharts on the Unicode website Check nameslist and annotations Check nameslist and annotations Not in Unicode charts? See if it is in the “Pipeline” page on the website for new characters Not in Unicode charts? See if it is in the “Pipeline” page on the website for new characters Unsure? Ask on Unicode list Unsure? Ask on Unicode list Step 1. Identify the characters used in a language

Propose any missing characters for inclusion into the Unicode Standard Propose any missing characters for inclusion into the Unicode Standard Step 1. Identify the characters used in a language

Propose any missing characters for inclusion into the Unicode Standard Propose any missing characters for inclusion into the Unicode Standard TIP: Apply for funding to write a Unicode proposal or to conduct research TIP: Apply for funding to write a Unicode proposal or to conduct research Step 1. Identify the characters used in a language

Propose any missing characters for inclusion into the Unicode Standard Propose any missing characters for inclusion into the Unicode Standard TIP: Apply for funding to write a Unicode proposal or to conduct research TIP: Apply for funding to write a Unicode proposal or to conduct research TIP: Allow enough time for writing and review of proposal TIP: Allow enough time for writing and review of proposal Step 1. Identify the characters used in a language

Propose any missing characters for inclusion into the Unicode Standard Propose any missing characters for inclusion into the Unicode Standard TIP: Apply for funding to write a proposal or to conduct research TIP: Apply for funding to write a proposal or to conduct research TIP: Allow enough time for writing and review of proposal TIP: Allow enough time for writing and review of proposal Note: Once written, the proposal will take 2-5 years to get through standards bodies Note: Once written, the proposal will take 2-5 years to get through standards bodies Step 1. Identify the characters used in a language

For languages without an orthography, consult Unicode Technical Note #19 : For languages without an orthography, consult Unicode Technical Note #19 : Step 1. Identify the characters used in a language

From Unicode Technical Note #19: From Unicode Technical Note #19: If at all possible, use an already encoded character, abiding by the following tips: If at all possible, use an already encoded character, abiding by the following tips: If the script is right-to-left, select a character that is from a script that is right-to-left If the script is right-to-left, select a character that is from a script that is right-to-left Avoid “presentation forms” or “letterlike characters” Avoid “presentation forms” or “letterlike characters” For a punctuation mark, select a character from the general punctuation block. For a punctuation mark, select a character from the general punctuation block. Step 1. Identify the characters used in a language

Step 2: Send locale data to CLDR project Locales: local conventions used to create software that is tailored to a specific language and location Locales: local conventions used to create software that is tailored to a specific language and location Currency ($, £, etc.) Currency ($, £, etc.) Time/date formats, measurement systems (i.e., France: , Germany: , U.S.: 902,300) Time/date formats, measurement systems (i.e., France: , Germany: , U.S.: 902,300) Sorting order Sorting order

Step 2: Send locale data to CLDR project Common Locale Data Project: project hosted by Unicode that makes locale info freely available for software developers and others. Common Locale Data Project: project hosted by Unicode that makes locale info freely available for software developers and others.

Step 2: Send locale data to CLDR project

TIP: Involve a member of the user community to submit locale data TIP: Involve a member of the user community to submit locale data

Step 3: Create a font Once a list of all the letters and symbols has been created with Unicode values, work can begin on a font Once a list of all the letters and symbols has been created with Unicode values, work can begin on a font If any characters are being proposed, wait until they are far along in the standards process If any characters are being proposed, wait until they are far along in the standards process Tip: Apply for funding to create a freely available font; costs can run $100/glyph Tip: Apply for funding to create a freely available font; costs can run $100/glyph

Step 3: Create a font It is recommended to use someone familiar with the script and computer typography (esp. for complex scripts) It is recommended to use someone familiar with the script and computer typography (esp. for complex scripts) Use FontLab Use FontLab

Step 4: Rendering Engines for complex scripts need upgrade For new complex scripts (e.g., bidi issues, complex ligatures), upgrades to the rendering engine are often needed in order to properly draw the glyphs. For new complex scripts (e.g., bidi issues, complex ligatures), upgrades to the rendering engine are often needed in order to properly draw the glyphs. Early contact with companies (Microsoft and Adobe), the Linux community, and SIL is advised so the rendering engine can support the script properly Early contact with companies (Microsoft and Adobe), the Linux community, and SIL is advised so the rendering engine can support the script properly

Examples of Complex Scripts N’Ko Javanese

Step 4: Rendering Engines for complex scripts need upgrade SIL’s Graphite rendering engine offers a good test environment SIL’s Graphite rendering engine offers a good test environment Generally Apple does not require upgrades to its rendering engine Generally Apple does not require upgrades to its rendering engine Microsoft prioritizes which scripts are included in its next rendering engine; governmental support is helpful in making a case to MS Microsoft prioritizes which scripts are included in its next rendering engine; governmental support is helpful in making a case to MS

Step 5: Create a Keyboard There are a number of keyboard creation programs that are available, including: There are a number of keyboard creation programs that are available, including: Keyman (for Windows) Keyman (for Windows) Microsoft Keyboard Layout Creator (“MKLC”) Microsoft Keyboard Layout Creator (“MKLC”) Ukelele (for the Mac) Ukelele (for the Mac) Keyboard Mapping for Linux Keyboard Mapping for Linux

Step 5: Create a Keyboard Make the keyboard layout practical and have the user community test it out. Make the keyboard layout practical and have the user community test it out. Make the keyboard layout freely available on (such as on Tavultesoft’s website) Make the keyboard layout freely available on (such as on Tavultesoft’s website)

Conclusion Getting support for a language on the computer can be a long process, especially for new complex scripts, but the payoff is significant. Patience and persistence are key. Avoid promising immediate access to a given language on the computer (unless all the characters are already encoded and available in widely used fonts) Raising funding to cover all parts of the process from encoding to fonts is still an issue: Balinese needs fonts, N’Ko needs rendering engine support.

Unicode website: Script Encoding Initiative: