Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Field Linguist’s Guide to Unicode Deborah Anderson Script Encoding Initiative (Universal Scripts Project) Dept. of Lings., UC-Berkeley LSA Panel: A Field.

Similar presentations


Presentation on theme: "A Field Linguist’s Guide to Unicode Deborah Anderson Script Encoding Initiative (Universal Scripts Project) Dept. of Lings., UC-Berkeley LSA Panel: A Field."— Presentation transcript:

1 A Field Linguist’s Guide to Unicode Deborah Anderson Script Encoding Initiative (Universal Scripts Project) Dept. of Lings., UC-Berkeley LSA Panel: A Field Linguist’s Guide to Making Long-Lasting Texts and Databases January 4, 2007 January 4, 2007

2 Working with Text Representation Working with Text Representation “Use Unicode” (ISO/IEC 10646) “Use Unicode” (ISO/IEC 10646)

3 Working with Text Representation Working with Text Representation “Use Unicode” (ISO/IEC 10646) “Use Unicode” (ISO/IEC 10646) Practical issues to consider: Practical issues to consider: * which Unicode characters? * what about fonts? * how about keyboards? * will the language be supported in off-the-shelf software?

4 Working with Text Representation Working with Text Representation Goal today is to discuss the whole process of enabling a language to be used on a computer: Goal today is to discuss the whole process of enabling a language to be used on a computer: identifying letters/symbols in Unicode identifying letters/symbols in Unicode fonts fonts keyboards keyboards how to get support for the characters and scripts in software how to get support for the characters and scripts in software

5 List all letters, symbols, digits, and marks of punctuation used in a language List all letters, symbols, digits, and marks of punctuation used in a language Step 1. Identify the characters used in a language

6 One proposal for the Kazym Khanty alphabet

7 List all letters, symbols, digits, marks of punctuation used in a language List all letters, symbols, digits, marks of punctuation used in a language Assign Unicode codepoints Assign Unicode codepoints Step 1. Identify the characters used in a language http://www.tlg.uci.edu/quickbeta.pdf

8 List all letters, symbols, digits, marks of punctuation used in a language List all letters, symbols, digits, marks of punctuation used in a language Assign Unicode codepoints Assign Unicode codepoints Post a plain text version on a publicly accessible website Post a plain text version on a publicly accessible website Circulate this list for comment Circulate this list for comment Step 1. Identify the characters used in a language

9 Questions on which Unicode characters to use? Questions on which Unicode characters to use? Check codecharts on the Unicode website Check codecharts on the Unicode website Step 1. Identify the characters used in a language

10 Questions on which Unicode characters? Questions on which Unicode characters? Check codecharts on the Unicode website Check codecharts on the Unicode website Check nameslist and annotations Check nameslist and annotations

11 Questions on which Unicode characters? Questions on which Unicode characters? Check codecharts on the Unicode website Check codecharts on the Unicode website Check nameslist and annotations Check nameslist and annotations Not in Unicode charts? See if it is on the “Pipeline” page on the website for new characters Not in Unicode charts? See if it is on the “Pipeline” page on the website for new characters Step 1. Identify the characters used in a language

12 http://www.unicode.org/alloc/Pipeline.html

13 Questions on which Unicode characters? Questions on which Unicode characters? Check codecharts on the Unicode website Check codecharts on the Unicode website Check nameslist and annotations Check nameslist and annotations Not in Unicode charts? See if it is in the “Pipeline” page on the website for new characters Not in Unicode charts? See if it is in the “Pipeline” page on the website for new characters Unsure? Ask on Unicode email list Unsure? Ask on Unicode email list Step 1. Identify the characters used in a language

14 Propose any missing characters for inclusion into the Unicode Standard Propose any missing characters for inclusion into the Unicode Standard Step 1. Identify the characters used in a language

15 Propose any missing characters for inclusion into the Unicode Standard Propose any missing characters for inclusion into the Unicode Standard TIP: Apply for funding to write a Unicode proposal or to conduct research TIP: Apply for funding to write a Unicode proposal or to conduct research Step 1. Identify the characters used in a language

16 Propose any missing characters for inclusion into the Unicode Standard Propose any missing characters for inclusion into the Unicode Standard TIP: Apply for funding to write a Unicode proposal or to conduct research TIP: Apply for funding to write a Unicode proposal or to conduct research TIP: Allow enough time for writing and review of proposal TIP: Allow enough time for writing and review of proposal Step 1. Identify the characters used in a language

17 Propose any missing characters for inclusion into the Unicode Standard Propose any missing characters for inclusion into the Unicode Standard TIP: Apply for funding to write a proposal or to conduct research TIP: Apply for funding to write a proposal or to conduct research TIP: Allow enough time for writing and review of proposal TIP: Allow enough time for writing and review of proposal Note: Once written, the proposal will take 2-5 years to get through standards bodies Note: Once written, the proposal will take 2-5 years to get through standards bodies Step 1. Identify the characters used in a language

18 For languages without an orthography, consult Unicode Technical Note #19 : For languages without an orthography, consult Unicode Technical Note #19 : Step 1. Identify the characters used in a language http://www.unicode.org/notes/tn19/

19 From Unicode Technical Note #19: From Unicode Technical Note #19: If at all possible, use an already encoded character, abiding by the following tips: If at all possible, use an already encoded character, abiding by the following tips: If the script is right-to-left, select a character that is from a script that is right-to-left If the script is right-to-left, select a character that is from a script that is right-to-left Avoid “presentation forms” or “letterlike characters” Avoid “presentation forms” or “letterlike characters” For a punctuation mark, select a character from the general punctuation block. For a punctuation mark, select a character from the general punctuation block. Step 1. Identify the characters used in a language http://www.unicode.org/notes/tn19/

20 Step 2: Send locale data to CLDR project Locales: local conventions used to create software that is tailored to a specific language and location Locales: local conventions used to create software that is tailored to a specific language and location Currency ($, £, etc.) Currency ($, £, etc.) Time/date formats, measurement systems (i.e., France: 902 300, Germany: 902.300, U.S.: 902,300) Time/date formats, measurement systems (i.e., France: 902 300, Germany: 902.300, U.S.: 902,300) Sorting order Sorting order

21 Step 2: Send locale data to CLDR project Common Locale Data Project: project hosted by Unicode that makes locale info freely available for software developers and others. Common Locale Data Project: project hosted by Unicode that makes locale info freely available for software developers and others. http://www.unicode.org/cldr/

22 Step 2: Send locale data to CLDR project

23 TIP: Involve a member of the user community to submit locale data TIP: Involve a member of the user community to submit locale data

24 Step 3: Create a font Once a list of all the letters and symbols has been created with Unicode values, work can begin on a font Once a list of all the letters and symbols has been created with Unicode values, work can begin on a font If any characters are being proposed, wait until they are far along in the standards process If any characters are being proposed, wait until they are far along in the standards process Tip: Apply for funding to create a freely available font; costs can run $100/glyph Tip: Apply for funding to create a freely available font; costs can run $100/glyph

25 Step 3: Create a font It is recommended to use someone familiar with the script and computer typography (esp. for complex scripts) It is recommended to use someone familiar with the script and computer typography (esp. for complex scripts) Use FontLab Use FontLab

26 Step 4: Rendering Engines for complex scripts need upgrade For new complex scripts (e.g., bidi issues, complex ligatures), upgrades to the rendering engine are often needed in order to properly draw the glyphs. For new complex scripts (e.g., bidi issues, complex ligatures), upgrades to the rendering engine are often needed in order to properly draw the glyphs. Early contact with companies (Microsoft and Adobe), the Linux community, and SIL is advised so the rendering engine can support the script properly Early contact with companies (Microsoft and Adobe), the Linux community, and SIL is advised so the rendering engine can support the script properly

27 Examples of Complex Scripts N’Ko Javanese

28 Step 4: Rendering Engines for complex scripts need upgrade SIL’s Graphite rendering engine offers a good test environment SIL’s Graphite rendering engine offers a good test environment Generally Apple does not require upgrades to its rendering engine Generally Apple does not require upgrades to its rendering engine Microsoft prioritizes which scripts are included in its next rendering engine; governmental support is helpful in making a case to MS Microsoft prioritizes which scripts are included in its next rendering engine; governmental support is helpful in making a case to MS

29 Step 5: Create a Keyboard There are a number of keyboard creation programs that are available, including: There are a number of keyboard creation programs that are available, including: Keyman (for Windows) Keyman (for Windows) Microsoft Keyboard Layout Creator (“MKLC”) Microsoft Keyboard Layout Creator (“MKLC”) Ukelele (for the Mac) Ukelele (for the Mac) Keyboard Mapping for Linux Keyboard Mapping for Linux

30 Step 5: Create a Keyboard Make the keyboard layout practical and have the user community test it out. Make the keyboard layout practical and have the user community test it out. Make the keyboard layout freely available on (such as on Tavultesoft’s website) Make the keyboard layout freely available on (such as on Tavultesoft’s website)

31 Conclusion Getting support for a language on the computer can be a long process, especially for new complex scripts, but the payoff is significant. Patience and persistence are key. Avoid promising immediate access to a given language on the computer (unless all the characters are already encoded and available in widely used fonts) Raising funding to cover all parts of the process from encoding to fonts is still an issue: Balinese needs fonts, N’Ko needs rendering engine support.

32 Unicode website: http://www.unicode.org Script Encoding Initiative: http://linguistics.berkeley.edu/sei


Download ppt "A Field Linguist’s Guide to Unicode Deborah Anderson Script Encoding Initiative (Universal Scripts Project) Dept. of Lings., UC-Berkeley LSA Panel: A Field."

Similar presentations


Ads by Google