Presentation is loading. Please wait.

Presentation is loading. Please wait.

Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University.

Similar presentations


Presentation on theme: "Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University."— Presentation transcript:

1 Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University

2 Overview Sources of Linguistic BiasSources of Linguistic Bias Linguistic Bias: examplesLinguistic Bias: examples –Text Communication –Internet Host Names –Web Programming Global Linguistic DiversityGlobal Linguistic Diversity –Who bears the costs? ConclusionsConclusions

3 Sources of Linguistic Bias (Friedman and Nissenbaum 1997) Pre-existingPre-existing –originate from outside the technical system National, trans-national and institutional policiesNational, trans-national and institutional policies Technology companiesTechnology companies TechnicalTechnical –are built into the technical system itself Developers language backgrounds, national originsDevelopers language backgrounds, national origins Legacy standards, backward compatibilityLegacy standards, backward compatibility EmergentEmergent –arise in specific contexts of use of a technical system Economics of technology industry (marketing, monopoly power, unstable markets, etc.)Economics of technology industry (marketing, monopoly power, unstable markets, etc.) Rapid technologizationRapid technologization

4 Text Communication Requires an encoding and its supportRequires an encoding and its support –Assign code numbers to script characters ASCII (American English)ASCII (American English) ISO-8859-1 (European Languages)ISO-8859-1 (European Languages) Unicode (most languages, but support is uneven)Unicode (most languages, but support is uneven) –Support means many things Fonts, rendering, sorting, spell-checking etc.Fonts, rendering, sorting, spell-checking etc. Computer-Mediated CommunicationComputer-Mediated Communication –Web pages, Email, chat, etc. –Language use is not uniform in these modes Multilinguals tend to favor different languages for specific purposesMultilinguals tend to favor different languages for specific purposes Represents both technical and emergent biasesRepresents both technical and emergent biases

5 Unicode Status: Examples Language Chinese English French German Spanish Finnish Russian Arabic Hindi Sinhala S. Azerbaijani Unicode yes no Browser good good (late) poor none Script Chinese Roman Cyrillic Arabic Indic Arabic Pop. 1,240M 400M 81M 82M 358M 5M 132M 247M 213M 15M 26M Good support Poor support No support

6 Internet Host Names The Domain Name SystemThe Domain Name System –Uses a 30-year old 7-bit ASCII standard Now supports Punycode (a variant of Unicode)Now supports Punycode (a variant of Unicode) Imposes a maximum name lengthImposes a maximum name length –Run by ICANN under US Dept of Commerce contract More concerned with trademark protectionMore concerned with trademark protection Host/domain naming is widely abused (e.g. tv domain)Host/domain naming is widely abused (e.g. tv domain) Names provided by the DNS are not that usefulNames provided by the DNS are not that useful An example of emergent biasAn example of emergent bias –Technical origin –Economic and political forces amplify and sustain it

7 Web Programming and Unicode Markup & web scripting languagesMarkup & web scripting languages –Unicode is standard –Browser support, fonts, etc. lag behind –Databases and development environments tend to lack proper Unicode support –End-user oriented, not programmer oriented All of the most important technologies are Open- Source software (FLOSS)All of the most important technologies are Open- Source software (FLOSS) –User extensible/modifiable –Language localization of these is possible but rare

8 Linguistic Bias in Web Programming English is the source language for most programming & markup languagesEnglish is the source language for most programming & markup languages –Keywords –Operator-argument order –Programming constructs, etc. Programming as a linguistic actProgramming as a linguistic act –Complex concepts are rendered into text –Different languages have different ways of doing this Emergent language biasesEmergent language biases

9 Linguistic Properties of Programming LISPLISP –Predicates precede their arguments Like Arabic, Celtic, Hebrew, etc.Like Arabic, Celtic, Hebrew, etc. (defun fact (x)(if (<= x 0) 1 (* x (fact (- x 1))))) PostscriptPostscript –Predicates follow their arguments Like Farsi, Hindi, Japanese, Tamil, Turkish, etc.Like Farsi, Hindi, Japanese, Tamil, Turkish, etc. /factorial { dup 1 gt { dup 1 sub factorial mul } if } def

10 The Linguistic Digital Divide Language issues go beyond contentLanguage issues go beyond content –WSIS repeatedly re-affirms principles of TransparencyTransparency Self-determinationSelf-determination Open access to participation for all partiesOpen access to participation for all parties These principles cannot be guaranteed unless speakers of different languages can manipulate all aspects of IT use in a way that is native-like The linguistic divide has broader consequencesThe linguistic divide has broader consequences –Costs are borne in Education great for non-English speaking peopleEducation great for non-English speaking people Technical development small, in comparisonTechnical development small, in comparison (there is a trade-off)

11 Language Diversity Who bears the costs?

12 (source data: www.ethnologue.com) A typical language group has around 10-50 thousand people 80% of language groups have fewer than 100 thousand members

13 (source data: www.ethnologue.com) 90% of the worlds population belongs to a language group with at least 1 million people (416 groups) Many languages with hundreds of milloins of speakers lack adequate support

14 (source data: www.ethnologue.com)

15 Conclusions Linguistic Bias is manifest in many waysLinguistic Bias is manifest in many ways –Technical biases are sometimes overt –Emergent biases can be subtle All potential sources of bias need to be examined and questioned if we are to uphold principles affirmed by WSISAll potential sources of bias need to be examined and questioned if we are to uphold principles affirmed by WSIS Without this effort, the linguistic digital divide will simply amplify existing disparities in wealth and powerWithout this effort, the linguistic digital divide will simply amplify existing disparities in wealth and power

16

17 Language Diversity On The Internet

18 Global Reach

19 Linguistic Diversity Based on Entropy: Diversity = –2 p i ln p i Diversity is the long-run per-individual average variance in language category (similar to log-likelihood)

20

21 ONeill, Lavoie and Bennett, 2003

22 www.isc.org/ds

23

24 www.isc.org/ds, ITU

25

26 ITU

27 www.isc.org/ds

28

29 www.isc.org/ds, UNPD

30 ITU, UNPD

31 ITU

32 www.isc.org/ds, ITU

33


Download ppt "Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University."

Similar presentations


Ads by Google