We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byHayden Pearson
Modified over 3 years ago
Nurturing Living Languages © C-DAC Mahesh D. Kulkarni Group Coordinator C-DAC GIST 4 th August 2006 Venue : Hotel Raddison, Noida WELCOME
Nurturing Living Languages © C-DAC Indian Language Domain Name Registration Issues and Solutions
Nurturing Living Languages © C-DAC Social and economic growth is catalyzed by the presence of Internet Development of internet is mainly in English Uses only 26 alphabet (unaccented Latin letters), the 10 digits (0-9), hyphen and the dot. For proliferation and preservation of heritage, culture and content creation in multiple languages it is essential to have the domain names in multilingual scripts. Background
Nurturing Living Languages © C-DAC User enters IDN : www. (non-ASCII characters) Application (such as browser) converts to ASCII Compatible encoding (ACE) : Registry entry : xn3b7vcv67.com (ASCII characters) Background xn--e2br9czb xn--m1be
Nurturing Living Languages © C-DAC Overview : India has largest linguistic diversities in the world 4 major language families and at least 35 different languages and around 2000 dialects. Languages belong to either Indo-Aryan (ca.74%), the Dravidian (ca 24%), the Austro-Asiatic (Munda) (ca 1.2%) or the Tibeto-Burman (ca 0.6%) families. Some of the languages of Himalayas still unclassified. India has 22 scheduled languages and English continue to be associate additional official language Following scripts will be most needed : Assamese, Bangla, Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, Telugu, Urdu.
Nurturing Living Languages © C-DAC One script :: many languages Devanagari – Hindi, Marathi, Konkani, Rajasthani, Sindhi, Nepali, Dogri, Santhali, etc. Thus the code page Devanagari can support all languages using that particular script. Solution : Though the contents would reveal the language used, it would be ideal if a special attribute code to indicate language is inserted.
Nurturing Living Languages © C-DAC Konkani is written in Roman, Devanagari, Malayalam and Kannada. Sindhi is written in Gurmukhi (Punjabi), Arabi (Perso-Arabic), Devanagari, Gujarati and also Roman. Sindhi has adopted the Perso-Arabic script for representing their language. In case of Konkani, Devanagari is used as official script. Hence it is proposed that the same formula be used in attributing in IDN. However nothing stops a client from desiring to have his IDN in all the scripts and this can be efficiently catered by providing broad based transliteration facility which would transliterate a name from one Indian script to another. Thus a Konkani domain name in Devanagari could be transliterated into Kannada, Malayalam and Roman. Solution: The best solution to this is by way of linguistic or political consensus One language :: many scripts
Nurturing Living Languages © C-DAC The solution : A tool for transliteration from one Indian script to another can be easily deployed. The transliterated data could be presented to the client who could verify the transliteration and see if it meets his approval and if so, the IDN could be registered in all possible scripts
Nurturing Living Languages © C-DAC ACE i.e. ASCII compatible encoding. This is intimately tied to NamePrep (3491)and PunyCode (RFC-3492) as well as to RFC 3454 StringPrep. ACE prepares a IDN string to be sent down to PunyCode for storage where it is stored as a 7 bit numeric data We would like to make a case for the use of ISCII 91 as a parallel code for Brahmi based scripts. ISCII deploys the same encoding for all Brahmi based scripts. The advantage for this obvious as storage in ISCII will allow IDN to transliterate on the fly a name into any Indic script and thereby ensure at the PunyCode level itself that a name allotted in one script is also automatically allotted in another script to the same owner, thereby doing away with name squatting in Indic scripts, which will be a regular feature for IDN allocation in Indic scripts. Alternate mechanism
Nurturing Living Languages © C-DAC 1.IDN & THE PROBLEM OF ALLOTTING NAMES The IDN server which will attribute the domain names is to be automated and hence it is of vital interest that a mechanism of checks and counter-checks be set up to ensure the highest level of security. Two major issues are at stake. These issues are mainly specific to Indian scripts and the complex nature of their visual rendering.
Nurturing Living Languages © C-DAC PROBLEM 1: DOUBLETS The first is the need to ensure that doublets are avoided. Doublets are IDNs which are nearly alike either as homophones or close homographs. Thus spelling: Mahararashtra as: can lead to identity confusion and since all the three spellings are different, the server would attribute all the name as valid IDNs whereas in fact the original client would not like that his IDN be misused.
Nurturing Living Languages © C-DAC Problem 2: SECURITY ISSUES More serious is the willful use of such tactics to perpetrate fraud by misleading a user into believing that he has logged on to a bonafide site and thus persuade the user to divulge information such as the number of his credit card etc.
Nurturing Living Languages © C-DAC UNDERLYING THESE PROBLEMS AND ISSUES ARE THREE MAJOR POTENTIAL SECURITY HOLES HOMOPHONES AND HOMOGRAPHS SPELLING VARIANTS SPELLING ERRORS Each of these will be studied in relation to their pertinence to ensuring maximal security
Nurturing Living Languages © C-DAC These are aural and visual look-alikes and given the phonetic nature of Indian scripts are a potential source of confusion. A typology of these has been established: VISUAL LOOK ALIKES AURAL LOOK ALIKES Homophones and Homographs
Nurturing Living Languages © C-DAC Visual Look-Alikes-1 TWO LIGATURES HAVING PRACTICALLY THE SAME FORM Devanagari The first ligature is a Half da+ Full dha, the second is a half dha followed by a full da. To an average reader of Hindi, the two forms look practically alike and lead to confusion. A similar situation arises in the case of Gujarati The first is ka+la The second is ka+halanta+la Homophones and Homographs
Nurturing Living Languages © C-DAC Visual Look-Alikes-2 AMBIGUITIES ARISING OUT OF POSSIBLE UNICODE VARIANTS. This can be best seen in the case of Nukta characters. These can be generated out in two different manners: In each pair, the first character is a single character whereas the second character is made up of two characters: the consonant followed by the dot or nukta character. To the naked eye the two look alike, whereas for the machine, these would be two different IDNs. Homophones and Homographs
Nurturing Living Languages © C-DAC Visual look-alikes-3 SIMILAR LOOKING CHARACTERS WITHIN THE SAME CODE-PAGE: Within a code-page two characters can look practically alike and create ambiguity. This is especially the case when on the client machine the font enabled is not of high quality and given the size of the characters (normally 10 point), can lead to confusion. Some examples are given below: Devanagari Homophones and Homographs
Nurturing Living Languages © C-DAC Visual Look-Alikes -4 IDENTICAL CHARACTERS IN UNICODE As is the case of the Urdu and Sindhi glyph. Character 06a9 is the letter /keheh/ in Urdu whereas the same symbol in Sindhi has the representation /kheheh/. Since both fall within the same codepage aural disambiguation apart from recourse to the language used is impossible. Homophones and Homographs
Nurturing Living Languages © C-DAC Aural Look-Alikes: Homophones Indian Languages being phonetic in nature, aural representation is a major issue. These mainly arrive out of the fact that Indian languages are generally typed as they are spoken. Very often these arrive out of spelling variants and/or The ignorance of the user as to the correct spelling of the word. A large number of sub-types of problems can emerge from such Homophonic representations Homophones and Homographs
Nurturing Living Languages © C-DAC Aural Look-Alikes: Homophones-1 Confusion between the two nasal modifiers (wherever such nasal modifiers) exist. Hindi Gujarati Confusion between two or more similar sounding consonants (normally dental vs. retroflex sibilants and laterals): Marathi Gujarati Confusion arising out of short and long vowels: Tamil: Gujarati Hindi Homophones and Homographs
Nurturing Living Languages © C-DAC Aural Look-Alikes: Homophones-2 Absence or presence of a halanta. This is a source of errors even among educated speakers of the language. Proper names tend to be written at times with or without the halanta. Thus the name Shirke in Marathi can be written in the following two ways of which the first is correct, the second not normatively valid but could be accepted: Confusion arising out of the use of the rakar+ u matra instead of the vowel form: vs. Homophones and Homographs
Nurturing Living Languages © C-DAC Aural Look-Alikes: Homophones-3 A remote source of error would be the use of the Visarga or Vowel lengthener to modify an IDN. The Visarga is mainly used in Sanskrit and very rarely in neo Indian Aryan languages. However an IDN with or without the Visarga could create ambiguity. Homophones and Homographs
Nurturing Living Languages © C-DAC Aural Look-Alikes: Homophones-4 Insertion of a zero width character (ZWJ/ZWNJ) within the name string: The first has no non-joiner, the second has a non-joiner. Visually both look alike and can lead to confusion. Homophones and Homographs
Nurturing Living Languages © C-DAC Sub-Type 2: SPELLING ERRORS SUB-TYPE II Spelling Variants This is best seen in the case of Hindi where a nasal modifier can substitute for a corresponding half nasal consonant. The word Hindi itself allows to be written either as: Obviously two IDNs based on these spelling variants should not be allowed but must be resolved to the same norm. A similar situation exists in Marathi in the use of (timba) vs. /e/ vowel modifier. The first is used in colloquial Marathi under special environments whereas the second is the literary form. A filter which would normalize the two would have to be written. Other languages and scripts display similar patterns
Nurturing Living Languages © C-DAC More examples
Nurturing Living Languages © C-DAC SUB-TYPE III SPELLING ERRORS These whether conscious or unconscious could create homographic doublets and need to be detected in order to ensure that the client does not have a spurious IDN competing with his real IDN. Misspellings of words, introversions can all lead to IDN doublets. A good example is words in Hindi which have Urdu roots and which can admit spellings without Halanta (Urdu norm) and with halanta (Hindi aural norm)
Nurturing Living Languages © C-DAC 2. PROPOSED RECOMMENDATIONS
Nurturing Living Languages © C-DAC Proposed Recommendations An action plan has been proposed for ensuring maximum security in allotment of IDNs in Indian scripts. This is in shape of recommendations arising out of discussions. The recommendations are both specific and generic in nature.
Nurturing Living Languages © C-DAC Proposed Recommendations: GENERIC STRATEGIES-1 Creation of Levels: Four Levels are provided: Level 1 Highest security Level 2 Government bodies and Institutions (Bank, insurance, healthcare, etc) Level 3 Corporate and NGOs Level 4 All other users.
Nurturing Living Languages © C-DAC Proposed Recommendations: GENERIC STRATEGIES-2 The implementation should be tested in TESTBED mode and IDNs should be allotted in a phased manner: Level 1 (Highest security) and Level2 (Government bodies and Institutions) should be permitted to register in the test bed mode. This will also have the advantage of blocking out automatically all demands by spoofers and hackers to squat on such names. Levels 1 and 2 should be automatically denied to users. At this stage the automated software for providing variants based on visual and homophonic identities should be set in place.
Nurturing Living Languages © C-DAC Proposed Recommendations: GENERIC STRATEGIES-2 Subsequently Level 3 i.e. corporate, NGOs should be allowed to register. The software which will generate out all possible variants for their names, as per the rules of the language can be proposed to them. If they so desire they can register all these variants or keep them open, after being overtly warned that such a step could lead to spoofing. Level 4 can be integrated at the end Phased allotment of IDNs will eradicate to a large extent spoofing and phishing and ensure maximal security.
Nurturing Living Languages © C-DAC Proposed Recommendations: SPECIFIC ISSUES 1.Two scripts page should not be mixed. 2.As far as possible, numbers (digits) should not be used, unless they acquire a linguistic value such as 365, 24/7 etc. Domain names are not like mail applications where you can have the name followed by a digit. 3.Punctuation marks should be avoided as far as possible. These can also result in confusion as is the case of eyelash repha in Marathi: - 4. Although under ideal circumstances, correct spelling would be the norm, the first instance of a name registered even if it is incorrect would be deemed as registered and all further variants including the correct one, generated out by the software would be reserved or permitted as per the wish of the sanctioning authority.
Nurturing Living Languages © C-DAC Proposed Recommendations: SPECIFIC ISSUES-2 5. The whole process to be automated by means of a software which will ensure to the highest degree that the security holes are not breached. Given that there would be a large number of applications and that manual processing would not be possible and if possible would result in inordinate delays, automation is a pre-requisite.
Nurturing Living Languages © C-DAC Action Plan -1 Identification of Potential zones : Potential zones for ensuring were identified. These are: Creation of Variant Lists List of potential spelling variants List of potential zones of error in terms of misspellings and which are not trapped by the variants list.
Nurturing Living Languages © C-DAC Explanatory documents and Templates for each of the desired data were provided by CDAC GIST to the concerned The templates gave examples for each type of requirements in the sample template below:
Nurturing Living Languages © C-DAC CDAC. Pune has been entrusted with the creation of data for three languages: Hindi, Marathi and Urdu As per agreement Expert committees for all these three languages have been appointed, the experts being professors and experts working in the publishing industry; since these have the linguistic skills and know-how to investigate and create the required data A translation of the three letter extension of the names has also been provided. To ensure across the board intelligibility, this is in Sanskrit In the slides that follow, samples of the quantum of work accomplished in each of the languages will be detailed out. Report-1
Nurturing Living Languages © C-DAC Translation of IDN extensions: a sample: 1)EDU 2)GOV 3)IN 4)COM 5)ORG, 6)MIL - 7)RES 8)AC 9)TRAVEL 10)MOBI 11)NET 12)INT 13)MED 14)AGRI Report-2
Nurturing Living Languages © C-DAC Report-1: Marathi In the case of Marathi, a committee headed by Shri Phadake who has books on shuddha-lekhan to his credit has been appointed. Work has commenced on all the three areas: Variants list Spelling Variants Erroneous Spellings A large number of rules have been generated and so is the data on spelling variants and misspellings
Nurturing Living Languages © C-DAC Report-1: Marathi : Sample image of Variants list
Nurturing Living Languages © C-DAC Report-1: Marathi : Sample image of Variants list
Nurturing Living Languages © C-DAC Report-1: Marathi Sample image of Multiple spellings And misspellings
Nurturing Living Languages © C-DAC Report -2 Hindi A similar exercise has been carried out for Hindi. Sample files are provided below. Over 100 different rule variants have been identified.
Nurturing Living Languages © C-DAC Report -2 Hindi Spelling variants and misspellings for Hindi Over 300+ collected at present
Nurturing Living Languages © C-DAC Report -3 Urdu Under the able guidance of Prof Yunus Fahmi, spelling variants, misspellings and variant lists are being created. Some sample files for variant list and spellings variants are appended
Nurturing Living Languages © C-DAC Report -3 Urdu Urdu spelling Variants (over 280 in number)
Nurturing Living Languages © C-DAC Report -3 Urdu Urdu spelling Variants in PASCII (over 280 in number)
Nurturing Living Languages © C-DAC LanguageOfficial languageFamilyScript AssameseAssamIndo-AryanBangla (Modified) BengaliTripura and West BengalIndo-AryanBangla BodoAssamTibeto-BurmanDevanagari Bangla (modified) DogriJammu and KashmirIndo-AryanDevanagari, Perso- Arabic GujaratiDadra and Nagar Haveli, Daman and Diu, and Gujarat Indo-AryanGujarati HindiAndaman and Nicobar Islands, Bihar, Chandigarh, Chhattisgarh, Delhi, Harayana, Himachal Pradesh, Jharkhand, Madhya Pradesh, Rajasthan, Uttar Pradesh and Uttaranchal Indo-AryanDevanagari List of Official languages of India
Nurturing Living Languages © C-DAC LanguageOfficial languageFamilyScript KannadaKarnatakaDravidianKannada KashmiriKashmirIndo-AryanPerso-Arabic, Devanagari KonkaniGoaIndo-AryanDevanagari, Roman, Malayalam, Kannada MaithiliBiharIndo-AryanDevanagari MalayalamKerala and LakshadweepDravidianMalayalam ManipuriMaithiliTibeto-BurmanBangla, Meetei-Mayek MarathiMaharashtraIndo-AryanDevanagari NepaliSikkimIndo-AryanDevanagari List of Official languages of India
Nurturing Living Languages © C-DAC LanguageOfficial languageFamilyScript OriyaOrissaIndo-AryanOriya PunjabiPunjabIndo-AryanGurumukhi, Shahmukhi Sanskrit Indo-AryanDevanagari SantaliMundaDevanagari OI (ciki) Sihdhi Indo-AryanPerso-Arabic, Devanagari, Gujarati, Roman TamilTamil Nadu and PondicherryDravidianTamil TeluguAndhra PradeshDravidianTelugu UrduJannu and KashmirIndo-AryanPerso-Arabic List of Official languages of India
Nurturing Living Languages © C-DAC T H A N K Y O U Nurturing living languages
Introduction to Indian language computing 20 th MAR 2014.
UNICODE & Indic Scripts Dr. Mukul K Sinha Expert Software Consultants Ltd., New Delhi
© 2015 albert-learning.com Indian languages Indian Languages.
Internationalized Domain Names (IDNs) Yale A2K2 Conference New Haven, USA April 27, 2007 Ram Mohan Building a Sustainable Framework.
Testing Relational Database. Overview Once the design of a database system has been completed, the developers are ready to move into the implementation.
Text #ICANN50. Text #ICANN50 IDN Variant TLD Program GNSO Update Saturday 21 June 2014.
India. Homework. Due next lesson. Complete your presentation. If you choose to do a PowerPoint it should be e mailed to your Geography teacher by 08:00.
The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
An ISO 9001:2008 Company With all the tools you need to compute in Indian Languages.
Chapter: 3:XI_C++ Data Representation in Computers =>After studying this chapter the student will be able to: =>Learn about binary, octal, decimal and.
IDN TLD Variants Implementation Guideline draft-yao-dnsop-idntld-implementation-01.txt Yao Jiankang.
Modular InfoTech’s Modular Infotech is proud to offer Tools and Components enabled with Indian language so as to address each & every client located across.
Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee ( ) National Center for Science Information.
Global Registry Services 1 INTERNATIONALIZED Domain Names Testbed presented to ITU/WIPO Joint Symposium Geneva 6-7 Dec An Overview On VeriSign Global.
Proposed Vedic Sanskrit Coding Scheme: Some suggestions Akshar Bharati Amba Kulkarni Department of Sanskrit Studies University of Hyderabad Hyderabad
XP Tutorial 9New Perspectives on Creating Web Pages with HTML, XHTML, and XML 1 Working with XHTML Creating a Well-Formed Valid Document Tutorial 9.
Language and the Internet Assessing Linguistic Bias Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University.
1 © 2000, Cisco Systems, Inc. DNSSEC IDN Patrik Fältström
PROGRAMMING LANGUAGES Prof. Lani Cantonjos. PROGRAM - set of step-by-step instructions that tells or directs the computer what to do. PROGRAMMING LANGUAGE.
21st September 2004localisation and the digital divide1 and the Development and the Information Society Economic divides Language divides Cultural divides.
Creating a Well-Formed Valid Document. 2 Objectives Introducing XHTML Creating a Well-Formed Document Creating a Valid Document Creating an XHTML Document.
THE INTERNATIONAL STANDARD ISO The International Organization for Standardization (ISO) is a worldwide organization which deals with the development.
ICANN Rio Meeting IDN Authorization for TLDs with ICANN agreements 26 March, 2003 Andrew McLaughlin.
CcTLD IDN TF Report ccTLD Meeting, Rio de Janero Mar. 25, 2003 Young-Eum Chair, ccTLD IDN TF.
INTERNET PROTOCOLS Class 9 CSCI 6433 David C. Roberts Entire contents copyright 2011, David C. Roberts, all rights reserved.
International Domain Name TWNIC Nai-Wen Hsu
IDN Standards and Implications Kenny Huang Board, PIR
1 Lab Session-IV CSIT-120 Spring 2001 Lab 3 Revision and Exercises Rev: Precedence Rules Lab Exercise 4-A Machine Language Programming The “Micro” Machine.
Chapter 5 Using a Template to Create a Resume and Sharing a Finished Document Microsoft Word 2013.
1 ENG224 INFORMATION TECHNOLOGY – Part I 4. Internet Programming.
Data vs. Information OUTPUTOUTPUT Information Data PROCESSPROCESS INPUTINPUT There are 10 types of people in this world those who read binary and those.
Collecting data Chapter 6. What is data? Data is raw facts and figures. In order to process data it has to be collected. The method of collecting data.
SEC835 Prevent Cross-Site Scripting (XSS) attack.
Internet Explorer 7 Updated Advice for the NHS 04 February 2008 Version 1.3.
1 Lab Session-III CSIT-120 Spring 2001 Revising Previous session Data input and output While loop Exercise Limits and Bounds GOTO SLIDE 13 Lab session.
Internationalized Domain Names Dr. Cary Karp MUSENIC Project Manager Second MUSENIC Project Workshop Stockholm, March 2004 MUSENIC – The Museum Network.
Unicode & W3C Jataayu Software C. Kumar January 2007.
The Assembly Language Level Translators can be divided into two groups. When the source language is essentially a symbolic representation for a numerical.
Computing Higher - Unit 1… Computer Systems 1 Higher Computing Unit 1 – Topic 1 Data Representation.
CAPTCHA. A CAPTCHA is a type of challenge-response test used in computing to ensure that the response is not generated by a computer. CAPTCHA requires.
Internationalizing WHOIS Preliminary Approaches for Discussion Internationalized Registration Data Working Group ICANN Meeting, Brussels, Belgium Jeremy.
The techniques involved in systems analysis Explanation of a feasibility study:Explanation of a feasibility study: –economic, –legal, –technical, –time.
Languages of Asia Part 2: South Asia ASIAN 401 Spring 2009 ASIAN 401 Spring 2009.
IOTA Improved Design and Implementation of a Modular and Extensible Website Framework Andrew Hamilton – TJHSST Computer Systems Lab Abstract.
Internationalization Status and Directions: IETF, JET, and ICANN John C Klensin October 2002 © 2002 John C Klensin.
Resource Creation for Training and Testing of Transliteration Systems for Indian Languages Sowmya V.B. *, Monojit Choudhury *, Kalika Bali *, Tirthankar.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Copyright © 2004 ProsoftTraining, All Rights Reserved. Lesson 10: GUI HTML Editors © 2007 Prosoft Learning Corporation All rights reserved ITD 110 Web.
© 2017 SlidePlayer.com Inc. All rights reserved.