پاکستانی زبانوں کی ترتیب تہجی اور متعلقہ مسائل

Slides:



Advertisements
Similar presentations
JavaScript I. JavaScript is an object oriented programming language used to add interactivity to web pages. Different from Java, even though bears some.
Advertisements

ASER PAKISTAN A Citizen Led Initiative National Launch January 16, 2014 Islamabad.
Elementary Data Types Prof. Alamdeep Singh. Scalar Data Types Scalar data types represent a single object, i.e. only one value can be derived. In general,
Bits and the "Why" of Bytes: Representing Information Digitally
Tafseer Ahmed Department of Computer Science University of Karachi Urdu on Linux International Support.
Center for Research in Urdu Language Processing PAN Localization Project A Regional Initiative to Develop Local Language Computing Capacity in Asia ثناء.
Binary Representation Introduction to Computer Science and Programming I Chris Schmidt.
Advanced Topics Object-Oriented Programming Using C++ Second Edition 13.
 2008 Pearson Education, Inc. All rights reserved JavaScript: Introduction to Scripting.
Introduction to Computers and Programming. Some definitions Algorithm: –A procedure for solving a problem –A sequence of discrete steps that defines such.
1 Lab Session-IV CSIT-120 Spring 2001 Lab 3 Revision and Exercises Rev: Precedence Rules Lab Exercise 4-A Machine Language Programming The “Micro” Machine.
Chapter 8_2 Bits and the "Why" of Bytes: Representing Information Digitally.
Elementary Data Types Scalar Data Types Numerical Data Types Other
Chapter 8_1 Bits and the "Why" of Bytes: Representing Information Digitally.
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
Chapter 8 Bits and the "Why" of Bytes: Representing Information Digitally.
Introduction to C Programming
1 Lecture 3  Lexical elements  Some operators:  /, %, =, +=, ++, --  precedence and associativity  #define  Readings: Chapter 2 Section 1 to 10.
C How to Program, 6/e Summary © by Pearson Education, Inc. All Rights Reserved.
CCE-EDUSAT SESSION FOR COMPUTER FUNDAMENTALS Date: Session III Topic: Number Systems Faculty: Anita Kanavalli Department of CSE M S Ramaiah.
Number System Review This section reviews binary numbers, hexadecimal numbers, and binary arithmetic © 2014 B. Wilkinson Modification date: Dec 29,
COMPUTER FUNDAMENTALS David Samuel Bhatti
Management Information Systems Lection 06 Archiving information CLARK UNIVERSITY College of Professional and Continuing Education (COPACE)
1 SSML Extensions for TTS in Indian Languages II workshop on Internationalizing SSML May 2006, Greece Nixon Patel and Kishore Prahallad Bhrigus.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Lesson 3 — How a Computer Processes Data
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
A Variable is symbolic name that can be given different values. Variables are stored in particular places in the computer ‘s memory. When a variable is.
Transliteration System
General Computer Science for Engineers CISC 106 Lecture 02 Dr. John Cavazos Computer and Information Sciences 09/03/2010.
Chapter 2 Overview of C Part I J. H. Wang ( 王正豪 ), Ph. D. Assistant Professor Dept. Computer Science and Information Engineering National Taipei University.
EE2174: Digital Logic and Lab Professor Shiyan Hu Department of Electrical and Computer Engineering Michigan Technological University CHAPTER 2 Number.
CSC312 Automata Theory Lecture # 2 Languages.
10-Sep Fall 2001: copyright ©T. Pearce, D. Hutchinson, L. Marshall Sept Representing Information in Computers:  numbers: counting numbers,
Paragraph one - information about publisher, title of survey and main conclusions.(who, what, when, how, so what…) Remaining Paragraphs - state the figures.
Productivity Programs Common Features and Commands.
In Business Series © Prentice Hall Microsoft Office Word 2007 In Business Core Chapter 3 Word Document Enhancements.
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. Chapter 2 Chapter 2 - Introduction to C Programming.
Math 10 Cohort Resource Review.  Create a cohort of Math 10C teachers in CESD  Gain common understanding of implementation for Sept 2010  Develop common.
Community Readiness for IDN Variant TLDs Arabic Script Case Sarmad Hussain Center for Language Engineering Al-Khawarizmi Institute of Computer.
Lesson 3 — How a Computer Processes Data Unit 1 — Computer Basics.
Translation Memory System (TMS)1 Translation Memory Systems Presentation by1 Melina Takanen & Julianna Ekert CAT Prof. Thorsten Trippel University.
Overview of Bioinformatics 1 Module Denis Manley..
Representing Characters in a computer Pressing a key on the computer a code is generated that the computer can convert into a symbol for displaying or.
Data Representation, Number Systems and Base Conversions
Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.
Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008.
 2008 Pearson Education, Inc. All rights reserved JavaScript: Introduction to Scripting.
LANGUAGE PLANNING AND POLICY. WHAT IS LANGUAGE PLANNING? Language planning is official, government-level activity concerning the selection and promotion.
Role of Policy in Local Language Computing ثناء گل مرکز تحقیقات اردو پاکستان ، ۲۰۰۵ Sana GUL Pakistan, 2005.
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. 1 Chapter 2 - Introduction to C Programming Outline.
Computer Science: A Structured Programming Approach Using C1 Objectives ❏ To introduce the basic concepts of linked lists ❏ To introduce the basic concepts.
IC 3 BASICS, Internet and Computing Core Certification Computing Fundamentals Lesson 2 How Does a Computer Process Data?
Victoria Ibarra Mat:  Generally, Computer hardware is divided into four main functional areas. These are:  Input devices Input devices  Output.
FG Group -Afrilia BP -Liana F.B.I -Maulidatun Nisa -Riza Amini F.
An Efficient Hindi-Urdu Transliteration System Nisar Ahmed PhD Scholar Department of Computer Science and Engineering, UET Lahore.
Copyright © Cengage Learning. All rights reserved.
Bits and the "Why" of Bytes: Representing Information Digitally
The letters F, V, U, Y, W.
TOPICS Information Representation Characters and Images
SDD 1.1 General Direction Proposal
The Data Element.
String Encodings and Penny Math
WJEC GCSE Computer Science
CSC312 Automata Theory Lecture # 2 Languages.
Chapter 3 - Binary Numbering System
ASCII and Unicode.
PYTHON - VARIABLES AND OPERATORS
Presentation transcript:

پاکستانی زبانوں کی ترتیب تہجی اور متعلقہ مسائل Collation Sequences and Related Issues for Pakistani Languages سرمد حسین F Center For Research in Urdu Language Processing National University of Computer and Emerging Sciences

Purpose of Presentation Briefly discuss character sets Discuss Urdu Collating sequence Propose a possible Urdu collation sequence Overview collation of other languages of Pakistan

اردو ل ف س ر د ج ا م ق ش رھ دھ جھ آ مھ ک ص ڑ ڈ چ ب ں کھ ض ڑھ ڈھ چھ بھ ںھ گ ط ز ذ ح پ ن گھ ظ ژ   خ پھ نھ ع ت و غ تھ وھ ٹ ہ ٹھ ة ث ء ی ے

بلوچی ۓ ل ف س ر د ج ا م ق ش ڑ ڈ چ آ ن ک ص ز ذ ح ب و گ ض ژ خ پ ہ ط ت ء ظ   ٹ ی ع ث ے غ ۓ

پشتو ل ف س ر د ج ا م ق ش ړ ډ ځ ب ن ک ښ ز ذ چ پ ڼ ګ ص ژ څ ت و ض ږ ح ټ ہ   څ ت و ض ږ ح ټ ہ ط خ ث ي ظ ې ع ۍ غ ٸ ے

پنجابی ل ف س ر د ج ا لھ ک ش رھ دھ جھ ب م کھ ص ڑ ڈ چ بھ مھ ق ض ڑھ ڈھ چھ گ ط ز ذ ح پھ نھ گھ ظ ژ   خ ت ڼ ع تھ و غ ٹ ہ ٹھ ء ث ی ے

سندھی ل ف س ر د ج ا لھ ڦ ش ڙ ڌ ڄ آ م ق ص ڙھ ڏ جھ ب مھ ڪ ض ز ڊ ڃ ٻ ن ک ط   ڍ چ ڀ نھ گ ظ ذ ڇ ت ڻ ڳ ع ح ٿ ڻھ گھ غ خ ٽ و ڱ ٺ ھ ث ہ پ ء ي

Sources Urdu Balochi Pashto Punjabi Sindhi Akhbar-e-Urdu (Special Supplement on Urdu Software; Jan-Feb. 2002), National Language Authority, Islamabad Balochi Fax communication (Sept. 2002), Balochi Academy, Quetta Pashto Fax communication (Sept. 2002), Pashto Academy, Peshawar Punjabi Punjabi Qaida (Experimental), Punjabi Adabi Board, Lahore Sindhi Sindhi Boli (July-Dec. 2001) and SLA Letter Circulation of Sindhi Collation (June 2002), Sindhi Language Authority, Hyderabad

اردو آ ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن و ہ ء ی ے -اردو قائدہ ، فیروز سنز ، لاہور

Urdu Alphabet: State of Affairs Are the following letters of Urdu? آ أ ٶ بھ پھ تھ ۔ ۔ ۔ ... ں ة لھ مھ نھ ںھ وھ If yes, where are they placed in the alphabet?

Sources Data from eight dictionaries of Urdu فیروزاللغات جامع، فیروز سنز، لاہور(FLJ) Standard Twentieth Century Dictionary: Urdu to English, Educational Publishing House, New Dehli, India (STCD) فرہنگِِِِ تلفظ ، مقتدرہ قومی زبان، اسلام آباد(FT) جدید اردو لغت ، مقتدرہ قومی زبان، اسلام آباد (JUL) اردو لغت ، اردو لغت بورڈ ، کراچی (UL) A Dictionary of Urdu, Classical Hindi and English, Crosby Lockwood and Son, London (1911) (UHE) فرہنگ آصفیہ، دہلی (1918)(FA) نوراللغات، سنگ میل، لاہور (NL)

Urdu Alphabet: State of Affairs FT, JUL , UL ا آ ب بھ پ پھ ت تھ ٹ ٹھ ث ج جھ چ چھ ح خ د دھ ڈ ڈھ ذ ر رھ ڑ ڑھ ز ژ س ش ص ض ط ظ ع غ ف ق ک کھ گ گھ ل لھ م مھ ں ںھ ن نھ و ہ ء ی ے FLJ, NL آ ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ں ن و ہ ھ ء ی ے UHE, FA , STCD ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن و ہ

Conclusions: Urdu Character Set No general agreement on Urdu Character Set by dictionary publishers Standard Character Set defined by National Language Authority and Urdu Dictionary Board not traditional not well-publicized not completely adopted GoP Computing Standard for Computing, UZT 1.01 implements the NLA-defined character and symbol set UZT 1.01 will soon be fully represented in Unicode/ISO IEC 10646

Character Set Alphabet Harakat (Aerab) Other Symbols

“Familiar” Harakaat (Aerab) Do zabar دً Do zer دٍ Do pesh دُ Tashdeed دّ Noon ghunna ن Jazm ْد Zabar دَ Zer ِد Pesh دُ Khari zabar د Khari zer د Ulta pesh د

“Common” Other Symbols Numbers 0 ۰ 1 ١ 2 ٢ 3 ٣ 4 ‌ 5 ۵ 6 ٦ 7 8 ٨ 9 ٩ Punctuation ؟ ؛ ٬ - Honorifics Other Symbols ס

Current GoP Standard: UZT 1.01

Logical Sections of UZT 1.01 Alphabet (80 – 122) Aerab/diacritics/harakat (66 – 79, 123 – 126) Other characters Punctuation and arithmetic symbols (32 – 47, 58 – 65) Digits (48 – 57) Special symbols (160 – 176, 192 – 199) Miscellaneous Control characters (0 – 31, 127) Reserved control space (128 – 159, 255) Reserved expansion space (177 – 191, 200 – 207, 240 – 253) Vendor area (208 – 239) Toggle character (254)

Urdu Collation Sequence How do the following figure in? Basic Letters Other Letters Basic Aerab Other Aerab Others Arguments should be consistent and simple

Character vs. Phoneme Character = written content = letters Phoneme = linguistic content in word “phone” 5 Characters = p h o n e 3 Phonemes = f o n

Urdu Collating Sequence: Letters What is the status and sequence of following characters? ا آ أ ٶ ن ں ہ ھ ة ہ ت ی ے

ا آ Variation آ = ا ا آب = ا ا ب آپ = ا ا پ اب ایوان اب ایوان FLJ آب = ا ا ب آپ = ا ا پ اب ایوان FT, JUL, UL اب ایوان آب= ا ا ب آپ= ا ا پ STCD, UHE, FA, NL ا آب آپ اب ایوان stylistic variation of ا ا adds a character to single alif not a character in the pure sense

أ ٶ Status Not a character in ANY dictionary including dictionaries by National Language Authority Urdu Dictionary Board Has same bearing on collation sequences as ء ا ء و Included in UZT 1.01 as per terms of reference given by NLA May be made by combination of ء followed by ا ، و Should be taken out of UZT1.01 in its next version

ن ں Variation FLJ, FT, STCD, NL, FA, UHE ماں مان JUL, UL مان ماں ں is a vowel modifier which nasalizes the vowel but DOES NOT add any “phonemic content” not a phoneme is a character does not represent any other character or combination written adjacent to ن lighter goes up! would come before ن ما C V = ماں C V = مان C V C =

ہ ھ Variation FT, JUL, UL FLJ, UHE, FA, NL STCD باپ باپ باپ بہن بھابی ( بھ not character; ہ then ھ) باپ بھابی بہن بہنگی بھنگی بیٹا STCD ( بھ not character; ھ then ہ) باپ بھابی بہن بھنگی بہنگی بیٹا FT, JUL, UL ( بھ character) باپ بہن بہنگی بیٹا بھابی بھنگی

ہ ھ Variation ب C = بھ C = بہ C V C = Like ں is a vowel modifier ھ is a consonant modifier and DOES NOT add any “phonemic content” as with ں , ھ not a phoneme written adjacent to ہ lighter goes up! would come before ہ ب C = بھ C = بہ C V C =

بھ، پھ،۔۔۔ Status as “Character” Urdu Dictionary Board and National Language Authority assert that these are phonemes therefore the character combination should be made a character If character combinations which are phonemes are to be promoted as characters then the following combinations should also be made characters to be consistent یں، وں ، اں However, it is common in languages that character combinations represent phonemes p h  f (in English), so پ ھ  پھ (in Urdu) ھ may remain a character like ں, even if it is not a phoneme بھ ، پھ، ۔۔۔ not characters but character combinations

ة Status as “Character” Not a character in ANY dictionary including dictionaries by National Language Authority Urdu Dictionary Board Stylistic variation of ت (e.g. STCD, NL, …) زکوة  زکوت Not a character

ی ے Variation بی بی بی بے بیابان بی بے بیابان بی بی FJL, FT, JUL, UL, NL بی بی بی بے بیابان STCD, UHE, FA بی بے بیابان بی بی Middle ے or ی predicament بیکار = بے کار ٹیلیوژن = ٹیلی وژن

ی ے Variation ے different from ں because Like ا،و،یthe character ے is a vowel (phoneme) unlike ں, ے is not a vowel modifier ے different from ں because ے replaces : ی بے  بی ں adds onto ا : ما  ماں placed at the end of the alphabet (based on traditional collation) Collated as “heavier” than ی at ligature endings but “equal to” ی ligature medially

Role of Aerab in Sorting Aerab ignored in the first (primary) pass of sorting an Urdu string only characters are considered بِہار (= بِ ہار) بَہانہ (= بَ ہانہ) بِہائ (= بِ ہاءی) However, aerab are relevant in second pass, when first pass gives an exact match بَن بِن بُُن سَن سِن سُُن

‎Vocalic Aerab - Zabar, Zer, Pesh بَہَر بَہِر بَہُر بَہ۫ر بُہ۫ر (UL) بَیر بِیَر بِیر بیر FT, FLJ, JUL, UL بَن بِن بُُن بِیر بیر STCD بَن بُُن بِن سَن سِن سُُن

Vocalic Aerab – Khari Zabar No effect at primary level sorting اعلا مَوسی اعلان مُوسی اعلم اعلی No minimal pairs found on secondary level so involvement could not be determined

Consonantal Aerab - Tashdeed Ignored are primary level (FT, UL, NL, …) Effects secondary level sorting “heavier” lighter goes up بدی بدّی بدّیا بَرانا برّانا بَرایا َپتا َپتّا ِپتا

Ligature-Break (Half Space) Hex 41 (UZT) and Hex 200B (Unicode) Ignored at primary level and secondary level ٹیلیوژن ، ٹیلی وژن ٹیلیفون ، ٹیلی فون بے کار ، بیکار But given each pair, which word first? Tertiary level decision lighter goes up! single word without break comes first?

Word-Break (Normal Space) Ignored at primary level ? American Heritage Dictionary (2nd Collegiate ed.) black art black bear blackberry black box blacken Black Death black gold Space ignored at primary level

Word-Break (Normal Space) FLJ, UL بانگ بانگِ درا بانگ دینا If sorting is done at word break then 1,3,2 So sorting ignores word break

Conclusions: Urdu Character Set Two levels of characters Core Characters Non-core characters آ ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ں ن و ھ ہ ء ی ے

Conclusions: Urdu Collating Sequence Multi-level Complex Problem Pre-processing Contractions (ب ھ  بھ) Insert un-written aerab Primary Level characters Secondary Level aerab Others (?) Tertiary Level Ligature Break Ignorable Space secondary aerab (?) Symbols (?)

What Needs to be Done for Urdu Debate and standardize Character Set Develop computational model to implement sorting Culturally acceptable Collation Element Table to generate sort keys Standardize and publicize this computational model for Urdu sorting

What Needs to be Done Take national standards to International forums: Unicode/ISO Complete similar work for all other local languages of Pakistan Character set Script Collating Sequence

Relevant National and Provincial Government Organizations Urdu and Regional Languages’ Software Development Forum (URLSDF), Ministry of Science and Technology (MoST), Islamabad National Language Authority (NLA), Islamabad (Urdu) Pakistan Standards and Quality Control Authority (PSQCA), Karachi Provincial Balochi Academy, Quetta Pashto Academy, Peshawar Punjabi Adabi Board, Lahore Sindhi Language Authority (SLA), Hyderabad

شکر یہ