Presentation is loading. Please wait.

Presentation is loading. Please wait.

پاکستانی زبانوں کی ترتیب تہجی اور متعلقہ مسائل

Similar presentations


Presentation on theme: "پاکستانی زبانوں کی ترتیب تہجی اور متعلقہ مسائل"— Presentation transcript:

1 پاکستانی زبانوں کی ترتیب تہجی اور متعلقہ مسائل
Collation Sequences and Related Issues for Pakistani Languages سرمد حسین F Center For Research in Urdu Language Processing National University of Computer and Emerging Sciences

2 Purpose of Presentation
Briefly discuss character sets Discuss Urdu Collating sequence Propose a possible Urdu collation sequence Overview collation of other languages of Pakistan

3 اردو ل ف س ر د ج ا م ق ش رھ دھ جھ آ مھ ک ص ڑ ڈ چ ب ں کھ ض ڑھ ڈھ چھ بھ
ںھ گ ط ز ذ ح پ ن گھ ظ ژ خ پھ نھ ع ت و غ تھ وھ ٹ ہ ٹھ ة ث ء ی ے

4 بلوچی ۓ ل ف س ر د ج ا م ق ش ڑ ڈ چ آ ن ک ص ز ذ ح ب و گ ض ژ خ پ ہ ط ت ء
ظ ٹ ی ع ث ے غ ۓ

5 پشتو ل ف س ر د ج ا م ق ش ړ ډ ځ ب ن ک ښ ز ذ چ پ ڼ ګ ص ژ څ ت و ض ږ ح ټ ہ
څ ت و ض ږ ح ټ ہ ط خ ث ي ظ ې ع ۍ غ ٸ ے

6 پنجابی ل ف س ر د ج ا لھ ک ش رھ دھ جھ ب م کھ ص ڑ ڈ چ بھ مھ ق ض ڑھ ڈھ چھ
گ ط ز ذ ح پھ نھ گھ ظ ژ خ ت ڼ ع تھ و غ ٹ ہ ٹھ ء ث ی ے

7 سندھی ل ف س ر د ج ا لھ ڦ ش ڙ ڌ ڄ آ م ق ص ڙھ ڏ جھ ب مھ ڪ ض ز ڊ ڃ ٻ ن ک
ط ڍ چ ڀ نھ گ ظ ذ ڇ ت ڻ ڳ ع ح ٿ ڻھ گھ غ خ ٽ و ڱ ٺ ھ ث ہ پ ء ي

8 Sources Urdu Balochi Pashto Punjabi Sindhi
Akhbar-e-Urdu (Special Supplement on Urdu Software; Jan-Feb. 2002), National Language Authority, Islamabad Balochi Fax communication (Sept. 2002), Balochi Academy, Quetta Pashto Fax communication (Sept. 2002), Pashto Academy, Peshawar Punjabi Punjabi Qaida (Experimental), Punjabi Adabi Board, Lahore Sindhi Sindhi Boli (July-Dec. 2001) and SLA Letter Circulation of Sindhi Collation (June 2002), Sindhi Language Authority, Hyderabad

9 اردو آ ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ
ل م ن و ہ ء ی ے -اردو قائدہ ، فیروز سنز ، لاہور

10 Urdu Alphabet: State of Affairs
Are the following letters of Urdu? آ أ ٶ بھ پھ تھ ۔ ۔ ۔ ... ں ة لھ مھ نھ ںھ وھ If yes, where are they placed in the alphabet?

11 Sources Data from eight dictionaries of Urdu
فیروزاللغات جامع، فیروز سنز، لاہور(FLJ) Standard Twentieth Century Dictionary: Urdu to English, Educational Publishing House, New Dehli, India (STCD) فرہنگِِِِ تلفظ ، مقتدرہ قومی زبان، اسلام آباد(FT) جدید اردو لغت ، مقتدرہ قومی زبان، اسلام آباد (JUL) اردو لغت ، اردو لغت بورڈ ، کراچی (UL) A Dictionary of Urdu, Classical Hindi and English, Crosby Lockwood and Son, London (1911) (UHE) فرہنگ آصفیہ، دہلی (1918)(FA) نوراللغات، سنگ میل، لاہور (NL)

12 Urdu Alphabet: State of Affairs
FT, JUL , UL ا آ ب بھ پ پھ ت تھ ٹ ٹھ ث ج جھ چ چھ ح خ د دھ ڈ ڈھ ذ ر رھ ڑ ڑھ ز ژ س ش ص ض ط ظ ع غ ف ق ک کھ گ گھ ل لھ م مھ ں ںھ ن نھ و ہ ء ی ے FLJ, NL آ ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ں ن و ہ ھ ء ی ے UHE, FA , STCD ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن و ہ

13 Conclusions: Urdu Character Set
No general agreement on Urdu Character Set by dictionary publishers Standard Character Set defined by National Language Authority and Urdu Dictionary Board not traditional not well-publicized not completely adopted GoP Computing Standard for Computing, UZT 1.01 implements the NLA-defined character and symbol set UZT 1.01 will soon be fully represented in Unicode/ISO IEC 10646

14 Character Set Alphabet Harakat (Aerab) Other Symbols

15 “Familiar” Harakaat (Aerab)
Do zabar دً Do zer دٍ Do pesh دُ Tashdeed دّ Noon ghunna ن Jazm ْد Zabar دَ Zer ِد Pesh دُ Khari zabar د Khari zer د Ulta pesh د

16 “Common” Other Symbols
Numbers 0 ۰ 1 ١ 2 ٢ 3 ٣ 4 ‌ 5 ۵ 6 ٦ 7 8 ٨ 9 ٩ Punctuation ؟ ؛ ٬ - Honorifics Other Symbols ס

17 Current GoP Standard: UZT 1.01

18 Logical Sections of UZT 1.01
Alphabet (80 – 122) Aerab/diacritics/harakat (66 – 79, 123 – 126) Other characters Punctuation and arithmetic symbols (32 – 47, 58 – 65) Digits (48 – 57) Special symbols (160 – 176, 192 – 199) Miscellaneous Control characters (0 – 31, 127) Reserved control space (128 – 159, 255) Reserved expansion space (177 – 191, 200 – 207, 240 – 253) Vendor area (208 – 239) Toggle character (254)

19 Urdu Collation Sequence
How do the following figure in? Basic Letters Other Letters Basic Aerab Other Aerab Others Arguments should be consistent and simple

20 Character vs. Phoneme Character = written content = letters
Phoneme = linguistic content in word “phone” 5 Characters = p h o n e 3 Phonemes = f o n

21 Urdu Collating Sequence: Letters
What is the status and sequence of following characters? ا آ أ ٶ ن ں ہ ھ ة ہ ت ی ے

22 ا آ Variation آ = ا ا آب = ا ا ب آپ = ا ا پ اب ایوان اب ایوان
FLJ آب = ا ا ب آپ = ا ا پ اب ایوان FT, JUL, UL اب ایوان آب= ا ا ب آپ= ا ا پ STCD, UHE, FA, NL ا آب آپ اب ایوان stylistic variation of ا ا adds a character to single alif not a character in the pure sense

23 أ ٶ Status Not a character in ANY dictionary including dictionaries by
National Language Authority Urdu Dictionary Board Has same bearing on collation sequences as ء ا ء و Included in UZT 1.01 as per terms of reference given by NLA May be made by combination of ء followed by ا ، و Should be taken out of UZT1.01 in its next version

24 ن ں Variation FLJ, FT, STCD, NL, FA, UHE ماں مان JUL, UL مان ماں ں is a vowel modifier which nasalizes the vowel but DOES NOT add any “phonemic content” not a phoneme is a character does not represent any other character or combination written adjacent to ن lighter goes up! would come before ن ما C V = ماں C V = مان C V C =

25 ہ ھ Variation FT, JUL, UL FLJ, UHE, FA, NL STCD باپ باپ باپ بہن بھابی
( بھ not character; ہ then ھ) باپ بھابی بہن بہنگی بھنگی بیٹا STCD ( بھ not character; ھ then ہ) باپ بھابی بہن بھنگی بہنگی بیٹا FT, JUL, UL ( بھ character) باپ بہن بہنگی بیٹا بھابی بھنگی

26 ہ ھ Variation ب C = بھ C = بہ C V C =
Like ں is a vowel modifier ھ is a consonant modifier and DOES NOT add any “phonemic content” as with ں , ھ not a phoneme written adjacent to ہ lighter goes up! would come before ہ ب C = بھ C = بہ C V C =

27 بھ، پھ،۔۔۔ Status as “Character”
Urdu Dictionary Board and National Language Authority assert that these are phonemes therefore the character combination should be made a character If character combinations which are phonemes are to be promoted as characters then the following combinations should also be made characters to be consistent یں، وں ، اں However, it is common in languages that character combinations represent phonemes p h  f (in English), so پ ھ  پھ (in Urdu) ھ may remain a character like ں, even if it is not a phoneme بھ ، پھ، ۔۔۔ not characters but character combinations

28 ة Status as “Character”
Not a character in ANY dictionary including dictionaries by National Language Authority Urdu Dictionary Board Stylistic variation of ت (e.g. STCD, NL, …) زکوة  زکوت Not a character

29 ی ے Variation بی بی بی بے بیابان بی بے بیابان بی بی
FJL, FT, JUL, UL, NL بی بی بی بے بیابان STCD, UHE, FA بی بے بیابان بی بی Middle ے or ی predicament بیکار = بے کار ٹیلیوژن = ٹیلی وژن

30 ی ے Variation ے different from ں because
Like ا،و،یthe character ے is a vowel (phoneme) unlike ں, ے is not a vowel modifier ے different from ں because ے replaces : ی بے  بی ں adds onto ا : ما  ماں placed at the end of the alphabet (based on traditional collation) Collated as “heavier” than ی at ligature endings but “equal to” ی ligature medially

31 Role of Aerab in Sorting
Aerab ignored in the first (primary) pass of sorting an Urdu string only characters are considered بِہار (= بِ ہار) بَہانہ (= بَ ہانہ) بِہائ (= بِ ہاءی) However, aerab are relevant in second pass, when first pass gives an exact match بَن بِن بُُن سَن سِن سُُن

32 ‎Vocalic Aerab - Zabar, Zer, Pesh
بَہَر بَہِر بَہُر بَہ۫ر بُہ۫ر (UL) بَیر بِیَر بِیر بیر FT, FLJ, JUL, UL بَن بِن بُُن بِیر بیر STCD بَن بُُن بِن سَن سِن سُُن

33 Vocalic Aerab – Khari Zabar
No effect at primary level sorting اعلا مَوسی اعلان مُوسی اعلم اعلی No minimal pairs found on secondary level so involvement could not be determined

34 Consonantal Aerab - Tashdeed
Ignored are primary level (FT, UL, NL, …) Effects secondary level sorting “heavier” lighter goes up بدی بدّی بدّیا بَرانا برّانا بَرایا َپتا َپتّا ِپتا

35 Ligature-Break (Half Space)
Hex 41 (UZT) and Hex 200B (Unicode) Ignored at primary level and secondary level ٹیلیوژن ، ٹیلی وژن ٹیلیفون ، ٹیلی فون بے کار ، بیکار But given each pair, which word first? Tertiary level decision lighter goes up! single word without break comes first?

36 Word-Break (Normal Space)
Ignored at primary level ? American Heritage Dictionary (2nd Collegiate ed.) black art black bear blackberry black box blacken Black Death black gold Space ignored at primary level

37 Word-Break (Normal Space)
FLJ, UL بانگ بانگِ درا بانگ دینا If sorting is done at word break then 1,3,2 So sorting ignores word break

38 Conclusions: Urdu Character Set
Two levels of characters Core Characters Non-core characters آ ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ں ن و ھ ہ ء ی ے

39 Conclusions: Urdu Collating Sequence
Multi-level Complex Problem Pre-processing Contractions (ب ھ  بھ) Insert un-written aerab Primary Level characters Secondary Level aerab Others (?) Tertiary Level Ligature Break Ignorable Space secondary aerab (?) Symbols (?)

40 What Needs to be Done for Urdu
Debate and standardize Character Set Develop computational model to implement sorting Culturally acceptable Collation Element Table to generate sort keys Standardize and publicize this computational model for Urdu sorting

41 What Needs to be Done Take national standards to International forums: Unicode/ISO Complete similar work for all other local languages of Pakistan Character set Script Collating Sequence

42 Relevant National and Provincial Government Organizations
Urdu and Regional Languages’ Software Development Forum (URLSDF), Ministry of Science and Technology (MoST), Islamabad National Language Authority (NLA), Islamabad (Urdu) Pakistan Standards and Quality Control Authority (PSQCA), Karachi Provincial Balochi Academy, Quetta Pashto Academy, Peshawar Punjabi Adabi Board, Lahore Sindhi Language Authority (SLA), Hyderabad

43 شکر یہ


Download ppt "پاکستانی زبانوں کی ترتیب تہجی اور متعلقہ مسائل"

Similar presentations


Ads by Google