Presentation is loading. Please wait.

Presentation is loading. Please wait.

School of Computing FACULTY OF ENGNEERING 17/07/09CL 20091 School of Computing FACULTY OF ENGNEERING Linguistically Informed and Corpus Informed Morphological.

Similar presentations


Presentation on theme: "School of Computing FACULTY OF ENGNEERING 17/07/09CL 20091 School of Computing FACULTY OF ENGNEERING Linguistically Informed and Corpus Informed Morphological."— Presentation transcript:

1 School of Computing FACULTY OF ENGNEERING 17/07/09CL School of Computing FACULTY OF ENGNEERING Linguistically Informed and Corpus Informed Morphological Analysis of Arabic Majdi Sawalha & Eric Atwell School of Computing, University of Leeds, Leeds, LS2 9JT, UK

2 2 Introduction Arabic Morphological Analyzers Arabic Corpora & Lexicons Analytical Study of Tri-literal Roots of Arabic Specifications of the Morphological Analyzer Morphological Features of Arabic Words and Tag Set Evaluation and Results Gold Standard for Evaluation Morphochallenge 2009 Quran Gold Standard Outline

3 3 Introduction Methodologies for developing a robust Arabic morphological analyzer Syllable-based Morphology (SBM) Root-Pattern Methodology Lexeme-based Morphology Stem-based Arabic lexicon with grammar and lexis specifications Using tagged corpora and computer algorithms to build morphological database of the tagged words Roots, stems, patterns and affixes are pre- stored. Grammar and linguistic information are encoded with the analyzers

4 4 Arabic Morphological Analyzers Buckwalter Morphological Analyzer Uses pre-stored dictionaries of words, stems and affixes constructed manually. Khojas Stemmer Removes the longest prefix and suffix of the word, Matches the processed word with lists of noun and verb patterns to extract the correct root of the word. Al-Shalabi et al Depends on mathematical calculations of weights assigned to the letters of the word, The algorithm selects the letters with lower weights as root letters.

5 5 Comparative Evaluation of Arabic Morphological Analyzers Studying freely available morphological analyzers and stemmers. Developing a gold standard for evaluation. Results: More work is needed for the development of morphological analysis of Arabic. We can not rely on such analyzers for further analysis such as part-of- speech tagging and parsing.

6 6 Arabic Corpora The Quran 78,000 tokens, 19,000 vowelized word types, 15,000 non-vowelized word types. The Corpus of Contemporary Arabic (CCA) Modern standard Arabic text corpus consists of 1 million word. The Penn Arabic Treebank 734 files, 166,000 words of written Modern Standard Arabic. The text of 15 traditional Arabic lexicons as corpora. About 11 million words and 2 million word types of both modern and classical Arabic text.

7 7 Arabic Lexicons Methodologies of ordering lexical entries in the Arabic lexicons Al-Khalil methodology ( Listed the lexical entries based on the pronunciation of the letters, starting from the farthest in the mouth to the nearest) Abi Obaid methodology ( Listed the lexical entries based on similarity in meaning.) Al-Jawhari methodology ( Listed the lexical entries based on last letter of the word.) Al Barmaki methodology ( Listed the lexical entries alphabetically.)

8 Arabic Lexicons A sample of Arabic lexicon كتب : الكِتابُ : معروف، والجمع كُتُبٌ وكُتْبٌ. كَتَبَ الشيءَ يَكْتُبه كَتْباً وكِتاباً وكِتابةً، وكَتَّبَه : خَطَّه؛ قال أَبو النجم : أَقْبَلْتُ من عِنْدِ زيادٍ كالخَرِفْ، تَخُطُّ رِجْلايَ بخَطٍّ مُخْتَلِفْ، تُكَتِّبانِ في الطَّريقِ لامَ أَلِفْ قال : ورأَيت في بعض النسخِ تِكِتِّبانِ، بكسر التاء، وهي لغة بَهْرَاءَ، يَكْسِرون التاء، فيقولون : تِعْلَمُونَ... k t b: [Alkitab] the book; is well known. The plural forms are [kutubun] and [kutbun]. [kataba Alshay] He wrote something, [yaktubuhu] the action of writing something. [katban], [kitaban] and [kitabatan] means the art of writing. And [kattabahu] writing it means draw it up. Abu Al-Najim said: I returned back from Ziyad place [after meeting him] as senile, my legs draw up different drawings (means walking in different way). they wrote [tukattibani] on the road the letters of Lam Alif (describing how he was walking crazily and in different way). He said: I saw in different version, the word they wrote [tikittibani] using the short vowel kasrah on the first letter [taa], as it is used by Bahraa [Arab tribe] dialect. They say: [tilamuwn] (you know). A sample of Arabic-English Dictionary by Edward Lane

9 9 Analytical Study of Tri-literal Roots of Arabic Tri-literal roots were classified into 3 main groups and 22 detailed groups. Experiment 1 : Quran words derived from tri-literal roots were analyzed, (45,534 words) and (1,610 tri-literal roots) Tri-literal roots of QuranQuran tokens

10 10 Analytical Study of Tri-literal Roots of Arabic Experiment 2 : Word-types of broad-lexical resource constructed by analyzing 15 Arabic lexicons, which contains 376,167 word types Word types of broad-lexical resourceRoots of broad-lexical resource

11 11 Specifications of the Morphological Analyzers - Inputs Input: single words or text (fully vowelized, partially vowelized, or non-vowelized) Tokenization: Arabic word, number, currency or punctuation mark. Processing Arabic words: وَصَّى وَصْصَى waS~aY waSoSaY وَصَّى وَصْصَى waS~aY waSoSaY آمَنُوا ءامَنُوا |manuwA AmanuwA آمَنُوا ءامَنُوا |manuwA AmanuwA -A-wunam-A-AmanuwA - ا - وُنَم - ا - ءءامَنُوا -YaSoSawwaSoSaY - ىَصْصَووَصْصَى Position Resolving doubled letter marked with Shaddah Only one short vowel might appear on any letter of the Arabic word. Resolving the Extention (maddah)

12 12 Stop Words (Unambiguous Words) Stop word has only one morphological analysis wherever they appear in the text. About 40% of any text tokens belongs to stop words. The system contains a list of 1,368 stop words. Personal Pronouns : أنا >nA I, هي hy she Relative pronouns : الذي Al*y who (sm), التي Alty who (sf) Demonstrative pronouns : هذا h*A this (sm), هذه h*h this (sf) Prepositions: في fy in, على ElY on, إلى nA I, هي hy she Relative pronouns : الذي Al*y who (sm), التي Alty who (sf) Demonstrative pronouns : هذا h*A this (sm), هذه h*h this (sf) Prepositions: في fy in, على ElY on, إلى

13 13 Cliticts, Prefixes and Suffixes Proclitics, prefixes, suffixes and enclitics were collected from traditional Arabic grammar books. Clitics and affixes lists were checked using four Arabic corpora: The Quran The Corpus of Contemporary Arabic (CCA) The Penn Arabic Treebank The text of the 15 traditional Arabic lexicons as a corpus

14 14 Cliticts, Prefixes and Suffixes 215 Proclitics & Prefixes 127 Suffixes & Enclitics AlwwAlsmAwAl r---d-----d ال p--t ووالـسماءوال tsffst*krwnfst r---s-nus ت p--i س p--t ففستـذكرونفست TagP3TagP2TagP1ExamplePrefix wnyAlHwArywnYwn r---l-mp-n?----?--- ون r---j يالحواريونيون hmAwtm>wrvtmwhAtmwhA r---&-ndt??----h--- هما r---l-mp-n?----?--- و r---&-mps??----h--- تمأورثـتموهاتموهما TagP3TagP2TagP1ExampleSuffix

15 15 Cliticts, Prefixes and Suffixes Words are divided into three parts of different size. The first part is searched in the proclitics & prefixes list The third part is searched in the suffixes & enclitics list Not acceptedlwn لون m م yE يع Candidate analysiswn ون Eml عمل y ي Not acceptedn ن yEmlw يعملو Candidate analysisyEmlwn يعملون yaEomaluwna يَعْمَلُونَ Prefixes & Suffixes analysesThird PartSecond PartFirst Part Analyzed Word

16 16 Root or Stem The system uses a list of about 12,000 roots extracted by analyzing 15 traditional Arabic language lexicons The second part of the word is searched by the root list. Accepted AnalysisCandidate analysiswn ون Eml عمل y ي Not accepted analysisCandidate analysisEmlwn عملو ن y ي Not accepted analysisCandidate analysiswn ون yEml يعمل Not accepted analysisCandidate analysisyEmlwn يعملو ن yaEomaluwna يَعْمَلُونَ Affixes and Root analysesAffixes analysesThird Part Second partFirst part Analyzed Word

17 17 Word Pattern Different words are derived from their roots using certain patterns. Derived words inherent morphological features of the derivation patterns. The system has a list of patterns which are extracted from traditional Arabic language grammar books verb patterns 985 noun patterns Morphological features POS tags are assigned to each pattern in the list. Patterns are fully vowelized v-p---mss---an?-st?- faEalota فَعَلْتَ v-p---npf---an?-st?- faEalonaA فَعَلْنَا v-p---nsf---an?-st?- faEalotu فَعَلْتُ POS TagVerb Patterns nw----??-??----?qt-? fAEuwlA فاعُولا ء nw----??-??----?qt-? AifoEiylAl اِفْعِيلال nw----??-??----?qt-? >ufoEulAwaY أُفْعُلاوَ ى POS Tag Noun Patterns

18 18 First algorithm : depends on the word and its root as inputs. The root letters of the word are replaced by the letters (fa, Aiin, Lam, [Lam]) ( ف ، ع ، ل ، [ ل ] ). Replacement of root letters is not an easy task !!!! Second algorithm : depends on a pre-stored list of patterns. Searches the pattern list for patterns of similar size as the analyzed word, after removing its affixes. E.g: The word كتب ktb matches the following patterns: Replaces the letters of the word corresponding to the letters (Fa, Ain, Lam, [ Lam ] ) ( ف ، ع ، ل ، [ ل ] ) of the pattern. Pattern Matching Algorithms fiEilfiEolfuEilfuEulfuEalfuEolfaElfaEulFaEalfaEol فِعِلفِعْلفُعِلفُعُلفُعَلفُعْلفَعِلفَعُلفَعَلفَعْل

19 19 v-c---mpt--ipn?-tt?yufoEaluwna يُفْعَلُونَ v-c---mpt--ipn?-at?yufoEiluwna يُفْعِلُونَ v-c---mpt--ian?-st?yafoEaluwna يَفْعَلُونَ v-c---mpt--ian?-st?yafoEiluwna يَفْعِلُونَ v-c---mpt--ian?-st?yafoEuluwna يَفْعُلُونَ yaEomaluwna يَعْمَلُونَ TagMatched Patterns Analyzed Word Word Pattern: The second algorithm (Example)

20 20 Vowelization Helps in determining some morphological features of the words. Analyzed Word Vowelization Pattern kitibkitobkutibkutubkutabkutobkatibkatubkatabkatob كِتِبكِتْبكُتِبكُتُبكُتَبكُتْبكَتِبكَتُبكَتَبكَتْب fiEilfiEolfuEilfuEulfuEalfuEolfaElfaEulFaEalfaEol فِعِلفِعْلفُعِلفُعُلفُعَلفُعْلفَعِلفَعُلفَعَلفَعْل ktb كتب

21 21 Part-of-Speech Tag Set is designed following the traditional grammar classifications. Tag Set has 22 morphological features of Arabic words. The Tag consists of 22 characters. E.g. v at the first position indicates verb, n at the second position indicates proper name. At the seventh position m indicates masculine, and f indicates feminine - is used If the value of a certain feature is not applicable for the tagged word. ? is used if the value of a certain feature belongs to word, but at the moment is not available or the automatic tagger could not guess it. Morphological Features of Arabic Words and Tag Set

22 22 Morphological Features of Arabic Words and Tag Set الحالة الإعرابية للاسم أو الفعل Case and Mood11 الصَّرف Morphology10 الشخص Person9 العدد Number8 الجنس Gender7 علامات الترقيم Punctuations6 أقسام فرعيَّة (أخرى) Residuals5 أقسام فرعيَّة (الحرف) POS of Particle4 أقسام فرعيَّة (الفعل) POS of Verb3 أقسام فرعيَّة (الاسم) POS of Noun2 أَقسام الكلام الرئيسيَّة Main POS1 Morphological Features Categories P Noun finals22 Verb Internal Structure 21 Root letters20 Augmented & Unaugmented 19 Variability & Conjugation 18 Humanness17 Transitivity16 Emphasize15 Voice14 Definiteness13 Case and Mood marks 12 أقسام الأسم تبعاً للفظ آخره بُنية الفعل عَدَد أحْرُف الجَذْر المجرَّد والمزيد التَّصريف العاقل وغير العاقل اللازم والمتعدي المُؤكَّد وغيرُ المُؤكَّد المَبْني لِلمَعْلُوم و المَبْني لِلمَجْهُول المَعْرِفَة والنَّكِرَة علامة الإعراب أو البناء Morphological Features Categories P

23 23 Morphological Features of Arabic Words and Tag Set Sample of tagged document using the morphological feature Tag Set وَوَصَّيْنَا الْإِنسَانَ بِوَالِدَيْهِ حُسْنًا We have recommended that a person must take good care of their parents. وَوَصَّيْنَا الْإِنسَانَ بِوَالِدَيْهِ حُسْنًا We have recommended that a person must take good care of their parents. WordTag وَ waAndp--t-----a وَصَّيْ waS~ayo Recommende d v-p---npf--iano-at& نَا naAWep--&---p-n الْإِنسَانَ Alo

24 24 Gold standards are used to evaluate and measure the actual accuracy of automatic systems. To construct a gold standard for evaluation, we need to determine: The Problem Domain Evaluating morphological analyzers and part-of-speech taggers. The Corpora Corpora of different text domains, formats and genres of both vowelized and non-vowelized Arabic text. Two versions of the Quran text, vowelized Quran text, and non-vowelized Quran text. The Corpus of Contemporary Arabic (Al-Sulaiti & Atwell, 2006). Evaluation and Results: Gold Standard for Evaluation

25 25 Gold Standard Format Includes morphological and part-of-speech information for each word of the gold standard in a line separated by tabs. Contains the root and the pattern information of the words. The gold standard will be stored using flat text files, using Unicode utf8 encoding or using XML. Gold Standard Size It must be relatively large. can cover most cases that morphological analyzers have to handle. It is measured by the number of words it contains. Gold Standard for Evaluation

26 26 Morphochallenge 2009 Gold Standard MorphoChallenge aims to develop an unsupervised morphological analyzer to be used for different languages including Arabic. A Gold standard of the Quran has been constructed to be used to evaluate morphological analyzers in Morphochallenge 2009 competition. Its size is 78,004 words. It contains the full morphological analysis for each word, according to the morphological analysis of the Quran in the tagged database of the Quran developed at the University of Haifa (Dror et al, 2004).

27 27 Morphochallenge 2009 Quran Gold Standard بِسْمِسمNoneب+Prep, سم+Noun+Triptotic+Sg+Masc+Gen, اللّهِNoneNoneللَاه+Noun+ProperName+Gen+Def, الرَّحْمـَنِرحمفَعلَانرَحمَان+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def, الرَّحِيمِرحمفَعِيلرَحِيم+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def, بسمسمNoneب+Prep, سم+Noun+Triptotic+Sg+Masc+Gen, اللهNoneNoneللاه+Noun+ProperName+Gen+Def, الرحمـنرحمفعلانرحمان+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def, الرحيمرحمفعيلرحيم+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def, bisomismNoneb+Prep, sm+Noun+Triptotic+Sg+Masc+Gen, All~hiNoneNonellaah+Noun+ProperName+Gen+Def, Alr~aHom_anirHmfaElaAnraHmaan+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def Alr~aHiymirHmfaEiylraHiim+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def, bsmsmNoneb+Prep, sm+Noun+Triptotic+Sg+Masc+Gen, AllhNoneNonellAh+Noun+ProperName+Gen+Def, AlrHm_nrHmfElAnrHmAn+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def AlrHymrHmfEylrHym+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def,

28 28 Thank you! Questions ?


Download ppt "School of Computing FACULTY OF ENGNEERING 17/07/09CL 20091 School of Computing FACULTY OF ENGNEERING Linguistically Informed and Corpus Informed Morphological."

Similar presentations


Ads by Google