Presentation is loading. Please wait.

Presentation is loading. Please wait.

BAMAE: Buckwalter Arabic Morphological Analyzer Enhancer Sameh Alansary Alexandria University Bibliotheca Alexandrina 4th International.

Similar presentations


Presentation on theme: "BAMAE: Buckwalter Arabic Morphological Analyzer Enhancer Sameh Alansary Alexandria University Bibliotheca Alexandrina 4th International."— Presentation transcript:

1 BAMAE: Buckwalter Arabic Morphological Analyzer Enhancer Sameh Alansary Alexandria University Bibliotheca Alexandrina Sameh.alansary@bibalex.org 4th International Conference on Arabic Language Processing (CITALA’12) 2-3 May 2012 Rabat, Morocco Bibalex Arabic Morphological Enhancer

2 Overview  Introduction.  Issues on BAMA’s output quality.  Preparing data sets (corpus).  Building BAMAE.  Evaluating BAMAE.  Conclusion 4th International Conference on Arabic Language Processing Rabat, Morocco May 2nd – 3rd 2012

3 INTRODUCTION

4  100,000 words.  Morphological and syntactic analysis.  Uses Buckwalter Arabic Morphological Analyzer (BAMA). CLARA( Corpus Lingae Arabcae). Previous Arabic analyzed corpora The Penn Arabic Treebank. 4th International Conference on Arabic Language Processing Rabat, Morocco May 2nd – 3rd 2012  1 million words annotated.  Building a balanced and annotated corpus.  15,000 words analyzed. Prague Arabic Dependency Treebank.  POS tag set based on EAGLES recommendations.  Multi-level linguistic annotations.  The morphological level is based on Penn Treebank’s model.  Adopts Functional Generative Description of language. Most previous Arabic analyzed corpora used BAMA as it has been found to be the most suitable lexical resource.

5 International Corpus of Arabic (ICA)  Planned to include 100 million words.  Representative corpora from written MSA sources.  Covers different sources and genres.  Markup codes have been added.  BAMA was a choice to analyze the ICA.  The concatenative approach adopted in analyzing the ICA. 4th International Conference on Arabic Language Processing Rabat, Morocco May 2nd – 3rd 2012

6  Dealing with Arabic words as their English counterparts.  Missing information.  Wrong information.  Wrong concatenations and segmentations. 4th International Conference on Arabic Language Processing Rabat, Morocco May 2nd – 3rd 2012 ISSUES ON BAMA’S OUTPUT QUALITY

7 Dealing with Arabic words as their English counterparts.  BAMA classifies some adverbs as prepositions or sub- conjunctions. بين إذا 4th International Conference on Arabic Language Processing Rabat, Morocco May 2nd – 3rd 2012  BAMA classifies some verbs as adverbs. These words were modified to be adverbs of time or place; (ADV_T) or (ADV_P). These words are modified to be verbs.

8  BAMA does not provide all the possible solutions or tags.  BAMA does not provide some passives of past and present verbs. Missing information  BAMA does not assign number, gender or definiteness in some cases. No number, gender or definiteness Such words have been dealt with manually and others have been fixed automatically. No passive form is provided Such passive and imperative forms have been analyzed manually.  BAMA does not provide most imperative form of verbs. No imperative form is provided.  BAMA does not assign a type of adverb; Time or Place. Such these words were modified to be (ADV_T) or (ADV_P). ADJ only need NOUN also. The tag NOUN has been added manually.  BAMA can’t cover all possible glossaries. More glossaries have been added; pregnancy and motivation/motivating

9 بين Wrong information 4th International Conference on Arabic Language Processing Rabat, Morocco May 2nd – 3rd 2012  BAMA wrongly predict gender, number and definiteness in some cases.  BAMA classifies some adverbs as inflectional nouns.  BAMA does not detect lemmas correctly in some cases: Such words have been dealt with manually and others have been fixed automatically. Such lemmas have been fixed manually. Only accusative case is true.

10 Wrong concatenations and segmentations 4th International Conference on Arabic Language Processing Rabat, Morocco May 2nd – 3rd 2012  It concatenates prefix, stem and suffix wrongly. EX: “زاهدين”: زاهد + ين dual suffix. Given by BAMA زاهد + ين plural suffix. Not available EX: “بأكبر”: أكبر+ ب ADJ. أكبر + ب NOUN. EX: “اترك”: Stem = “اترك” by BAMA. 4th International Conference on Arabic Language Processing Rabat, Morocco May 2nd – 3rd 2012  It sometimes fails to segment words correctly. Prefix = “ا” + Stem =“ترك”. EX: “آمر”: Prefix “آ” + Stem = “مر” by BAMA. Prefix = “أ” + Stem =“أمر”.

11 DEALING WITH BAMA’S SOLUTIONS

12 EX: The words “أبخرة” and “أبواب” are BR_PL. The words “قلمي” and the word “كتب” in “كتب محمد كثيرة” are EDAFAH. More linguistic information that BAMA does not provide have been added Broken plural and EDAFAH features: Root information: Stem Pattern:  Detected according its lemma.  Words have two different lemmas and, consequently, two different roots: EX: The word “يتم” has the lemma “>atam~” its root should be [tmm] and the lemma “yutom” its root should be [ytm].  Arabic word may have one or two roots: EX: The word “سيد” has two roots; [swd/syd]. The word “محمد” has one root; [Hmd].  Root and stem pattern are quite independent.  Depends on the word’s lemma, stem and root: EX: The word “مختار” has two stem patterns; “mufotaEil” and “mufotaEal”.  Words may have two stem pattern: EX: The word “آبار” has the root [b’r] and the pattern is “>aEofaAl” (أعفال) rather than “>afoEaAl” (أفعال).  Some words are exhibit metathesis: EX: The words “الولايات المتحدة الأمريكية” have assigned the feature NE behind each one of them. Name entity feature: 4th International Conference on Arabic Language Processing Rabat, Morocco May 2nd – 3rd 2012

13

14 Testing Data Training Data 4th International Conference on Arabic Language Processing Rabat, Morocco May 2nd – 3rd 2012 Preparing Data Set 400,000 words Manually Varified Manual Disambiguation Buckwalter Many Solutions Training Data Testing Data 60,000 words Training Data One solution for each word

15

16 Word Word- based level. Context- based level. Memory- based level. 4th International Conference on Arabic Language Processing Rabat, Morocco May 2nd – 3rd 2012

17 + Grammatica l Arabic Rules 1.Word-based level. Prefix Stem Suffix Concatenation Possible Impossible 4th International Conference on Arabic Language Processing Rabat, Morocco May 2nd – 3rd 2012

18 A word should not be adjective in some cases: EX 4th International Conference on Arabic Language Processing Rabat, Morocco May 2nd – 3rd 2012 30% Applying Rules eliminated Adjective Possessive Pronouns Nouns Possessive Pronouns

19 2. Context - based level. Prepositions “with no suffix” Adjective Verb Noun in genitive case 4th International Conference on Arabic Language Processing Rabat, Morocco May 2nd – 3rd 2012 Extracted From Training Data Disambiguation : Rule: based on the Context of the word W ord

20 Applicable only if all previous levels failed to decide the best solution. لا تبك على المفقود حتى لا تفقد الموجود Context – based level Preposition Negative Part حتى لا Present Verb تفقد Present Verb Subjective mood 4th International Conference on Arabic Language Processing Rabat, Morocco May 2nd – 3rd 2012 Memory – based level

21 المحمول Freq ( Past Verb + Noun + Noun + Adjective ) 3. Memory – based level.  The morphological features of ambiguous word along with its context are defined along with their occurrences frequency. EX تعددت استخدامات التليفون Has 2 Tags. Noun Context – based level Word – based level Memory – based level Freq ( Past Verb + Noun + Noun + Noun) Past Verb Noun Adjective 4th International Conference on Arabic Language Processing Rabat, Morocco May 2nd – 3rd 2012 One solution for each word by BAMAE

22 BAMAE: A final view  Each word has 17 pieces of linguistic information. Namely: word, lemma, vocalization, gloss, preffix1, preffix2, preffix3, stem, sufix1, sufix2, gender, number definiteness, case, Arabic stem, stem pattern and root.  Each word is indexed with its meta information.

23

24 4th International Conference on Arabic Language Processing Rabat, Morocco May 2nd – 3rd 2012 Testing Data 60,000 words One Solution for each word Evaluation Precision & Recall 0.870.83

25 Conclusion  Results are promising using rule based approach.  Future Plan: 1.Increase the training data size. 2.Enhance the Arabic Linguistic Rules for disambiguation. 3.Adopt Language Modeling tools (SRILM). 4th International Conference on Arabic Language Processing Rabat, Morocco May 2nd – 3rd 2012  The system will be released soon over Bibalex website: www.bibalex.org/UNL  Bibalex Enhancer is built on the top of BAMA instead of building another Analyzer from scratch.

26


Download ppt "BAMAE: Buckwalter Arabic Morphological Analyzer Enhancer Sameh Alansary Alexandria University Bibliotheca Alexandrina 4th International."

Similar presentations


Ads by Google