Download presentation
Presentation is loading. Please wait.
1
Prague Arabic Dependency Treebank
MorphoTrees of Arabic and Their Annotation in the TrEd Environment Otakar Smrž Petr Pajas Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague
2
MorphoTrees of Arabic and Their Annotation in the TrEd Environment
MorphoTrees … TrEd … ??? MorphoTrees mean turning unorganized sets of complex morphological analyses into hierarchies Intuitive, decision-efficient, multi-purpose, interesting In general, not limited to the language, nor the system of morphology, nor the levels, nor the implementation TrEd is a fully programmable graphical editor for tree-like graphs and an excellent suite of tools for data batch processing (local/network) Analytical and tectogrammatical dependency annotation Viewing and converting of Arabic phrase-structure trees Evaluating and merging of parser/tagger/human results September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment
3
MorphoTrees of Arabic and Their Annotation in the TrEd Environment
MorphoTrees in TrEd Files with two types of trees Criteria & restrictions Automatic decisions Hiding modes Viewing options Short-cut keys & mouse Consist-ency checks Processing & update macros September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment
4
MorphoTrees of Arabic and Their Annotation in the TrEd Environment
Arabic … the Questions Is there syntactic difference in sawfa ′arā ′abā ′Aḥmada and sa′as′alu wālidahu ? Is there morphological difference? The only difference is in the use of lexical units and morphs. The grammatical categories are unchanged, and morphology and syntax should clearly show this. How do we find syntactic units? How do we get back word-forms from the lexical units and tags? How much does improper morphological reading disturb consequent syntactic representation? Improper in tags, lemmas, diacritics, or in tokenization? September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment
5
MorphoTrees of Arabic and Their Annotation in the TrEd Environment
Reminder of the Terms Grapheme / Phoneme The least units capable of distinguishing meanings ~ 40 letters, context-dependent forms 28 consonants, 6 vowels Morph Composition of graphemes / phonemes Abstract derivational forms Morpheme The least unit representing some linguistic meaning Function of morphs Projection of grammatical categories Token The least syntactic unit Bearer of a uniform vector of grammatical categories September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment
6
Tim Buckwalter’s Morphology
PADT MorphoTrees are generated based on the information provided by Buckwalter Arabic Morphological Analyzer + Updateable stem-based lexicon, finite-state model, implementation in Perl and published under GNU GPL – Morphs, mapping only to Quasi-Functional Morphology The tokenization, clustering, modeling of conditionality, … (wabijAnibihA) [jAnib_1] wa/CONJ + bi/PREP + jAnib/NOUN + i/CASE_DEF_GEN + hA/POSS_PRON_3FS C wa CONJ and P bi PREP at N R jAnib+i NOUN+CASE_DEF_GEN side of S----3FS2- hA POSS_PRON_3FS her September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment
7
Xerox Morphological Analyzer
September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment
8
MorphoTrees Hierarchy
MorphoTrees of Arabic propose these levels Entity – the analyzed elements of the discourse Partitioning to the standard forms of the tokens Non-vocalized standard orthographical forms Lemmas/identifiers of lexical units Tokens – syntactic units including the form and the tag Independence on the language / implementation More/different levels, inclusion of spelling variations, … Annotation of various tagsets, other features of tokens Efficiency of decision-making Distance between analyses becomes recognized September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment
9
MorphoTrees Annotation
Selecting the leaves that correspond to the proper reading of the tokens constituting the entity Quick use of keyboard and/or mouse for annotations Restricting the tree according to the criteria/categories required by the context Natural control over the inheritance of restrictions Employing automatic restrictions and annotation actions, both generic and linguistic Learning about the discriminative categories and “human tagging” September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment
10
Discussion and Conclusion
MorphoTrees Imporant in morphological annotation and in evaluation PADT 1.0 provides annotated tokens Functional Morphology … more in Prague Arabic Dependency Treebank: Development in Data and Tools Even its approximation is promising and welcome Feature-Based Tagger trained on Penn ATB 2 3.6% error rate in major part-of-speech (15 values) 10.8% in the full tagset (317 evidenced combinations) 0.8–0.6% error rate in tokenization of the input September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.