Presentation on theme: "Arabic Language Computing applied to the Quran - a PhD research project by Kais Dukes I-AIBS Institute for Artificial Intelligence and Biological Systems."— Presentation transcript:
Arabic Language Computing applied to the Quran - a PhD research project by Kais Dukes I-AIBS Institute for Artificial Intelligence and Biological Systems School of Computing University of Leeds
(1) Quranic Studies (3) Computational Linguistics (2) Traditional Arabic Linguistics The Challenge: An interdisciplinary approach to understanding the Quran
(1) What is the Quran? Holy BookProphetText Dated Suhuf Ibrahim (Scrolls)Abraham? The Tawrat (Torah)Moses1500 BCE? The Zabur (Psalms)David1000 BCE? The Injil (Gospel)Jesus1 CE The QuranMuhammad (PBUH)610-632 CE The last in a series of 5 religious texts
(1) What is the Quran? -Classical Arabic, 1300+ years ago - All believers should learn the text; translations are interpretations - Islamic Law (legal logic) - Divine guidance & direction - Science and philosophy - Has inspired Algebra, Linguistics The central religious text of Islam
(2) Traditional Arabic Linguistics - Orthography (diacritics and vowelization) - Etymology (Semitic roots) - Morphology (derivation and inflection) - Syntax (origins of dependency grammar) - Discourse Analysis & Rhetoric - Semantics & Pragmatics Originated in Arabs studying the language of the Quran (scientific analysis for at least 1000 years – a lot older than English language!):
(3) Computational Linguistics Quran is online, for keyword search BUT verse-by-verse translations are interpretations Muslims should access the true Classical Arabic source
(3) Computational Linguistics Example question-answering dialog system: Question How long should I breastfeed my child for? Answer Mothers should suckle their offspring for two years, if the father wishes to complete the term (The Holy Quran, Verse 2:233). - How far can we go? - Is an Artificial Intelligence system realistic?
An AI approach to understanding the Quran Central Hypothesis Augmenting the text of the Quran with rich annotation will lead to a more accurate AI system. - Prepare the data by annotating the Quran. - Use the data to build an AI system for concept search and question-answering.
Annotating the Quran Challenges Orthography - Complex non-standard script Morphology (word structure) - Arabic is highly inflected, challenging to analyze Grammar - Phrase structure, dependency Semantics – Ontology of Entities and Concepts referred to by pronouns and nouns
Annotating the Quran Solutions - Computing advances have made annotation possible, to high accuracy - Leverage existing resources from Traditional Arabic Grammar -Machine-Learning annotation followed by manual verification -- Community effort using online volunteers
Recent Advances: Orthography Google Search for verse (68:38) on Jan 21, 2008 shows many typos An accurate digital copy of the Quran? Encoding Issues - Missing diacritics - Simplified script (not Uthmani) - Windows code page 1256, not Unicode
Recent Advances: Orthography Tanzil Project (http://tanzil.info)http://tanzil.info - Stable version released May 2008 - Uses Unicode XML encoding, including the special characters designed for the complex Arabic script of the Quran - Manually verified to 100% accuracy by a group of experts who have memorized the entire text of the Quran
Recent Advances: Orthography Java Quran API (http://jqurantree.org)http://jqurantree.org (Dukes 2009) - Java classes for querying the Tanzil XML of the Quran - gives authentic script on web-pages
Recent Advances: Morphology - Buckwalter Arabic Morphological Analyzer (Tim Buckwalter, 2002) - Morphological Analysis of the Quran at the University of Haifa (Shuly Wintner, 2004) - Lexeme & feature based morphological representation of Arabic (Nizar Habash, 2006)
The Haifa Corpus (2004) Multiple analysis for each word (up to 5) rbb+fa&l+Noun+Triptotic+Masc+Sg+Pron+Dependent+1P+Sg rbb+fa&l+Noun+Triptotic+Masc+Sg+Gen Not manually verified Authors reports an F-measure of 86% Non-standard annotation scheme not familiar to traditional Arabic linguists e.g. extracting a list of all verbs is non-trivial Arabic text is only encoded phonetically instead of using the original Arabic. e.g. searching for a specific root is not easy
The Quranic Arabic Corpus http://corpus.quran.com/ http://corpus.quran.com/ Kais DukesKais Dukes Arabic Language Computing Applied to the Quran – PhD (part-time) word structure - colour-coded morphological analysis translation - word-for-word English translations grammar- dependency parse following Arabic tradition semantics – ontology of entities and concepts Machine Learning - annotations used for A.I. training Impact - dozens of researchers have collaborated/cited, and a million visitors have used the website this year
The Quranic Arabic Corpus Verified Uthmani Script - Unicode Uthmani Script - Sourced from the verified Tanzil project
The Quranic Arabic Corpus Phonetics (faja'alnāhumu) - Phonetic transcription generated algorithmically - Guided by Arabic vowelized diacritics
The Quranic Arabic Corpus Interlinear translation - Word-for-word translation from accepted sources - Interlinear translation scheme
The Quranic Arabic Corpus Location Reference (21:70:4) - Common standard for verses (Chapter:Verse) - Extended in the QAC corpus to include word numbers and segment numbers, e.g. (21:70:4:2)
The Quranic Arabic Corpus Morphological Segmentation - Division of a single word into multiple segments - Part-of-speech tag assigned to each segment - Traditional Arabic Grammar rules used for division
The Quranic Arabic Corpus Morphological segment features
The Quranic Arabic Corpus Arabic Grammar Summary
The Quranic Arabic Treebank Syntactic Annotation - Dependency Grammar based onإعراب (i'rāb) - Syntactico-semantic roles for each word
The Quranic Arabic Treebank Ontology of entities and concepts - linked to/from nouns and pronouns in the text
The Quranic Arabic Treebank Framework for collaboration Message Board: If you come across a word and you feel that a better analysis could be provided, you can suggest a correction online by clicking on an Arabic word (currently 5228 resolved messages; 1048 under review) Resources: Publications; Citations, Reviews, FAQs, Feedback, Data Download, Software download, Mailing list
The Quranic Arabic Treebank Users: researchers, public - Artificial Intelligence and Computational Linguistics - Arabic linguistics -Quranic and Islamic Studies -Classical literature analysis -Anyone who wants to appreciate the Quran
The Quranic Arabic Treebank new Computational Linguistics? - First Treebank of Classical Arabic - Free Treebank of the Quran - First formal representation of Traditional Arabic Grammar using constituency/dependency graphs - Machine-Learning parser
The Quranic Arabic Corpus Part-of-speech Tagging Part-of-speech TagNameArabic Name NNoun اسم PNProper noun اسماء علم PRONPersonal pronoun ضمير DEMDemonstrative pronoun اسم اشارة RELRelative pronoun اسم موصول ADJAdjective صفة VVerb فعل PPreposition حرف جر PARTParticle حرف INTGInterrogative particle حرف استفهام VOCVocative particle حرف نداء NEGNegative particle حرف نفي FUTFuture particle حرف استقبال CONJConjunction حرف عطف NUMNumber رقم TTime adverb ظرف زمان LOCLocation adverb ظرف مكان EMPHEmphatic lām prefix لام التوكيد PRPPurpose lām prefix لام التعليل IMPVImperative lām prefix لام الامر INLQuranic initials حروف مقطعة -Part-of-speech tags adapted from Traditional Arabic Grammar, and mapped to English equivalents (not the other way around) - These tags apply to words in the Quran, as well as to individual morphological segments in the text
Automatic Annotation Classical Arabic Dependency Parser - - Joakim Nivre (2009) dependency parsing using a shift/reduce queue/stack architecture with machine learning - Following similar architecture, but with hand written rules, custom parser has an F-measure of 77.2%
University of Leeds Postgraduate Researcher Conference 2011 Criteria for PGR Researcher of the Year 2011 Ability to communicate research to the lay and non-specialist research audience Impact/potential impact of the research in terms of e.g. application of findings for economic or social benefit; the significance of the contribution/potential contribution of the research to the academic subject area Evidence of local or national publicity or public engagement.
Ability to communicate research to the lay and non-specialist audience Example Feedback (319 comments) I would like to applaud you for your effort Prof Behnam Sadeghi, Stanford University We are big admirers of the work Prof Gregory Crane, Classics Dept, Tufts University I regularly use your work on the Qur'an and read it whenever I can. Prof Yousuf Islam, Director, Daffodil International University Congratulations to all concerned on this project - Prof Michael Arthur, VC, Leeds Uni
Impact: application of findings for economic or social benefit Over a million users already, and growing; many unforseen social benefits, eg: I work as a chaplain in correctional centers in the State of Missouri, U.S.A. Thanks for your permission to use the Quranic Arabic Corpus in these correctional centers Tadar Wazir.
Impact: significance of the research to the academic subject area 10 papers in research conferences & journalspapers 25 citations (from Google Scholar) - so far...citations Positive feedback from top researchersfeedback Only free-to-download Arabic treebank A de-facto standard data-set for AI research
Evidence of local or national publicity or public engagement Newspapers, eg Muslim Post; better still: Website – world-wide public engagement!
Conclusion This is not the end to come: 2 nd half of PhD project; and more? Kais Dukes I-AIBS Institute for Artificial Intelligence and Biological Systems School of Computing University of Leeds