Presentation on theme: "I-AIBS Institute for Artificial Intelligence and Biological Systems"— Presentation transcript:
1I-AIBS Institute for Artificial Intelligence and Biological Systems Arabic Language Computing applied to the Quran - a PhD research project byKais DukesI-AIBS Institute for Artificial Intelligence and Biological SystemsSchool of ComputingUniversity of Leeds
3The Challenge: An interdisciplinary approach to understanding the Quran (1) Quranic Studies(3) Computational Linguistics(2) Traditional Arabic Linguistics
4(1) What is the Quran? The last in a series of 5 religious texts Holy BookProphetText DatedSuhuf Ibrahim (Scrolls)Abraham?The Tawrat (Torah)Moses1500 BCE?The Zabur (Psalms)David1000 BCE?The Injil (Gospel)Jesus1 CEThe QuranMuhammad (PBUH)CE
5The central religious text of Islam (1) What is the Quran?The central religious text of IslamClassical Arabic, years agoAll believers should learn the text; translations are “interpretations”Islamic Law (legal logic)Divine guidance & directionScience and philosophyHas inspired Algebra, Linguistics
6(2) Traditional Arabic Linguistics Originated in Arabs studying the language of the Quran (scientific analysis for at least 1000 years – a lot older than English language!):- Orthography (diacritics and vowelization)- Etymology (Semitic roots)- Morphology (derivation and inflection)- Syntax (origins of dependency grammar)- Discourse Analysis & Rhetoric- Semantics & Pragmatics
7(3) Computational Linguistics Quran is online, for keyword searchBUT verse-by-verse translations are interpretationsMuslims should access the “true” Classical Arabic source
8(3) Computational Linguistics - How far can we go?- Is an Artificial Intelligence system realistic?Example question-answering dialog system:QuestionHow long should I breastfeed my child for?Answer Mothers should suckle their offspring for two years, if the father wishes to complete the term (The Holy Quran, Verse 2:233).
9An AI approach to understanding the Quran Central HypothesisAugmenting the text of the Quran with rich annotation will lead to a more accurate AI system.- Prepare the data by annotating the Quran.- Use the data to build an AI system for concept search and question-answering.
10Annotating the Quran Challenges Orthography - Complex non-standard scriptMorphology (word structure) - Arabic is highly inflected, challenging to analyzeGrammar - Phrase structure, dependencySemantics – Ontology of Entities and Concepts referred to by pronouns and nouns
11Annotating the Quran Solutions - Computing advances have made annotation possible, to high accuracy- Leverage existing resources from Traditional Arabic GrammarMachine-Learning annotation followed by manual verification- Community effort using online volunteers
12Recent Advances: Orthography An accurate digital copy of the Quran?Encoding IssuesMissing diacriticsSimplified script (not Uthmani)Windows code page 1256, not UnicodeGoogle Search for verse (68:38) on Jan 21, 2008 shows many typos
13Recent Advances: Orthography Tanzil Project (http://tanzil.info)Stable version released May 2008Uses Unicode XML encoding, including the special characters designed for the complex Arabic script of the QuranManually verified to 100% accuracy by a group of experts who have memorized the entire text of the Quran
14Recent Advances: Orthography Java Quran API (http://jqurantree.org)(Dukes 2009)Java classes for querying the Tanzil XML of the Qurangives authentic script on web-pages
15Recent Advances: Morphology - Buckwalter Arabic Morphological Analyzer (Tim Buckwalter, 2002)Morphological Analysis of the Quran at the University of Haifa (Shuly Wintner, 2004)- Lexeme & feature based morphological representation of Arabic (Nizar Habash, 2006)
16The Haifa Corpus (2004) Multiple analysis for each word (up to 5) rbb+fa&l+Noun+Triptotic+Masc+Sg+Pron+Dependent+1P+Sgrbb+fa&l+Noun+Triptotic+Masc+Sg+GenNot manually verifiedAuthors reports an F-measure of 86%Non-standard annotation schemenot familiar to traditional Arabic linguistse.g. extracting a list of all verbs is non-trivialArabic text is only encoded phoneticallyinstead of using the original Arabic.e.g. searching for a specific root is not easy
17The Quranic Arabic Corpus http://corpus.quran.com/ Kais Dukes Arabic Language Computing Applied to the Quran – PhD (part-time)word structure - colour-coded morphological analysistranslation - word-for-word English translationsgrammar- dependency parse following Arabic traditionsemantics – ontology of entities and conceptsMachine Learning - annotations used for A.I. trainingImpact - dozens of researchers have collaborated/cited,and a million visitors have used the website this year
18The Quranic Arabic Corpus Verified Uthmani Script Unicode Uthmani ScriptSourced from the verified Tanzil project
19The Quranic Arabic Corpus Phonetics (faja'alnāhumu) Phonetic transcription generated algorithmicallyGuided by Arabic vowelized diacritics
20The Quranic Arabic Corpus Interlinear translation Word-for-word translation from accepted sourcesInterlinear translation scheme
21The Quranic Arabic Corpus Location Reference (21:70:4) Common standard for verses (Chapter:Verse)Extended in the QAC corpus to include word numbers and segment numbers, e.g. (21:70:4:2)
22The Quranic Arabic Corpus Morphological Segmentation Division of a single word into multiple segmentsPart-of-speech tag assigned to each segment- Traditional Arabic Grammar rules used for division
23The Quranic Arabic Corpus Morphological segment features
24The Quranic Arabic Corpus Arabic Grammar Summary
25The Quranic Arabic Treebank Syntactic Annotation Dependency Grammar based onإعراب (i'rāb)Syntactico-semantic roles for each word
26The Quranic Arabic Treebank Ontology of entities and concepts linked to/from nouns and pronouns in the text
27The Quranic Arabic Treebank Framework for collaboration Message Board:“If you come across a word and you feel that a better analysis could be provided, you can suggest a correction online by clicking on an Arabic word”(currently 5228 resolved messages; 1048 under review)Resources:Publications; Citations, Reviews, FAQs, Feedback,Data Download, Software download, Mailing list
28The Quranic Arabic Treebank Users: researchers, public Artificial Intelligence and Computational LinguisticsArabic linguisticsQuranic and Islamic StudiesClassical literature analysisAnyone who wants to appreciate the Quran
29The Quranic Arabic Treebank new Computational Linguistics? First Treebank of Classical ArabicFree Treebank of the QuranFirst formal representation of Traditional Arabic Grammar using constituency/dependency graphsMachine-Learning parser
30The Quranic Arabic Corpus Part-of-speech Tagging Part-of-speech tags adapted from Traditional Arabic Grammar, and mapped to English equivalents (not the other way around)These tags apply to words in the Quran, as well as to individual morphological segments in the textPart-of-speech TagNameArabic NameNNounاسمPNProper nounاسماء علمPRONPersonal pronounضميرDEMDemonstrative pronounاسم اشارةRELRelative pronounاسم موصولADJAdjectiveصفةVVerbفعلPPrepositionحرف جرPARTParticleحرفINTGInterrogative particleحرف استفهامVOCVocative particleحرف نداءNEGNegative particleحرف نفيFUTFuture particleحرف استقبالCONJConjunctionحرف عطفNUMNumberرقمTTime adverbظرف زمانLOCLocation adverbظرف مكانEMPHEmphatic lām prefixلام التوكيدPRPPurpose lām prefixلام التعليلIMPVImperative lām prefixلام الامرINLQuranic initialsحروف مقطعة
31Automatic Annotation Classical Arabic Dependency Parser Joakim Nivre (2009) dependency parsing using a shift/reduce queue/stack architecture with machine learningFollowing similar architecture, but with hand written rules, custom parser has anF-measure of 77.2%
32University of Leeds Postgraduate Researcher Conference 2011 Criteria for “PGR Researcher of the Year 2011”Ability to communicate research to the lay and non-specialist research audienceImpact/potential impact of the research in terms of e.g. application of findings for economic or social benefit; the significance of the contribution/potential contribution of the research to the academic subject areaEvidence of local or national publicity or public engagement.
33Ability to communicate research to the lay and non-specialist audience Example Feedback (319 comments) “I would like to applaud you for your effort” Prof Behnam Sadeghi, Stanford University “We are big admirers of the work” Prof Gregory Crane, Classics Dept, Tufts University “I regularly use your work on the Qur'an and read it whenever I can.” Prof Yousuf Islam, Director, Daffodil International University “Congratulations to all concerned on this project” - Prof Michael Arthur, VC, Leeds Uni
34Impact: application of findings for economic or social benefit Over a million users already, and growing; many unforseen social benefits, eg: “I work as a chaplain in correctional centers in the State of Missouri, U.S.A. Thanks for your permission to use the Quranic Arabic Corpus in these correctional centers” Tadar Wazir.
35Impact: significance of the research to the academic subject area 10 papers in research conferences & journals25 citations (from Google Scholar) - so far...Positive feedback from top researchersOnly free-to-download Arabic treebankA de-facto standard data-set for AI research
36Evidence of local or national publicity or public engagement Newspapers, eg Muslim Post; better still: Website – world-wide public engagement!
37I-AIBS Institute for Artificial Intelligence and Biological Systems Conclusion This is not the end to come: 2nd half of PhD project; and more?Kais DukesI-AIBS Institute for Artificial Intelligence and Biological SystemsSchool of ComputingUniversity of Leeds