Presentation on theme: "Corpus Linguistics for Understanding the Quran"— Presentation transcript:
1 Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee MuhammadI-AIBS Institute for Artificial Intelligence and Biological SystemsSchool of ComputingUniversity of Leeds
2 The Challenge: An interdisciplinary approach to understanding the Quran (1) Quranic Studies(3) Computational Linguistics(2) Traditional Arabic Linguistics
3 (1) What is the Quran? The last in a series of 5 religious texts Holy BookProphetText DatedSuhuf Ibrahim (Scrolls)Abraham?The Tawrat (Torah)Moses1500 BCE?The Zabur (Psalms)David1000 BCE?The Injil (Gospel)Jesus1 CEThe QuranMuhammad (PBUH)CE
4 The central religious text of Islam (1) What is the Quran?The central religious text of Islam- Classical ArabicIslamic Law (legal logic)Divine guidance & directionScientific & philosophical knowledgeHas inspired many scientific achievements, e.g. Algebra and linguistics
5 (2) Traditional Arabic Linguistics Originated in Arabs studying the language of the Quran (detailed analysis for at least 1000 years):- Orthography (diacritics and vowelization)- Etymology (Semitic roots)- Morphology (derivation and inflection)- Syntax (origins of dependency grammar)- Discourse Analysis & Rhetoric- Semantics & Pragmatics
6 (3) Computational Linguistics Where are we now?Current use of computing to analyze the Quran is mostly…- Keyword search (useful)- Frequency analysis (numerology?)
7 (3) Computational Linguistics - How far can we go?- Is an artificial intelligence system realistic?Example question-answering dialog system:QuestionHow long should I breastfeed my child for?Answer Mothers should suckle their offspring for two years, if the father wishes to complete the term (The Holy Quran, Verse 2:233).
8 An AI approach to understanding the Quran Central HypothesisAugmenting the text of the Quran with rich annotation will lead to a more accurate AI system.- Prepare the data by annotating the Quran.- Use the data to build an AI system for concept search and question-answering.
9 Annotating the Quran Challenges Orthography - Complex script verified in Unicode?Morphology - Arabic is highly inflected and this is challenging to model by computerSyntax - Phrase structure or dependency grammar?Semantics – lexical semantics, ontology, logic, lexical frames?
10 Annotating the Quran Solutions - Recent computational advances have made possible annotating the Quran to very high accuracy- Community effort using volunteers- Leverage existing resources from Traditional Arabic Grammar- Automatic annotation followed by manual verification
11 Recent Advances: Orthography Does an accurate digital copy of the Quran exist?Encoding IssuesMissing diacriticsSimplified script (not Uthmani)Windows code page 1256, not UnicodeGoogle Search for verse (68:38) on Jan 21, 2008 shows many typos
12 Recent Advances: Orthography Tanzil Project (http://tanzil.info)Stable version released May 2008Uses Unicode XML encoding, including the special characters designed for the complex Arabic script of the QuranManually verified to 100% accuracy by a group of experts who have memorized the entire text of the Quran
13 Recent Advances: Orthography Java Quran API (http://jqurantree.org)March 2009Java classes for querying the Tanzil XML of the QuranFirst step towards software package for analyzing the Quran
14 Recent Advances: Morphology - Buckwalter Arabic Morphological Analyzer (2002)Morphological Analysis of the Quran at the University of Haifa, Israel (2004)- Lexeme & feature based morphological representation of Arabic (Nizar Habash, 2006)
15 The Haifa Corpus (2004) Multiple analysis for each word (up to 5) rbb+fa&l+Noun+Triptotic+Masc+Sg+Pron+Dependent+1P+Sgrbb+fa&l+Noun+Triptotic+Masc+Sg+GenNot a manually verified corpusAuthors reports an F-measure of 86%Non-standard annotation scheme not familiar to traditional Arabic linguists (e.g. extracting a list of all verbs in the corpus is non-trivial)Arabic text is only encoded phonetically instead of using the original Arabic. Searching for the possible morphological analyses for a specific word is not easy
16 The Quranic Arabic Corpus - Manually verified (99% accuracy)Poplar website with very positive feedbackmillion(s) of visitors1. Initial tagging using Buckwalter Analyzer2. Paid annotator working for 3 months3. Community of volunteers verifying against existing books of Traditional Arabic Grammar which analyse the QuranShows Arabic and English morphological analysis side-by-side, with phonetic transcription, search and translation.
17 The Quranic Arabic Corpus http://corpus.quran.com/ Kais Dukes Arabic Language Computing Applied to the Quran – PhD (part-time)an open-source online focus for linguistic research on Classical Arabic:morphology - each word shows colour-coded morphological analysissyntax - each verse shows dependecy parse following Arabic traditionsemantics - entitites and concepts are linked to an ontologytranslation - word-for-word English translations to aid understandingMachine Learning - annotations provide training data for a parserImpact on society - dozens of researchers collaborated on the analysisand over a million visitors have used the website this year.
18 The Quranic Arabic Corpus Part-of-speech Tagging Part-of-speech tags adapted from Traditional Arabic Grammar, and mapped to English equivalents (not the other way around)These tags apply to words in the Quran, as well as to individual morphological segments in the textPart-of-speech TagNameArabic NameNNounاسمPNProper nounاسماء علمPRONPersonal pronounضميرDEMDemonstrative pronounاسم اشارةRELRelative pronounاسم موصولADJAdjectiveصفةVVerbفعلPPrepositionحرف جرPARTParticleحرفINTGInterrogative particleحرف استفهامVOCVocative particleحرف نداءNEGNegative particleحرف نفيFUTFuture particleحرف استقبالCONJConjunctionحرف عطفNUMNumberرقمTTime adverbظرف زمانLOCLocation adverbظرف مكانEMPHEmphatic lām prefixلام التوكيدPRPPurpose lām prefixلام التعليلIMPVImperative lām prefixلام الامرINLQuranic initialsحروف مقطعة
19 The Quranic Arabic Corpus Verified Uthmani Script Unicode Uthmani ScriptSourced from the verified Tanzil project
20 The Quranic Arabic Corpus Phonetics (faja'alnāhumu) Phonetic transcription generated algorithmicallyGuided by Arabic vowelized diacritics
21 The Quranic Arabic Corpus Interlinear translation Word-for-word translation from accepted sourcesInterlinear translation scheme
22 The Quranic Arabic Corpus Location Reference (21:70:4) Common standard for verses (Chapter:Verse)Extended in the QAC corpus to include word numbers and segment numbers, e.g. (21:70:4:2)
23 The Quranic Arabic Corpus Morphological Segmentation Division of a single word into multiple segmentsPart-of-speech tag assigned to each segment- Traditional Arabic Grammar rules used for division
24 The Quranic Arabic Corpus Morphological segment features
25 The Quranic Arabic Corpus Arabic Grammar Summary
26 The Quranic Arabic Treebank Syntactic Annotation Dependency Grammar based onإعراب (i'rāb)Syntactico-semantic roles for each word
27 The Quranic Arabic Treebank What’s new about this research? First Treebank of Classical ArabicFree Treebank of the Quran- Well-defined formal representation of Traditional Arabic Grammar using hybrid constituency/dependency graphs
28 Automatic Annotation Classical Arabic Dependency Parser Joakim Nivre (2009) dependency parsing using a shift/reduce queue/stack architecture with machine learningFollowing similar architecture, but with hand written rules, custom parser has anF-measure of 77.2%
29 Quran ‘Search for a Concept’ Tool Nora Abbas developed the first Quran "search for a concept" tool and website, Qurany;Noorhan Abbas. Qurany: A Tool to Search for Concepts in the Quran (PDF). MSc by Research Thesis, School of Computing, Leeds University, 2009
30 Quran ‘Search for a Concept’ Tools The SearchTruth tool 48%Search TruthThe Holy Quran Viewer tool 34%Holy Quran ViewerThe University of Southern California tool 49%MSA-USC Qur’an DatabaseWhat the available Quran tools on the net provide?What is the main problem with these tools?What about the Recall value of their results?What is the main reason for these poor results?
31 Quran ‘Search for a Concept’ Tool What is a CONCEPT?NOT just a “keyword”“index term” in a textbook?
32 Quran ‘Search for a Concept’ Tool General/Abstract Concepts:Women’s financial statusMain pillars of IslamCharacteristics of ParadiseConcrete Concepts:Names of places(Makkah, Mecca, Meccah)Names of prophets, angels,…etc.(Musa, Moses)Names of Holy Books(The Book (Bible), Bible, New Testament)
33 Quran ‘Search for a Concept’ Tool 12345What does my tool look like?6
34 Quran ‘Search for a Concept’ Tool Handling the Concrete ConceptsEight Parallel English TranslationsSearch for one English word or a group of words in one search requestSearch for one Arabic word or a group of words in one search requestSearch for a mixed list of Arabic and English words in one search requestOffers a list of synonyms for the English words
35 Quran ‘Search for a Concept’ Tool General/Abstract ConceptsIt is imported from ‘Mushaf Al Tajweed’ index of topics published by Dar Al-Maarifa in Syria.The tool has 15 main concepts.The tool covers all the concepts in both languages Arabic and English.The total number of concepts covered is 1170.For example, to represent:Women’s financial statusMain pillars of IslamCharacteristics of Paradise
36 Knowledge representation and text mining of the Qur'an Abdul-Baquee Muhammad
37 Qur'anic Applications Text Mining The Quran Verse similarity: Allows you to see all verses that share a certain percent of characters with your input verse. Quranic Chapter Relatives allows you to see the strongest relatives of a given Quran Chapter. Word Cloud: See word clouds of a sura or group of suras of the Qur'anic. Qur'an Concordance: Concordance over lemma. Part-of-Speech Display of Sura: View a sura of the Qur'an with color-coded Part of speech tags. Quranic word co-occurence: Allows you to enter a quranic terms to finds its most frequent neighbors. N-gram Search: Search upto 5-gram phrases of the Quran with a frequency of 5 or more. Pronoun References: Given a verse, see all pronoun references within this verse. List of Concepts: See a list of concepts arising from Pronoun referents in the Quran.
38 AI for understanding the Quran Kais Dukes developed the first online annotated linguistic resource which shows the Arabic "irab" morphology and grammar for each word and verse in the Holy Quran, the Quranic Arabic Corpus including word-by-word morphology and English gloss, and Ontology of Quranic concepts; Nora Abbas developed the first Quran "search for a concept" tool and website, Qurany; Abdul-Baquee Sharaf developed tools and resources for text mining the Quran including verse similarity, lemma concordance and collocation, and text mining the Hadeeth