Presentation on theme: "I-AIBS Institute for Artificial Intelligence and Biological Systems"— Presentation transcript:
1 I-AIBS Institute for Artificial Intelligence and Biological Systems Corpus resources for learning Arabic to understand the Quran Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes) Learning and Teaching" Parkinson B08, University of Leeds, Monday 23rd July 2012.Eric AtwellI-AIBS Institute for Artificial Intelligence and Biological SystemsSchool of ComputingUniversity of Leeds
2 (2) Traditional Arabic Linguistics An Artificial Intelligence interdisciplinary approach to understanding the Quran(1) Quranic Studies(3) Computing(2) Traditional Arabic Linguistics
3 (1) What is the Quran?Islam: the last in a series of 5 religious textsHoly BookProphetText DatedSuhuf Ibrahim (Scrolls)Abraham?Tawrat (Torah)Moses1500 BCE?Zabur (Psalms)David1000 BCE?Injil (Gospel)Jesus1 CEThe QuranMuhammad (PBUH)CE
4 The central religious text of Islam (1) What is the Quran?The central religious text of IslamClassical Arabic, years agoAll believers should learn the text; translations are “interpretations”Islamic Law (legal logic)Divine guidance & directionScience and philosophyHas inspired Algebra, Linguistics
5 (2) Traditional Arabic Linguistics Originated in Arabs studying the language of the Quran (scientific analysis for at least 1000 years – a lot older than English language!):- Orthography (diacritics and vowelization)- Etymology (Semitic roots)- Morphology (derivation and inflection)- Syntax (origins of dependency grammar)- Discourse Analysis & Rhetoric- Semantics & Pragmatics
6 (3) Computing Quran is online, for keyword search BUT verse-by-verse translations are interpretationsMuslims should access the “true” Classical Arabic source
7 (3) Computing - How far can we go? - An Artificial Intelligence system which “understands” the Quran?Example question-answering dialog system:QuestionHow long should I breastfeed my child for?Answer Mothers should suckle their offspring for two years, if the father wishes to complete the term (The Holy Quran, Verse 2:233).
8 An Artificial Intelligence approach to understanding the Quran Central HypothesisAugmenting the text of the Quran with rich linguistic annotation will lead to a more intelligent/accurate AI systems.- Prepare the data by annotating the Quran.- Use the data to build an AI system for concept search and question-answering.
9 Corpus resources for learning Arabic to understand the Quran Augmenting the Arabic text of the Quran with rich linguistic annotation will help learners to understand Quranic Arabic.- Annotate the Quranic Arabic Corpus.- Teacher and Learners use the annotations for deeper understanding of Quranic Arabic.
10 Straw Poll: LSP for religious texts? How many Muslims in the audience?How many read/recite Classical Arabic Quran?How many would like to?How many Jews in the audience?How many read/recite Classical Hebrew Tanakh?How many Christians in the audience?How many read/recite Classical Hebrew/Greek Bible?Have I left anyone out?
11 Annotating the Quran Challenges Orthography - Complex non-standard scriptMorphology (word structure) - Arabic is highly inflected, challenging to analyzeGrammar - Phrase structure, dependencySemantics – Ontology of Entities and Concepts referred to by pronouns and nouns
12 Annotating the Quran Solutions - Computing advances have made annotation possible, to high accuracy- Leverage existing resources from Traditional Arabic GrammarMachine-Learning annotation followed by manual verification- Community effort using online volunteers
13 Recent Advances: Orthography An accurate digital copy of the Quran?Encoding IssuesMissing diacriticsSimplified script (not Uthmani)Windows code page 1256, not UnicodeGoogle Search for verse (68:38) on Jan 21, 2008 shows many typos
14 Recent Advances: Orthography Tanzil Project (http://tanzil.info)Stable version released May 2008Uses Unicode XML encoding, including the special characters designed for the complex Arabic script of the QuranManually verified to 100% accuracy by a group of experts who have memorized the entire text of the Quran
15 Recent Advances: Orthography Java Quran API (http://jqurantree.org)(Dukes 2009)Java classes for querying the Tanzil XML of the Qurangives authentic script on web-pages
16 Recent Advances: Morphology - Buckwalter Arabic Morphological Analyzer (Tim Buckwalter, 2002)Morphological Analysis of the Quran at the University of Haifa (Shuly Wintner, 2004)- Lexeme & feature based morphological representation of Arabic (Nizar Habash, 2006)
17 The Haifa Corpus (2004) Multiple analysis for each word (up to 5) rbb+fa&l+Noun+Triptotic+Masc+Sg+Pron+Dependent+1P+Sgrbb+fa&l+Noun+Triptotic+Masc+Sg+GenNot manually verifiedAuthors reports an F-measure of 86%Non-standard annotation schemenot familiar to Arabic linguistse.g. extracting a list of all verbs is non-trivialArabic text is only encoded phoneticallye.g. searching for a specific root is not easy
18 The Quranic Arabic Corpus http://corpus.quran.com/ Kais Dukes – PhD (part-time)word structure - colour-coded morphological analysistranslation – verse, word-for-word English translationsgrammar- dependency parse following Arabic traditionsemantics – ontology of entities and conceptsMachine Learning - annotations used for A.I. trainingImpact - dozens of researchers have collaborated/cited,and over a million visitors use the website per year
19 The Quranic Arabic Corpus Verified Uthmani Script Unicode Uthmani ScriptSourced from the verified Tanzil project
20 The Quranic Arabic Corpus Phonetics (faja'alnāhumu) Phonetic transcription generated algorithmicallyGuided by Arabic vowelized diacritics
21 The Quranic Arabic Corpus Interlinear translation Word-for-word translation from accepted sourcesInterlinear translation scheme
22 The Quranic Arabic Corpus Location Reference (21:70:4) Common standard for verses (Chapter:Verse)Extended in the QAC corpus to include word numbers and segment numbers, e.g. (21:70:4:2)
23 The Quranic Arabic Corpus Morphological Segmentation Division of a single word into multiple segmentsPart-of-speech tag assigned to each segment- Traditional Arabic Grammar rules used for division
24 The Quranic Arabic Corpus Morphological segment features
25 The Quranic Arabic Corpus Arabic Grammar Summary
26 The Quranic Arabic Corpus Syntactic Annotation Dependency Grammar based onإعراب (i'rāb)Syntactico-semantic roles for each word
27 The Quranic Arabic Corpus Ontology of entities and concepts linked to/from nouns and pronouns in the text
28 The Quranic Arabic Corpus Framework for collaboration User Interaction via Message Board:“If you come across a word and you feel that a better analysis could be provided, you can suggest a correction online by clicking on an Arabic word”(5000+ resolved messages)Resources:Publications; Citations, Reviews, FAQs, Feedback,Data Download, Software download, Mailing list
29 The Quranic Arabic Corpus Users: researchers, public Artificial Intelligence and Computational LinguisticsArabic linguisticsQuranic and Islamic StudiesClassical literature analysisAnyone who wants to appreciate the Quran
30 The Quranic Arabic Corpus new Computational Linguistics First Treebank of Classical ArabicFree Treebank of the QuranFirst formal representation of Traditional Arabic Grammar using constituency/dependency graphsMachine-Learning parser
31 User Feedback (300+ comments) “I would like to applaud you for your effort” Prof Behnam Sadeghi, Stanford University “We are big admirers of the work” Prof Gregory Crane, Classics Dept, Tufts University “I regularly use your work on the Qur'an and read it whenever I can.” Prof Yousuf Islam, Director, Daffodil International University “Congratulations to all concerned on this project” - Prof Michael Arthur, VC, Leeds Uni
32 Most users are teachers and learners of Quranic Arabic Over a million users already, and growing; many unforseen social benefits, eg: “I work as a chaplain in correctional centers in the State of Missouri, U.S.A. Thanks for your permission to use the Quranic Arabic Corpus in these correctional centers” Tadar Wazir.
33 AI for understanding the Quran Qurany the first Quran "search for a concept" websiteIf you choose from the tree of concepts on the left hand sideconcept "Pillars of Islam"then "The Prayers"then "Performing the Prayers"then "Friday Prayers“...you get Quran verses on this topic in the upper right frameand Hadith on this topic in the lower right frame.Nora Abbas, Qurany: A Tool to Search for Concepts in the Quran (PDF). 2009
34 “Google Qurany" html version at Store each Quran or Haddith verse as a separate web-page, andannotate each web-page with English translations and concept-tags. Then search is enabled via Google, but "keywords” can beconcept-tags and/or English words and/or Arabic words.Google "Jesus site:http://www.comp.leeds.ac.uk/nora/html“Google “Friday Prayers site:http://www.comp.leeds.ac.uk/nora/html”
35 AI for understanding the Quran - Tools and resources for text mining the Quran including pronoun references, related verses, lemma concordance and collocation, and text mining the Hadeeth Abdul-Baquee Sharaf and Eric Atwell (2012). QurAna: Corpus of the Quran annotated with Pronominal Anaphora. Proc LREC’2012, Istanbul Abdul-Baquee Sharaf and Eric Atwell (2012). QurSim: A corpus for evaluation of relatedness in short texts. Proc LREC’2012, Istanbul
36 QurSim - 7,679 pairs of related verses, according to Ibn Kathir, respected Islamic Scholar QurAna - 24,668 pronouns, each linked to its anaphoric referent entity or concept, and the location of the antecedent if available. Concept list - a list of 1054 entities or concepts arising from Pronoun referents in the Quran – nominal entities in a Quran ontology
37 AI for understanding the Quran SALMA – Sawalha Atwell Leeds Morphological Analyser SALMA Morphological analysis of Quran text Majdi Sawalha, Eric Atwell (2010). Fine-Grain Morphological Analyzer and Part-of-Speech Tagger for Arabic Text. Proc LREC’2010, Valetta, Malta Majdi Sawalha, Eric Atwell (2010). Constructing and Using Broad-Coverage Lexical Resource for Enhancing Morphological Analysis of Arabic. Proc LREC’2010, Valetta, Malta
38 AI for understanding the Quran Boundary-Annotated Quran - Tagged with prosodic annotation scheme from Tajwīd (recitation) mark-up in the Qur'an Claire Brierley, Majdi Sawalha and Eric Atwell (2012). Open-Source Boundary-Annotated Corpus for Arabic Speech and Language Processing. Proc LREC’2012, Istanbul Majdi Sawalha, Claire Brierley and Eric Atwell (2012). Predicting Phrase Breaks in Classical and Modern Standard Arabic Text. Proc LREC’2012, Istanbul
39 AI for understanding the Quran The Quranic Arabic Corpus - the first online annotated linguistic resource which shows the Arabic "irab" morphology and grammar for each word and verse in the Holy Quran, including word-by-word morphology and English gloss, and Ontology of Quranic concepts Kais Dukes, Eric Atwell and Nizar Habash (2011). Supervised Collaboration for Syntactic Annotation of Quranic Arabic. Language Resources and Evaluation Journal (LREJ). Kais Dukes and Eric Atwell (2012). LAMP: A Multimodal Web Platform for Collaborative Linguistic Analysis. Proc LREC’2012, Istanbul
40 Conclusion Augmenting the Arabic text of the Quran with rich linguistic annotation will help learners to understand Quranic Arabic.Eric Atwell, Nora Abbas, Claire Brierley, Kais Dukes, Majdi Sawalha, Abdul-Baquee SharafI-AIBS Institute for Artificial Intelligence and Biological SystemsSchool of Computing, University of Leeds