Presentation on theme: "Data Mining and Text Analytics By Saima Rahna & Anees Mohammad Quranic Arabic Corpus."— Presentation transcript:
Data Mining and Text Analytics By Saima Rahna & Anees Mohammad Quranic Arabic Corpus
Summary ● Quranic Arabic corpus enables further analysis of the Quran ● Uses linguistic resources for each word and verse in the quran – e.g. Morphology and syntax ● Automated algorithms were used in the Quran.
Introduction ● Islam was born in Arabia (1400 years ago) ● The key sacred texts are in Arabic ● Only a minority Muslims can speak and understand Arabic ● A larger percentage of Muslims know English as a second language or even first ● Web resources and book resources use English in parallel with Arabic.
Data Mining ● Uses tools and techniques to extract data ● Different aspects of a single topic in the Quran can reappear in many chapters ● Therefore frequent patterns can be used to construct a subjective index where all versus on a single topic can be covered easily.
Text Analytic ● Referred to as information extraction ● The Quranic corpus is an advantage to those who don't understand Arabic ● Can give the English readers a better insight into the source ● The translation is at a detailed text Analytic level
Resources & Techniques Statistical techniques ● Implementing statistical techniques such as keyword extraction ● Can explore semiotic relationships between sound and meaning in the Quran ● Recognise reoccurring patterns ● Recognise reoccurring patterns for high level of accuracy ● Linguistic resource ● Arabic grammar and syntax used for each word in the quran ● A comment based system used online for visitors to discuss and correct the data.
Algorithms ● Quranic Arabic Corpus used Java to implement their algorithms. ● Search feature ● (searching concepts and key words in the Holy Quran) ● Finding multi-word repetitions ● Mining frequent patterns to a graph.
Algorithm for indexing the Quran When a word is encountered for the first time, it is added to the index; if it already exists there, then a new location is added to its list. For each verse V parse word list -> list(W) For each word W If INDEX contains W is false add W and W.location to Index Else fetch W in INDEX add new location to W
Filtering algorithm ● The Quranic 'quote filtering' algorithm ● The Quran has the use of Arabic diacritics (symbols) ● The filtering algorithm has 3 filtering stages after making the input text. Algorithm-Sub path Mining ● This is used to generate frequent patterns within the Quran corpus ● The process starts by scanning the transaction database, calculating the count for each vertex in the graph
Conclusion ● Algorithms used ● Resources and techniques used for ● implementation of the Quranic Arabic corpus ● How data mining is applied ● How text analytic has also been applied