 # Clustering Categorical Data The Case of Quran Verses

## Presentation on theme: "Clustering Categorical Data The Case of Quran Verses"— Presentation transcript:

Clustering Categorical Data The Case of Quran Verses
Presented By Muhammad Al-Watban IS 598

Outline Introduction Preprocessing of Quran Verses Similarity Measures
Assisting Clusters Similarities Shortcomings of Traditional clustering methods with categorical data ROCK - Major definitions ROCK clustering Algorithm ROCK example Conclusion and future work

Introduction The holy Quran covers a wide range of topics.
Quran does not cover each topic by a set of sequenced verses or sura’s. A single verse usually deals with many subjects Project goal: to cluster the verses of The Holy Quran based on the verse’s subjects.

Preprocessing of Quran Verses
it is necessary to perform manual preprocessing for the Quran text to capture the subjects of the verses into a tabular format Verses in the Holy Quran can be viewed as records and the related subjects as attributes of the record. This is demonstrated by the following table: The data in the above table is similar to what is known as market-basket data. Here, we will call it verses-treasues data

Similarity Measures Two types of attributes: Continuous attributes:
range of attribute value is continuous and ordered includes Attributes with numeric values (e.g. salary) also includes attributes whose allowed set of values are thought to be part of an ordered set of a meaningful sequence (e.g. professional ranks, disease severity levels) The similarity (or dissimilarity) between objects is computed based on distance between them. the most commonly used distance measure is Euclidean distance, and Manhattan distance

Similarity Measures Categorical attributes:
consists of attributes whose underlying domain is not ordered Examples : colors, blood type. If the attribute has only two states (namely 0 and 1), then it is called binary; if it has more than two states, it is called nominal. there is no easy way to measure a distance between objects We can define dissimilarity based on the simple matching approach Where m is the number of matched attribute, and p is the total number of attributes.

Similarity Measures Where does the verses treasures data fit?
Each verse can be represented by a record with Boolean attributes, each attribute corresponds to a single subject The attribute corresponding to a subject is T if the verse contains that subjects; otherwise, it is F As we said, Boolean attributes are a special case of categorical attributes

Assisting Clusters Similarities
Many clustering algorithm(such as hirarchical clustering) requires computing distance between clusters (rather than elements) There are several standard methods: 1- Single linkage: D(r,s): distance between clusters r and s is defined as the distance between the closest pair of objects D(r,s)

Assisting Clusters Similarities
2. Complete linkage distance is defined as the distance between the farthest pair of objects 3. Average linkage distance is defined as the average of distances between all pairs of objects r and s, where r and s belong to different clusters D(r,s)

Assisting Clusters Similarities
4. Centroid Linkage:   distance between clusters is defined as the distance between the pair of cluster centroids.   D(r,s)

Shortcomings of Traditional clustering methods with categorical data
Example Consider the following 4 market basket transactions T1= {1, 2, 3, 4} T2= {1, 2, 4} T3= {3} T4= {4} converting these transactions to Boolean points, we get: P1= (1, 1, 1, 1) P2= (1, 1, 0, 1) P3= (0, 0, 1, 0) P4= (0, 0, 0, 1) using Euclidean distance to measure the closeness between all pairs of points, we find that d(p1,p2) is the smallest distance :

Shortcomings of Traditional clustering methods with categorical data
If we use the centroid-based hierarchical algorithm then we merge P1 and P2 and get a new cluster (P12) with (1, 1, 0.5, 1) as a centroid Then, using Euclidean distance again, we find: d(p12,p3)= 3.25 d(p12,p4)= 2.25 d(p3,p4)= 2 So, we should merge P3 and P4 since the distance between them is the shortest. However, T3 and T4 don't have even a single common item. So, using distance metrics as similarity measure for categorical data is not appropriate The solution is ROCK

ROCK - Major definitions
Similarity function Neighbors Links Criterion function Goodness measure

Similarity function Let Sim (Pi, Pj) be a similarity function that is used to measure the closeness between points pi and Pj. ROCK assumes that Sim function is normalized to return a value between 0 and 1 For Quran treasures data, a possible definition for the sim function is based on the Jaccard coefficient:

Example : similarity function
Suppose two verses (P1 and P2) contain the following subjects P1={ judgment, faith, prayer, fair} P2={ fasting, faith, prayer} Sim(P1,P2)= | P1 P2| / | P1P2| = 2 / 5 = 0.40

Major definitions Similarity for data objects Neighbors Links
Criterion function Goodness measure

Neighbors and Links one main problem of traditional clustering is:local properties involving only the two points are considered. Neighbor If similarity between two points exceeds certain similarity threshold (), they are neighbors. Link The Link for pair of points is: the number of their common neighbors. Obviously, Link incorporates global information about the other points in the neighborhood of the two points. The larger the Link, the higher probability that this pair of points are in the same clusters.

Assume that we have three distinct points: p1,p2 and p3; where neighbor(p1)={p1,p2} neighbor(p2)={p1,p2,3} neighbor(p3)={p3,p2} Neighboring graph  To define the number of links between two points, say p1 and p3, we have to find the number of their common neighbors; hence, we can define the linkage function between p1 and p3 to be: Link (p1,p3) = | neighbor(p1)  neighbor(p3) |= | {P2}| Or Link (p1,p3) = 1

If we have four points:P1,P2,P3,P4 suppose that similarity threshold () is equal to 1 Then, Two Points are neighbors if sim(Pi,Pj)>=1 hence, points are considered neighbors only to identical points (i.e. only to themselves) To find Link(P1,P2): neighbor(P1)={P1} neighbor(P2)={P2} link (P1,P2)= |neighbor(p1)  neighbor(p2) | =0

The following table shows the number of links (common neighbors) between the four points:
We can depict the neighboring graph:

If we have four points:P1,P2,P3,P4 suppose that similarity threshold () is equal to 0 Then, Two Points are neighbors if sim(Pi,Pj)>=0 hence, any pair of points are neighbors To find Link(P1,P2): neighbor(P1)={P1,P2,P3,P4} neighbor(P2)={P1,P2,P3,P4} link (P1,P2)= |neighbor(P1)  neighbor(P2) | =4

The following table shows the number of links (common neighbors) between the four points:
We can depict the neighboring graph:

from the previous example, we have: neighbor(P1)={P1,P2,P3,P4} neighbor(P3)={P1,P2,P3,P4} link (P1,P3)= |neighbor(P1)  neighbor(P3) | =4 links we can depict these four different links (or paths) through these four different neighbors as follows:

Major definitions Similarity for data objects Neighbors Links
Criterion function Goodness measure

Criterion function to get the best clusters, we have to maximize this Criterion Function Where Ci denotes cluster i ni is the number of points in Ci k is the number of clusters  is the similarity threshold Suppose in Ci, each point has roughly nf(θ) neighbors. A suitable choice for basket data is : f(θ)=(1-θ)/(1+θ)

Criterion function By maximizing this criterion function, we are maximizing the sum of links of intra cluster point pairs and at the same time minimizing the sum of links among pairs of points belonging to different clusters (i.e. among inter cluster point pairs)

Major definitions Similarity for data objects Neighbors Links
Criterion function Goodness measure

Goodness measure Goodness Function
During clustering, we use this goodness measure in order to maximize the criterion function. This goodness measure helps to identify the best pair of clusters to be merged during each step of ROCK.

ROCK Clustering algorithm
Input: A set S of data points Number of k clusters to be found The similarity threshold Output: Groups of clustered data The ROCK algorithm is divided into three major parts: Draw a random sample from the data set: Perform a hierarchical agglomerative clustering algorithm Label data on disk in our case, we do not deal with a very huge data set. So, we will consider the whole data in the process of forming clusters, i.e. we skip step1 and step3

ROCK Clustering algorithm
Draw a random sample from the data set: sampling is used to ensure scalability to very large data sets The initial sample is used to form clusters, then the remaining data on disk is assigned to these clusters in our case, we will consider the whole data in the process of forming clusters.

ROCK Clustering algorithm
Perform a hierarchical agglomerative clustering algorithm: ROCK performs the following steps which are common to all hierarchical agglomerative clustering algorithms, but with different definition to the similarity measures: places each single data point into a separate cluster compute the similarity measure for all pairs of clusters merge the two clusters with the highest similarity (goodness measure) Verify a stop condition. If it is not met then go to step b

Label data on disk: Finally, the remaining data points in the disk are assigned to the generated clusters. This is done by selecting a random sample Li from each cluster Ci, then we assign each point p to the cluster for which it has the strongest linkage with Li. As we said, we will consider the whole data in the process of forming clusters.

ROCK Clustering algorithm
Computation of links: using the similarity threshold , we can convert the similarity matrix into an adjacency matrix (A) Then we obtain a matrix indicating the number of links by calculating (A x A ) , i.e., by multiplying the adjacency matrix A with itself

ROCK Example Suppose we have four verses contains some subjects , as follows: P1={ judgment, faith, prayer, fair} P2={ fasting, faith, prayer} P3={ fair, fasting, faith} P4={ fasting, prayer, pilgrimage} the similarity threshold = 0.3, and number of required cluster is 2. using Jaccard coefficient as a similarity measure, we obtain the following similarity table :

ROCK Example Since we have a similarity threshold equal to 0.3, then we derive the adjacency table: By multiplying the adjacency table with itself, we derive the following table which shows the number of links (or common neighbors) :

ROCK Example we obtain the following table:
we compute the goodness measure for all adjacent points ,assuming that f() =1- / 1+ we obtain the following table: we have an equal goodness measure for merging ((P1,P2), (P2,P1), (P3,P1))

ROCK Example Now, we start the hierarchical algorithm by merging, say P1 and P2. A new cluster (let’s call it C(P1,P2)) is formed. It should be noted that for some other hierarchical clustering techniques, we will not start the clustering process by merging P1 and P2, since Sim(P1,P2) = 0.4,which is not the highest. But, ROCK uses the number of links as the similarity measure rather than distance.

ROCK Example Now, after merging P1 and P2, we have only three clusters. The following table shows the number of common neighbors for these clusters: Then we can obtain the following goodness measures for all adjacent clusters:

ROCK Example Since the number of required clusters is 2, then we finish the clustering algorithm by merging C(P1,P2) and P3, obtaining a new cluster C(P1,P2,P3) which contains {P1,P2,P3} leaving P4 alone in a separate cluster.

Conclusion and future work (1/3)
We aim to apply a clustering technique on the verses of the Holy Quran We should first perform manual preprocessing for the Quran text to capture the subjects of the verses into a tabular format. Then we can apply a clustering algorithm which clusters each set of similar verses into the same group.

Conclusion and future work (2/3)
Most traditional clustering algorithm uses distance based similarity measures which is not appropriate for clustering our categorical-type datasets. we will apply the general framework of the ROCK algorithm. The ROCK (RObust Clustering using linKs) algorithm is an agglomerative hierarchical clustering algorithm for clustering categorical data. It presents a new notion of link to measure similarity between data objects.

Conclusion and future work (3/3)
We will adopt JAVA language to implement ROCK clustering algorithm. During testing, will try to form clusters of verses belonging to a single sura, and verses belonging to many different suras. Insha Allah, we will achieve success in performing this mission.