Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining and Text Mining. The Standard Data Mining process.

Similar presentations


Presentation on theme: "Data Mining and Text Mining. The Standard Data Mining process."— Presentation transcript:

1 Data Mining and Text Mining

2 The Standard Data Mining process

3 Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks include: – Text categorization (document classification) – Text clustering – Text summarization – Opinion mining – Entity/concept extraction – Information retrieval: search engines – information extraction: Question answering

4 Supervised learning algorithms – Decision tree learning, e.g. C4.5 – Naïve Bayes (NB) – K-nearest neighbour (KNN) – Support Vector Machines (SVM) – Neural Networks – Genetic algorithms – Top 10 algorithms in data mining in 2007 paper by X Wu etc. C4.5, NB, KNN, SVM

5 Supervised Machine learning 1. Build or get a representative corpus 2. Label it 3. Define features 4. Represent documents 5. Learn and analyse 6. Go to 3 until accuracy is acceptable

6 Text data First test features: stemmed words Feature selection and generation

7 Unsupervised Learning Learning from unlabelled data Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: – Examples within a cluster are very similar – Examples in different clusters are very different Discover new categories in an unsupervised manner (no sample category labels provided).

8 8. Clustering Example...............................

9 Document clustering Distance based k-means Hierarchic Agglomerative Clustering (HAC) …. Word and Phrase based Probabilistic based, Topic models Online clustering with text streams Clustering text in networks Semi-supervised clustering

10 Similarity measure There are many different ways to measure how similar two documents are, or how similar a document is to a query Highly depending on the choice of terms to represent text documents Bag-of-words, Vector space model – Each term is a feature – Documents are feature vectors – Weighted by Tf-IDF – Similarity measure: cosine similarity

11 Document Similarity measures

12 Document Similarity Measures

13 13 Hierarchical Clustering Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples. Recursive application of a standard clustering algorithm can produce a hierarchical clustering. animal vertebrate fish reptile amphib. mammal worm insect crustacean invertebrate

14 14 Aglommerative vs. Divisive Clustering Aglommerative (bottom-up) methods start with each example in its own cluster and iteratively combine them to form larger and larger clusters. Divisive (partitional, top-down) separate all examples immediately into clusters.

15 Hierarchical clustering Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, 1.Start by assigning each item to its own cluster 2.Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster. 3.Compute distances (similarities) between the new cluster and each of the old clusters. 4.Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

16 Iwona Białynicka-Birula - Clustering Web Search Results Agglomerative hierarchical clustering

17 Iwona Białynicka-Birula - Clustering Web Search Results Clustering result: dendrogram

18 18 Cluster Similarity Assume a similarity function that determines the similarity of two instances: sim(x,y). –Cosine similarity of document vectors. How to compute similarity of two clusters each possibly containing multiple instances? –Single Link: Similarity of two most similar members. –Complete Link: Similarity of two least similar members. –Group Average: Average similarity between members.

19 Iwona Białynicka-Birula - Clustering Web Search Results AHC variants Various ways of calculating cluster similarity single-link (minimum) complete-link (maximum) Group-average (average)

20 20 Single Link Agglomerative Clustering Use maximum similarity of pairs: Can result in “straggly” (long and thin) clusters due to chaining effect. –Appropriate in some domains, such as clustering islands.

21 21 Single Link Example

22 22 Complete Link Agglomerative Clustering Use minimum similarity of pairs: Makes more “tight,” spherical clusters that are typically preferable.

23 23 Complete Link Example

24 24 Group Average Agglomerative Clustering Use average similarity across all pairs within the merged cluster to measure the similarity of two clusters. Compromise between single and complete link. Averaged across all ordered pairs in the merged cluster instead of unordered pairs between the two clusters (to encourage tighter final clusters).

25 Strength and weakness Can find clusters of arbitrary shapes Single link has a chaining problem Complete link is sensitive to outliers Computation complexities and space requirements 25

26 Data Clustering K-means – Partitional clustering – Initial number of clusters k

27 K-means 1.Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. 2.Assign each object to the group that has the closest centroid. 3.When all objects have been assigned, recalculate the positions of the K centroids. 4.Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated. 27

28 28 K Means Example (K=2) Pick seeds Reassign clusters Compute centroids x x Reasssign clusters x x x x Compute centroids Reassign clusters Converged!

29 Example by Andrew W. Moore 29

30 30

31 K-means 31

32 Iwona Białynicka-Birula - Clustering Web Search Results K-means clustering (k=3)

33 Strengths and Weaknesses Only applicable to data sets where the notion of the mean is defined Need to now the number of clusters K in advance Sensitive to outliers Sensitive to initial seeds Not suitable for discovering clusters that are not hyper-ellipsoids (e.g. L shape) 33


Download ppt "Data Mining and Text Mining. The Standard Data Mining process."

Similar presentations


Ads by Google