Data Mining and Text Mining. The Standard Data Mining process.

Slides:



Advertisements
Similar presentations
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Advertisements

Albert Gatt Corpora and Statistical Methods Lecture 13.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document Clustering l Dr. Paula Matuszek l
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
Data Mining Techniques: Clustering
Introduction to Bioinformatics
Clustering II.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
CS728 Web Clustering II Lecture 14. K-Means Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean)
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Lecture 13: Clustering (continued) May 12, 2010
CS276A Text Retrieval and Mining
Unsupervised Learning and Data Mining
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 17: Hierarchical Clustering 1.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Text mining.
Data mining and machine learning A brief introduction.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Hierarchical Clustering
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Clustering. What is Clustering? Clustering: the process of grouping a set of objects into classes of similar objects –Documents within a cluster should.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
Today’s Topic: Clustering  Document clustering Motivations Document representations Success criteria  Clustering algorithms Partitional Hierarchical.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Text Clustering Hongning Wang
Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Slides adapted from Chris Manning, Prabhakar Raghavan, and Hinrich Schütze (
1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Hierarchical Clustering & Topic Models
Unsupervised Learning: Clustering
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Machine Learning Lecture 9: Clustering
Instance Based Learning
K-means and Hierarchical Clustering
Information Organization: Clustering
Revision (Part II) Ke Chen
CS 391L: Machine Learning Clustering
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
Unsupervised Learning: Clustering
Hierarchical Clustering
Presentation transcript:

Data Mining and Text Mining

The Standard Data Mining process

Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks include: – Text categorization (document classification) – Text clustering – Text summarization – Opinion mining – Entity/concept extraction – Information retrieval: search engines – information extraction: Question answering

Supervised learning algorithms – Decision tree learning, e.g. C4.5 – Naïve Bayes (NB) – K-nearest neighbour (KNN) – Support Vector Machines (SVM) – Neural Networks – Genetic algorithms – Top 10 algorithms in data mining in 2007 paper by X Wu etc. C4.5, NB, KNN, SVM

Supervised Machine learning 1. Build or get a representative corpus 2. Label it 3. Define features 4. Represent documents 5. Learn and analyse 6. Go to 3 until accuracy is acceptable

Text data First test features: stemmed words Feature selection and generation

Unsupervised Learning Learning from unlabelled data Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: – Examples within a cluster are very similar – Examples in different clusters are very different Discover new categories in an unsupervised manner (no sample category labels provided).

8. Clustering Example

Document clustering Distance based k-means Hierarchic Agglomerative Clustering (HAC) …. Word and Phrase based Probabilistic based, Topic models Online clustering with text streams Clustering text in networks Semi-supervised clustering

Similarity measure There are many different ways to measure how similar two documents are, or how similar a document is to a query Highly depending on the choice of terms to represent text documents Bag-of-words, Vector space model – Each term is a feature – Documents are feature vectors – Weighted by Tf-IDF – Similarity measure: cosine similarity

Document Similarity measures

Document Similarity Measures

13 Hierarchical Clustering Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples. Recursive application of a standard clustering algorithm can produce a hierarchical clustering. animal vertebrate fish reptile amphib. mammal worm insect crustacean invertebrate

14 Aglommerative vs. Divisive Clustering Aglommerative (bottom-up) methods start with each example in its own cluster and iteratively combine them to form larger and larger clusters. Divisive (partitional, top-down) separate all examples immediately into clusters.

Hierarchical clustering Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, 1.Start by assigning each item to its own cluster 2.Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster. 3.Compute distances (similarities) between the new cluster and each of the old clusters. 4.Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

Iwona Białynicka-Birula - Clustering Web Search Results Agglomerative hierarchical clustering

Iwona Białynicka-Birula - Clustering Web Search Results Clustering result: dendrogram

18 Cluster Similarity Assume a similarity function that determines the similarity of two instances: sim(x,y). –Cosine similarity of document vectors. How to compute similarity of two clusters each possibly containing multiple instances? –Single Link: Similarity of two most similar members. –Complete Link: Similarity of two least similar members. –Group Average: Average similarity between members.

Iwona Białynicka-Birula - Clustering Web Search Results AHC variants Various ways of calculating cluster similarity single-link (minimum) complete-link (maximum) Group-average (average)

20 Single Link Agglomerative Clustering Use maximum similarity of pairs: Can result in “straggly” (long and thin) clusters due to chaining effect. –Appropriate in some domains, such as clustering islands.

21 Single Link Example

22 Complete Link Agglomerative Clustering Use minimum similarity of pairs: Makes more “tight,” spherical clusters that are typically preferable.

23 Complete Link Example

24 Group Average Agglomerative Clustering Use average similarity across all pairs within the merged cluster to measure the similarity of two clusters. Compromise between single and complete link. Averaged across all ordered pairs in the merged cluster instead of unordered pairs between the two clusters (to encourage tighter final clusters).

Strength and weakness Can find clusters of arbitrary shapes Single link has a chaining problem Complete link is sensitive to outliers Computation complexities and space requirements 25

Data Clustering K-means – Partitional clustering – Initial number of clusters k

K-means 1.Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. 2.Assign each object to the group that has the closest centroid. 3.When all objects have been assigned, recalculate the positions of the K centroids. 4.Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated. 27

28 K Means Example (K=2) Pick seeds Reassign clusters Compute centroids x x Reasssign clusters x x x x Compute centroids Reassign clusters Converged!

Example by Andrew W. Moore 29

30

K-means 31

Iwona Białynicka-Birula - Clustering Web Search Results K-means clustering (k=3)

Strengths and Weaknesses Only applicable to data sets where the notion of the mean is defined Need to now the number of clusters K in advance Sensitive to outliers Sensitive to initial seeds Not suitable for discovering clusters that are not hyper-ellipsoids (e.g. L shape) 33