V. Clustering 2007.2.10. 인공지능 연구실 이승희 Text: Text mining Page:82-93.

Slides:



Advertisements
Similar presentations
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Advertisements

Hierarchical Clustering, DBSCAN The EM Algorithm
Albert Gatt Corpora and Statistical Methods Lecture 13.
Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
Clustering II.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Principal Component Analysis
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
Clustering 10/9/2002. Idea and Applications Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Unsupervised Learning and Data Mining
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
What is Cluster Analysis?
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
Text mining.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Document Clustering 文件分類 林頌堅 世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
SINGULAR VALUE DECOMPOSITION (SVD)
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Machine Learning Queens College Lecture 7: Clustering.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into.
Text Clustering Hongning Wang
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering.
Natural Language Processing Topics in Information Retrieval August, 2002.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Data Mining and Text Mining. The Standard Data Mining process.
Sampath Jayarathna Cal Poly Pomona
Semi-Supervised Clustering
Data Mining K-means Algorithm
John Nicholas Owen Sarah Smith
Revision (Part II) Ke Chen
CSE572, CBS572: Data Mining by H. Liu
CS 391L: Machine Learning Clustering
Clustering Techniques
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
CSE572: Data Mining by H. Liu
Presentation transcript:

V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93

Outline  V.1 Clustering tasks in text analysis  V.2 The general clustering problem  V.3 Clustering algorithm  V.4 Clustering of textual data

Clustering  Clustering An unsupervised process through which objects are classified into groups called cluster. (cf. categorization is a supervised process.) Data mining, document retrieval, image segmentation, pattern classification.

V.1 Clustering tasks in text analysis(1/2)  Cluster hypothesis “ Relevant documents tend to be more similar to each other than to nonrelevant ones. ”  If cluster hypothesis holds for a particular document collection, then the clustering of documents may help to improve the search effectiveness. Improving search recall When a query matches a document its whole cluster can be return Improving search precision By grouping the document into a much smaller number of groups of related documents

V.1 Clustering tasks in text analysis(2/2) Scatter/gather browsing method Purpose: to enhance the efficiency of human browsing of a document collection when a specific search query cannot be a formulated. Session1: a document collection is scattered into a set of clusters. Sesson2: then the selected clusters are gathered into a new subcollection with which the process may be repeated. 참고사이트 – Query-Specific clustering are also possible. - the hierarchical clustering is appealing

V.2 Clustering problem(1/2)  Cluster tasks problem representation definition proximity measures actual clustering of objects data abstraction evalutation  Problem representation Basically, optimization problem. Goal: select the best among all possible groupings of objects Similarity function: clustering quality function. Feature extraction/ feature selection In a vector space model, objects: vectors in the high-dimensional feature space. the similarity function: the distance between the vectors in some metric

V.2 Clustering problem(2/2)  Similarity Measures Euclidian distance Cosine similarity measure is the most common

V.3 Clustering algorithm (1/9)  flat clustering: a single partition of a set of objects into disjoint groups. hierarchical clustering: a nested series of partition.  hard clustering: every objects may belongs to exactly one cluster. soft clustering: objects may belongs to several clusters with a fractional degree of membership in each.

V.3 Clustering algorithm (2/9)  Agglomerative algorithm: begin with each object in a separate cluster and successively merge cluster until a stopping criterion is satisfied. Divisive algorithm: begin with a single cluster containing all objects and perform splitting until stopping criterion satisfied. Shuffling algorithm: iteratively redistribute objects in clusters

V.3 Clustering algorithm (3/9)  k-means algorithm(1/2) hard, flat, shuffling algorithm

V.3 Clustering algorithm (4/9) example of K-means algorithm

V.3 Clustering algorithm (5/9)  K-means algorithm(2/2) Simple, efficient Complexity O(kn) bad initial selection of seeds.-> local optimal. k-means suboptimality is also exist.-> Buckshot algorithm. ISO-DATA algorithm Maximizes the quality function Q:

V.3 Clustering algorithm (6/9)  EM-based probabilistic clustering algorithm(1/2) Soft, flat, probabilistic

V.3 Clustering algorithm (7/9)

V.3 Clustering algorithm (8/9)  Hierarchical agglomerative Clustering single-link method Complete-link method Average-link method

V.3 Clustering algorithm (9/9)

Other clustering algorithms  minimal spanning tree  nearest neighbor clustering  Buckshot algorithm

V.4 clustering of textual data(1/6)  representation of text clustering problem Objects are very complex and rich internal structure. Documents must be converted into vectors in the feature space. Bag-of-words document representation. Reducing the dimensionality Local method: delete unimportant components from individual document vectors. Global method: latent semantic indexing(LSI)

V.4 clustering of textual data(2/6)  latent semantic indexing map N-dimensional feature space F onto a lower dimensional subspace V. LSI is based upon applying the SVD to the term-document matrix.

V.4 clustering of textual data(3/6)  Singular value decomposition (SVD) A = UDV T U : column-orthonormal mxr matrix D: diagonal rxr matrix, matrix,digonal elements are the singular values of A V: column-orthonormal nxr UU T = V T V = I  Dimension reduction

V.4 clustering of textual data(4/6)  Mediods: actual documents that are most similar to the centroids  Using Na ï ve Bayes Mixture models with the EM clustering algorithm

V.4 clustering of textual data(5/6)  Data abstraction in text clustering generating meaningful and concise description of the cluster. method of generating the label automatically a title of the medoid document several words common to the cluster documents can be shown. a distinctive noun phrase.

V.4 clustering of textual data(6/6)  Evaluation of text clustering - the quality of the result? purity assume {L 1,L 2,...,L n } are the manually labeled classes of documents, {C 1,C 2,...,C m } the clusters returned by the clustering process entropy, mutual information between classes and clusters