Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis.

Slides:



Advertisements
Similar presentations
DISCOVERING EVENT EVOLUTION GRAPHS FROM NEWSWIRES Christopher C. Yang and Xiaodong Shi Event Evolution and Event Evolution Graph: We define event evolution.
Advertisements

BY ROSELINE ANTAI CLUTO A Clustering Toolkit. What is CLUTO? CLUTO is a software package which is used for clustering high dimensional datasets and for.
Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Basic Concepts and Algorithms
Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.
Clustering Categorical Data The Case of Quran Verses
PARTITIONAL CLUSTERING
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document Clustering l Dr. Paula Matuszek l
Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Techniques: Clustering
Introduction to Bioinformatics
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
S IMILARITY M EASURES FOR T EXT D OCUMENT C LUSTERING Anna Huang Department of Computer Science The University of Waikato, Hamilton, New Zealand BY Farah.
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Clustering Unsupervised learning Generating “classes”
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Cluto – Clustering toolkit by G. Karypis, UMN
Data mining and machine learning A brief introduction.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Chameleon: A hierarchical Clustering Algorithm Using Dynamic Modeling By George Karypis, Eui-Hong Han,Vipin Kumar and not by Prashant Thiruvengadachari.
Clustering.
Clustering C.Watters CS6403.
Selecting Diverse Sets of Compounds C371 Fall 2004.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into.
Clustering Algorithm CS 157B JIA HUANG. Definition Data clustering is a method in which we make cluster of objects that are somehow similar in characteristics.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Statistics and Computing, Dec. 2007, Vol. 17, No.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
IR 6 Scoring, term weighting and the vector space model.
Similarity Measures for Text Document Clustering
Big Data Infrastructure
Data Mining: Basic Cluster Analysis
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Semi-Supervised Clustering
Clustering Patrice Koehl Department of Biological Sciences
(University of Minnesota)
Machine Learning Clustering: K-means Supervised Learning
Document Clustering Based on Non-negative Matrix Factorization
Constrained Clustering -Semi Supervised Clustering-
Data Mining K-means Algorithm
Metric Learning for Clustering
Topic 3: Cluster Analysis
CSE 5243 Intro. to Data Mining
Clustering Evaluation The EM Algorithm
Clustering Techniques and IR
Representation of documents and queries
Text Categorization Berlin Chen 2003 Reference:
Hierarchical Clustering
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Introduction to Machine learning
Presentation transcript:

Clustering for web documents 1 박흠

Clustering for web documents 2 Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis (2002) by Ying Zhao and George Karypis Department of Computer Science, University of Minnesota, Minneapolis, MN Feature selection for web documents (2004)

Clustering for web documents 3 Cluto Clustering Toolkit Department of Computer Science, University of Minnesota, Minneapolis platform Linux Sun OS 5.7 Win32 programs CLUTO's user callable library vcluster scluster

Clustering for web documents 4 Cluto What is Cluto.(1/2) Clustering algorithms partitional clustering agglomerative clustering graph-partitioning clustering clustering criterion function provide seven different criterion functions both partitional and agglomerative clustering algorithms provide some of the more traditional local criteria (e.g., single-link, complete-link, and UPGMA) agglomerative clustering.

Clustering for web documents 5 Cluto What is Cluto.(2/2) Analyze discovered clusters relations between the objects assigned to each cluster relations between the different clusters identify the features that best describe and/or discriminate each cluster. relationships between the clusters, objects, and features. operate on very large datasets the number of objects the number of dimensions.

Clustering for web documents 6 Cluto Programs vcluster operate in the object’s feature space scluster operate in the object’s similarity space. Interface vcluster [optional parameters] MatrixFile Ncluster n*m matrix. rows to objects, cols to features space Ncluster : number of cluster

Clustering for web documents 7 Cluto Parameters of Algorithms rd, rdr k-1 repeated bisections. (rdr : optimize the criterion function) direct computed by simultaneously finding all k clusters agglo the agglomerative paradigm graph using a nearest-neighbor graph bagglo

Clustering for web documents 8 Cluto Parameters of the similarity function cos the cosine function. default. corr the correlation coefficient. dist the Euclidean distance applicable when -clmethod=graph. jacc the extended Jaccard coefficient. applicable when -clmethod=graph.

Clustering for web documents 9 Cluto Parameters of the criterion function i1, i2, e1, g1, g1p, h1, h2

Clustering for web documents 10 Cluto Parameters of the criterion function slinksingle link wslinkweighted single link clinkcomplete link wclinkweighted complete link upgmaUPGMA cstype fulltree rowmodel, colmodel showfeatures

Clustering for web documents 11

Clustering for web documents 12 Criterion Functions for Document Clustering Experiments and Analysis (2002) by Ying Zhao and George Karypis Department of Computer Science, University of Minnesota, Minneapolis, MN 55455

Clustering for web documents 13 Data Clustering A.K. JAIN Michigan State University M.N. MURTY Indian Institute of Science AND P.J. FLYNN The Ohio State University ACM Computing Surveys

Clustering for web documents 14 Introduction(1/2) Clustering algorithms Agglomerative algorithms UPGMA, single-link, complete-link, CURE, ROCK, Chameleon Partitional algorithms K-means, K-medoids, Autoclass, graph-partitional-based, spectral-partitional-based well suit for large datasets. so fast. Seven Criterion functions measure intra-cluster similarity, inter-cluster similarity, two combinations. i1, i2, e1, g1, g1p, h1, h2

Clustering for web documents 15 Introduction(2/2) Datasets 15 different data sets

Clustering for web documents 16 Preliminaries(1/3) Document Representation use vector space model for each document d : document, tf : term frequency, tf i : frequency of i-th term in the doc use idf or tf*idf N : total documents Similarity Measures The similarity between two docs di, dj Cosine functions ||d|| : normalize the length of doc vector 1 : identical, 0 : nothing in common

Clustering for web documents 17 Preliminaries(2/3) Euclidean functions if dis=0, docs are identical, if, nothing in common. Definitions S : set of documents S 1, S 2, … S k : set of document of k-th cluster k : number of clusters n 1, n 2, … n k : size docs of the corresponding clusters A : a set of docs composite vector D A centroid vector C A. sum of all docs vector in A average the weight of terms of docs in A

Clustering for web documents 18 Preliminaries(3/3) Vector Properties Si, Sj : two sets of docs containing ni, nj documents Di, Dj : the composite vector, Ci, Cj : the centroid vector The sum of the pair similarity between the docs in Si and Sj is D j t D j The sum of the pair similarity between the docs in Si is ||D i || 2

Clustering for web documents 19 Criterion Functions(1/5) Internal Criterion Functions maximize sum of the average pairwise similarities between the docs to each cluster use cosine function. I1 is similar to function of hierarchical agglomerative clustering that uses group average heuristics to determine merge. use cosine function. I2 : vector space of K-means algorithm. Cr : centroid vector of clusters

Clustering for web documents 20 Criterion Functions (2/5) External Criterion Functions. E1, E2 optimize a function that different from each cluster external function derived that the centroid vectors of the different clusters as orthogonal as possible C : the centroid vector of the entire docs D : the composite vector of the entire docs. 1/||D|| is constant.

Clustering for web documents 21 Criterion Functions (3/5) define with the Euclidean distance function. Hybrid Criterion Functions. H1, H2 maximize the similarity of docs in each cluster, minimize the similarity between the cluster’s docs and the entire docs H1. combine criterion function I1, E1

Clustering for web documents 22 Criterion Functions (4/5) H2. combine criterion function I2, E1 Graph Based Criterion Functions view the relations between docs is to use graphs G1 : computing pairwise similarities between the docs G2 : computing pairwise similarities between the docs and terms S : given collection of n docs Gs : similarity graph

Clustering for web documents 23 Criterion Functions (5/5) G1. G2.

Clustering for web documents 24

Clustering for web documents 25

Clustering for web documents 26 Experimental Results Direct k -way Clustering

Clustering for web documents 27 Experimental Results

Clustering for web documents 28 Experimental Results

Clustering for web documents 29 Data Sets ‘the Natural Science’ category in Naver directory ( 6 subcategories in corpora 1,215 docs, 17,223 terms, 20 clusters, 5 features per a doc, idf Sub CategoryNo. of Docs.Sub CategoryNo. of Docs. Physics102 Earth science149 Biology426 Astrology323 Mathematics102 Chemistry113 Total1,215

Clustering for web documents 30 Experimental parameters Algorithms rd, rdr k-1 repeated bisections. (rdr : optimize the criterion function) direct computed by simultaneously finding all k clusters agglo the agglomerative paradigm graph using a nearest-neighbor graph

Clustering for web documents 31 Experimental parameters Criterion Functions i1, i2, e1, g1, g1p, h1, h2, clink, slink Similarity Functions cosine measure

Clustering for web documents 32 Experimental results Entropy rbrbrdirectagglograph I I E G G1p H H Clink.761 slink.895

Clustering for web documents 33 Entropy

Clustering for web documents 34 Experimental results Purity rbrbrdirectagglograph I I E G G1p H H Clink.458Cut functions slink.368

Clustering for web documents 35 Purity

Clustering for web documents 36 Best results rbrbrdirectagglograph entrpurientrpurientrpurientrpurientrpuri g1ph2h1 cut