Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.

Slides:



Advertisements
Similar presentations
CSE 211 Discrete Mathematics
Advertisements

Clustering II.
Clustering.
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Midterm topics Chapter 2 Data Data preprocessing Measures of similarity/dissimilarity Chapter.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
K Means Clustering , Nearest Cluster and Gaussian Mixture
Data Mining Techniques: Clustering
Introduction to Bioinformatics
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
The number of edge-disjoint transitive triples in a tournament.
Maximal Lifetime Scheduling in Sensor Surveillance Networks Hai Liu 1, Pengjun Wan 2, Chih-Wei Yi 2, Siaohua Jia 1, Sam Makki 3 and Niki Pissionou 4 Dept.
CSL758 Instructors: Naveen Garg Kavitha Telikepalli Scribe: Manish Singh Vaibhav Rastogi February 7 & 11, 2008.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Clustering II.
Lecture 21: Spectral Clustering
1 Maximizing Lifetime of Sensor Surveillance Systems IEEE/ACM TRANSACTIONS ON NETWORKING Authors: Hai Liu, Xiaohua Jia, Peng-Jun Wan, Chih- Wei Yi, S.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
HCS Clustering Algorithm
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Face Recognition Using Embedded Hidden Markov Model.
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Cluster Analysis.
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis (1).
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering Unsupervised learning Generating “classes”
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Introduction to Graph Theory
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Clustering Spatial Data Using Random Walks Author : David Harel Yehuda Koren Graduate : Chien-Ming Hsiao.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
A genetic approach to the automatic clustering problem Author : Lin Yu Tseng Shiueng Bien Yang Graduate : Chien-Ming Hsiao.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Clustering.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Community detection via random walk Draft slides.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Data Mining: Basic Cluster Analysis
More on Clustering in COSC 4335
Clustering CSC 600: Data Mining Class 21.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Clustering.
Locality In Distributed Graph Algorithms
Presentation transcript:

Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001

Introduction Advances in database technologies resulted in huge amounts of spatial data The characteristics of spatial data pose several difficulties for clustering algorithms. – Clusters may have arbitrary shapes and non-uniform sizes. – Different clusters may have different densities. – The existence of noise may interfere the clustering process and should be identified. Regarding complexity, the huge sizes of spatial databases imply the need for very efficient clustering algorithms

Motivation Partitional approach – Minimize the overall squared distance between each data point and the center of its related cluster. – K-mean, k-medoid – Robustness to outliers and quick running time. – Tendency to produce spherically shaped clusters of similar sizes, which prevents the finding of natural clusters. And Predetermine the number of k

Motivation Hierarhcical algorithm – Create a sequence of partitions in which each partition is nested into next partition in the sequence. – Agglomerative clustering Starts from the trivial partition of n points into n clusters of size 1 and continue by repeatedly merging pairs of clusters. At each step the two clusters that are most similar are merged, until the clustering is satisfactory. Single-link cluster similarity : similarity between the most similar pair of elements, one from each of clusters; it can be easily fooled by outliers, merging two clusters that are connected by a narrow string of points. Complete-link cluster similarity : least similar pair of elements; break up a relatively large cluster into clusters

Approach Proposed Present an approach to clustering spatial data, based on deterministic exploration of random walks on a weighted graph associated with the data. The heart is to apply the method iteratively. – To sharpen the distinction between the weights of inter-cluster edges (those that ought to separate clusters) and intra-cluster edges (those that ought to remain inside a single cluster) by decreasing the former and increasing the latter.

Basic Notion(Graph-theoretic notions) Let G(V, E, w) be a weighted graph, which should be viewed as modeling a relation E over a set V of entities. – The set of nodes V is {1….n}. – w: E -> R +, measures the similarity between pairs of items. Let S V. The set of nodes that are connected to some node of S by a path with at most k edges is denoted by. The degree of G, denoted by deg(G), is the maximal edges incident to some single node of G. The subgraph of G induced by S is denoted by G(s). The edge between i and j is denoted by. The probability of a transition from node i to node j, is where is the weighted degree of node i. A random walk is a natural stochastic process on graphs. Given a graph and a start node, we select a neighbor of the node at random, and “go there ”, after which we continue the random walk from the newly chosen node.

Basic Notions (Cont.) Given a weighted graph G(V,E,w), the associated transition matrix, denoted by M G, is the n×n matrix in which, if i and j are connected, the (i, j)’th entry is simply p ij. Hence, we have Denote by the vector whose j-th component is the probability that a random walk originating at i will visit node j in its k-th step. – Thus, is the i-th row in the matrix (M G ) k, the k’th power of M G The escape probability from a source node s to a target node t, denoted by P escape (s,t) is defined as the probability that a random walk originating at s will reach t before returning to s. The probability can be computed. For ever, define a variable satisfying: Then the escape probability is given by

Clustering using random walks (Cluster Quality) The main idea to identifying separators is to use an iterative process of separation. – Separation reweights edges by local considerations in such a way that the weight of an edge connecting “intimately related” nodes is increased, and for others it is decreased. – Sharpening pass: in it, the edges are reweighted to sharpen the distinction between separating and non-separating edges. – When the separating operation is iterated several, a sort of “zero-one” phenomenon emerges, whereby the weight of an edge that would be a separator notably diminished.

NS: Separation by neighborhood similarity A helpful property of the vector is that it provides the level of nearness or intimacy between the node i and every other node. generalizes the concept of weighted neighborhoods. is exactly the weighted neighborhood of i. is not very interesting for large values of k. We will actually be using the term To estimate the closeness of two nodes v and u, fix some small k(=3) and compare The smaller the difference the greater the intimacy between u and v.

NS: Separation by neighborhood similarity (Cont.)

CE: Separation by circular escape. A random walk that starts at v visits u exactly once before returning to v for the first time. – If v and u are in different natural clusters, the probability of such an event will be low, since a random walk that visits v will likely return to v before reaching u. – The notion is symmetric, since the event obtained by exchanging the roles of v and u has the same probability. – The probability of this event is give by

Clustering by Separation

Clustering Spatial Points Dataset 1 shows the inherent capability to cluster at different resolutions at once,. Dataset 2 demonstrates the ability to separate the two left hand side clusters, despite the fact that the distance between these clusters is smaller than the distance between points inside the right hand side cluster. The data set 3 exhibits the capability to take into account the structural properties of the data set, which is the only clue for separating the set evenly spaced points.

Clustering Spatial Points

Conclusion Propose a novel approach to clustering – Based on the deterministic analysis of random walks on a weighted graph generated from the data. Decompose the data into arbitrarily shaped clusters of different sizes and densities, overcoming noise and outliers that may blur the natural decomposition of the data. Experiment shows it is efficient.