CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.

Slides:



Advertisements
Similar presentations
Christoph F. Eick Questions and Topics Review Nov. 30, Give an example of a problem that might benefit from feature creation 2.How does DENCLUE.
Advertisements

7/03Spatial Data Mining G Dong (WSU) & H. Liu (ASU) 1 6. Spatial Mining Spatial Data and Structures Images Spatial Mining Algorithms.
Christoph F. Eick Questions and Topics Review Nov. 22, Assume you have to do feature selection for a classification task. What are the characteristics.
Cluster Analysis Part III. Learning Objectives Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary.
Clustering Prof. Navneet Goyal BITS, Pilani
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Data Mining Techniques: Clustering
Spatial Mining.
Spatial Clustering Methods
Cluster Analysis.
4. Clustering Methods Concepts Partitional (k-Means, k-Medoids)
Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.
On Reducing Communication Cost for Distributed Query Monitoring Systems. Fuyu Liu, Kien A. Hua, Fei Xie MDM 2008 Alex Papadimitriou.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Spatial Interpolation
Basic Data Mining Techniques Chapter Decision Trees.
Basic Data Mining Techniques
An Approach to Active Spatial Data Mining Wei Wang Data Mining Lab, UCLA March 24, 1999.
An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger Department of Computer.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
Cost-Based Plan Selection Choosing an Order for Joins Chapter 16.5 and16.6 by:- Vikas Vittal Rao ID: 124/227 Chiu Luk ID: 210.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Spatial Temporal Data Mining
CSE 634 Data Mining Techniques
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
1  The goal is to estimate the error probability of the designed classification system  Error Counting Technique  Let classes  Let data points in class.
Title: Spatial Data Mining in Geo-Business. Overview  Twisting the Perspective of Map Surfaces — describes the character of spatial distributions through.
Inductive learning Simplest form: learn a function from examples
Confidence Intervals for the Mean (σ known) (Large Samples)
9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Spatial Data Analysis Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What is spatial data and their special.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Yaomin Jin Design of Experiments Morris Method.
Estimation (Point Estimation)
Han/Eick: Clustering II 1 Clustering Part2 continued 1. BIRCH skipped 2. Density-based Clustering --- DBSCAN and DENCLUE 3. GRID-based Approaches --- STING.
Clustering.
By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.
Presented by Ho Wai Shing
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
Other Clustering Techniques
CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.
CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.
CLUSTERING PARTITIONING METHODS Elsayed Hemayed Data Mining Course.
Clustering By : Babu Ram Dawadi. 2 Clustering cluster is a collection of data objects, in which the objects similar to one another within the same cluster.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar GNET 713 BCB Module Spring 2007 Wei Wang.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
QED : An Efficient Framework for Temporal Region Query Processing Yi-Hong Chu 朱怡虹 Network Database Laboratory Dept. of Electrical Engineering National.
Data Transformation: Normalization
Data Mining Soongsil University
Data Mining: Concepts and Techniques
Clustering in Ratemaking: Applications in Territories Clustering
K-means and Hierarchical Clustering
Data Queries Raster & Vector Data Models
Outlier Discovery/Anomaly Detection
K Nearest Neighbor Classification
CSE572, CBS598: Data Mining by H. Liu
DATA MINING Introductory and Advanced Topics Part II - Clustering
CSE572, CBS572: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
Three-Dimensional Object Representation
Text Categorization Berlin Chen 2003 Reference:
Continuous Density Queries for Moving Objects
CSE572: Data Mining by H. Liu
Presentation transcript:

CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course

Outline GRID-based Clustering Methods 2  Introduction  Grid-Based Clustering Techniques  STING  What is spatial data  STING Overview  Grid Cell Hierarchy  Hierarchical Structure  Statistical Parameters  Query Types  Query Processing  Advantage and disadvantage

Clustering Methods GRID-based Clustering Methods 3  Partitioning methods  K-Means  Hierarchical methods  Agglomerative Hierarchical Clustering  Divisive hierarchical clustering  Density-based methods  DBSCAN: a Density-Based Spatial Clustering of Applications with Noise  Grid-based methods  STING: A Statistical Information Grid Approach to Spatial Data Mining  High Dimensional Data Clustering  CLIQUE: A Dimension-Growth Subspace Clustering Method

GRID-BASED CLUSTERING METHODS  This is the approach in which we quantize space into a finite number of cells that form a grid structure on which all of the operations for clustering is performed.  So, for example assume that we have a set of records and we want to cluster with respect to two attributes, then, we divide the related space (plane), into a grid structure and then we find the clusters. 4 GRID-based Clustering Methods

Age Salary (10,000) Our “space” is this plane Example GRID-based Clustering Methods 5

Grid-Based Clustering Techniques  The following are some techniques that are used to perform Grid-Based Clustering:  CLIQUE (CLustering In QUest)  STING (STatistical Information Grid)  WaveCluster 6 GRID-based Clustering Methods

STING: A Statistical Information Grid Approach to Spatial Data Mining 7 GRID-based Clustering Methods

What is Spatial Data? GRID-based Clustering Methods  Spatial data may be thought of as features located on or referenced to the Earth's surface, such as roads, streams, political boundaries, schools, land use classifications, property ownership parcels, drinking water intakes, pollution discharge sites - in short, anything that can be mapped.  Spatial Area: The area that encompasses the locations of all the spatial data is called spatial area.

STING Overview  STING is used for performing clustering on spatial data.  STING uses a hierarchical multi resolution grid data structure to partition the spatial area.  STINGS big benefit is that it processes many common “region oriented” queries on a set of points, efficiently.  We want to cluster the records that are in a spatial table in terms of location.  Placement of a record in a grid cell is completely determined by its physical location. 9 GRID-based Clustering Methods

Grid Cell Hierarchy GRID-based Clustering Methods 10  The spatial area is divided into rectangular cells. (Using latitude and longitude.)  Each cell forms a hierarchical structure.  This means that each cell at a higher level is further partitioned into 4 smaller cells in the lower level.  In other words each cell at the ith level (except the leaves) has 4 children in the i+1 level.  The union of the 4 children cells would give back the parent cell in the level above them.

Grid Cell Hierarchy (Cont.) GRID-based Clustering Methods 11  The size of the leaf level cells and the number of layers depends upon how much granularity the user wants.  So, Why do we have a hierarchical structure for cells?  We have them in order to provide a better granularity, or higher resolution.

A Hierarchical Structure for Sting Clustering GRID-based Clustering Methods 12

Statistical Parameters Stored in each Cell  For each cell in each layer we have:  Attribute Independent Parameter: Count : number of records in this cell.  Attribute Dependent Parameter: (We are assuming that our attribute values are real numbers.) 13 GRID-based Clustering Methods

Statistical Parameters (Cont.) GRID-based Clustering Methods 14  For each attribute of each cell we store the following parameters:  M  mean of all values of each attribute in this cell.  S  Standard Deviation of all values of each attribute in this cell.  Min  The minimum value for each attribute in this cell.  Max  The maximum value for each attribute in this cell.  Distribution  The type of distribution that the attribute value in this cell follows. (e.g. normal, exponential, etc.) None is assigned to “Distribution” if the distribution is unknown.

Storing of Statistical Parameters GRID-based Clustering Methods 15  Statistical information regarding the attributes in each grid cell, for each layer are pre-computed and stored before hand.  The statistical parameters for the cells in the lowest layer is computed directly from the values that are present in the table.  The Statistical parameters for the cells in all the other levels are computed from their respective children cells that are in the lower level.

Query Types GRID-based Clustering Methods 16  SQL like Language used to describe queries  Two types of common queries found:  find region specifying certain constraints  take in a region and return some attribute of the region A top-down approach is used to answer spatial data queries.

Query Processing GRID-based Clustering Methods Start from a pre-selected layer-typically with a small number of cells. //The pre-selected layer does not have to be the top most layer. 2. For each cell in the current layer compute the confidence interval (or estimated range of probability) reflecting the cells relevance to the given query. 3. The confidence interval is calculated by using the statistical parameters of each cell.

Query Processing (Cont.) 4. Remove irrelevant cells from further consideration. 5. When finished with the current layer, proceed to the next lower level. 6. Processing of the next lower level examines only the remaining relevant cells. 7. Repeat this process until the bottom layer is reached. 8. Return the regions of relevant cells that satisfy the query 18 GRID-based Clustering Methods

Different Grid Levels during Query Processing GRID-based Clustering Methods 19

Sample Query Examples GRID-based Clustering Methods 20  Select the maximal regions that have at least 100 houses per unit area and at least 70% of the house prices are above $4OOK and with total area at least 100 units with 90% confidence.  Select the range of age of houses in those maximal regions where there are at least 100 houses per unit area and at least 70% of the houses have price between $150K and $300K with area at least 100 units in California.

Sample Query Examples  Assume that the spatial area is the map of the regions of Long Island, Brooklyn and Queens.  Our records represent apartments that are present throughout the above region.  Query : “ Find all the apartments that are for rent near Stony Brook University that have a rent range of: $800 to $1000”  The above query depend upon the parameter “near.” For our example near means within 15 miles of Stony Brook University. 21 GRID-based Clustering Methods

Advantages and Disadvantages of STING  ADVANTAGES:  Very efficient.  The computational complexity is O(k) where k is the number of grid cells at the lowest level. Usually k << N, where N is the number of records.  STING is a query independent approach, since statistical information exists independently of queries.  Incremental update.  DISADVANTAGES:  All Cluster boundaries are either horizontal or vertical, and no diagonal boundary is selected. 22 GRID-based Clustering Methods

Summary GRID-based Clustering Methods 23  Introduction  Grid-Based Clustering Techniques  STING  What is spatial data  STING Overview  Grid Cell Hierarchy  Hierarchical Structure  Statistical Parameters  Query Types  Query Processing  Advantage and disadvantage