Presentation is loading. Please wait.

Presentation is loading. Please wait.

CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.

Similar presentations


Presentation on theme: "CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course."— Presentation transcript:

1 CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course

2 Outline GRID-based Clustering Methods 2  Introduction  Grid-Based Clustering Techniques  STING  What is spatial data  STING Overview  Grid Cell Hierarchy  Hierarchical Structure  Statistical Parameters  Query Types  Query Processing  Advantage and disadvantage

3 Clustering Methods GRID-based Clustering Methods 3  Partitioning methods  K-Means  Hierarchical methods  Agglomerative Hierarchical Clustering  Divisive hierarchical clustering  Density-based methods  DBSCAN: a Density-Based Spatial Clustering of Applications with Noise  Grid-based methods  STING: A Statistical Information Grid Approach to Spatial Data Mining  High Dimensional Data Clustering  CLIQUE: A Dimension-Growth Subspace Clustering Method

4 GRID-BASED CLUSTERING METHODS  This is the approach in which we quantize space into a finite number of cells that form a grid structure on which all of the operations for clustering is performed.  So, for example assume that we have a set of records and we want to cluster with respect to two attributes, then, we divide the related space (plane), into a grid structure and then we find the clusters. 4 GRID-based Clustering Methods

5 Age Salary (10,000) Our “space” is this plane 20 30 40 50 60 8 7 6 5 4 3 2 1 0 Example GRID-based Clustering Methods 5

6 Grid-Based Clustering Techniques  The following are some techniques that are used to perform Grid-Based Clustering:  CLIQUE (CLustering In QUest)  STING (STatistical Information Grid)  WaveCluster 6 GRID-based Clustering Methods

7 STING: A Statistical Information Grid Approach to Spatial Data Mining 7 GRID-based Clustering Methods

8 What is Spatial Data? GRID-based Clustering Methods  Spatial data may be thought of as features located on or referenced to the Earth's surface, such as roads, streams, political boundaries, schools, land use classifications, property ownership parcels, drinking water intakes, pollution discharge sites - in short, anything that can be mapped.  Spatial Area: The area that encompasses the locations of all the spatial data is called spatial area.

9 STING Overview  STING is used for performing clustering on spatial data.  STING uses a hierarchical multi resolution grid data structure to partition the spatial area.  STINGS big benefit is that it processes many common “region oriented” queries on a set of points, efficiently.  We want to cluster the records that are in a spatial table in terms of location.  Placement of a record in a grid cell is completely determined by its physical location. 9 GRID-based Clustering Methods

10 Grid Cell Hierarchy GRID-based Clustering Methods 10  The spatial area is divided into rectangular cells. (Using latitude and longitude.)  Each cell forms a hierarchical structure.  This means that each cell at a higher level is further partitioned into 4 smaller cells in the lower level.  In other words each cell at the ith level (except the leaves) has 4 children in the i+1 level.  The union of the 4 children cells would give back the parent cell in the level above them.

11 Grid Cell Hierarchy (Cont.) GRID-based Clustering Methods 11  The size of the leaf level cells and the number of layers depends upon how much granularity the user wants.  So, Why do we have a hierarchical structure for cells?  We have them in order to provide a better granularity, or higher resolution.

12 A Hierarchical Structure for Sting Clustering GRID-based Clustering Methods 12

13 Statistical Parameters Stored in each Cell  For each cell in each layer we have:  Attribute Independent Parameter: Count : number of records in this cell.  Attribute Dependent Parameter: (We are assuming that our attribute values are real numbers.) 13 GRID-based Clustering Methods

14 Statistical Parameters (Cont.) GRID-based Clustering Methods 14  For each attribute of each cell we store the following parameters:  M  mean of all values of each attribute in this cell.  S  Standard Deviation of all values of each attribute in this cell.  Min  The minimum value for each attribute in this cell.  Max  The maximum value for each attribute in this cell.  Distribution  The type of distribution that the attribute value in this cell follows. (e.g. normal, exponential, etc.) None is assigned to “Distribution” if the distribution is unknown.

15 Storing of Statistical Parameters GRID-based Clustering Methods 15  Statistical information regarding the attributes in each grid cell, for each layer are pre-computed and stored before hand.  The statistical parameters for the cells in the lowest layer is computed directly from the values that are present in the table.  The Statistical parameters for the cells in all the other levels are computed from their respective children cells that are in the lower level.

16 Query Types GRID-based Clustering Methods 16  SQL like Language used to describe queries  Two types of common queries found:  find region specifying certain constraints  take in a region and return some attribute of the region A top-down approach is used to answer spatial data queries.

17 Query Processing GRID-based Clustering Methods 17 1. Start from a pre-selected layer-typically with a small number of cells. //The pre-selected layer does not have to be the top most layer. 2. For each cell in the current layer compute the confidence interval (or estimated range of probability) reflecting the cells relevance to the given query. 3. The confidence interval is calculated by using the statistical parameters of each cell.

18 Query Processing (Cont.) 4. Remove irrelevant cells from further consideration. 5. When finished with the current layer, proceed to the next lower level. 6. Processing of the next lower level examines only the remaining relevant cells. 7. Repeat this process until the bottom layer is reached. 8. Return the regions of relevant cells that satisfy the query 18 GRID-based Clustering Methods

19 Different Grid Levels during Query Processing GRID-based Clustering Methods 19

20 Sample Query Examples GRID-based Clustering Methods 20  Select the maximal regions that have at least 100 houses per unit area and at least 70% of the house prices are above $4OOK and with total area at least 100 units with 90% confidence.  Select the range of age of houses in those maximal regions where there are at least 100 houses per unit area and at least 70% of the houses have price between $150K and $300K with area at least 100 units in California.

21 Sample Query Examples  Assume that the spatial area is the map of the regions of Long Island, Brooklyn and Queens.  Our records represent apartments that are present throughout the above region.  Query : “ Find all the apartments that are for rent near Stony Brook University that have a rent range of: $800 to $1000”  The above query depend upon the parameter “near.” For our example near means within 15 miles of Stony Brook University. 21 GRID-based Clustering Methods

22 Advantages and Disadvantages of STING  ADVANTAGES:  Very efficient.  The computational complexity is O(k) where k is the number of grid cells at the lowest level. Usually k << N, where N is the number of records.  STING is a query independent approach, since statistical information exists independently of queries.  Incremental update.  DISADVANTAGES:  All Cluster boundaries are either horizontal or vertical, and no diagonal boundary is selected. 22 GRID-based Clustering Methods

23 Summary GRID-based Clustering Methods 23  Introduction  Grid-Based Clustering Techniques  STING  What is spatial data  STING Overview  Grid Cell Hierarchy  Hierarchical Structure  Statistical Parameters  Query Types  Query Processing  Advantage and disadvantage


Download ppt "CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course."

Similar presentations


Ads by Google