Clustering Uncertain Data Speaker: Ngai Wang Kay.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Nearest Neighbor Search
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 12: Indexing and.
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
Efficient Reverse k-Nearest Neighbors Retrieval with Local kNN-Distance Estimation Mike Lin.
1 Lecture 8: Data structures for databases II Jose M. Peña
Discrete geometry Lecture 2 1 © Alexander & Michael Bronstein
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Clustering Color/Intensity
1. 2 General problem Retrieval of time-series similar to a given pattern.
Introduction to Bioinformatics - Tutorial no. 12
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Efficient Distance Computation between Non-Convex Objects By Sean Quinlan Presented by Sean Augenstein and Nicolas Lee.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Data Structures for Computer Graphics Point Based Representations and Data Structures Lectured by Vlastimil Havran.
Lecture 2 We have given O(n 3 ), O(n 2 ), O(nlogn) algorithms for the max sub-range problem. This time, a linear time algorithm! The idea is as follows:
B-trees (Balanced Trees) A B-tree is a special kind of tree, similar to a binary tree. However, It is not a binary search tree. It is not a binary tree.
CHAPTER 71 TREE. Binary Tree A binary tree T is a finite set of one or more nodes such that: (a) T is empty or (b) There is a specially designated node.
1 Chapter 17 Disk Storage, Basic File Structures, and Hashing Chapter 18 Index Structures for Files.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Learning the threshold in Hierarchical Agglomerative Clustering
M- tree: an efficient access method for similarity search in metric spaces Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.
Divide and Conquer Applications Sanghyun Park Fall 2002 CSE, POSTECH.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
CompSci 100E 39.1 Memory Model  For this course: Assume Uniform Access Time  All elements in an array accessible with same time cost  Reality is somewhat.
Spatial Query Processing Spatial DBs do not have a set of operators that are considered to be basic elements in a query evaluation. Spatial DBs handle.
Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering.
Clustering of Uncertain data objects by Voronoi- diagram-based approach Speaker: Chan Kai Fong, Paul Dept of CS, HKU.
Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University.
Multi-dimensional Search Trees CS302 Data Structures Modified from Dr George Bebis.
CompSci Memory Model  For this course: Assume Uniform Access Time  All elements in an array accessible with same time cost  Reality is somewhat.
Efficient Clustering of Uncertain Data Wang Kay Ngai, Ben Kao, Chun Kit Chui, Reynold Cheng, Michael Chau, Kevin Y. Yip Speaker: Wang Kay Ngai.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
Clustering Data Streams A presentation by George Toderici.
Color Image Segmentation Mentor : Dr. Rajeev Srivastava Students: Achit Kumar Ojha Aseem Kumar Akshay Tyagi.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Jeremy Iverson & Zhang Yun 1.  Chapter 6 Key Concepts ◦ Structures and access methods ◦ R-Tree  R*-Tree  Mobile Object Indexing  Questions 2.
Data Indexing Herbert A. Evans.
Indexing and hashing.
Data Science Algorithms: The Basic Methods
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Database Management Systems (CS 564)
Binary Search Tree Chapter 10.
Clustering Uncertain Taxi data
Nearest Neighbor Queries using R-trees
K Nearest Neighbor Classification
Indexing and Hashing Basic Concepts Ordered Indices
Distributed Probabilistic Range-Aggregate Query on Uncertain Data
Lecture 2- Query Processing (continued)
Analysis of Algorithms
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
Continuous Density Queries for Moving Objects
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Data Mining CSCI 307, Spring 2019 Lecture 23
Presentation transcript:

Clustering Uncertain Data Speaker: Ngai Wang Kay

Data Clustering is used to discover any cluster patterns in a data set, e.g. the data set may be partitioned into several groups, or clusters, such that the data within the same cluster are closer to each other or more similar (based on some distance functions) than the data from any other clusters. There are many methods for clustering data, and K-means clustering is a common one.

K-means clustering considers each cluster to have a representative and it is the mean of the data in the cluster. For an example, consider a database of location data reported from moving vehicles in a tracking system.

Given K number of location points (e.g. US White House, a school, etc.) around the data for an initial guess of the representatives of the clusters expected to exist in the data. K-means clustering assigns each vehicle to one of the K clusters such that its location is closer in Euclidean distance to that cluster's representative than any others' representatives.

Then the representative of each cluster is updated to the mean of the locations of the vehicles in the cluster. And each vehicle is re- assigned to the K clusters with the new representatives. This process repeats until some objectives is met, e.g. no changes of any vehicles' clusters between two successive processes.

After the clustering, a cluster could be empty. A non-empty cluster may have some meaning, e.g. if its representative point is very close to US White House, the vehicles in the cluster may be classified as spies. Note that if the vehicles are constantly moving, their actual locations may have changed when their reported locations data is received.

In that case, the data in the database is not very accurate. A data will have an "uncertainty" region around it where its corresponding vehicle's actual location lies within. The uncertainty region could be arbitrary or simply a circle region using the reported location as its center and has a radius of the vehicle's maximum speed times the time elapsed since the location data is reported.

The uncertainty region could also be associated with an arbitrary probability density function (pdf) for the probability of the vehicle's actual location being in a particular point of the region. For an example, part of the region may be sea, so the part may be associated with a total of 0.1 probability uniformly distributed for the points in the part (for the vehicle crashes to any one of those points). A probability 0.9 is for the vehicle to occur in any points of the remaining part of the region.

Such kind of data with uncertainty is called uncertain data. There is only few methods for clustering uncertain data. UK-means clustering is a common one. UK-means clustering is the same as K-means clustering except its distance function is using the "expected distance" from the data's uncertainty region to the representative of the candidate cluster to whom it is assigned.

For the representative c of the cluster, an uncertainty region R with a pdf f, and a Euclidean distance function D(p,c) for any two points, the expected distance, called ed, is

An uncertainty region could have arbitrary shape, so the minimum bounding box (MBR) of the region is used for R. The time complexity in UK-means clustering is O(nK) for computing the ed values for each of n data and each of K candidate clusters. This is very much especially for an uncertainty region with arbitrary pdf f that needs to be sampled using large number of samples in Monte Carlo methods to compute f values for the points.

A method called Global-minmax Pruning is developed in my research to prune out some candidate clusters for a data to save some computations of ed values. Define lmin(R,M) to be a minimum limit on the ed values computed for an uncertainty region R and a MBR M of some candidate clusters’ representative points. Also define lmax(R,M) to be a maximum limit on the ed values computed for an uncertainty region R and a MBR M of some candidate clusters’ representative points.

Define minmax to be the minimum of lmax values among all candidate clusters for a data. Then Global-minmax Pruning works like this: A KD-tree for a given height h is built to index the representative points of the candidate clusters using p = entries in each non- leaf node. Hence [(p^h – p) / (p – 1)] non-leaf entries.

The entries in each level, except the leaf level, are sorted in increasing order in a particular dimension alternately. O((h-1) K log(K) ). KD-tree is usually small and can be stored in CPU cache. Its access time is so little that it is ignored.

The time complexity for computing the ed values is then O(nK-P) if P candidate clusters are pruned out, in average, for each object.

If the KD-tree uses a capacity p for its non-leaf nodes, the worst time complexity for the pruning process is O( (h-1) K log(K) + n[2K + (p^h – p) / (p – 1)] ) when no non-leaf entries are pruned out. Note this is not better than the time complexity O(nK) for computing the ed values in UK- means clustering without any pruning if computing an ed value is not at least twice slower than computing a lmax or lmin value (e.g. for uncertainty regions with uniform pdf). So another pruning method called Local- minmax Pruning is developed in my research to address that special case.

The worst time complexity for the pruning process is then O( (h–1) K log(K) + n[2K – 2Q + 2(p^h – p) / (p – 1)] ) when Q leaf entries are pruned out before their nodes are visited. This is better than t O( (h-1) K log(K) + n[2K – Q + (p^h – p) / (p – 1)] ) of Global-minmax Pruning if Q > (p^h – p) / (p – 1). This could even be better than the non- pruning method’s O(nK) for the earlier special case of fast computation of an ed value against a lmin or lmax value if Q > [ K / 2 + (p^h – p) / (p – 1) + [(h-1) K log(K)] / 2n ].

But Local-minmax Pruning is not as effective in pruning as Global-minmax Pruning and hence could be less efficient when the computations of ed values are large overhead as its O(nk –P) increases. Simple way to compute lmin(R,M) is to use the Euclidean distance between the two nearest points in R and M. Simple way to compute lmax(R,M) is to use the Euclidean distance between the two farthest points in R and M.