A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad.

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague.

Incremental Clustering for Trajectories

ADAPTIVE FASTEST PATH COMPUTATION ON A ROAD NETWORK: A TRAFFIC MINING APPROACH Hector Gonzalez, Jiawei Han, Xiaolei Li, Margaret Myslinska, John Paul Sondag.

Indexing DNA Sequences Using q-Grams

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.

Di Yang, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute VLDB 2009, Lyon, France 1 A Shared Execution Strategy for Multiple Pattern.

Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.

BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.

CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.

1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

Mining Time Series.

Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.

Karl Schnaitter and Neoklis Polyzotis (UC Santa Cruz) Serge Abiteboul (INRIA and University of Paris 11) Tova Milo (University of Tel Aviv) Automatic Index.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.

Spatio-Temporal Databases

Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.

Crossroads: A Practical Data Sketching Solution for Mining Intersection of Streams Jun Xu, Zhenglin Yu (Georgia Tech) Jia Wang, Zihui Ge, He Yan (AT&T.

Spatio-Temporal Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases …..

Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.

Module 8: Designing Active Directory Disaster Recovery in Windows Server 2008.

Stream Clustering CSE 902. Big Data Stream analysis Stream: Continuous flow of data Challenges ◦Volume: Not possible to store all the data ◦One-time.

Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,

Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden)

1 On Querying Historical Evolving Graph Sequences Chenghui Ren $, Eric Lo *, Ben Kao $, Xinjie Zhu $, Reynold Cheng $ $ The University of Hong Kong $ {chren,

1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.

Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.

Group 8: Denial Hess, Yun Zhang Project presentation.

CURE: EFFICIENT CLUSTERING ALGORITHM FOR LARGE DATASETS VULAVALA VAMSHI PRIYA.

Presented by Ho Wai Shing

Di Yang, Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute EDBT 2010, Submitted 1 A Unified Framework Supporting Interactive.

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

1 Complex Spatio-Temporal Pattern Queries Cahide Sen University of Minnesota.

1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.

Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.

Spatio-Temporal Databases. Term Project Groups of 2 students You can take a look on some project ideas from here:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Adaptive Clustering for Multiple Evolving Streams Graduate.

1 On Demand Classification of Data Streams Charu C. Aggarwal Jiawei Han Philip S. Yu Proc Int. Conf. on Knowledge Discovery and Data Mining (KDD'04),

Xifeng Yan Philip S. Yu Jiawei Han SIGMOD 2005 Substructure Similarity Search in Graph Databases.

QED : An Efficient Framework for Temporal Region Query Processing Yi-Hong Chu 朱怡虹 Network Database Laboratory Dept. of Electrical Engineering National.

Dense-Region Based Compact Data Cube

Data Stream Management Systems--Supporting Stream Mining Applications

Presented by Niwan Wattanakitrungroj

What Is Cluster Analysis?

University of Waikato, New Zealand

Spatio-Temporal Databases

Updating SF-Tree Speaker: Ho Wai Shing.

A paper on Join Synopses for Approximate Query Answering

Online Frequent Episode Mining

Clustering Uncertain Taxi data

CS 685: Special Topics in Data Mining Jinze Liu

Jiawei Han Department of Computer Science

Spatio-Temporal Databases

A Framework for Clustering Evolving Data Streams

CSE572, CBS572: Data Mining by H. Liu

Pei Lee, ICDE 2014, Chicago, IL, USA

CSE572: Data Mining by H. Liu

K.L Ong, W. Li, W.K. Ng, and E.P. Lim

CS 685: Special Topics in Data Mining Jinze Liu

Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)

Frequent Pattern Mining for Data Streams

Presentation transcript:

A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad

Outline Background of Clustering Motivation for Clustering over Streaming Data. Overall Solution Micro Clusters Pyramid Time Frame Macro Cluster Cluster Maintenance

Background of Clustering Definition of Clustering  For a given set of data points, partitioning them into one or more groups of similar objects.  “Similarity” is often defined with the use of some distance measure. Difference between “group by” queries and clustering.

Background of Clustering Some of the most popular clustering algorithms:  K- Means, BIRCH, CURE, Density Based Clustering. Clustering has many applications in data bases, information visualization, data mining. What are Oultiers?

Motivation Challenge in Streaming Environment:  Clustering is an expensive process.  Resource constraints.  Infinite streams. Can simply extending one pass algorithms for static databases to stream processing suffice?

Motivation Requirements of clustering for stream processing:  Statistical summary information storage.  Efficient update process.  Ability to cluster for a specific time horizon,

Overall Solution of the Paper Divide the clustering process to two phases Online Component: periodically stores detailed summary statistics Offline Component uses only the summary statistics to do clustering

Micro-Clusters What is a Micro-Cluster A Micro-Cluster is a set of individual data points that are close to each other and will be treated as a single unit in further offline Macro-clustering. View of Micro-ClusterView of Macro-Cluster

Micro-Clusters What to Store in a Micro-Cluster = Key idea: Additivity Property

Pyramidal Time Frame The snapshots follow a pyramidal pattern … … When should we make the snapshot? The micro-clusters are stored at snapshots. Snapshot

Pyramidal Time Frame Snapshots are classified into different orders which can vary from 1 to log α(T). For example, T is 55, α=2, then we have orders 0 with interval 2^0=1, order 1 with interval 2^1=2, order 2 with interval 2^2=4, order 3 with interval 2^3=8, order 4 with interval 2^4=16, order 5 with interval 2^5=32. For a data stream the maximum number of snapshots maintained at T time units since the beginning of the stream mining process is (α + 1) log α(T). (α + 1 for each order)

Why Pyramidal Pattern? For any user-specified time window of h, at least one stored snapshot can be found within 2 h units of the current time. Please Note: Only Approximate Answers!!!

Micro Cluster Creation It is assumed that a total of q micro- clusters are maintained at any moment by the algorithm. This is done using an offline process (k- means) at the very beginning of the data stream computation process.

Online Micro Cluster Maintenance How to deal with a new coming point? 1. Join one of the old cluster 2. Create a new cluster by its own How to deal with the old clusters 1. Delete them (based on relevance stamp) 2. Merge them (merge the closest two) A merged cluster will have all the IDs its components have

Macro-Cluster Creation Based on the Additivity Property of cluster feature vector

Macro-Cluster Creation Current Time T, the window size is h. That means the user want to find the clusters formed in (T-h, T). Approach: 1. 1st step: Find the snapshot for T, get the micro-cluster set S(T). 2. 2nd step: Find the snapshot for T-h, get the micro-cluster set S(T-h). 3. Use S(T)-S(T-h) Specifically, we have a merged cluster with Id list (C1, C2, C3) in S(T) and a cluster with Id C1 in S(T-h). Then the we use CFT(C1,C2,C3)-CFT(C1)=CFT(C2,C3), because C1 are formed before T-h, thus should not contribute to the micro-cluster formed in (T-h,T)

Example C_ID: [C1] Time: T-h C_ID: [C1, C2, C3] Time: T C_ID: [C2, C3] Result: T-h

Macro-Cluster Creation Run K-means on Micro-Clusters

How do you feel about this paper? My feeling: Quite Fuzzy Results: Approximation is every where. Nothing New: Micro-Clusters, K-means, Cluster Feature Vectors, Pyramidal Time Frame are all old stuffs.

Counter Example C_ID: [C2] C_ID: [C1, C2, C3] Time: T C_ID: [C1, C3] Time: T-h Result

Advertisement Di and Charu’s project deals with: 1. Deterministic Clusters 2. Clusters with Arbitrary Shapes 3. Real Expirations 4. Disk Version 5. Outlier Detection by Free