Data Stream Management Systems--Supporting Stream Mining Applications

Data Stream Management Systems--Supporting Stream Mining Applications
Carlo Zaniolo CS240B

Motivation for Data Stream Mining
Most interesting applications come from dynamic environments where data are collected over time -- e.g., customer transactions, call records, customer click data. In these applicationsn batch learning is not sufficient anymore algorithms should be able to incorporate new data Some algorithms that are incremental by nature, e.g. kNN classifiers, Naïve Bayes classifiers can be easily extended for data streams. But most algorithms need changes to make incremental induction. Algorithms should be able to deal with non‐stationary data, by – Adapting in the presence of concept drift – Forgetting outdated data and use the most recent state of the Knowledge in the presence of significant changes (concept shift),

Motivation Experiments at CERN are generating an entire petabyte (1PB=106 GB) of data every second as particles fired around the Large Hadron Collider (LHC) at velocities approaching the speed of light are smashed together “We don’t store all the data as that would be impractical. Instead, from the collisions we run, we only keep the few pieces that are of interest, the rare events that occur, which our filters spot and send on over the network,” he said. This still means CERN is storing 25PB of data every year – the same as 1,000 years' worth of DVD quality video – which can then be analyzed andinterrogated by scientists looking for clues to the structure and make‐up of the universe.

Cluster Analysis: objective shared by all algorithms
Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized ‹#› Introduction to Data Mining © Tan,Steinbach, Kumar

Cluster Analysis: Many Different Approaches and Algorithms
Partitional Clustering A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset Hierarchical clustering A set of nested clusters organized as a hierarchical tree Exclusive versus non-exclusive In non-exclusive clusterings, points may belong to multiple clusters. Can represent multiple classes or ‘border’ points Fuzzy versus non-fuzzy In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 Weights must sum to 1 Probabilistic clustering has similar characteristics Partial versus complete In some cases, we only want to cluster some of the data Introduction to Data Mining

PART I: Maim Static Clustering Algorithms
K-means and its variants Hierarchical clustering Density-based clustering Introduction to Data Mining ‹#›

K-means Clustering Partitional clustering approach
Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

K-means Clustering – Details
Initial centroids are often chosen randomly. Clusters produced vary from one run to another. The centroid is (typically) the mean of the points in the cluster. ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc. K-means will converge for common similarity measures mentioned above. Most of the convergence happens in the first few iterations. Often the stopping condition is changed to ‘Until relatively few points change clusters’ Complexity is O( n * K * I * d ) n = number of points, K = number of clusters, I = number of iterations, d = number of attributes © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Two different K-means Clusterings
3 2.5 Original Points 2 1.5 y 1 0.5 -2 -1.5 -1 -0.5 x 0.5 1 1.5 2 3 3 2.5 2.5 2 2 1.5 1.5 y y 1 1 0.5 0.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x Optimal Clustering Sub-optimal Clustering © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Importance of Choosing Initial Centroids
© Tan,Steinbach, Kumar 4/18/2004 ‹#› Introduction to Data Mining

Different Centroids (Seeds)
© Tan,Steinbach, Kumar 4/18/2004 ‹#› Introduction to Data Mining

Limitations of k-means:
Problems with the algorithm: Result depends on initial centroids—no assurance of optimality. Many runs used in practice. Much work on good seeding algorithms: K++means But user must supply K. Or try many Ks to find the best. Or use a series of Bisecting K-means.

Limitations of k-Means
Problems with the model: K-means has problems when clusters are of differing – Sizes – Densities – Non-globular shapes K-means has problems when the data contains outliers.

Static Clustering Algorithms
K-means and its variants: in spite of all these problems K-means remains the most commonly used clustering algorithm !*? Next: Hierarchical clustering Density-based clustering Introduction to Data Mining ‹#›

Limitations of k-Means: different sizes

Limitatios of k-Means: different densities

Limitations of k-means: non-globular shapes

Hierarchical Clustering
Two main types of hierarchical clustering – Agglomerative: Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left – Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters) Traditional hierarchical algorithms use similarity or distance matrix Merge or split one cluster at a time --Expensive

Hierarchical Clustering Algorithms
Hierarchical Clustering Algorithms Can be used to generate hierarchically structured clusters such as that below, or to simply partition the data into clusters.

Hierarchical Clustering Algorithms
Hierarchical Clustering Algorithms Can also be used to partition the data into clusters. The CLUBS/CLUBS+ algorithm recently developed at UCLA: Uses a divisive phase followed by an agglomerative phase to build elliptical clusters around centroids, and it is Totally unsupervised (no seeding, no K), Insensitive to noise and outliers, it produces results of superior quality Extremely fast. So much so that it can be used for fast seeding of K-means.

Clustering Algorithms
K-means and its variants Hierarchical clustering Density-based clusterin: next Introduction to Data Mining ‹#›

DBSCAN DBSCAN is a density-based algorithm.
Density = number of points within a specified radius (Eps) A point is a core point if it has more than a specified number of points (MinPts) within Eps – Core points that are at the interior of a cluster A border point has fewer than MinPts within Eps, but is in but is in A noise point is any point that is neither a core point or a border point.

DBSCAN: core, border & noise points

DBSCAN: The Algorithm Eps and MinPts
Let ClusterCount=0. For every point p: If p it is not a core point, assign a null label to it [e.g., zero] If p is a core point, a new cluster is formed [with label ClusterCount:= ClusterCount+1] Then find all points density-reachable form p and classify them in the cluster. [Reassign the zero labels but not the others] Repeat this process until all of the points have been visited. Since all the zero labels of border points have been reassigned in 2, the remaining points with zero label are noise.

Many stream clustering approaches: a taxonomy

Goal: Construct a partition of a set of objects into k clusters
Partitioning methods Goal: Construct a partition of a set of objects into k clusters e.g. k‐Means, k‐Medoids Two types of methods: Adaptive methods: Leader (Spath 1980) Simple single pass k‐Means (Farnstrom et al, 2000) STREAM k‐Means (O’Callaghan et al, 2002) Online summarization ‐ offline clustering methods: CluStream (Aggarwal et al, 2003)

Leader [Spath 1980] The simplest single‐pass partitioning algorithm
Whenever a new instance p arrives from the stream Find its closest cluster (leader), cclos Assign p to cclos if their distance is below the threshold dthresh Otherwise, create a new cluster (leader) with p + 1‐pass and fast algorithm + No prior information on the number of clusters Unstable algorithm It depends on the order of the examples It depends on a correct guess of dthresh

STREAM k-Means(O’Callaghan et al, 2002)
An extension of k‐Means for streams The iterative process of static k‐Means cannot be applied to streams Use a buffer that fits in memory and apply k‐Means locally in the buffer Stream is processed in chunks X1, X2…, each fitting in memory For each chunk Xi Apply k‐Means locally on Xi (retain only the k centers) X’  i*k weighted centers obtained from chunks X1 … Xi o Each center is treated as a point, weighted with the number of points it compresses Apply k‐Means on X’ output the k centers …

CluStream [Aggarwal et al. 2003]
The stream clustering process is separated into: an online micro‐cluster component, that summarizes the stream locally as new data arrive over time Micro‐clusters are stored in disk at snapshots in time that follow a pyramidal time frame. an offline macro‐cluster component, that clusters these summaries into global clusters Clustering is performed upon summaries instead of raw data

CluStream: microcluster summary Structure

CluStream Algorithm A fixed number of q micro‐clusters is maintained over time Initialize: apply q‐Means over initPoints, built a summary for each cluster Online micro‐cluster maintenance as a new point p arrives from the stream Find the closest micro‐cluster clu for the new point p If p is within the max‐boundary of clu, p is absorbed by clu otherwise., a new cluster is created with p The number of micro‐clusters should not exceed q Delete most obsolete micro‐cluster or merge the two closest ones Periodic storage of micro‐clusters snapshots into disk At different levels of granularity depending upon their recency Offline macro‐clustering Input: A user defined time horizon h and number of macro‐clusters k to be detected Locate the valid micro‐clusters during h Apply k‐Means upon these micro‐clusters  k macro‐clusters

CluStream: Initialization Step
Done using an offline process in the beginning Wait for the first InitNumber points to arrive Apply a standard k‐Means algorithm to create q clusters o For each discovered cluster, assign it a unique ID and create its micro‐cluster summary. Comments on the choice of q much larger than the natural number of clusters much smaller than the total number of points arrived

CluStream: on-line step
A fixed number of q micro‐clusters is maintained over time Whenever a new point p arrives from the stream Compute distance between p and each of the q maintained micro‐cluster centroids clu  the closest micro‐cluster to p Find the max boundary of clu It is defined as a factor of t of clu radius If p falls within the maximum boundary of clu p is absorbed by clu Update clu statistics (incrementality property) Else, create a new micro‐cluster with p, assign it a new cluster ID, initialize its statistics To keep the total number of micro‐clusters fixed (i.e., q): Delete the most obsolete micro‐cluster or If its safe (based on how far in the past, the micro‐cluster received new points) Merge the two closest ones (Additivity property) When two micro‐clusters are merged, a list of ids is created. This way, we can identify the component micro‐clusters that comprise a micro‐cluster.

CluStream: periodic microcluster storage
Micro‐clusters snapshots are stored at particular times If current time is tc and user wishes to find clusters based on a history of length h Then we use the subtractive property of micro‐clusters at snapshots tc and tc‐h In order to find the macro‐clusters in a history or time horizon of length h How many snapshots should be stored? It is too expensive to store snapshots at every time stamp They are stored in a pyramidal time frame It is an effective trade‐off between the storage requirements and the ability to recall summary statistics from different time horizons.

CluStream: offline step
The offline step is applied on demand upon the q maintained micro‐clusters instead of the raw data User input: time horizon h, # macro‐clusters k to be detected Find the active micro‐clusters during h: We exploit the subtractivity property to find the active micro‐clusters during h: Suppose current time is tc. Let S(tc) be the set of micro‐clusters at tc. Find the stored snapshot which occurs just before time tc‐h. We can always find such a snapshot h’. Let S(tc–h’) be the set of micro‐clusters. For each micro‐cluster in the current set S(tc), we find the list of ids. For each of the list of ids, find the corresponding micro‐clusters in S(tc–h’). Subtract the CF vectors for the corresponding micro‐clusters in S(tc–h’) This ensures that the micro‐clusters created before the user‐specified horizon do not dominate the result of clustering process Apply k‐Means over the active micro‐clusters in h to derive the k macro‐clusters Initialization: seeds are not picked up randomly, rather sampled with probability proportional to the number of points in a given micro‐cluster Distance is the centroid distance New seed for a given partition is the weighted centroid of the micro‐clusters in that partition

CluStream: summary + CluStream clusters large evolving data streams. It + Views the stream as a changing process over time, rather than clustering the whole stream at a time + Can characterize clusters over different time horizons in changing environment + Provides flexibility to an analyst in a real‐time and changing environment Fixed number of micro‐clusters maintained over time Sensitive to outliers/ noise

Density-Based Data Stream Clustering
We will cover DenStream: Feng Cao, Martin Ester, Weining Qian, Aoying Zhou: “Density-Based Clustering over an Evolving Data Stream with Noise”. SDM ’06 DenStream operates on microclusters using an extension of DBSCAN: Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu: “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”. KDD ‘96

DBSCAN is a density-based algorithm.
Density = number of points within a specified radius (Eps) A point is a core point if it has more than a specified number of points (MinPts) within Eps These are points that are at the interior of a cluster A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point A noise point is any point that is not a core point or a border point.

DBSCAN: Core, Border, and Noise Points

Density-Reachable and Density-Connected (w.r.t. Eps, MinPts)
Let p be a core point, then every point in its Eps neighborhood is said to be directly density-reachable from p. A point p is density-reachable from a point core point q if there is a chain of points p1, …, pn, p1 = q, pn = p A point p is density-connected to a point q if there is a point o such that both, p and q are density-reachable from o p p1 q p q o

DBSCAN: The Algorithm Eps and MinPts
Let ClusterCount=0. For every point p: If p it is not a core point, assign a null label to it [e.g., zero] If p is a core point, a new cluster is formed [with label ClusterCount:= ClusterCount+1] Then find all points density-reachable form p and classify them in the cluster. [Reassign the zero labels but not the others] Repeat this process until all of the points have been visited. Since all the zero labels of border points have been reassigned in 2, the remaining points with zero label are noise.

DBSCAN Application examples: Population density, Spreading of Deseases, Trajectory tracing

DenStream Based on DBSCAN
Feng Cao, Martin Ester, Weining Qian, Aoying Zhou: “Density-Based Clustering over an Evolving Data Stream with Noise”. SDM ‘06 • Based on DBSCAN Core-micro-cluster: CMC(w,c,r) weight w > μ, center c, radius r < ε Potential/outlier micro-clusters Online: merge point into p (or o) micro-cluster if new radius r'< ε Promote o microcluster to p if w > βμ Else create new o-micro-cluster Offline: modified DBSCAN (on user demand)

Conclusion Much work still needed on data stream clustering

Data Stream Management Systems--Supporting Stream Mining Applications

Similar presentations

Presentation on theme: "Data Stream Management Systems--Supporting Stream Mining Applications"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Stream Management Systems--Supporting Stream Mining Applications

Similar presentations

Presentation on theme: "Data Stream Management Systems--Supporting Stream Mining Applications"— Presentation transcript:

Similar presentations

About project

Feedback