Presentation on theme: "Di Yang, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute VLDB 2009, Lyon, France 1 A Shared Execution Strategy for Multiple Pattern."— Presentation transcript:
Di Yang, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute VLDB 2009, Lyon, France 1 A Shared Execution Strategy for Multiple Pattern Mining Requests Over Streaming Data This work is supported under NSF grants CCF-0811510, IIS- 0119276, IIS-0414380. Chandi
Motivation: data streams are everywhere 2 transaction info patterns Stock Market Are there any patterns in transactions over past hour? Battlefield position info Stock Analysts Commander Where are the main clusters formed by enemy warcraft? patterns 2
Motivation: pattern mining requests tend to be parameterized 3 Example 1: give me the stocks that dropped significantly in the most recent transactions. Example 2: give me the major clusters formed by enemy warcraft. 10%, 30% or 50% to the original price with in last 10,30, or 60 minutes. size: n war-crafts density: m war-crafts / mi 2
Motivation: best parameters settings are hard to determine I need info for any cluster sized 5 or higher Clusters formed by fighter planes need to be updated every 5 seconds I only care about the clusters that are formed by more than 20 warcraft Clusters formed by boom carriers need to be updated every 10 seconds Multiple analysts may raise multiple queries with different parameter settings. Parameter settings? I probably know. But, can I try different combinations of them? ? A single analyst may raise multiple queries with different parameter settings. 4 Problem: A lot of similar queries, yet with different parameter settings, how to answer them efficiently.
Outline 1. Motivation 2. Problem Definition & State-of-Art 3. Basic: Single Query Processing Strategies. 4. Proposed Solution: Multi-Query Sharing. 5. Experimental Study 6. Conclusion 5 5
Target Pattern Type 6 12 14 5 7 6 4 8 2 9 16 17 1 13 15 Core Object: has no less than neighbors in distance from it. Edge Object: not core object but a neighbor of a core object. Noise: not core object and not a neighbor of any core object. θ range θ cnt A Density-Based Cluster (DB-Cluster) is a maximum group of connected core objects and the edge objects attached to them Why: popular and well know, arbitrary shapes, allow unclassified (noise), deterministic…
Clustering in Sliding Windows Over Stream 54321678 W1 54321678 W2 7 Applications include: monitoring congestion (cluster) in traffic looking for intensive transaction areas (cluster) in stock trades identifying malicious attacks (cluster) in network
Problem Definition Input: a query group QG with multiple density-based clustering queries querying on the same input stream but with arbitrary parameter settings. Goal: to minimize both the average processing time and the peak memory space needed by the system. 8 Template Density-Based Clustering Query Over Sliding Windows Pattern-specific Window-specific
State-Of-Art 9 Previous work on multiple query sharing concentrates on traditional SPJ and aggregation operators [Arasu04][Hammad03][Wang06]. No existing solution for multiple query optimization for complex pattern mining requests, such as clustering, in streaming environments. Existing algorithms for single density-based clustering query over sliding windows include Incremental DBSCAN, Exact-N, Abstract-C and Extra-N [Ester98] [Yang09]. Executing algorithms above independently for each query (naive solution ) is not scalable.
Outline 1. Motivation 2. Problem Definition & State-of-Art 3. Basic: Single Query Processing Strategies. 4. Proposed Solution: Multi-Query Sharing. 5. Experimental Study 6. Conclusion 10
Density-Based Clustering on Data Streams 11 In highly dynamic streaming environments: Re-computation. Incremental cluster maintenance. Extra-N proposed a hybrid neighbor relationship (neighborship) mechanism to represent cluster structure. Maintain “Exact Neighborships” (neighbor lists) for none-core objects. Maintain “Abstract Neighborships” (cluster memberships) for core objects. A general concept of “Predicted View” is applied to efficiently update the cluster structure. —key: a compact and easy-maintainable cluster representation.
Concept of Predicted Views 12 14 5 7 6 13 11 2 9 10 1 3 8 4 15 16 Current View of W 0 window size=16, slide size=4, time=1 Predicted View of W 1 12 14 5 7 6 13 11 9 10 8 15 16 Predicted View of W 2 12 14 5 7 6 13 11 2 9 10 1 3 8 4 15 16 12 1413 11 9 10 15 16 Predicted View of W 3 12345678910111213141516 W0W0 W1W1 W2W2 W3W3 12
Update Predicted Views Current View of W 1 Predicted View of W 2 12 1413 11 9 10 15 16 Predicted View of W 3 12 14 5 7 6 13 11 9 10 8 15 16 1413 15 16 Predicted View of W 4 171819205678910111213141516 W1W1 W2W2 W3W3 W4W4 17 20 18 19 17 20 18 19 17 20 18 19 17 20 18 19 New Data Points 12 14 5 7 6 13 11 2 9 10 1 3 8 4 15 16 window size=16, slide size=4, time=1 13 Expired View of W 0
Outline 1. Motivation 2. Problem Definition & State-of-Art 3. Basic: Single Query Processing Strategies. 4. Proposed Solution: Multiple Query Sharing. 5. Experimental Study 6. Conclusion 14
Arbitrary Pattern-Specific Parameter Case -- arbitrary, fixed 16 θ range θ cnt Any relationship between the cluster sets identified by them?
“Growth Property” among DB-cluster Sets 17 Independent Cluster Structure StorageHierarchical Cluster Structure Storage Grow If any cluster Ci in Clu_Set1 is “contained” by one cluster in Clu_Set2, Clu_Set2 is a “Growth” of Clu_Set1. c6c5c4 c6c5c4
Benefits of Hierarchical Cluster Structure 18 Benefits for Memory Resources: Memory space needed by storing cluster sets identified by multiple queries in QG is independent from |QG|. Benefits for Computational Resources: Multiple cluster sets stored in the hierarchical cluster structure (which are usually similar) can be maintained incrementally, rather than independently.
Arbitrary Pattern-Specific Parameter Case -- arbitrary, fixed 19 θ range θ cnt θ θ range Growth property transitively holds among the cluster sets identified by multiple queries with arbitrary and same.
Arbitrary Pattern-Specific Parameter Case -- arbitrary, fixed Idea: Growth property transitively holds. Solution: A single integrated representation of predicted views: θ range θ cnt θ IntView_
Arbitrary Pattern-Specific Parameter Case -- Fixed, Arbitrary Growth property transitively holds among the cluster sets identified by multiple queries with arbitrary and same. θ cnt θ range θ θ cnt 22
Arbitrary Pattern-Specific Parameter Case -- arbitrary, fixed θ cnt θ range Idea: Growth property transitively holds. Solution: A single integrated representation of predicted views: θ range IntView_
25 Growth property holds,if ≥ and ≤. An “Predicted View Tree” for all queries. Arbitrary Pattern-Specific Parameter Case -- arbitrary, arbitrary θ cnt θ range θ Qi. θ range Qj. θ cnt Qi. θ cnt Qj. θ IntView_ Cluster set by Q3 Cluster set by Q5 Cluster set by Q2 Cluster set by Q1 Cluster set by Q4 grow
Arbitrary Window-Specific Parameter Case -- arbitrary win, fixed slide In this case, maintaining a single query will be sufficient to answer all. The predicted views for Qi with largest win cover all needed. 27 12345678910111213141516 W0W0 W1W1 W3W3 Q1.win=16, slide=4 Q2.win=12, slide=4 W0W0 W2W2 Shared Time Answer for Q1 at T=16Answer for Q2 at T=16 W0W0 Q3.win=8, slide=4 W0W0 Q4.win=4, slide=4 Answer for Q3 at T=16 Answer for Q4 at T=16
General Case -- arbitrary all four parameters Our proposed techniques for arbitrary pattern parameter cases (intra-window-optimization) for arbitrary window parameter cases (inter-window-optimization) are orthogonal to each other. Final integrated structure for QG. 31 θ IntView_ IntView_W IntView
Experimental Study 32 Alternative Methods: 1. Incremental DBSCAN [Ester98] 2. Incremental DBSCAN with rqs (range query search sharing) 3. Extra-N [Yang09] 4. Extra-N with rqs (range query search sharing) 5. Chandi (our solution) Real Streaming Data: 1. GMTI data recording information about moving vehicles [Mitre08]. 2. STT data recording stock transactions from NYSE [INETATS08]. Measurements: 1. Average processing time for each tuple. 2. Memory footprint. Chandi
Evaluation for Performance 33 Arbitrary Pattern Parameter Case Arbitrary Window Parameter Case
Evaluation for Performance 34 Arbitrary All Four Parameter Cases
35 Conclusion: First effort of applying multi-query optimization techniques on complex mining requests. General principles proposed, such as incremental pattern representation and meta query strategy, applicable to other pattern types. First full-sharing framework, called Chandi, for multiple density-based clustering queries over streaming windows, Experimental study shows Chandi has excellent efficiency and scalability. Future Work: Other pattern types: other cluster types, outliers, association rules… Pattern management. Visualization and interaction.