ARGUS: A Prototype Stream Anomaly Monitoring System Thesis Proposal Chun Jin Thesis Committee Jaime Carbonell (Chair) Christopher Olston Jamie Callan Phil.

Slides:



Advertisements
Similar presentations
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
Advertisements

Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
CS4432: Database Systems II
ARGUS: Rete + DBMS = Efficient Persistent Profile Matching on Large-Volume Data Streams Chun Jin Language Technologies Institute School of Computer Science.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
GridVine: Building Internet-Scale Semantic Overlay Networks By Lan Tian.
Evaluation of Relational Operators CS634 Lecture 11, Mar Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
Incremental Maintenance for Non-Distributive Aggregate Functions work done at IBM Almaden Research Center Themis Palpanas (U of Toronto) Richard Sidle.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Database management concepts Database Management Systems (DBMS) An example of a database (relational) Database schema (e.g. relational) Data independence.
Chapter 10: Stream-based Data Management Title: Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core Authors:
Novelty Detection and Profile Tracking from Massive Data Jaime Carbonell Eugene Fink Santosh Ananthraman.
NIMD 1 Project Argus Massive Data NIMD PI Meeting December 2, 2004.
Database Systems More SQL Database Design -- More SQL1.
Chapter 19 Query Processing and Optimization
HOL9396: Oracle Event Processing 12c
Data Warehouse View Maintenance Presented By: Katrina Salamon For CS561.
...Looking back Why use a DBMS? How to design a database? How to query a database? How does a DBMS work?
Optimizing Multiple Continuous Queries Dissertation Defense Chun Jin Thesis Committee Jaime Carbonell (Chair) Christopher Olston, on leave at Yahoo! Research.
Chapter 17 Methodology – Physical Database Design for Relational Databases Transparencies © Pearson Education Limited 1995, 2005.
DATABASE MANAGEMENT SYSTEM ARCHITECTURE
Query Processing Presented by Aung S. Win.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Overview of the Database Development Process
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Chapter 2 CIS Sungchul Hong
CSE314 Database Systems More SQL: Complex Queries, Triggers, Views, and Schema Modification Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
Lecture 9 Methodology – Physical Database Design for Relational Databases.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Physical Database Design Chapter 6. Physical Design and implementation 1.Translate global logical data model for target DBMS  1.1Design base relations.
Master Thesis Defense Jan Fiedler 04/17/98
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
Query Optimization Arash Izadpanah. Introduction: What is Query Optimization? Query optimization is the process of selecting the most efficient query-evaluation.
INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.
Efficiently Processing Queries on Interval-and-Value Tuples in Relational Databases Jost Enderle, Nicole Schneider, Thomas Seidl RWTH Aachen University,
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
10/10/2012ISC239 Isabelle Bichindaritz1 Physical Database Design.
SPARQL Query Graph Model (How to improve query evaluation?) Ralf Heese and Olaf Hartig Humboldt-Universität zu Berlin.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
DATABASE MANAGEMENT SYSTEM ARCHITECTURE
To Tune or not to Tune? A Lightweight Physical Design Alerter Nico Bruno, Surajit Chaudhuri DMX Group, Microsoft Research VLDB’06.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
Introduction to Active Directory
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
REED : Robust, Efficient Filtering and Event Detection in Sensor Network Daniel J. Abadi, Samuel Madden, Wolfgang Lindner Proceedings of the 31st VLDB.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Copyright © 2004 Pearson Education, Inc.. Chapter 24 Enhanced Data Models for Advanced Applications.
More SQL: Complex Queries, Triggers, Views, and Schema Modification
Database Management System
Chapter 12: Query Processing
Database Performance Tuning and Query Optimization
TT-Join: Efficient Set Containment Join
File Processing : Query Processing
Database management concepts
Approximate Frequency Counts over Data Streams
Database management concepts
Overview of Query Evaluation
Chapter 11 Database Performance Tuning and Query Optimization
Query Optimization.
Storing and Processing Sensor Networks Data in Public Clouds
Presentation transcript:

ARGUS: A Prototype Stream Anomaly Monitoring System Thesis Proposal Chun Jin Thesis Committee Jaime Carbonell (Chair) Christopher Olston Jamie Callan Phil Hayes, DYNAMiX Technologies

Chun Jin Carnegie Mellon 2 Thesis Statement Stream Anomaly Monitoring System (SAMS) is an important sub-class of stream applications. The difficulty is raised by the very-large-volume data and a large number of queries the system is supposed to handle. Propose an approach for SAMS’s that implements incremental evaluation schemes with adapted Rete algorithm upon a traditional DBMS platform and exploit SAMS characteristics for query evaluation optimization. Demonstrate how the approach and the improvements could lead to a simple and fast implementation of an effective and efficient SAMS system.

Chun Jin Carnegie Mellon 3 Outline Motivation My ARGUS Approach Current Work Status Current System Preliminary Results Proposed Work and Timeline

Chun Jin Carnegie Mellon 4 Stream Processing Stream Processing Applications Network Traffic Analysis and Router Configuration Internet Services Sensor Data Analysis Anomaly Detection Stream Processing Projects STREAM, TelegraphCQ, Aurora NiagaraCQ, OpenCQ, WebCQ Gigascope, Tribeca Tapestry, Alert, Tukwila, etc.

Chun Jin Carnegie Mellon 5 Stream Anomaly Monitoring Systems (SAMS) SAMS monitors structured data streams for anomalies or potential hazards. Continuous queries may number in thousands or tens of thousands. Daily stream volumes may exceed millions of records. Satisfaction of a SAMS query is often rare (very-high-selectivity).

Chun Jin Carnegie Mellon 6 SAMS Dataflow Analyst Stream Anomaly Monitoring System Storage Queries Alerts Data Streams FedWire Money Transfers Patient Records

Chun Jin Carnegie Mellon 7 Query Example 4 Suppose for every big transaction of type code 1000, the analyst wants to check if the money stayed in the bank or left within ten days. An additional sign of possible fraud is that transactions involve at least one intermediate bank. The query generates an alarm whenever the receiver of a large transaction (over $1,000,000) transfers at least half of the money further within ten days of this transaction using an intermediate bank.

Chun Jin Carnegie Mellon 8 SQL Query for Example 4 FROM transaction r1, transaction r2, transaction r3 WHERE r2.type_code = 1000 AND r3.type_code = 1000 AND r1.type_code = 1000 AND r1.amount > AND r1.rbank_aba = r2.sbank_aba AND r1.benef_account = r2.orig_account AND r2.amount > 0.5 * r1.amount AND r1.tran_date <= r2.tran_date AND r2.tran_date <= r1.tran_date + 10 AND r2.rbank_aba = r3.sbank_aba AND r2.benef_account = r3.orig_account AND r2.amount = r3.amount AND r2.tran_date <= r3.tran_date AND r3.tran_date <= r2.tran_date + 10;

Chun Jin Carnegie Mellon 9 ARGUS as a Prototype SAMS Implement the Adapted Rete Algorithm upon a traditional DBMS platform Rete (Forgy 1982): Incremental Evaluation based on Materialized Intermediate Results. SAMS’s assumption of very-high-selectivity query over very-large-volume data justifies employment of Rete and necessitates some unique improvements. Transitivity Inference Ono/Lohman VLDB90, Pirahesh/Leung/Hasan ICDE97 Predicate Set Evaluation and Materialization Partial Rete (Materialization skipping) Complex Common Computation Identification for Sharing Intermingled Sharing and Optimization processing

Chun Jin Carnegie Mellon 10 ARGUS System Architecture Rete Network Generator Query Rete Networks Data Tables Analyst Identified Threats Intermediate Tables Data Streams Query Table Stream Anomaly Monitoring Do_queries Scheduler

Chun Jin Carnegie Mellon 11 ReteGenerator Architecture System Catalog Topology Table History-based Rete Optimizer ReteGen Manager Query Rewriter Topology Checker Transitivity Inference Counter Table SQL Queries Check Topology Register Rete Networks Update Tables History-based Cost Estimating Sharing ReteGenerator

Chun Jin Carnegie Mellon 12 Selected ARGUS Topics Adapted Rete Algorithm ReteGenerator translates a query into a Rete network that is wrapped as a stored procedure. The procedure implements the Adapted Rete Algorithm accounting for the incremental evaluation Transitivity Inference Rete Optimization Computation Sharing

Chun Jin Carnegie Mellon 13 Adapted Rete Algorithm (Selection) n and m are old data sets Δn and Δm are the new much smaller incremental data sets. Selection ơ ơ(n+ Δn) ơ(n) ơ(Δn)= +

Chun Jin Carnegie Mellon 14 Adapted Rete Algorithm (Join) Join (n+Δn) (m+Δm) = n m + Δn m + n Δm + Δn Δm When Δn and Δm are very small compared to n and m, time complexity of incremental join is O(n+m) Old Results New Incremental Results

Chun Jin Carnegie Mellon 15 Incremental Evaluation in Rete Example 4 DataTable r1, r2, r3 Type_code=1000 Amount> Type_code=1000 r1.rbank_aba = r2.sbank_aba r1.benef_account = r2.orig_account r2.amount > r1.amount*0.5 r1.tran_date <= r2.tran_date r2.tran_date >= r1.tran_date+10 r2.rbank_aba = r3.sbank_aba r2.benef_account = r3.orig_account r2.amount = r3.amount r2.tran_date <= r3.tran_date r3.tran_date >= r2.tran_date+10

Chun Jin Carnegie Mellon 16 Complex Queries A continuous query may contain multiple SQL statements, and a single SQL statement may contain unions of multiple SQL terms. Each SQL term is mapped to a sub-Rete network. These sub-Rete networks are then connected to form the statement-level sub-networks. And the statement-level subnetworks are further connected based on the view references to form the final query-level Rete network.

Chun Jin Carnegie Mellon 17 Transitivity Inference Exploring transitivity properties of comparison operators To derive hidden high-selective selection predicates High-selective selection predicates can significantly improve performance as they may produce very small intermediate results. Subsequent join could be performed very fast on the materialized intermediate results.

Chun Jin Carnegie Mellon 18 Transitivity Inference Example Given r1.amount > and r2.amount > r1.amount * 0.5 and r3.amount = r2.amount r1.amount > is very high- selective on r1 We can infer high-selective predicates: r2.amount > r3.amount >

Chun Jin Carnegie Mellon 19 Rete Optimization Active List Join Graph StructureBuilder Join Enumerator History-based Cost Estimator DB SQL Query Rete network Update Tables History-based Rete Optimizer

Chun Jin Carnegie Mellon 20 Join Graph Example P(1,2) P(2,3) P(1,3) P(3,4) 1,2

Chun Jin Carnegie Mellon 21 History-based Cost Estimator Run sub-plans on historical data To estimate the costs of sub-plans on future data Assume same data distribution in past and future Apply heuristic functions to avoid estimating extremely high cost sub-plans. Justify History-based Cost Estimator Compiled and optimized once, and executed multiple times Tolerable to spend more time on the one-time optimization Accurate cost estimates compensate as queries run more and more times

Chun Jin Carnegie Mellon 22 Computation Sharing Predicate Indexing Extended predicate set operations Sharing Algorithm

Chun Jin Carnegie Mellon 23 Predicate Indexing Predicate Indexing Concepts: Equivalent Predicate, p 1 ≡ p 2, iff ∀ D, p 1 (D) = p 2 (D) Equivalent Predicate Class Canonical Predicate Form Predicates are converted into the canonical forms and stored as records in tables. Searching a predicate becomes data retrieval from tables.

Chun Jin Carnegie Mellon 24 Relationship between Predicate Sets and Their Result Tuple Sets Predicate Set: a set of conjunctive predicates Its Result Tuple Set: a set of database tuples that satisfy all the predicates of the Predicate Set. Fix database status D, a mapping from predicate set P to its result tuple set S D (P): S D : P ---> S D (P) Predicate sets and their result tuple sets are complementary: Predicates are filters of data items The more number of predicates, the less number of result tuples

Chun Jin Carnegie Mellon 25 Extending Predicate Set Operations Defined on predicate sets Definitions are justified by the relationships among corresponding result tuple sets Important to common computation identification

Chun Jin Carnegie Mellon 26 Semantic Subset ⊆ ≡ Given two predicate sets P 1 and P 2, we say that P 1 is a semantic subset of P 2, and denote as P 1 ⊆ ≡ P 2, if for any database status D, we have S D (P 1 ) ⊇ S D (P 2 ).

Chun Jin Carnegie Mellon 27 Semantic Subset Example p 1 : t1.a>1, p 2 : t1.a>2 P 1 = {p 1 }, P 2 = {p 2 } S(P 1 ) ⊇ S(P 2 ), P 1 ⊆ ≡ P 2. Why? P 2 ≡ ≡ {p 1, p 2 }

Chun Jin Carnegie Mellon 28 Sharing Types T1T1 T2T2 P OT2 P OT1 P FJ P OJ -P FJ T1T1 T2T2 P OT2 P OT1 P OJ P NT 1 -P OT 1 P NJ P NT 2 -P OT 2 T1T1 T2T2 P OT2 P OT1 P OJ T1T1 T2T2 P OT2 P OT1 P OJ P NJ -P OJ Non-change Add-only Reconstruction Selection Add-only

Chun Jin Carnegie Mellon 29 Sharing Algorithm Overview Non-change sharing. Add-only sharing. Optimizing the remaining query. Reconstruction and selection sharing. Constructing the remaining Rete network based on the optimized plan with possible sharing.

Chun Jin Carnegie Mellon 30 Current Work Status A preliminary system Database A preliminary ReteGenerator With the Adapted Rete and Transitivity Inference Will be expanded to incorporate optimization, computation sharing, and incremental aggregation, etc. A Preliminary evaluation Will conduct full evaluation on the complete system in future

Chun Jin Carnegie Mellon 31 Preliminary Evaluation: Queries and Data 7 queries on synthesized FedWire money transfer database records. Two Data Conditions: Data1: Old: first records New: remaining records ALERT Data2: Old: first records New: next records NOT alert

Chun Jin Carnegie Mellon 32 Preliminary Results Rete with Transitivity Inference Q1Q2Q3Q4Q5Q6Q7 Execution Time(s) Rete Data1SQL Data1Rete Data2SQL Data2

Chun Jin Carnegie Mellon 33 Transitivity Inference Q2 Q Data1Data2 Execution Time(s) Data1Data2 Execution Time(s) Rete TIRete Non-TISQL Non-TISQL TI

Chun Jin Carnegie Mellon 34 Partial Rete Generation Q4 assumes Transitivity Inference not applicable Data1Data2 Execution Time(s) Partial Rete Rete SQL

Chun Jin Carnegie Mellon 35 Proposed Work System Design and Implementation System Evaluation

Chun Jin Carnegie Mellon 36 System Design and Implementation Rete Optimization (am doing) (05–08/2004) Computation Sharing (will do) (07–11/2004) Incremental Aggregation (will do) (12/2004– 02/2005) Constraint Exploiting (optional) (04–05/2005) Transitivity Inference Enhancements (optional) ( 06 – 08/2005) Automatic Index Selection (optional) (09– 12/2005)

Chun Jin Carnegie Mellon 37 System Evaluation Data Collection ( 12/2004 – 01/2005) Query Generation ( 12/2004 – 01/2005) Simulation and Evaluation ( 02 – 05/2005) Single SQL vs. Single Rete, Multiple SQL vs. Multiple Shared Optimized Rete Single Non-optimized Rete vs. Single Optimized Rete Multiple Non-shared Optimized Rete vs. Multiple Shared Optimized Rete Non-incremental Aggregation vs. Incremental Aggregation

Chun Jin Carnegie Mellon 38 Evaluation: Data Collection FedWire Money Transfer Transactions Synthesized 0.5M records. Plan to generate 0.5M more. 23 attributes/record Massachusetts Medical Data Real 1.6M records (sanitized) 70 attributes/record In-patient admission and discharge records. Expand to 10M.

Chun Jin Carnegie Mellon 39 Evaluation: Queries Now, 7 queries on FedWire, 3 queries on Medical. Plan to extend to queries for each domain. Further extend query sets: Similar predicates matching different constants Join predicate sets have non-empty intersections Same where_clauses but different groupby_clauses Same where_clauses and groupby_clauses but different aggregation operators

Chun Jin Carnegie Mellon 40 Timeline System Design and Implementation (Required) 03/2004 – 02/2005 System Implementation (Optional) 04/2005 – 12/2005 Evaluation on Required Parts 12/2004 – 05/2005 Thesis Writing and Defense 06/2005 – 03/2006 Thesis Writing 06 – 12/2005 Thesis Finalizing 01 – 03/2006 Defense 02 or 03/2006

Chun Jin Carnegie Mellon 41 ARGUS Summary Implement the incremental evaluation schemes with the Adapted Rete Algorithm upon a traditional DBMS platform To deal with very-large-volume data, exploit the very-high-selectivity query property for optimization: Transitivity Inference Predicate Set Evaluation and Materialization Partial Rete (Materialization skipping) Complex Common Computation Identification for Sharing Intermingled Sharing and Optimization processing

Chun Jin Carnegie Mellon 42 Thank you! Questions and Comments?