ARGUS: Rete + DBMS = Efficient Persistent Profile Matching on Large-Volume Data Streams Chun Jin Language Technologies Institute School of Computer Science.

Slides:



Advertisements
Similar presentations
Abstract There is significant need to improve existing techniques for clustering multivariate network traffic flow record and quickly infer underlying.
Advertisements

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
Class-constrained Packing Problems with Application to Storage Management in Multimedia Systems Tami Tamir Department of Computer Science The Technion.
1 DynaMat A Dynamic View Management System for Data Warehouses Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan.
CS4432: Database Systems II
Data Streaming Algorithms for Accurate and Efficient Measurement of Traffic and Flow Matrices Qi Zhao*, Abhishek Kumar*, Jia Wang + and Jun (Jim) Xu* *College.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.
CS 540 Database Management Systems
Efficient Constraint Monitoring Using Adaptive Thresholds Srinivas Kashyap, IBM T. J. Watson Research Center Jeyashankar Ramamirtham, Netcore Solutions.
A Software-Defined Networking based Approach for Performance Management of Analytical Queries on Distributed Data Stores Pengcheng Xiong (NEC Labs America)
Data Stream Computation Lecture Notes in COMP 9314 modified from those by Nikos Koudas (Toronto U), Divesh Srivastava (AT & T), and S. Muthukrishnan (Rutgers)
1 Continuous Queries over Data Streams Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem,
Session – 10 QUERY OPTIMIZATION Matakuliah: M0184 / Pengolahan Data Distribusi Tahun: 2005 Versi:
1 Incremental Aggregation on Multiple Continuous Queries Chun Jin Carnegie Mellon University 09/28/2006 ISMIS, Bari Italy.
ARGUS: A Prototype Stream Anomaly Monitoring System Thesis Proposal Chun Jin Thesis Committee Jaime Carbonell (Chair) Christopher Olston Jamie Callan Phil.
Chapter 10: Stream-based Data Management Title: Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core Authors:
Massive Data Analysis Lab (MassDAL) S. Muthukrishnan CS Dept.
CS538: Advanced Topics in Information Systems. 2 Secure Location transparency Consistent Real-Time Available Black Box: Distributed Storage [GMM] ? Data.
Dunja Mladenić Marko Grobelnik Jožef Stefan Institute, Slovenia.
WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University.
Mining Behavior Models Wenke Lee College of Computing Georgia Institute of Technology.
NIMD 1 Project Argus Massive Data NIMD PI Meeting December 2, 2004.
Chapter 14 The Second Component: The Database.
Summary of query compilers (Section16.8) Varun Gupta Department of Computer Science ID-216 CS 257.
HOL9396: Oracle Event Processing 12c
Technical Writing Examples Plus A Few Tips. What is wrong? How to rewrite?
Query Processing Presented by Aung S. Win.
Chirag N. Modi and Prof. Dhiren R. Patel NIT Surat, India Ph. D Colloquium, CSI-2011 Signature Apriori based Network.
Cong Wang1, Qian Wang1, Kui Ren1 and Wenjing Lou2
Kien A. Hua Data Systems Lab Division of Computer Science University of Central Florida.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Sensor Data Management: Challenges and (some) Solutions Amol Deshpande, University of Maryland.
NiagaraCQ A Scalable Continuous Query System for Internet Databases Jianjun Chen, David J DeWitt, Feng Tian, Yuan Wang University of Wisconsin – Madison.
CS An Overlay Routing Scheme For Moving Large Files Su Zhang Kai Xu.
CPS 216: Advanced Database Systems Shivnath Babu.
NiagaraCQ : A Scalable Continuous Query System for Internet Databases (modified slides available on course webpage) Jianjun Chen et al Computer Sciences.
9/15/2015CS622 - MIRO Presentation1 Wen Xu and Jennifer Rexford Department of Computer Science Princeton University Chuck Short CS622 Dr. C. Edward Chow.
Hopkins Storage Systems Lab, Department of Computer Science A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang, Tanu.
CPS 216: Advanced Database Systems Shivnath Babu Fall 2006.
PIER & PHI Overview of Challenges & Opportunities Ryan Huebsch † Joe Hellerstein † °, Boon Thau Loo †, Sam Mardanbeigi †, Scott Shenker †‡, Ion Stoica.
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
Adaptive Web Caching CS411 Dynamic Web-Based Systems Flying Pig Fei Teng/Long Zhao/Pallavi Shinde Computer Science Department.
Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Chapter 9 Database Systems © 2007 Pearson Addison-Wesley. All rights reserved.
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
1 Supporting Dynamic Migration in Tightly Coupled Grid Applications Liang Chen Qian Zhu Gagan Agrawal Computer Science & Engineering The Ohio State University.
Cybersecurity: Expanding the Front Lines of Defense Dr. George K. Kostopoulos Professor Electrical and Computer Engineering Cybersecurity New York Institute.
Students: Aiman Md Uslim, Jin Bai, Sam Yellin, Laolu Peters Professors: Dr. Yung-Hsiang Lu CAM 2 Continuous Analysis of Many CAMeras The Problem Currently.
Adaptive Ordering of Pipelined Stream Filters Babu, Motwani, Munagala, Nishizawa, and Widom SIGMOD 2004 Jun 13-18, 2004 presented by Joshua Lee Mingzhu.
1 10/15/04CS150 Introduction to Computer Science 1 Reading from and Writing to Files Part 2.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
Rate-Based Query Optimization for Streaming Information Sources Stratis D. Viglas Jeffrey F. Naughton.
A Presentation: Shruthi Gayakwad 4BD07EC105. Modular Robotics Computer Science Systems Nano Technology.
NiagaraCQ : A Scalable Continuous Query System for Internet Databases Jianjun Chen et al Computer Sciences Dept. University of Wisconsin-Madison SIGMOD.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
1 Monitoring: from research to operations Christophe Diot and the IP Sprintlabs ipmon.sprintlabs.com.
Database Systems, 8 th Edition SQL Performance Tuning Evaluated from client perspective –Most current relational DBMSs perform automatic query optimization.
Understanding DBMSs. Data Management Data Query Application DataBase Management System (DBMS)
Execution Plans Detail From Zero to Hero İsmail Adar.
Christoph F. Eick: Final Words COSC Topics Covered in COSC 3480  Data models (ER, Relational, XML)  Using data models; learning how to store real.
Structured Analysis and Design Technique
Object-Oriented Static Modeling of the Banking System - I
Modern Data Management
NiagaraCQ : A Scalable Continuous Query System for Internet Databases
SpatialHadoop: A MapReduce Framework for Spatial Data
Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering
United Nations Development Account 10th Tranche Statistics and Data
Approximate Frequency Counts over Data Streams
Presentation transcript:

ARGUS: Rete + DBMS = Efficient Persistent Profile Matching on Large-Volume Data Streams Chun Jin Language Technologies Institute School of Computer Science Carnegie Mellon University

Chun Jin Carnegie Mellon 2 Stream Processing Model Stream Processing becomes demanding and prevalent. Storage Data Streams Output

Chun Jin Carnegie Mellon 3 Stream Databases Stream Database Applications Network Traffic Analysis and Router Configuration Dynamic Internet Services Sensor Data Analysis Anomaly Detection Stream Database Projects STREAM, TelegraphCQ, Aurora NiagaraCQ, OpenCQ, WebCQ Gigascope, Tribeca Tapestry, Alert, Tukwila, etc. ARGUS

Chun Jin Carnegie Mellon 4 Stream Anomaly Monitoring Systems (SAMS) SAMS monitors structured data streams for anomalies or potential hazards. Matches of queries may be high urgency alerts. Prompt detections are desirable. Satisfaction of a SAMS query is often rare (very-high-selectivity).

Chun Jin Carnegie Mellon 5 SAMS Dataflow Analyst Stream Anomaly Monitoring System Storage Queries Alerts Data Streams FedWire Money Transfers Patient Records

Chun Jin Carnegie Mellon 6 Challenges to SAMS Persistent queries may number in thousands or tens of thousands. Daily stream volumes may exceed millions of records. Prompt detections are desirable. Very-high-selectivity Query Property.

Chun Jin Carnegie Mellon 7 Proposed ARGUS Approach Basic Framework: Incremental evaluation schemes (Adapted Rete algorithm) Rete (Forgy 1982): Incremental Evaluation based on Materialized Intermediate Results. Upon a traditional DBMS platform Exploiting Very-High-Selectivity Query Property: Transitivity Inference Conditional Materialization Optimizing Join Order Computation Sharing Related to Other Applications Stream Databases Modern DBMS Query Optimization

Chun Jin Carnegie Mellon 8 Query Example 4 Suppose for every big transaction of type code 1000, the analyst wants to check if the money stayed in the bank or left within ten days. An additional sign of possible fraud is that transactions involve at least one intermediate bank. The query generates an alarm whenever the receiver of a large transaction (over $1,000,000) transfers at least half of the money further within ten days of this transaction using an intermediate bank.

Chun Jin Carnegie Mellon 9 SQL Query for Example 4 FROM transaction r1, transaction r2, transaction r3 WHERE r2.type_code = 1000 AND r3.type_code = 1000 AND r1.type_code = 1000 AND r1.amount > AND r1.rbank_aba = r2.sbank_aba AND r1.benef_account = r2.orig_account AND r2.amount > 0.5 * r1.amount AND r1.tran_date <= r2.tran_date AND r2.tran_date <= r1.tran_date + 10 AND r2.rbank_aba = r3.sbank_aba AND r2.benef_account = r3.orig_account AND r2.amount = r3.amount AND r2.tran_date <= r3.tran_date AND r3.tran_date <= r2.tran_date + 10;

Chun Jin Carnegie Mellon 10 ARGUS System Architecture Rete Network Generator Query Rete Networks Data Tables Analyst Identified Threats Intermediate Tables Data Streams Query Table Stream Anomaly Monitoring Do_queries Scheduler

Chun Jin Carnegie Mellon 11 ReteGenerator Architecture System Catalog Transitivity Inference SQL Queries ReteGenerator Sharing Module Join Order Conditional Materialization Optimizer Common Computation Identification Predicate Indexing Extended Predicate Set Operations Choose what and how to share Recording and Manipulating Network Topology Estimating Sharing Costs

Chun Jin Carnegie Mellon 12 Adapted Rete Algorithm (Selection) n and m are old data sets Δn and Δm are the new much smaller incremental data sets. Selection ơ ơ(n+ Δn) ơ(n) ơ(Δn)= +

Chun Jin Carnegie Mellon 13 Adapted Rete Algorithm (Join) Join (n+Δn) (m+Δm) = n m + Δn m + n Δm + Δn Δm When Δn and Δm are very small compared to n and m, time complexity of incremental join is O(n+m) Old Results New Incremental Results

Chun Jin Carnegie Mellon 14 Incremental Evaluation in Rete Example 4 DataTable r1, r2, r3 Type_code=1000 Amount> Type_code=1000 r1.rbank_aba = r2.sbank_aba r1.benef_account = r2.orig_account r2.amount > r1.amount*0.5 r1.tran_date <= r2.tran_date r2.tran_date >= r1.tran_date+10 r2.rbank_aba = r3.sbank_aba r2.benef_account = r3.orig_account r2.amount = r3.amount r2.tran_date <= r3.tran_date r3.tran_date >= r2.tran_date+10

Chun Jin Carnegie Mellon 15 Complex Queries A persistent query may contain multiple SQL statements, and a single SQL statement may contain unions of multiple SQL terms. Each SQL term is mapped to a sub-Rete network. These sub-Rete networks are then connected to form the statement-level sub-networks. And the statement-level subnetworks are further connected based on the view references to form the final query-level Rete network.

Chun Jin Carnegie Mellon 16 Transitivity Inference Exploring transitivity properties of comparison operators To derive hidden high-selective selection predicates High-selective selection predicates can significantly improve performance as they may produce very small intermediate results. Subsequent join could be performed very fast on the materialized intermediate results. Ono/Lohman VLDB90, Pirahesh/Leung/Hasan ICDE97

Chun Jin Carnegie Mellon 17 Transitivity Inference Example Given r1.amount > and r2.amount > r1.amount * 0.5 and r3.amount = r2.amount r1.amount > is very high- selective on r1 We can infer high-selective predicates: r2.amount > r3.amount >

Chun Jin Carnegie Mellon 18 Conditional Materialization r2 r1 r2 r1 Unconditional Materialization Conditional Materialization: Choose materialization or not based on cost estimates

Chun Jin Carnegie Mellon 19 Preliminary Evaluation: Queries and Data 7 queries on synthesized FedWire money transfer database records. Two Data Conditions: Data1: Old: first records New: remaining records ALERT Data2: Old: first records New: next records NOT alert

Chun Jin Carnegie Mellon 20 Preliminary Results Rete with Transitivity Inference Q1Q2Q3Q4Q5Q6Q7 Execution Time(s) Rete Data1SQL Data1Rete Data2SQL Data2

Chun Jin Carnegie Mellon 21 Transitivity Inference Q2 Q Data1Data2 Execution Time(s) Data1Data2 Execution Time(s) Rete TIRete Non-TISQL Non-TISQL TI

Chun Jin Carnegie Mellon 22 Conditional Materialization Q4 assumes Transitivity Inference not applicable Data1Data2 Execution Time(s) Conditional Rete SQL

Chun Jin Carnegie Mellon 23 ARGUS Summary Adapted Rete Algorithm upon a traditional DBMS platform Exploit the very-high-selectivity query property for optimization: Transitivity Inference Conditional Materialization Current and Future Work: Optimizing Join Order Computation Sharing

Chun Jin Carnegie Mellon 24 Thank you! Questions and Comments?