Data Stream Computation Lecture Notes in COMP 9314 modified from those by Nikos Koudas (Toronto U), Divesh Srivastava (AT & T), and S. Muthukrishnan (Rutgers)

Slides:



Advertisements
Similar presentations
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
Advertisements

ARGUS: Rete + DBMS = Efficient Persistent Profile Matching on Large-Volume Data Streams Chun Jin Language Technologies Institute School of Computer Science.
1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.
Kien A. Hua Division of Computer Science University of Central Florida.
Fast Algorithms For Hierarchical Range Histogram Constructions
A Data Stream Management System for Network Traffic Management Shivnath Babu Stanford University Lakshminarayanan Subramanian Univ. California, Berkeley.
IntroductionAQP FamiliesComparisonNew IdeasConclusions Adaptive Query Processing in the Looking Glass Shivnath Babu (Stanford Univ.) Pedro Bizarro (Univ.
1 Continuous Queries over Data Streams Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem,
Information Retrieval in Practice
Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj,
Traffic Engineering With Traditional IP Routing Protocols
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Aurora Proponent Team Wei, Mingrui Liu, Mo Rebuttal Team Joshua M Lee Raghavan, Venkatesh.
Chapter 10: Stream-based Data Management Title: Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core Authors:
Massive Data Analysis Lab (MassDAL) S. Muthukrishnan CS Dept.
1 Stream-based Data Management IS698 Min Song 2 Characteristics of Data Streams  Data Streams Data streams — continuous, ordered, changing, fast, huge.
Dunja Mladenić Marko Grobelnik Jožef Stefan Institute, Slovenia.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Monitoring Streams -- A New Class of Data Management Applications Don Carney Brown University Uğur ÇetintemelBrown University Mitch Cherniack Brandeis.
1 PODS 2002 Motivation. 2 PODS 2002 Data Streams data sets Traditional DBMS – data stored in finite, persistent data sets data streams New Applications.
198:671 Processing Massive Data Sets S. Muthukrishnan.
The Stanford Data Streams Research Project Profs. Rajeev Motwani & Jennifer Widom And a cast of full- and part-time students: Arvind Arasu, Brian Babcock,
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,
Overview of Search Engines
Stream Clustering CSE 902. Big Data Stream analysis Stream: Continuous flow of data Challenges ◦Volume: Not possible to store all the data ◦One-time.
Morten Lindeberg University of Oslo (With slides from Vera Goebel)
Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.
Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden)
Firewall and Internet Access Mechanism that control (1)Internet access, (2)Handle the problem of screening a particular network or an organization from.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
1 Pieter Meulenhoff KPN Research ROOT2002 I-Mode Performance Monitoring Use of ROOT in telecommunications at KPN Pieter Meulenhoff.
Master’s Thesis (30 credits) By: Morten Lindeberg Supervisors: Vera Goebel and Jarle Søberg Design, Implementation, and Evaluation of Network Monitoring.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Data Stream Systems Reynold Cheng 12 th July, 2002 Based on slides by B. Babcock et.al, “Models and Issues in Data Stream Systems”, PODS’02.
Vladimír Smotlacha CESNET Full Packet Monitoring Sensors: Hardware and Software Challenges.
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
DBSQL 12-1 Copyright © Genetic Computer School 2009 Chapter 12 Recent Concepts and Application of Databases.
ECEN “Internet Protocols and Modeling”, Spring 2012 Slide 2.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Measurement COS 597E: Software Defined Networking.
Data Stream Management Systems
Aum Sai Ram Security for Stream Data Modified from slides created by Sujan Pakala.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
Intradomain Traffic Engineering By Behzad Akbari These slides are based in part upon slides of J. Rexford (Princeton university)
CS 416 Artificial Intelligence Lecture 17 Reasoning over Time Chapter 15 Lecture 17 Reasoning over Time Chapter 15.
Data Mining: Concepts and Techniques Mining data streams
1/14/ :59 PM1/14/ :59 PM1/14/ :59 PM Research overview Koen Victor, 12/2007.
Mining of Massive Datasets Ch4. Mining Data Streams
Chapter 5: MULTIMEDIA DATABASE MANAGEMENT SYSTEM ARCHITECTURE BIT 3193 MULTIMEDIA DATABASE.
Chapter 1 Database Access from Client Applications.
CPT-S Advanced Databases 11 Yinghui Wu EME 49.
1 Monitoring: from research to operations Christophe Diot and the IP Sprintlabs ipmon.sprintlabs.com.
TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).
Stream Reasoning with Linked Data Open Data Open Day 2013 Sina Samangooei, Nick Gibbins 26 June 2013.
Streaming Semantic Data COMP6215 Semantic Web Technologies Dr Nicholas Gibbins –
1 Netflow Collection and Aggregation in the AT&T Common Backbone Carsten Lund.
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Data Streams COMP3017 Advanced Databases Dr Nicholas Gibbins –
1 Advanced Database Systems: DBS CB, 2 nd Edition Advanced Topics of Interest: DB the Cloud, and SQL & Stream Processing.
Data Mining - Introduction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
1 Out of Order Processing for Stream Query Evaluation Jin Li (Portland State Universtiy) Joint work with Theodore Johnson, Vladislav Shkapenyuk, David.
SketchVisor: Robust Network Measurement for Software Packet Processing
Advanced Database Systems: DBS CB, 2nd Edition
COMP3211 Advanced Databases
Big Data Infrastructure
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
Introduction to Stream Computing and Reservoir Sampling
Adaptive Query Processing (Background)
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
Presentation transcript:

Data Stream Computation Lecture Notes in COMP 9314 modified from those by Nikos Koudas (Toronto U), Divesh Srivastava (AT & T), and S. Muthukrishnan (Rutgers)

Outline Week 1: General Introduction Data streams: what, why now, applications Data streams: architecture and issues Statistic Estimation Week 2: Heavy Hitters Week 3: Quantiles Week 4: Miscellaneous

Data Streams: What and Where? A data stream is a (potentially unbounded) sequence of tuples Transactional data streams: log interactions between entities Credit card: purchases by consumers from merchants Telecommunications: phone calls by callers to dialed parties Web: accesses by clients of resources at servers Measurement data streams: monitor evolution of entity states IP network: traffic at router interfaces Sensor networks: physical phenomena, road traffic Earth climate: temperature, moisture at weather stations

Data Streams: Why Now? Haven’t data feeds to databases always existed? Yes Modify underlying databases, data warehouses Complex queries are specified over stored data DB Data Feeds Queries Two recent developments: application- and technology-driven Need for sophisticated near-real time queries/analyses Massive data volumes of transactions and measurements

Data Streams: Real-Time Queries With traditional data feeds Simple queries (e.g., value lookup) needed in real-time Complex queries (e.g., trend analyses) performed offline Now need sophisticated near-real time queries/analyses AT&T: fraud detection on call detail tuple streams NOAA: tornado detection using weather radar data DB ? Data Feeds Queries

Data Streams: Massive Volumes Now able to deploy transactional data observation points AT&T long-distance: ~300M call tuples/day AT&T IP backbone: ~50B IP flows/day Now able to generate automated, highly detailed measurements NOAA: satellite-based measurement of earth geodetics Sensor networks: huge number of measurement points ? DB ? Data Feeds

IP Network Application: Hidden P2P Traffic Detection Business Challenge: AT&T IP customer wanted to accurately monitor peer-to-peer (P2P) traffic evolution within its network Previous Approach: Determine P2P traffic volumes using TCP port number found in Netflow data Issues: P2P traffic might not use known P2P port numbers Solution: Using Gigascope SQL-based DSMS Search for P2P related keywords within each TCP datagram Identified 3 times more traffic as P2P than using Netflow Lesson: Essential to query massive volume data streams

IP Network Application: Web Client Performance Monitoring Business Challenge: AT&T IP customer wanted to monitor latency observed by clients to find performance problems Previous Approach: Measure latency at “active clients” that establish network connections with servers Issues: Use of “active clients” is not very representative Solution: Using Gigascope SQL-based DSMS Track TCP synchronization and acknowledgement packets Report round trip time statistics: latency Lesson: Essential to correlate multiple data streams

Gigascope: Features and Functions Gigascope is a fast, flexible data stream management system High performance at speeds up to OC48 (2 * 2.48 Gbit/sec) Developed at AT&T Labs-Research Collaboration between database and networking research Current libraries include Traffic matrix by site or autonomous system Detection of hidden P2P traffic End-to-end TCP performance monitoring Detailed custom performance statistics

DSMS + DBMS: Architecture Data stream management system at multiple observation points (Voluminous) streams-in, (data reduced) streams-out Database management system Outputs of DSMS can be treated as data feeds to database Queries DSMS DB Queries Data Feeds Queries DSMS Data Streams

DSMS + DBMS: Architecture Data Stream SystemsDatabase Systems Resource (memory, per- Resource (memory, disk, tuple computation) limitedper-tuple computation) rich Reasonably complex, near Extremely sophisticated real time, query processingquery processing, analyses Useful to identify what data Useful to audit query results to populate in databaseof data stream system

Databases vs Data Streams: Issues Database SystemsData Stream Systems Model: persistent relations Model: transient relations Relation: tuple set/bag Relation: tuple sequence Data Update: modifications Data Update: appends Query: transient Query: persistent Query Answer: exact Query Answer: approximate Query Evaluation: arbitrary Query Evaluation: one pass Query Plan: fixed Query Plan: adaptive

Relation: Tuple Set or Sequence? Traditional relation = set/bag of tuples Tuple sequences have been studied: Temporal databases [TCG+93]: multiple time orderings Sequence databases [SLR94]: integer “position” -> tuple Data stream systems: Ordering domains: Gigascope [CJSS03], Hancock [CFP+00] Position ordering: Aurora [CCC+02], STREAM [MWA+03]

Update: Modifications or Appends? Traditional relational updates: arbitrary data modifications Append-only relations have been studied: Tapestry [TGNO92]: s and news articles Chronicle data model [JMS95]: transactional data Data stream systems: Streams-in, stream-out: Aurora, Gigascope, STREAM Stream-in, relation-out: Hancock

Query: Transient or Persistent? Traditional relational queries: one-time, transient Persistent/continuous queries have been studied: Tapestry [TGNO92]: content-based , news filtering OpenCQ, NiagaraCQ [LPT99, CDTW00]: monitor web sites Chronicle [JMS95]: incremental view maintenance Data stream systems: Support persistent and transient queries

Query Answer: Exact or Approximate? Traditional relational queries: exact answer Approximate query answers have been studied [BDF+97]: Synopsis construction: histograms, sampling, sketches Approximating query answers: using synopsis structures Data stream systems: Approximate joins: using windows to limit scope Approximate aggregates: using synopsis structures

Query Evaluation: One Pass? Traditional relational query evaluation: arbitrary data access One/few pass algorithms have been studied: Limited memory selection/sorting [MP80]: n-pass quantiles Tertiary memory databases [SS96]: reordering execution Complex aggregates [CR96]: bounding number of passes Data stream systems: Per-element processing: single pass to reduce drops Block processing: multiple passes to optimize I/O cost

Query Plan: Fixed or Adaptive? Traditional relational query plans: optimized at beginning Adaptive query plans have been studied: Query scrambling [AFTU96]: wide-area data access Eddies [AH00]: volatile, unpredictable environments Data stream systems: Adaptive query operators Adaptive plans

Data Stream Query Processing: Anything New? ArchitectureIssues Resource (memory, per- Model: transient relations tuple computation) limited Relation: tuple sequence Data Update: appends Reasonably complex, near Query: persistent real time, query processing Query Answer: approximate Query Evaluation: one pass Query Plan: adaptive A lot of challenging problems...