Data Streams and Continuous Query Systems

Data Streams and Continuous Query Systems
CS 240B: Professor Zaniolo Eric Sytwu Joseph Joswig

Outline Review of Data Streams NiagaraCQ TelegraphCQ Conclusion
Bibliography

Data Sets VS Data Streams
Infrequently changing data Ex. Employee personnel table, contact database, library system Data Streams Data arriving continuously Ex. Stock streamer, sensor networks, weather monitoring system

Traditional Database Query
In a traditional query, the query engine returns a subset of the data that is currently in the system. End User / Application Query Results Query Processor Static Data Sets

Continuous Queries Continuous queries are persistent queries that allow users to get new results as new information enters the system. End User / Application Query Processor Data Streams Query Continuous Results Workspace

Niagara CQ

Goal: Allow users to obtain new results from a database without having to issue the same query repeatedly. Develop a system that will allow a large number of users to be able to register continuous queries using a high level language like XML-QL

What’s wrong with previous continuous querying systems?
Previous group optimization efforts focused on finding an optimal plan for a small number of queries. Computationally too expensive to handle a handle a large number of queries Not designed for the web, which is constantly changing

Benefits of NiagaraCQ Based on group optimization
Grouped queries can share computation Common execution plans of grouped queries reside in memory, saving on I/O costs compared to executing each query separately.

How do we get the benefits?
Incremental group optimization Groups are created for existing queries according to their signatures, which represent similar structures among the queries. Each individual query in a query group shares the results from the execution of the group plan When a new query is submitted, the group optimizer considers existing groups as potential optimization choices, the new query is merged into an existing group

Example: XML-QL query Expression signature = Quotes.Quote.Symbol
in quotes.xml constant

Query Plan Query plan Trigger Action I Trigger Action J
Select Symbol=“INTC” Select Symbol=“MSFT” File Scan File Scan Quotes.xml Quotes.xml

Group Plan Group plan …… Trigger Action I Trigger Action J Split Join
Symbol=Constant value File Scan File Scan Quotes.xml Constant Table

Materialized Intermediate Files
Query split with intermediate files Trigger_Act_j Trigger_Act_i …. File Scan File Scan File_i File_j Split

General Selection Predicates
“Attribute op Constant” Attribute = path expression without wildcards Op = “=“, “<“, “>”…

Join Operators A join signature in or approach contains the names of the two data sources and the predicated for the join. Join queries are grouped with the same join signature.

Processing Continuous Queries
1. CQM adds continuous queries with file and timer information to enable ED to monitor events 2. ED asks DM to monitor changes to files 3.When a timer event happens, ED asks DM the last modified time of files 4.DM informs ED of changes to push-based data sources 5.If file changes and timer events are satisfied. ED provides CQM with a list of firing CQs 6.CQM invokes QE to execute firing CQs. 7.File scan operator calls DM to retrieve selected documents. 8.DM only returns data changes between last fire time and current fire time.

Experimental results Peformed on a Sun Ultra 6000 with 1GB RAM running JDK1.2 on Solaris 2.6

Experimental Results

Analysis of NiagaraCQ Pros Scalable to large number of queries, users
Works with both change and timer based continuous queries Better performance, less I/O required to execute queries. Cons No dynamic re-grouping of groups, eventually, groups become sub-optimal Assumes queries have common structure, not always the case Incremental grouping works only for select and join as of now. Eventually, aggregation may be included.

Niagara in Review The goal was to develop an Internet-scale continuous query system using group optimization based on the assumption that many continuous queries on the Internet will have some similarities. Proposed novel “incremental grouping” methodology Supports both timer-based and changed based queries.

TelegraphCQ

TelegraphCQ Design Overview
Focus: Continuously Adaptive Query Processing of high volume and highly variable data streams. Large scale Deeply networked nature Unpredictability of the environment Need for close user interaction Data constantly moving and changing

TelegraphCQ Restrictions
Data is pushed to the query processor Data arrival rate can be high and bursty On the fly processing, data can be stored, but real-time one pass analysis is important Ordering of data is of significant importance.

Design Goals scheduling and resource management for groups of queries
support for out-of-core (non main memory) data variable adaptivity dynamic QoS support parallel cluster-based processing and distributed computation.

TelegraphCQ Complete Redesign and Re-implementation of Telegraph system with focus on focus on support for shared, continuous query processing over query and data streams. Distinguish it from the Telegraph project’s broader focus on adaptive dataflow in general, and to emphasize the challenges we are addressing in our new implementation.

Telegraph Module Types
Ingress and Caching Interface with external data sources TeSS – HTML/XML Screenscraper TelNape – Interfaces with popular P2P networks Local caching to hide network delays Query Processing pipelined, non-blocking versions of standard relational operators such as joins, selections, projections, grouping and aggregation, and duplicate elimination. State Module (SteMs) Adaptive Routing ability to “re-optimize” the plan on a continuous basis while a queryis running. Eddies Flux (Fault-tolerant, Load-balancing eXchange): Opaque dataflow module handles buffering and reordering of streams

Eddies Role: Continuously route tuples among a set of other modules according to a routing policy Intercept tuples and choose the order that they travel between modules Eddy can shut down each module when the end of all of its input streams is reached and the modules have completed current processing. Not designed as general purpose scheduler, no enforcement of resource management policies Multiple eddies run as parallel threads on queries with disjoint sets of tables and streams.

Adaptive Processing W/Eddies & SteMs
SteM - temporary repository of tuples, essentially corresponding to half of a traditional join operator. It stores homogeneous tuples (i.e., tuples spanning the same set of tables) formed during query processing. Supports insert (build), search (probe), and optionally delete (eviction) operations. Two kinds of tuples can be routed to a SteM. When a tuple t in T (a build tuple) is routed to SteMT , t is added to the set of tuples in SteMT. When a tuple p ∉ T (a probe tuple) is routed to SteMT , SteMT returns concatenated matches for it to the Eddy. These concatenated matches are the tuples in {p} join SteMT that satisfy all query predicates that can be evaluated on the columns in p and T. SteMsS SteMsT ST matches S probe T probe Eddy S build T build S T

Fjords Inter Module Communications API Form the links between modules
Supports a mixture of Push (streaming) and Pull (static) operations for query plans Allows modules to ignore the specifics of the data source. Supports non-Blocking dequeue operations

System Specifications
Build on PostgreSQL platform process per connection model Coded in C/C++

Example: Landmark Query
The input windows of these queries have a fixed beginning point in the timeline, and a forward moving endpoint. Example: “Select all the days after the hundredth trading day, on which the closing price of MSFT has been greater than $50. Keep this query standing in the system for a thousand trading days”. SELECT closingPrice, timestamp FROM ClosingStockPrices WHERE stockSymbol = ‘MSFT’ And closingPrice > 50.00 for (t = 101; t <= 1100; t++ ) { WindowIs(ClosingStockPrices, 101, t); } MSFT 101 $60 MSFT 102 $48 MSFT 103 $52 MSFT 104 $60

Example: Sliding Window Query
The input windows of these queries have forward moving beginning and end points. Example: “On every third trading day starting today, calculate the average closing price of MSFT for the three most recent trading days. Keep the query standing for fifty trading days”. Select AVG(closingPrice) From ClosingStockPrices Where stockSymbol = ‘MSFT’ for (t = ST; t < ST + 50; t +=3 ) { WindowIs(ClosingStockPrices, t - 2, t); } MSFT 101 $60 MSFT 102 $48 MSFT 103 $52 MSFT 104 $56 MSFT 105 $55 MSFT 106 $58 MSFT 107 $52 MSFT 108 $60

Example: Temporal Band Join Query
These queries join tuples in one stream with tuples in another based on timestamp. Example: “For the five most recent trading days starting today, select all stocks that closed higher than MSFT on a given day. Keep the query standing for twenty trading days”. Select c2.* FROM ClosingStockPrices as c1, ClosingStockPrices as c2 WHERE c1.stockSymbol = ‘MSFT’ and c2.stockSymbol!= ‘MSFT’ and c2.closingPrice > c1.closingPrice and c2.timestamp = c1.timestamp for (t = ST; t < ST +20 ; t++ ) { WindowIs(c1, t - 4, t); WindowIs(c2, t - 4, t); }

Pros and Cons of System Pros Cons Focus on extreme adaptability
New code is multithreaded to help boost system parallelism and enhance performance particularly in multiprocessor scenarios. Cons Code not fully multi-threaded, existing PostgreSQL Queries separated into classes for processing based on disjoint footprints. Still in early development stages Issues still need to be solved no extensive performance analysis

TelegraphCQ Future Work
Egress Modules Include fault tolerance in delivery of results, ie: in mobile networks Improved interface with overlay networks Cluster and Distributed Implementations Extension of FLuX module Integration with TAG system

Conclusion and Review NiagaraCQ TelegraphCQ
NiagaraCQ is a system that establishes scalability with a general strategy of incremental group optimization. TelegraphCQ TelegraphCQ is a system that combines prior work in Fjords, Eddies, and PSoup in order to query streaming data on large scales Other Data Streaming solutions Aurora STREAM StreamMill

Thank You for your time!

Bibliography J. Chen, D. DeWitt, F.Tian, Y.Wang. NiagaraCQ: A Scalable Continuous Query System for Internet Databases. In Proc. Of the ACM SIGMOD Conf. on Management of Data, 2000. Xiaoning Wang, NiagaraCQ presentation. Chandrasekaran, et al. TelegraphCQ: Continuous Dataflow Processing for an Uncertain World. UC Berkeley CIDR Conference.

Data Streams and Continuous Query Systems

Similar presentations

Presentation on theme: "Data Streams and Continuous Query Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Streams and Continuous Query Systems

Similar presentations

Presentation on theme: "Data Streams and Continuous Query Systems"— Presentation transcript:

Similar presentations

About project

Feedback