1 Continuous Query Languages (CQL) Blocking Operators and the expressive power problem Carlo Zaniolo UCLA CSD Spring 2009.

1 Continuous Query Languages (CQL) Blocking Operators and the expressive power problem Carlo Zaniolo UCLA CSD Spring 2009

2 CQLs for DSMS zMost of DSMS projects use SQL for continuous queries—for good reasons, since yMany applications span data streams and DB tables yA CQL based on SQL will be easier to learn & use yMoreover: the fewer the differences the better! zBut DSMS were designed for persistent data and transient queries---not for persistent queries on transient data zAdaptation of SQL and its enabling technology presents difficult research challenges zThese combine with traditional SQL problem, such as inability to deal with sequences, DM tasks, and other complex query tasks---i.e., lack of expressive power

3 Language Problems z Most DSMS use SQL — queries spanning both data streams and DBs will be easier. But … zEven for persistent data, SQL is far from perfect. Important application areas poorly supported include: yData Mining, and we need to mine data streams, ySequence queries, and data streams are infinite time series! zMajor new problems for SQL on data stream applications. ( After all, it was designed for persistent data on secondary store, not for streaming data) y Only NonBlocking operators in DSMS: blocking forbidden y Distinction not clear in DBMS which often use blocking implementations for nonblocking operators yThe distinction needs to formally characterized y and so is the loss of query power caused upon CQLs.

4 Blocking Operators zA blocking query operator is ‘one that is unable to produce the first tuple of the output until it has seen the entire input’ [Babcock et al. PODS02] zBut continuous queries cannot wait for the end of the stream: must return results while the data is streaming in. Blocking operators cannot be used.  Only non-blocking ( nb ) queries and operators can be used on data streams (i.e. those that return their results before they have detected the end of the input). zCurrent DBMSs make heavy usage of blocking computations: 1.For operators that are intrinsically blocking 2.And for those that are not—i.e., they are only implemented that way. To exclude 1, we need to find a characterization for blocking & nonblocking that is independent of implementation.

5 Partial Ordering  Let S = [ t 1, , t n ] be a sequence and 0  k  n.  Then S k =[t 1, , t k ] is said to be the presequence of S, of length k>0. zAlso S 0 =[ ] denotes the empty sequence  L  S denotes that L is a presequence of S, z  Defines a Partial Order: reflexive, antisymmetric and transitive. zThe notion of `preorder’ generalizes to the standard subset notion when order and duplicates are immaterial  The empty sequence [ ] is a presequence of every other sequence.

6 Operators on Sequences: S  G  G(S) G j (S) denotes the cumulative output produced up to the j -th input tuple included. S j input up to step j. S is a sequence of length n. Then G is said to be: zBlocking when G j (S)=[] for j<n, and G n (S)=G(S)  Nonblocking when G j (S) = G(S j ), for every j  n. Operators viewed as incremental transducers: G(S): result of aapplying G to the whole S

7 employees(E#,Sal,...) select count(E#) from employees grouped by Sal zTraditional SQL-2 aggregates: blocking select Sal, count(E#) over (range unbounded preceding) from employees ordered by Sal zSQL:2003 Non Blocking Continuous count returns, for each new tuple, the count so far. On a sequence of length n: at each step j<n the count up to j is returned: count 1 (S)= [1], count 2 (S)= [1,2],... count j (S)= [1,2, …, j] independent on whether j=n or j<n. Tradional count: Cumulative return For each j<n: nothing, count j (S)=[] Final: count n (S)=[n]

8 Examples Selection is nonblocking. Projection is non-blocking even if we eliminate resulting duplicates Traditional SQL-2 aggregates are blocking (for arbitrarily ordered input) SQL:2003 OLAP functions are not. E.g. Continuous count, sum, max, etc. (i.e., the unlimited preceding count of OLAP functions) is non-blocking Intermediate cases are also possible

9 Characterization of NonBlocking ( nb ) Theorem: Queries can be expressed via nonblocking computations iff they are monotonic w.r.t. the presequence ordering. Proof: (i)NB G implies monotonic G. Say that Sj  Sn. It is always true that G j (Sn)  G n (Sn). But if G is NB then G j (Sn)=G j (Sj)=G(Sj). QED (ii)monotonic G implies NB G … the incremental G transducer, at step j+1 will add the difference between G(Sj +1 ) and G(Sj).

NonBlocking Iff Monotonic zThe theorem generalizes from presequences to sets---i.e. presequences where duplicates are not allowed and order is immaterial. yIn fact S1 is a subset of S2 iff S1 is a presequence of S2, after proper reordering and elimination of duplicates zNB=monotonic: e.g., selection, projection, and OLAP functions zBlocking= Non-Monotonic: e.g. Traditional aggregates. zResults hold for operators of more than one argument: y Join are monotonic (i.e., NB) in both arguments. yR-S is monotonic on R and antimonotonic on S: i.e., will block on S but not on R (but it will unblock on R only after it has seen the whole S!) 10

11 NB-Completeness  A query language L can express a given set of functions on its input (DB, sequences, data streams). zThus nonmonotonic functions are intrinsically blocking and they cannot be used on data streams. zFor continuous queries on data streams, we should disallow blocking (i.e., nonmonotonic) operators & constructs and only allow nonblocking (i.e., monotonic ) operators: nb-operators for short.  But can ALL the monotonic functions expressible by L be expressed using only its nb-operators ? zOr did we also lose some monotonic queries? Definition: When using only its NB -operators L can express all the monotonic queries expressible in L, then L is said to be NB -complete.

12 Expressive Power and NB-Completeness  Consider a (DB) language L. The expressive power of L is the set of functions F that can be computed on the DB using its operators (or constructs).  On data streams we are only interested in mononotonic functions: F’  F. Also let O be the operators of L, and O’  O be the subset of such operators that are monotonic.  L will be said to be NB-complete if all functions in F’ can be expressed using only the operators in O’. zNB-completeness is a test that O is as suitable for continuous queries on data streams as it is on the database.  Say that L is not NB-complete: then some monotonic function that L can express on the data stored in the DB, it can no longer express on the same data presented as a stream (i.e., from a single read of the DB---push model vs pull model of computation)

13 Is SQL NB complete? zE-Bay Example Auctions: a stream of bids on an item. bidStream(Item#, BidValue, Time) zItems for which sum of bids is > 100K SELECT Item# FROM bidStream GROUP BY Item# HAVING SUM(BidValue) > 100000; zThis is a monotonic query. Thus it can be expressed in a language containing suitable query operators, but not in SQL-2. SQL-2 is not nb-complete; thus it is ill-suited for continuous queries on data streams. zSo SQL-2 is not nb-complete because of its blocking aggregates. zWhat about RA without aggregates?

14 Relational Algebra (RA) zSet difference can produce monotonic queries: Are these still expressible without set diff?  Intersection is monotonic: R 1  R 2  = R 1  (R 1  R 2 ) But intersection can also be expressed as a joins: product+select. So it is not lost if we disallow set diff. zBut interval coalescing and Until queries are monotonic queries that can be expressed in RA but not in nb-RA. zExample: Temporal domain isomorfic to nonnegative integers.Intervals closed to the left but open to the right: p(0, 3). % 0,1, and 2 are in p but 3 is not p(2, 4). % 3 is not a hole because is covered by this p(4, 5). % 5 is a hole because not covered by any other interval p(6, 8).

15 Coalesce p (cp) & p Until q p(0, 3). p(2, 4). p(4, 5). p(6, 8). cp(0, 3). cp(2, 4). cp(4, 5). cp(6, 8). cp(0, 4). cp(2, 5). cp(0,5). cp contains intervals from the start point of any p interval to the endpoint of any p interval unless the endpoint of an interval in between is a hole. cp(I1, J2)  p(I1, J1), p(I2, J2), J1 < J2,  hole(I1, J2). hole(I1, J2)  p(I1, J1), p(I2, J2), p(_,K), J1  K, K < I2,  cep(K). cep(K)  p(_, K), p(I, J), I  K, K < J. q(5,_) holds if cp has an interval that starts at 0 & contains 5 p Until q(yes)  q(0, J). p Until q(yes)  cp(0, I), q(J, _), I  J.

16 Relational Algebra zNonMonotonic (i.e., blocking) RA operators: set difference and division zWe are left with: select, project, join, and union. Can these express all FO monotonic queries? zSome interesting temporal queries: coalesce and until yThey are expressible in RA (by double negation) yThey are monotonic yBut they cannot be expressed in NB-RA. Theorem: RA and SQL are not NB-complete. SQL faces two problems: (i) the exclusion of EXCEPT/NOT EXISTS, and (ii) the exclusion of aggregates.

17 Real Applications Require REAL Power zSQL’s lack of expressive power is a major problem for database-centric applications. zThese problems are significantly more serious for data streams since: yOnly monotonic queries can be used, yActually, not even all the monotonic ones since SQL is not nb-complete, yThese problems cannot be solved by embedding SQL statements in a PL program—next slide!

18 Embedding SQL Queries in a PL  In DB applications, SQL can be embedded in a PL (Java, C++…) where the PL accesses the tuples returned by SQL using a ` Get Next of Cursor’ statement. zOperations that could not be expressed in SQL can then be expressed in the PL: yan effective remedy for the lack of expressive power of SQL zBut cursors are a ‘pull-based’ mechanism and cannot be used on data streams: the DSMS cannot hold tuples until the PL request them! zThe DSMS can only deliver its output to the PL as a stream yThis might be OK for simple situations yBut if the core of the work has not been done yet, the PL system must do the actual DSMS work! zConclusion: to support applications of any complexity we must have a DSMS with real expressive power, yAs opposed to DBMS that are useful even with a weak QL.

19 Real Applications Require Real Power Embedding CQL in PL programs does not work well... BUT: Embedding PL programs in CQL works: zUser Defined Functions with BLOBS: y Good for DBMS but DSMS require incremental computation zUser-Defined Aggregates (UDAs) functions: yIncremental computation model y Can be defined using a PL or SQL itself y with natively defined UDAs, SQL becomes Turing complete yAnd NB-complete: can express all monotonic functions ySimple syntactic characterization for NB aggregates. y Effective on a broad range of data-intensive applications: KDD in particular. yA few extensions are still need—more later.

Why UDAS are Important zWe have seen how new aggregates can be defined by the intialize, iterate, terminate scheme, using SQL itself (native UDAs) or an external language (C++, Java, etc.) zSQL with natively defined UDAs is Turing Complete. zWith non-blocking UDAs SQL, with a becomes NB-complete: it can express all monotonic computable functions on a single stream. Also on multiple streams if we introduce a sort-merge operator. 20

21 References D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams - a new class of data management applications. In VLDB, Hong Kong, China, 2002. Yijian Bai, Hetal Thakkar, Chang Luo, Haixun Wang, Carlo Zaniolo: A Data Stream Language and System Designed for Power and Extensibility. Proc. of the ACM 15th Conference on Information and Knowledge Management (CIKM'06), 2006 Yan-Nei Law, Haixun Wang, Carlo Zaniolo: Query Languages and Data Models for Database Sequences and Data Streams. VLDB 2004: 492-503 Haixun Wang and Carlo Zaniolo. ATLaS: a native extension of SQL for data minining. In Proceedings of Third SIAM Int. Conference on Data MIning, pages 130-141, 2003.

22 `la femme fatale’ in Disney’s Cartoons I am not really bad... Just drawn that way!

1 Continuous Query Languages (CQL) Blocking Operators and the expressive power problem Carlo Zaniolo UCLA CSD Spring 2009.

Similar presentations

Presentation on theme: "1 Continuous Query Languages (CQL) Blocking Operators and the expressive power problem Carlo Zaniolo UCLA CSD Spring 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Continuous Query Languages (CQL) Blocking Operators and the expressive power problem Carlo Zaniolo UCLA CSD Spring 2009.

Similar presentations

Presentation on theme: "1 Continuous Query Languages (CQL) Blocking Operators and the expressive power problem Carlo Zaniolo UCLA CSD Spring 2009."— Presentation transcript:

Similar presentations

About project

Feedback