Presentation is loading. Please wait.

Presentation is loading. Please wait.

ISO/IEC JTC1/SC32 WG3:URC-nnn ANSI NCITS H nnn

Similar presentations


Presentation on theme: "ISO/IEC JTC1/SC32 WG3:URC-nnn ANSI NCITS H nnn"— Presentation transcript:

1 Pattern Matching in Sequences of Rows March 2, 2007 Change Proposal (for SQL standards)
ISO/IEC JTC1/SC32 WG3:URC-nnn ANSI NCITS H nnn Authors: Fred Zemke (Oracle), Andrew Witkowski (Oracle), Mitch Cherniak (Streambase),Latha Colby (IBM) CS240B Notes by: Carlo Zaniolo Computer Science Department UCLA

2 Match_Recognize Inspired by SQL-TS, but more verbose and more options. For instance: * — 0 or more matches + — 1 or more matches ? — 0 or 1 match { n } — exactly n matches { n, m } — between n and m (inclusive) matches Alternation: indicated by a vertical bar ( | ). More ...

3 Example Let Ticker (Symbol, Tstamp, Price) be a table with three columns representing historical stock prices. Symbol is a character column, Tstamp is a timestamp column (for simplicity shown as increasing integers) and Price is a numeric column. We want to partition the data by Symbol, sort it into increasing Tstamp order, and then detect the following pattern in Price: a falling price, followed by a rise in price that goes higher than the price was when the fall began. After finding such patterns, it is desired to report the starting time, starting price, inflection time (last time duringthe decline phase), low price, end time, and end price.

4 Example FROM Ticker MATCH_RECOGNIZE ( PARTITION BY Symbol
ORDER BY Tstamp MEASURES A.Symbol AS a_symbol, A.Tstamp AS a_tstamp, A.Price AS a_price, MAX (C.Tstamp) AS max_c_tstamp, LAST (C.Price) AS last_c_price MAX (F.Tstamp) AS max_f_tstamp MATCH_NUMBER AS matchno SELECT a_symbol, a_tstamp, /* start time */ a_price, /* start price */ max_c_tstamp, /* inflection time */ last_c_price, /* low price */ max_f_tstamp, /* end time */ last_c_price, /* end price */ Matchno ONE ROW PER MATCH AFTER MATCH SKIP PAST LAST ROW MAXIMAL MATCH PATTERN (A B C* D E* F+) DEFINE /* A defaults to True, matches any row */ B AS (B.price < PREV(B.price)), C AS (C.price <= PREV(C.price)), D AS D.Price > PREV(D.price)), E AS (E.Price >= PREV(E.Price)), F AS (F.Price >= PREV(F.price) AND F.price > A.price))

5 Measures: Naming and renaming
SELECT a_symbol, a_tstamp, /* start time */ a_price, /* start price */ max_c_tstamp, /* inflection time */ last_c_price, /* low price */ max_f_tstamp, /* end time */ last_c_price, /* end price */ Matchno FROM Ticker MATCH_RECOGNIZE ( PARTITION BY Symbol ORDER BY Tstamp MEASURES A.Symbol AS a_symbol, A.Tstamp AS a_tstamp, A.Price AS a_price, MAX (C.Tstamp) AS max_c_tstamp, LAST (C.Price) AS last_c_price MAX (F.Tstamp) AS max_f_tstamp MATCH_NUMBER AS matchno Measures: Naming and renaming ONE ROW PER MATCH AFTER MATCH SKIP PAST LAST ROW MAXIMAL MATCH PATTERN (A B C* D E* F+) DEFINE /* A defaults to True, matches any row */ B AS (B.price < PREV(B.price)), C AS (C.price <= PREV(C.price)), D AS D.Price > PREV(D.price)), E AS (E.Price >= PREV(E.Price)), F AS (F.Price >= PREV(F.price) AND F.price > A.price))

6 SELECT a_symbol, a_tstamp, /* start time */ a_price, /* start price */ max_c_tstamp, /* inflection time */ last_c_price, /* low price */ max_f_tstamp, /* end time */ last_c_price, /* end price */ Matchno FROM Ticker MATCH_RECOGNIZE ( PARTITION BY Symbol ORDER BY Tstamp MEASURES A.Symbol AS a_symbol, A.Tstamp AS a_tstamp, A.Price AS a_price, MAX (C.Tstamp) AS max_c_tstamp, LAST (C.Price) AS last_c_price MAX (F.Tstamp) AS max_f_tstamp MATCH_NUMBER AS matchno ONE ROW PER MATCH AFTER MATCH SKIP PAST LAST ROW MAXIMAL MATCH PATTERN (A B C* D E* F+) DEFINE /* A defaults to True, matches any row */ B AS (B.price < PREV(B.price)), C AS (C.price <= PREV(C.price)), D AS D.Price > PREV(D.price)), E AS (E.Price >= PREV(E.Price)), F AS (F.Price >= PREV(F.price) AND F.price > A.price)) Define the pattern and te conditions which must be satisfied in each state of the pattern No condition on A

7 SELECT a_symbol, a_tstamp, /* start time */ a_price, /* start price */ max_c_tstamp, /* inflection time */ last_c_price, /* low price */ max_f_tstamp, /* end time */ last_c_price, /* end price */ Matchno FROM Ticker MATCH_RECOGNIZE ( PARTITION BY Symbol ORDER BY Tstamp MEASURES A.Symbol AS a_symbol, A.Tstamp AS a_tstamp, A.Price AS a_price, MAX (C.Tstamp) AS max_c_tstamp, LAST (C.Price) AS last_c_price MAX (F.Tstamp) AS max_f_tstamp MATCH_NUMBER AS matchno ONE ROW PER MATCH AFTER MATCH SKIP PAST LAST ROW MAXIMAL MATCH PATTERN (A B C* D E* F+) DEFINE /* A defaults to True, matches any row */ B AS (B.price < PREV(B.price)), C AS (C.price <= PREV(C.price)), D AS D.Price > PREV(D.price)), E AS (E.Price >= PREV(E.Price)), F AS (F.Price >= PREV(F.price) AND F.price > A.price)) { ONE ROW | ALL ROWS } PER MATCH { MAXIMAL | INCREMENTAL } MATCH AFTER MATCH SKIP { TO NEXT ROW | PAST LAST ROW | TO LAST<variable> | TO FIRST <variable> }

8 ALL ROWS PER MATCH :one row for each row in the pattern.
FROM Ticker MATCH_RECOGNIZE ( PARTITION BY Symbol ORDER BY Tstamp MEASURES A.Symbol AS a_symbol, A.Tstamp AS a_tstamp, A.Price AS a_price, MAX (C.Tstamp) AS max_c_tstamp, LAST (C.Price) AS last_c_price MAX (F.Tstamp) AS max_f_tstamp MATCH_NUMBER AS matchno CLASSIFIER AS Classy SELECT T.Symbol, /* row’s symbol/ * T.Tstamp, /* row’s time */ T.Price, /* row’s price */ T.classy /* row’s classifier */ T.a_tstamp, /* start time */ T.a_price, /* start price */ T.max_c_tstamp, /*inflection time*/ T.last_c_price, /* low price */ T.max_f_tstamp, /* end time */ end price */ ALL ROWS PER MATCH AFTER MATCH SKIP PAST LAST ROW MAXIMAL MATCH PATTERN (A B C* D E* F+) DEFINE /* A defaults to True, matches any row */ B AS (B.price < PREV(B.price)), C AS (C.price <= PREV(C.price)), D AS D.Price > PREV(D.price)), E AS (E.Price >= PREV(E.Price)), F AS (F.Price >= PREV(F.price) AND F.price > A.price) ) T ALL ROWS PER MATCH :one row for each row in the pattern. In addition to partitioning, ordering and measure columns we can reference other columns. (via T) CLASSIFIER component that may be used to declare a character result column whose contents on each row is the variable name that the row matched with.

9 Syntactic Sugar Variables can be repeated in the pattern clause
SUBSET: to rename a set of variables Portion of the pattern can be excluded (when returning all rows) Special construct to define alternations obtained as permutations of variables

10 Singletons and group variables
FROM Ticker MATCH_RECOGNIZE ( PARTITION BY symbol ORDER BY tstamp MEASURES FIRST(a.time) a_firsttime, LAST(d.time) d_lasttime, AVG(b.price) b_avgprice, AVG(d.price) d_avgprice PATTERN ( A B+ C+ D ) DEFINE A AS A.price > 100, B AS B.price > A.price, C AS C.price < AVG (B.price), D AS D.price > PREV(D.price) ) If a variable is a singleton, then only individual columns may be referenced, not aggregates. If the variable is used in an aggregate, then the aggregate is performed over all rows that have matched the variable so far. If desired, we can construe this as providing running aggregates with no special syntax, when a variable is referenced in an aggregate in its own definition, or we can continue to require special syntax to highlight that a running aggregate is meant.

11 More ALL ROWS PER MATCH—only
CLASSIFIER is used to specify the name of a character string column, called the classifier column. In each row of output, the classifier column is set to the variable name in the PATTERN that the row matched. MATCH_NUMBER Matches within a partition are numbered sequentially starting with 1 in the order they are chosenin the previous section. The MATCH_NUMBER component is used to specify a column name for an extra column of output from the MATCH_RECOGNIZE construct. The extra column is an exact numeric with scale 0, and provides the MATCH_NUMBER within a partition, starting with 1 for the first match, 2 for the second, etc. FIRST and LAST special aggregates for group variables

12 Windows SELECT sum_yprice OVER W, x_time OVER W, AVG(Y.Price) FROM T
WINDOW W AS (PARTITION BY .. ORDER BY.. MEASURES SUM(Y.price) AS sum_yprice x.time AS x_time (PATTERN (X Y+ Z)...) )

13 Some Queries Task 1.1. Assume that you have the following temporal table: emp(Eno, Project, Tstart, Tend) Denoting periods of time during which an employee has worked on a project. The closed intervals denoting these periods could overlap, and thus you need to coalesced them into maximal periods. Suggestion, sort all events in a sequence, and then use SQL-MR to do the actual coalescing, and reconstruct the original table with the intervals coalesced. Task 1.2. Sensors have detected locations of objects at certain time. So items( itemNo, SensorNo, Time) Write an SQL-MR query to detect objects that are going around in a cycle, i.e., they have returned to the same location withing one day. Many objects do not move fast, so the sensor might produce consecutive readings of the same object even if this is not in a cycle. Task 1.3. Same as 1.2 but in SQL-TS

14 Coalescing emp(Eno, Project, Tstart, Tend)
Several overlapping intervals for each employee and project. SELECT c_Eno, c_Project, first_Tstart, max_Tend FROM emp MATCH_RECOGNIZE ( PARTITION BY Eno, Project ORDER BY Tstart MEASURES Z.Eno as c_Eno, Z.Project as c_Project, Z.Tstart as c_Tstart, First(Z.Tstart) AS first_Tstart, MAX (Z.Tend) AS max_Tend, ONE ROW PER MATCH AFTER MATCH SKIP PAST LAST ROW MAXIMAL MATCH PATTERN (Z+) DEFINE Z as (c_Tstart <= max_Tend) )

15 Cycles (task 1.2) Sensors: items(itemNo, SensorNo, Time)
SELECT T.itemNo, T.SensorNo, T.Time FROM items MATCH_RECOGNIZE ( PARTITION BY ItemNo ORDER BY Time MEASURES A.SensorNo as A_SensorNo, Z.SensorNo as Z_SensorNo, B.SensorNo as B_SensorNo, ONE ROW PER MATCH AFTER MATCH SKIP PAST LAST ROW PATTERN (A+, Z+, C) DEFINE Z as (Z_sensorNo <> A_SensorNO), B as (B_sensorNo = A_SensorNO)) as T

16 Task SQL-TS SELECT a.symbol, a.tstamp, /* start time */ a.price, /* start price */ max(tstamp), /* inflection time */ last(f.price), /* low price */ maxtstamp, /* end time */ LAST (C.Price), /* end price */ MAX (F.Tstamp) FROM Ticker AS (A B C* D E* F+) CLUSTER BY Symbol SEQUENCE BY Tstamp % ONE ROW PER MATCH %AFTER MATCH SKIP PAST LAST ROW %MAXIMAL MATCH WHERE B.price < PREV(B.price) AND C.price <= PREV(C.price) AND D.Price > PREV(D.price) AND E.Price >= PREV(E.Price) AND F.price <= A.price F.Price >= PREV(F.price) AND F.price > A.price The green condition of SQL-MR must now be replaced with the blue one.

17 Blocking and Non-Blocking Queries
Blocking (fully): no result till the end is detected---e.g., sum and count. Blocking (partially): some results are only returned at the end—others can be returned early. E.g., coalescing Non-Blocking (NB): all results before the end is detected Claims: Maximal Match Patterns ending with a plus, or a star are not NB in general (i.e., some are but others are not) Patterns with different ending are NB Also all patterns that are not Maximal Match are NB.

18 Conclusions Specs proposed by 2 DBMS vendors (Oracle & IBM) and 2 DSMS startups (Coral8 and Streambase) Very powerful: capabilities of SQL-TS plus several new constructs of convenience—particularly in controlling output. Optimization techniques developed for SQL-TS could also be critical here.


Download ppt "ISO/IEC JTC1/SC32 WG3:URC-nnn ANSI NCITS H nnn"

Similar presentations


Ads by Google