# BUILDING A DATABASE SYSTEM FOR ORDER New England Database Seminars April 2002 Alberto Lerner – ENST Paris Dennis Shasha – NYU

## Presentation on theme: "BUILDING A DATABASE SYSTEM FOR ORDER New England Database Seminars April 2002 Alberto Lerner – ENST Paris Dennis Shasha – NYU"— Presentation transcript:

BUILDING A DATABASE SYSTEM FOR ORDER New England Database Seminars April 2002 Alberto Lerner – ENST Paris Dennis Shasha – NYU {lerner,shasha}@cs.nyu.edu

NEDS April 2002 – Lerner and Shasha Agenda Motivation SQL + Order Transformations Conclusion

NEDS April 2002 – Lerner and Shasha Motivation The need for ordered data Some queries rely on order Examples: Moving averages Top N Rank “SQL can handle it.” Can it really?

NEDS April 2002 – Lerner and Shasha Motivation Moving Averages: algorithmically linear Sales(month, total) SELECT t1.month+1 AS forecastMonth, (t1.total+ t2.total + t3.total)/3 AS 3MonthMovingAverage FROM Sales AS t1, Sales AS t2, Sales AS t3 WHERE t1.month = t2.month - 1 AND t1.month = t3.month – 2 Can optimizer make a 3-way (in general, n-way) join linear time? Ref: Data Mining and Statistical Analysis Using SQL Trueblood and Lovett Apress, 2001

NEDS April 2002 – Lerner and Shasha Motivation Top N Employee(Id, salary) SELECT DISTINCT count(*), t1.salary FROM Employee AS t1, Employee AS t2 WHERE t1.salary <= t2.salary GROUP BY t1.salary HAVING count(*) <= N How many elements of cross-product have salaries at least as large as t1.salary? Will optimizer see essential sort-count trick? Ref: SQL for Smarties Joe Celko Morgan Kauffman, 1995

NEDS April 2002 – Lerner and Shasha Motivation Problems Extending SQL with Order Queries are hard to read Cost of execution is often non-linear (would not pass basic algorithms course) Few operators preserve order, so optimization hard.

NEDS April 2002 – Lerner and Shasha Agenda Motivation SQL + Order Transformations Conclusion

NEDS April 2002 – Lerner and Shasha SQL + Order Desirable Features Express order-dependent predicates and clauses in a readable, clear way Make optimization opportunities explicit (by getting rid of complex idioms, see above) Execution in linear (or n log n) time when possible

NEDS April 2002 – Lerner and Shasha SQL + Order three steps in solution 1. Give SQL a vector-oriented semantics – Database is a set of array-tables “arrables”; variables in the queries do not refer to a single tuple at a time anymore, but to a whole column vector 2. Provide new vector-to-vector functions – Supporting order-based manipulations of column vectors 3. Streaming: new data may need special treatment.

NEDS April 2002 – Lerner and Shasha SQL + Order Moving Averages Sales(month, total) SELECT month, avgs(8, total) FROM Sales ASSUMING ORDER month Execution (Sales is an arrable): 1.FROM clause – enforces the order in ASSUMING clause 2.SELECT clause – for each month yields the moving average (window size 8) ending at that month. No 8-way join. avgs: vector-to-vector function, order-dependant and size-preserving order to be used on vector- to-vector functions

NEDS April 2002 – Lerner and Shasha SQL + Order Top N Employee(ID, salary) SELECT first(N, salary) FROM Employee ASSUMING ORDER Salary first: vector-to-vector function, order-dependant and non size-preserving Execution: 1.FROM clause – orders arrable by Salary 2.SELECT clause – applies first() to the ‘salary’ vector, yielding first N values of that vector given the order. Could get the top earning IDs by saying first(N, ID).

NEDS April 2002 – Lerner and Shasha SQL + Order Ranking SalesReport(salesPerson, territory, total) SELECT territory, salesPerson, total, rank(total) FROM SalesReport WHERE rank(total) < N rank: vector-to-vector function, non order-dependant and size- preserving Execution: 1.FROM clause – assuming is NOT needed. 2.rank is applied to the ‘total’ vector and maps each position into an integer.

NEDS April 2002 – Lerner and Shasha SQL + Order Vector-to-Vector Functions prev, next, \$, [] avgs(*), prds(*), sums(*), deltas(*), ratios(*), reverse, … drop, first, last order- dependant non order- dependant size- preserving non size- preserving rank, tilemin, max, avg, count

NEDS April 2002 – Lerner and Shasha SQL + Order Complex queries: Best spread In a given day, what would be the maximum difference between a buying and selling point of each security? Ticks(ID, price, tradeDate, timestamp, …) SELECT ID, max(price – mins(price)) FROM Ticks ASSUMING ORDER timestamp WHERE tradeDate = ‘99/99/99’ GROUP BY ID Execution: 1.For each security, compute the running minimum vector for price and then subtract from the price vector itself; result is a vector of spreads. 2.Note that max – min would overstate spread. max min best spread running min

NEDS April 2002 – Lerner and Shasha SQL + Order Complex queries: Crossing averages part I When does the 21-day average cross the 5-month average? Market(ID, closePrice, tradeDate, …) TradedStocks(ID, Exchange,…) INSERT INTO temp FROM SELECT ID, tradeDate, avgs(21 days, closePrice) AS a21, avgs(5 months, closePrice) AS a5, prev(avgs(21 days, closePrice)) AS pa21, prev(avgs(5 months, closePrice)) AS pa5 FROM TradedStocks NATURAL JOIN Market ASSUMING ORDER tradeDate GROUP BY ID

NEDS April 2002 – Lerner and Shasha SQL + Order Complex queries: Crossing averages part I Execution: 1.FROM clause – order-preserving join 2.GROUP BY clause – groups are defined based on the value of the Id column 3.SELECT clause – functions are applied; non-grouped columns become vector fields so that target cardinality is met. Violates first normal form  groups in ID and non-grouped column grouped ID and non-grouped column Vector field two columns with the same cardinality

NEDS April 2002 – Lerner and Shasha SQL + Order Complex queries: Crossing averages part II Get the result from the resulting non first normal form relation temp SELECT ID, tradeDate FROM flatten(temp) WHERE a21 > a5 AND pa21 <= pa5 Execution: 1.FROM clause – flatten transforms temp into a first normal form relation (for row r, every vector field in r MUST have the same cardinality). Could have been placed at end of previous query. 2.Standard query processing after that.

NEDS April 2002 – Lerner and Shasha SQL + Order Related Work: Research SEQUIN – Seshadri et al. Sequences are first-class objects Difficult to mix tables and sequences. SRQL – Ramakrishnan et al. Elegant algebra and language No work on transformations. SQL-TS – Sadri et al. Language for finding patterns in sequence But: Not everything is a pattern!

NEDS April 2002 – Lerner and Shasha SQL + Order Related Works: Products RISQL – Red Brick Some vector-to-vector, order-dependent, size- preserving functions Low-hanging fruit approach to language design. Analysis Functions – Oracle 9i Quite complete set of vector-to-vector functions But: Can only be used in the select clause; poor optimization (our preliminary study) KSQL – Kx Systems Arrable extension to SQL but syntactically incompatible. No cost-based optimization.

NEDS April 2002 – Lerner and Shasha Agenda Motivation SQL + Order Transformations Conclusion

NEDS April 2002 – Lerner and Shasha SELECT ts.ID, ts.Exchange, avgs(10, hq.ClosePrice) FROM TradedStocks AS ts NATURAL JOIN HistoricQuotes AS hq ASSUMING ORDER hq.TradeDate GROUP BY Id Transformations Early sorting + order preserving operators (1) Sort then join preserving order (2) Preserve existing order (3) Join then sort before grouping op sort g-by avgs op avgs g-by op avgs g-by op sort (4) Join then sort after grouping avgs g-by sort

NEDS April 2002 – Lerner and Shasha Transformations Early sorting + order preserving operators

NEDS April 2002 – Lerner and Shasha Transformations UDFs evaluation order Gene(geneId, seq) SELECT t1.geneId, t2.geneId, dist(t1.seq, t2.seq) FROM Gene AS t1, Gene AS t WHERE dist(t1.seq, t2.seq) < 5 AND posA(t1.seq, t2.seq) posA asks whether sequences have Nucleo A in same position. Dist gives edit distance between two Sequences. posA dist posA (2)(1) (3) Switch dynamically between (1) and (2) depending on the execution history

NEDS April 2002 – Lerner and Shasha Transformations UDFs Evaluation Order

NEDS April 2002 – Lerner and Shasha Transformations Order preserving joins select lineitem.orderid, avgs(10, lineitem.qty), lineitem.lineid from order, lineitem assuming order lineid where order.date > 45 and order.date < 55 and lineitem.orderid = order.orderid Basic strategy 1: restrict based on date. Create hash on order. Run through lineitem, performing the join and pulling out the qty. Basic strategy 2: Arrange for lineitem.orderid to be an index into order. Then restrict order based on date giving a bit vector. The bit vector, indexed by lineitem.orderid, gives the relevant lineitem rows. The relevant order rows are then fetched using the surviving lineitem.orderid. Strategy 2 is often 3-10 times faster.

NEDS April 2002 – Lerner and Shasha Transformations Building Blocks Order optimization Simmens et al. `96 – push-down sorts over joins, and combining and avoiding sorts Order preserving operators KSQL – joins on vector Claussen et al. `00 – OP hash-based join Push-down aggregating functions Chaudhuri and Shim `94, Yan and Larson `94 – evaluate aggregation before joins UDF evaluation Hellerstein and Stonebraker ’93 – evaluate UDF according to its ((output/input) – 1)/cost per tuple Porto et al. `00 – take correlation into account

NEDS April 2002 – Lerner and Shasha Agenda Motivation SQL + Order Transformations Conclusion

NEDS April 2002 – Lerner and Shasha Conclusion Arrable-based approach to ordered databases may be scary – dependency on order, vector-to- vector functions – but it’s expressive and fast. SQL extension that includes order is possible and reasonably simple. Optimization possibilities are vast.

Download ppt "BUILDING A DATABASE SYSTEM FOR ORDER New England Database Seminars April 2002 Alberto Lerner – ENST Paris Dennis Shasha – NYU"

Similar presentations