Presentation is loading. Please wait.

Presentation is loading. Please wait.

Adaptively Processing Remote Data and Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March.

Similar presentations


Presentation on theme: "Adaptively Processing Remote Data and Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March."— Presentation transcript:

1 Adaptively Processing Remote Data and Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 14, 2005 LSD Slides courtesy AnHai Doan

2 2 Administrivia Midterm due 3/16  5-10 pages (single-spaced, 10-12 pt)  If you haven’t told me which topic, please do so now!

3 3 Today’s Trivia Question

4 4 Many Motivations for Adaptive Query Processing Many domains where cost-based query optimization fails: Complex queries in traditional databases: estimation error grows exponentially with # joins [IC91] ; the focus of [KD98], [M+04] Querying over the Internet: unpredictable access rates, delays Querying external data sources: limited information available about properties of this source Monitor real-world conditions, adapt processing strategy in response

5 5 Generalizing Adaptive Query Processing  We’ve seen a range of different adaptive techniques  How do they fit together?  Can we choose points between eddies and mid- query re-optimization? Can we exploit other kinds of query optimization “tricks”?

6 6 Popular Types of Adaptive Query Processing Adaptive scheduling (q. scramb. [UF98]; dyn. rescheduling [UF01]; PH join [UF00][RS86][I+99][HH99] )  Changes CPU scheduling to improve feedback or reduce delays  Can’t reduce total work Redundant computation (competitive exec. [AZ96] )  Compare two+ ways of executing the query  Need to identify a few promising plans Plan partitioning (INGRES [S+76], mid-q re-opt. [KD98][I+99][M+04] )  Break the plan into stages; re-optimize future stages as necessary  Coarse granularity, breaks pipelining Are these the only options?

7 7 Two More Forms of Adaptivity Adaptive data partitioning ( [AH00][R+03][DH04][I+04] )  Break the data into subsets; use a different plan for each subset  Generalizes intra-plan reordering in a SPJGU query  The only way to reduce overall computation with fine granularity  Only previous implementation has been eddies [AH00][R+03][DH04] Adaptive information passing  Extends sideways information passing (“magic sets”) to an adaptive context, both intra- and inter-plan  Reduces computation and space devoted to non-productive tuples

8 8 Eddies Combine Adaptive Scheduling and Data Partitioning Decisions Intuitively, each tuple gets its own query plan  Route to next operator based on speed and selectivity of each operator  Elegant and simple to implement But performing a join creates subresults at the next level! Local & greedy choices may result in state that needs to join with all future data! Consider long-term effects of decisions before making them – separate CPU scheduling from plan selection

9 9 Focusing Purely on Adaptive Data Partitioning Use adaptively scheduled operators to “fill CPU cycles” Now a query optimizer problem: Choose a plan that minimizes long- term cost (in CPU cycles) To allow multiple plans, distribute union through join (and select, project, etc.): If R 1 = R 1 1 [ R 1 2, R 2 = R 2 1 [ R 2 2 then: R 1 ⋈ R 2 = (R 1 1 [ R 1 2 ) ⋈ (R 2 1 [ R 2 2 ) = (R 1 1 ⋈ R 2 1 ) [ (R 1 2 ⋈ R 2 2 ) [ (R 1 1 ⋈ R 2 2 ) [ (R 1 2 ⋈ R 2 1 ) R1R1 R2R2 R11R11 R21R21 R12R12 R22R22 This generalizes to n joins, other SPJ + GU operators…

10 10 Adaptive Data Partitioning: Routing Data across Different Plans R ⋈ S ⋈ T  Exclude R 0 S 0 Exclude R 0 S 0 T 0, R 1 S 1 T 1  R 0 S 0 R 1 T 1 S 1 T 0 R 0 S 0 RST R 0 S 0 T 0 R 0 S 0 … S 0 R 0 T 0 S1T1S1T1 R 1 S 1 T 1 R 1 S 1 T 1 Options for combining across phases:  New results always injected into old plan  Old results into new plan  Wait until the end – “stitch-up” plan based on best stats

11 11 Special Architectural Features for ADP Monitoring and re-optimization thread runs alongside execution:  System-R-like optimizer with aggregation support; uses most current selectivity estimates  Periodic monitoring and re-optimization revises selectivity estimates, recomputes expected costs Query execution with “smart router” operators Special support for efficient stitch-up plans:  Uses intermediate results from previous plans (specialized-case of answering queries using views [H01])  Join-over-union (“stitch-up-join”) operator that excludes certain results

12 12 ADP Application 1: Correcting Cost Mis-estimates Goal: react to plans that are obviously bad  Don’t spend cycles searching for a slightly better plan  Try to avoid paths that are likely to not be promising Monitor/reoptimizer thread watches cardinalities of subresults  Re-estimate plan cost, compare to projected costs of alternatives, using several techniques & heuristics (see paper)  Our experiments: re-estimate every 1 sec. “Smart router” operator does the following:  Waits for monitor/reoptimizer to suggest replacement plan  Re-routes source data into the new plan  New plan’s output is unioned with output of previous plan; this is fed into any final aggregation operations

13 13 Correcting for Unexpected Selectivities Pentium IV 3.06 GHz Windows XP

14 14 ADP Application 2: Optimizing for Order Most general ADP approach:  Pre-generate plans for general case and each “interesting order”  “Smart router” sends tuple to the plan whose ordering constraint is followed by this tuple  But with multiple joins, MANY plans Instead: do ADP at the operator level  “Complementary join pair”  Does its own stitch-up internally  Easier to optimize for! Can also do “partial sorting” at the router (priority queue) QQ Q QQ Merge Hash R S h(R)h(S)h(R)h(S)... Routers

15 15 Exploiting Partial Order in the Data (1024 tuple) Pentium IV 3.06 GHz Windows XP

16 16 ADP Over “Windows”: Optimizing for Aggregation  Group-by optimization [CS94]:  May be able to “pre-aggregate” some tuples before joining  Why: aggregates can be applied over union  But once we insert pre-aggregation, we’re stuck (and it’s not pipelined)  Our solution:  “Adjustable window pre-aggregation”  Change window size depending on how effectively we can aggregate  Also allows data to propagate through the plan – better info for adaptivity, early answers T R SUM(T.y sums) GROUP BY T.x SUM(T.y) GROUP BY T.x, T.joinAttrib TR SUM(T.y) GROUP BY T.x vs.

17 17 Pre-Aggregation Comparison

18 18 The State of the Union Join and Agg  “Useless” intermediate state is perhaps the biggest concern in ADP-based (or even plan-partitioning) approaches  Very easy to create large intermed. state before switching from a plan  Results in significant additional computation  “The burden of history” [DH04]  Also the major bottleneck in computing queries with correlated subqueries  Only want to compute parts of a subquery that will contribute to final answers  Local DB solution: magic sets rewritings [M+90][CR91][MP94][S+96]

19 19 Intuition behind Magic Sets Rewritings  Observations:  Computing a subquery once for every iteration of the outer query is repetitive, inefficient  Computing the subquery in its entirety is also frequently inefficient  So “pass in” information about specifically which tuples from the inner query might join with the outer query  A “filter set” – generally a projection of a portion of the outer query results  Anything that joins with the parent block must join with the filter set  False positives are OK

20 20 Query with Magic Set CREATE VIEW TotalSales(SellerID, Sales, ItemsSold) SELECT SellerID, sum(salePrice) AS Sales, count(*) AS ItemsSold FROM SellerList SL, SaleItem S WHERE SL.SellerID = S.SellerID GROUP BY SL.SellerID SELECT SellerID, Sales, ItemsSold FROM TotalSales TS, Recommended REC, Ratings RAT WHERE REC.SellerID = TS.SellerID AND RAT.SellerID = TS.SellerID AND RAT.Rating > 4 AND ItemsSold > 50

21 21 Query with Magic Set [S+96] CREATE VIEW TotalSales(SellerID, Sales, ItemsSold) SELECT SellerID, sum(salePrice) AS Sales, count(*) AS ItemsSold FROM SellerList SL, SaleItem S WHERE SL.SellerID = S.SellerID GROUP BY SL.SellerID SELECT SellerID, Sales, ItemsSold FROM TotalSales TS, Recommended REC, Ratings RAT WHERE REC.SellerID = TS.SellerID AND RAT.SellerID = TS.SellerID AND RAT.Rating > 4 AND ItemsSold > 50

22 22 Magic in Data Integration In data integration:  Difficult to determine when to do sideways information passing/magic in a cost-based way  Magic optimization destroys some potential parallelism – must compute outer block first  Opportunities:  Pipelined hash joins give us complete state for every intermediate result  We use bushy trees Our idea: do information passing out-of-band  Consider a plan as if it’s a relational calculus expression – every tuple must satisfy constraints  The plan dataflow enforces this…  … But we can also pass information across the plan outside the normal dataflow A B C x x

23 23 Adaptive Information Passing  Cost-based strategy: 1.Execute all blocks in parallel (up to max. pipelineable size) 2.Whenever a subresult is completely computed, feed it elsewhere in the query plan as a filter set  Anywhere with a shared predicate is an eligible target  Use our ability to estimate remaining cost of query execution to see if the semijoin will speed performance 3.Can always inject “more precise” filter set (one that checks more predicates), or remove a filter set  Filter set is a performance/space optimization, not necessary for correctness  We use Bloom filters rather than hash tables (our VLDB05 submission has detailed performance comparison)  Also compared against a naïve strategy that generates filter sets at every operator; when complete, they are used as filters by downstream ops

24 24 Tuples Created – TPC-H, 1GB (~67% savings in Q2. Also savings in Q5, not shown)

25 25 Adaptive QP in Summary A variety of different techniques, focusing on:  Scheduling  Comparison & competition  Data + plan partitioning  Information passing A field that is still fairly open – missing:  Effective exploration methods  A true theory!  What’s possible? What kinds of queries make sense to adapt?  Guarantees of optimality and convergence (perhaps under certain assumptions)

26 26 Switching from Low-Level to High-Level  We’ve talked about:  Query reformulation (composing queries with mappings)  Query optimization + execution  But how did we ever get the mappings in the first place?  This is one of the most tedious tasks  Answer: LSD (and not the kind that makes you high!)  … Slides courtesy of AnHai Doan, UIUC

27 27 Semantic Mappings between Schemas  Mediated & source schemas = XML DTDs house location contact house address name phone num-baths full-bathshalf-baths contact-info agent-name agent-phone 1-1 mappingnon 1-1 mapping

28 28 Suppose user wants to integrate 100 data sources 1. User  manually creates mappings for a few sources, say 3  shows LSD these mappings 2. LSD learns from the mappings  “Multi-strategy” learning incorporates many types of info in a general way  Knowledge of constraints further helps 3. LSD proposes mappings for remaining 97 sources The LSD (Learning Source Descriptions) Approach

29 29 listed-price $250,000 $110,000... address price agent-phone description Example location Miami, FL Boston, MA... phone (305) 729 0831 (617) 253 1429... comments Fantastic house Great location... realestate.com location listed-price phone comments Schema of realestate.com If “fantastic” & “great” occur frequently in data values => description Learned hypotheses price $550,000 $320,000... contact-phone (278) 345 7215 (617) 335 2315... extra-info Beautiful yard Great beach... homes.com If “phone” occurs in the name => agent-phone Mediated schema

30 30 LSD’s Multi-Strategy Learning Use a set of base learners  each exploits well certain types of information Match schema elements of a new source  apply the base learners  combine their predictions using a meta-learner Meta-learner  uses training sources to measure base learner accuracy  weighs each learner based on its accuracy

31 31 Base Learners  Input  schema information: name, proximity, structure,...  data information: value, format,...  Output  prediction weighted by confidence score  Examples  Name learner  agent-name => (name,0.7), (phone,0.3)  Naive Bayes learner  “Kent, WA” => (address,0.8), (name,0.2)  “Great location” => (description,0.9), (address,0.1)

32 32 Boston, MA $110,000 (617) 253 1429 Great location Miami, FL $250,000 (305) 729 0831 Fantastic house Training the Learners Naive Bayes Learner (location, address) (listed-price, price) (phone, agent-phone) (comments, description)... (“Miami, FL”, address) (“$ 250,000”, price) (“(305) 729 0831”, agent-phone) (“Fantastic house”, description)... realestate.com Name Learner address price agent-phone description Schema of realestate.com Mediated schema location listed-price phone comments

33 33 Beautiful yard Great beach Close to Seattle (278) 345 7215 (617) 335 2315 (512) 427 1115 Seattle, WA Kent, WA Austin, TX Applying the Learners Name Learner Naive Bayes Meta-Learner (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) (address,0.6), (description,0.4) Meta-Learner Name Learner Naive Bayes (address,0.7), (description,0.3) (agent-phone,0.9), (description,0.1) address price agent-phone description Schema of homes.com Mediated schema area day-phone extra-info

34 34 Domain Constraints  Impose semantic regularities on sources  verified using schema or data  Examples  a = address & b = address a = b  a = house-id a is a key  a = agent-info & b = agent-name b is nested in a  Can be specified up front  when creating mediated schema  independent of any actual source schema

35 35 area: address contact-phone: agent-phone extra-info: description area: address contact-phone: agent-phone extra-info: address area: (address,0.7), (description,0.3) contact-phone: (agent-phone,0.9), (description,0.1) extra-info: (address,0.6), (description,0.4) The Constraint Handler  Can specify arbitrary constraints  User feedback = domain constraint  ad-id = house-id  Extended to handle domain heuristics  a = agent-phone & b = agent-name a & b are usually close to each other 0.3 0.1 0.4 0.012 0.7 0.9 0.6 0.378 0.7 0.9 0.4 0.252 Domain Constraints a = address & b = adderss a = b Predictions from Meta-Learner

36 36 Putting It All Together: LSD System L1L1 L2L2 LkLk Mediated schema Source schemas Data listings Training data for base learners Constraint Handler Mapping Combination User Feedback Domain Constraints  Base learners: Name Learner, XML learner, Naive Bayes, Whirl learner  Meta-learner  uses stacking [Ting&Witten99, Wolpert92]  returns linear weighted combination of base learners’ predictions Matching PhaseTraining Phase

37 37 Empirical Evaluation  Four domains  Real Estate I & II, Course Offerings, Faculty Listings  For each domain  create mediated DTD & domain constraints  choose five sources  extract & convert data listings into XML  mediated DTDs: 14 - 66 elements, source DTDs: 13 – 48  Ten runs for each experiment - in each run:  manually provide 1-1 mappings for 3 sources  ask LSD to propose mappings for remaining 2 sources  accuracy = % of 1-1 mappings correctly identified

38 38 LSD Matching Accuracy LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72% + Meta-learner: + 5 - 22% + Constraint handler: + 7 - 13% + XML learner: + 0.8 - 6% Average Matching Acccuracy (%)

39 39 LSD Summary  Applies machine learning to schema matching  use of multi-strategy learning  Domain & user-specified constraints  Probably the most flexible means of doing schema matching today in a semi-automated way  Complementary project: CLIO (IBM Almaden) uses key and foreign-key constraints to help the user build mappings

40 40 Jumping Up a Level  We’ve now seen how distributed data makes a huge difference  … In heterogeneity and the need for relating different kinds of attributes  Mapping languages  Mapping tools  Query reformulation  … and in query processing  Adaptive query processing


Download ppt "Adaptively Processing Remote Data and Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March."

Similar presentations


Ads by Google