Presentation is loading. Please wait.

Presentation is loading. Please wait.

AutoJoin: Providing Freedom from Specifying Joins Terrence Mason Lixin Wang

Similar presentations


Presentation on theme: "AutoJoin: Providing Freedom from Specifying Joins Terrence Mason Lixin Wang"— Presentation transcript:

1 AutoJoin: Providing Freedom from Specifying Joins Terrence Mason (terrence-mason@uiowa.edu)terrence-mason@uiowa.edu Lixin Wang (lixin-wang@uiowa.edu)lixin-wang@uiowa.edu Dr. Ramon Lawrence (ramon-lawrence@uiowa.edu)ramon-lawrence@uiowa.edu Iowa Database and Emerging Application Laboratory University of Iowa 7th International Conference on Enterprise Information Systems ICEIS 2005 Miami, Florida

2 Presentation Outline Define Query Inference Define Query Inference Query Languages that require Inference Query Languages that require Inference AutoJoin Architecture AutoJoin Architecture Join Graph represent a schema Join Graph represent a schema Queries and Query Interpretations on a Join Graph Queries and Query Interpretations on a Join Graph Pre-compute maximal join trees Pre-compute maximal join trees Algorithm EMO Algorithm EMO Query time processing – Example Query time processing – Example Performance Evaluation Performance Evaluation

3 Query Inference Problem New Languages The query inference problem requires enumerating and ranking query interpretations of a query such that the query interpretation desired by the user is among the highest ranked interpretations.

4 State of the art query languages require it State of the art query languages require it Keyword Search – automatically relate keywords across relations of a schema Keyword Search – automatically relate keywords across relations of a schema Conceptual Queries – Concepts mapped to database must be related Conceptual Queries – Concepts mapped to database must be related Natural Language Queries Natural Language Queries Natural language query mapped to concepts Natural language query mapped to concepts Relate concepts as in Conceptual Queries Relate concepts as in Conceptual Queries Current approaches not scalable Current approaches not scalable Tied to specific language Tied to specific language Or conceptual model Or conceptual model Motivation for Query Inference

5 Reduces to graph problem Reduces to graph problem Connect relations (nodes) with joins (edges) Connect relations (nodes) with joins (edges) Exponential solutions for highly connected graphs (database graphs less connected) Exponential solutions for highly connected graphs (database graphs less connected) Approaches to join determination Approaches to join determination Grow all ways Grow all ways Universal Relation (Maier and Ullman, 1983) Universal Relation (Maier and Ullman, 1983) Discover (Keyword) (Hristidis and Papakonstantinou, 2002, 2003, 2004) Discover (Keyword) (Hristidis and Papakonstantinou, 2002, 2003, 2004) Shortest Paths Shortest Paths CQL Conceptual Query Language (Owei and Navathe, 2001) CQL Conceptual Query Language (Owei and Navathe, 2001) Limited Interpretations Limited Interpretations Steiner Tree (2-Trees) (Wald and Sorenson, 1984) Steiner Tree (2-Trees) (Wald and Sorenson, 1984) Limit number of joins and interpretations (Zhang et al., 1999) Limit number of joins and interpretations (Zhang et al., 1999) Query time find spanning trees of keywords Query time find spanning trees of keywords DBXplorer Keyword Search (Agrawal et al. 2002) DBXplorer Keyword Search (Agrawal et al. 2002) Motivation for Query Inference

6 Goal of AutoJoin Consistent, Scalable Inference Engine Abstract database schema from users Automatically determine joins to relate relations and attributes Consistent approach to handle ambiguity in queries Efficient algorithm to pre-compute potential joins Minimal overhead at query time Demonstrate efficiency and scalability Structured on relational model without any required conceptual models

7 Example Query on TPC-H Schema English Query: List all parts ordered by Customers in the United States. Attribute-only SQL Determine Joins with AutoJoin New formulation for Query Inference problem.

8 Table Attributes Part partkey, name, mfgr, brand, type, size, container, retailprice, comment Supplier supkey, name, address, nationkey, phone, acctbal, comment PartSupp partkey, suppkey, availqty, supplycost, comment Customer custkey, name, address, nationkey, phone, acctbal, mktsegment, comment Order orderkey, custkey, orderstatus, totalprice, orderdate, orderpriority, clerk, shippriority, comment LineItem orderkey, partkey, suppkey, linenumber, quantity, extendedprice, discount, returnflag, tax, linestatus, shipdate, commitdate, receiptdate, shipinstruct, shipmode, comment Nation nationkey, name, regionkey, comment Region regionkey, name, comment TPC-H Schema TPC-H BENCHMARK™ ( http://www.tpc.org/) List all parts ordered by Customers in the United States.

9 Attribute-only Query: Select Part.Name where Nation.Name=‘United States’; Part.Name - name attribute in Part Table Nation.Name – name attribute in Nation Table Select and where similar to SQL No From clause or joins specified Keyword Query: Part ‘United States’ Maps Part to Part relation Maps ‘United States’ to tuple in Nation relation No joins specified

10 SQL Query SQL Query Select Part.Name where Nation.Name = ‘United States’; SELECT P.name FROM part P, nation N, partsupp PS, lineitem LI, orders O, customer C WHERE N.name = ‘United States’ And P.partkey = PS.partkey And PS.partkey = LI.partkey And PS.suppkey = LI.suppkey And O.custkey = C.custkey And C.nationkey = N.nationkey And LI.orderkey = O.orderkey; Specified Joins and Tables

11 User Query Interface Inference Request Query Builder GeneratorRanker Iterator Loader XML Document AutoJoin Inference Engine Relational Database Execute Queries Interpretations AutoJoin Architecture

12 Representing Joins of a Schema Join Graph Graph representation of relational schema Nodes Relations in schema Directed Edges Foreign key constraint between relations Edges directed from N to 1 cardinality of relationships Maintain Lossless property (No spurious tuples on joins)

13 Create Join Graph TPC-H Nodes JoinedForeign key/Join Line Item to Part partkey  partkey Line Item to PartSupp partkey, suppkey  partkey, suppkey Line Item to Supplier suppkey  suppkey Line Item to Order l_orderkey  o_orderkey PartSupp to Part ps_partkey  p_partkey PartSupp to Supplier ps_suppkey  s_suppkey Supplier to Nation s_nationkey  n_nationkey Order to Customer o_custkey  c_custkey Customer to Nation c_nationkey  n_nationkey Nation to Region n_regionkey  r_regionkey Part Supp Nation Supplier Part Line Item Order Customer Region Tables as Nodes

14 Pre-compute Maximal Join Trees EMO Algorithm on Join Graph Efficiently computes all Trees Executes where previous strategy failed Direction of edges results in lossless join trees Pre-computed Executed once prior to query time Structures built for query time performance

15 Compute Lossless Joins Maximal sets of lossless joins Maximal sets of lossless joins Ambiguity inherent in the schema Ambiguity inherent in the schema Two types of ambiguity: Two types of ambiguity: Single relation that plays multiple roles Single relation that plays multiple roles Node with more than one incoming edge in join graph Node with more than one incoming edge in join graph Multiple semantic relationships between entities Multiple semantic relationships between entities Strongly connected components greater than one node Strongly connected components greater than one node

16 Creation of Maximal Join Trees Lossless Joins Efficient Algorithm EMO Determine all reachable graphs from nodes that may be a root for Maximal Set of Lossless Joins Identify all Strong Connected Components (SCC) For each SCC If SCC is single node and no incoming edges, create reachable graph from this node If SCC has multiple nodes, for each node in SCC with no incoming edges that are not part of SCC create reachable graph. For each reachable graph find all spanning trees Spanning trees represent Maximal Join Trees

17 Maximal Join Trees of TPC-H LineItem is the only root for a reachable graph. No strongly connected components Join graph is reachable graph Enumerate spanning trees on original graph Remove shortcut joins and re-compute

18 Part Supp Nation SupplierPart Line Item Order Customer Region TPC-H Join Graph

19 Part Supp Nation Supplier Part Line Item Order Customer Region TPC-H Maximal Join Trees Part Supp Nation Supplier Part Line Item Order Customer Region Part Supp Nation Supplier Part Line Item Order Customer Region Part Supp Nation Supplier Part Line Item Order Customer Region Part Supp Nation Supplier Part Line Item Order Customer Region Part Supp Nation Supplier Part Line Item Order Customer Region Part Supp Nation Supplier Part Line Item Order Customer Region Part Supp Nation Supplier Part Line Item Order Customer Region 1 8 76 2 5 3 4

20 Shortcut Joins Semantically equivalent join paths A shortcut join is a join that is semantically equivalent to a longer join path Core join path (longer) preserved in join graph Shortcut join removed for join determination Appears to be a semantically different interpretation of the query Substituted back into query No nodes on core path in query (faster) execution) TPC-H has two shortcut joins

21 Part Supp Nation SupplierPart Line Item Order Customer Region TPC-H Join Graph Remove Shortcut Joins Red – Shortcut Joins

22 Part Supp Nation Supplier Part Line Item Order Customer Region Original TPC-H Maximal Join Trees Part Supp Nation Supplier Part Line Item Order Customer Region Part Supp Nation Supplier Part Line Item Order Customer Region Part Supp Nation Supplier Part Line Item Order Customer Region Part Supp Nation Supplier Part Line Item Order Customer Region Part Supp Nation Supplier Part Line Item Order Customer Region Part Supp Nation Supplier Part Line Item Order Customer Region Part Supp Nation Supplier Part Line Item Order Customer Region 1 8 76 2 5 3 4

23 TPC-H Semantically Unique Maximal Join Trees Part Supp Nation SupplierPart Line Item Order Customer Region Part Supp Nation SupplierPart Line Item Order Customer Region 12

24 Query and Query Interpretation AutoJoin Join Graphs Query: Sub-graph of the join graph Nodes and (optionally) edges Not connected requires inference Query Interpretation: Connected sub-graph of the join graph Includes all specified nodes and edges

25 Example Query SELECT Part.Name WHERE Nation.Name = ‘United States’; Relate Part.Name to Nation.Name Part and Nation Nodes. Query of Part and Nation nodes to AutoJoin. The query is ambiguous More than one query interpretation Nation relates to Supplier and Customer Return the query with fewest joins first

26 Efficient Query Time Execution Find maximal join trees with query nodes Reverse index - relation to its set of join trees Intersect lists Build Interpretations Least common ancestor (vs. recursive prune) Pre-compute ancestor lists No lossless interpretations (no trees) Find lossy interpretation Rank interpretations by cost function maximal sets of lossless joins

27 Both Trees Contain Query Nodes Select Part.Name where Nation.Name = ‘United States’; Part Supp Nation SupplierPart Line Item Order Customer Region Part Supp Nation SupplierPart Line Item Order Customer Region 12 Red – Target Nodes

28 Query Processing Part Supp Nation SupplierPart Line Item Order Customer Region Part Supp Nation SupplierPart Line Item Order Customer Region 12 Red – Target Nodes Blue – Tree Nodes Gray – Nodes to Prune

29 Query Interpretations Part Supp Nation Part Line Item Order Customer Part Supp Nation SupplierPart 12 Select Part.Name where Customer.Nation.Name = ‘United States’; Select Part.Name where Supplier.Nation.Name = ‘United States’;

30 Unambiguous Query Select Supplier.Name where Order.Id = 73; Part Supp Nation SupplierPart Line Item Order Customer Region Part Supp Nation SupplierPart Line Item Order Customer Region 12 Red – Target Nodes

31 Query Processing Select Supplier.Name where Order.Id = 73; Red – Target Nodes Blue – Tree Nodes Gray – Nodes to Prune Part Supp Nation SupplierPart Line Item Order Customer Region Part Supp Nation SupplierPart Line Item Order Customer Region 12

32 Query Interpretations Select Supplier.Name where Order.Id = 73; Part Supp Supplier Line Item Order 12 Part Supp Supplier Line Item Order

33 The Unambiguous Query Interpretation Select Supplier.Name where Order.Id = 73; Part Supp Supplier Line Item Order

34 Additional Interpretations Lossy Joins Related through a node involved in two distinct roles Two maximal join trees contain all query nodes and have at least one node in common Union maximal join trees Common nodes provide relation for trees. Interpretation where node will have two incoming edges No longer lossless Example Customer and Supplier related through Nation in TPC-H. Cross products of Customers and Suppliers with the same nation

35 Beyond Natural Joins Theta joins Merge the two nodes related by theta join into single node and re-compute maximal objects. Expand this node for final query interpretation with theta join Tuple Variables A query interface may specify tuple variables Additional nodes and edges will be added to join graph to complete the query interpretations

36 Performance Experiments Broad Range of Schemas caBIO (NCI) 149 relations, 213 joins, and 1253 maximal join trees TPC-H Standard Database Inferred standard queries (21 specified queries) Ambiguity reduced by removing shortcut joins Tenant – 9 nodes, 50 joins, and 1286 maximal join trees

37 Peformance Results Time to generate all Maximal Join Trees Handles schemas where previous method failed Worst test 2.7 seconds Average < 1 second Reduce Ambiguity Removing shortcut joins reduces ambiguity Increased number of unambiguous query From 45% to 68% for TPC-H Benchmark Queries Minimal overhead of inference at query time Average < 1 millisecond Worst test 7.4 milliseconds

38 Compute Maximal Join Trees EMO vs. All Ways

39 Reducing Ambiguity Remove Shortcut Joins

40 Query Inference Time (Milliseconds)

41 AutoJoin Conclusions Scalable inference engine Efficiently pre-compute maximal join trees Reduced ambiguity by removing shortcut joins Overhead is minimal Complex queries can be inferred Built directly on relational model

42 Future Work Develop a query language Develop a query language Remove requirement of understanding the underlying schema Remove requirement of understanding the underlying schema Automatically determines joins Automatically determines joins End user interface based on AutoJoin End user interface based on AutoJoin Query inference for integration systems. Query inference for integration systems.

43 Query Inference (Previous) The translation of a query in a query language into an unambiguous representation of the query [Wald and Sorenson, 1984]

44 Universal Relation First model to require query inference First model to require query inference Maximal Objects (Maier and Ullman, 1983) Maximal Objects (Maier and Ullman, 1983) Lossless Join property to identify potential joins Lossless Join property to identify potential joins Grows all ways on hyper-graph Grows all ways on hyper-graph Returns a union of all query interpretations Returns a union of all query interpretations Minimum Directed Cost Steiner Tree (Wald and Sorenson, 1984) Minimum Directed Cost Steiner Tree (Wald and Sorenson, 1984) Limited to Partial 2-Trees Limited to Partial 2-Trees Returns only lowest cost query interpretation Returns only lowest cost query interpretation Generate a single interpretation Generate a single interpretation Do not meet need of new query languages Do not meet need of new query languages Limited query interpretations possible Limited query interpretations possible

45 State of the Art Query Languages Keyword Searches Keyword Searches Keywords map to either specific data, attribute names, or relation names in a database. Keywords map to either specific data, attribute names, or relation names in a database. Must identify joins to relate keywords spread across multiple relations. Must identify joins to relate keywords spread across multiple relations. Multiple approaches to identifying the top-k relationships between keywords. Multiple approaches to identifying the top-k relationships between keywords.

46 Keyword Search T op-K Relationships Discover (Hristidis and Papakonstantinou, 2002, 2003, 2004) Discover (Hristidis and Papakonstantinou, 2002, 2003, 2004) Grow all ways from a keyword Grow all ways from a keyword Limit on number of joins Limit on number of joins Creates extra graphs Creates extra graphs DBXplorer (Agrawal et al. 2002) DBXplorer (Agrawal et al. 2002) Generates spanning trees at query time Generates spanning trees at query time BANKS ( ) BANKS ( ) Graph of all tuples related by joins Graph of all tuples related by joins Must fit in memory (limited to smaller databases) Must fit in memory (limited to smaller databases)

47 State of the Art Query Languages Conceptual Query Languages or Models Conceptual Query Languages or Models Queries built with concepts that map to a database. Queries built with concepts that map to a database. Remove the burden of knowledge of the schema. Remove the burden of knowledge of the schema. Must determine joins to relate concepts in query. Must determine joins to relate concepts in query. Use conceptual model to determine joins Use conceptual model to determine joins

48 Conceptual Query Languages CQL (Owei and Navathe, 2001) CQL (Owei and Navathe, 2001) Queries may include roles or joins required for a query Queries may include roles or joins required for a query Pathfinder algorithm for completing the query Pathfinder algorithm for completing the query Based on shortest path between source and target concepts in query Based on shortest path between source and target concepts in query Semantically Constrained ER Diagram as a graph used to determine joins. Semantically Constrained ER Diagram as a graph used to determine joins. Conceptual Model (Zhang et al., 1999) Conceptual Model (Zhang et al., 1999) Semantic graph of database Semantic graph of database Search algorithm constrained by number of joins or number of interpretations Search algorithm constrained by number of joins or number of interpretations

49 State of the Art Query Languages Natural Language Queries Natural Language Queries Natural language queries map the language to concepts in a database Natural language queries map the language to concepts in a database Joins must be determined to relate concepts in database similar to Conceptual Query Languages Joins must be determined to relate concepts in database similar to Conceptual Query Languages

50 Functional Dependencies due to Primary Keys TPC-H TableFunctional Dependencies Part p_partkey  p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment Supplier s_suppkey  s_name, s_address, s_nationkey, s_phone, s_acctbal, s_comment PartSupp ps_partkey, ps_suppkey  ps_availqty, ps_supplycost, ps_comment Customer c_custkey  c_name, c_address, c_nationkey, c_phone, c_acctbal, c_mktsegment, c_comment Order o_orderkey  o_custkey, o_orderstatus, o_totalprice, o_orderdate, o_orderpriority, o_clerk, o_shippriority, o_comment LineItem l_orderkey, l_linenumber  l_partkey, l_suppkey, l_orderkey, l_quantity, l_extendedprice, l_discount, l_returnflag, l_tax, l_linestatus, l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment Nation n_nationkey  n_name, n_regionkey, n_comment Region r_regionkey  r_name, r_comment Primary Keys Foreign Keys

51 Table with Foreign Key Table Referenced Functional Dependencies LineItemPart l_partkey  p_partkey LineItemSupplier l_suppkey  s_suppkey LineItemPartSupp l_partkey, l_suppkey  ps_partkey, ps_suppkey LineItemOrder l_orderkey  o_orderkey PartSuppPart ps_partkey  p_partkey PartSuppSupplier ps_suppkey  s_suppkey SupplierNation s_nationkey  n_nationkey CustomerNation c_nationkey  n_nationkey OrderCustomer o_custkey  c_custkey NationRegion n_regionkey  r_regionkey Primary Keys Foreign Keys Function Dependencies TPC-H implied by Foreign Keys

52 Part Supp Nation Supplier Part Line Item Order Customer Region TPC-H Join Graph


Download ppt "AutoJoin: Providing Freedom from Specifying Joins Terrence Mason Lixin Wang"

Similar presentations


Ads by Google