Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cost-based Optimization of Graph Queries Silke Trißl Humboldt-Universität zu Berlin Knowledge Management in Bioinformatics IDAR 2007.

Similar presentations


Presentation on theme: "Cost-based Optimization of Graph Queries Silke Trißl Humboldt-Universität zu Berlin Knowledge Management in Bioinformatics IDAR 2007."— Presentation transcript:

1 Cost-based Optimization of Graph Queries Silke Trißl Humboldt-Universität zu Berlin Knowledge Management in Bioinformatics IDAR 2007

2 2 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Motivation – Biological Networks from http://www.genome.jp/kegg Name Sequence TYPE Function Location … Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

3 3 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Querying Networks - PQL  Pathway Query Language (PQL) [Leser, 2005]  Syntax for querying graphs  Find subgraphs matching the query graph SELECT B FROM network LET node A, node B, path P WHERE A.name = ‘Glucose’ AND A ISA compound AND B ISA enzyme AND P.path = A[-*]B; A B name = Glucose ISA compound ISA enzyme P Find all enzymes that are directly or indirectly affected by „Glucose“ Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

4 4 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Node Conditions  Nodes can contain conditions on A B name = Glucose ISA compound ISA enzyme P query TYPE hierarchy - partially root molecule interaction macro- molecule compound sugar gene protein ionmRNA catalysis inhibition enzyme  Attributes A.name = ‘Glucose’  TYPE (of hierarchy) A ISA compound  Function (of ontology) A HASFUNC (‘catalysis’, GO)  Location A ISIN (‘Human’, taxonomy) Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

5 5 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Path Conditions  Paths can contain conditions on A B name = Glucose ISA compound ISA gene P query a b graph  Edges P.path = A[-*]B AND P.length = 1  Path existence P.path = A[-*]B  Path length P.path = A[-*]B AND P.length < 10  Start node P.start = A  Containment P { R Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

6 6 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Result of Graph Queries  Search for matching subgraphs  Find node and path bindings for the query variables in the network A B name = Glucose ISA compound ISA enzyme P network query Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

7 7 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Outline  Motivation  Optimize Graph Queries  Evaluate node conditions  Evaluate path conditions  Future Work  Relational algebra for graph queries  Conclusion Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

8 8 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Evaluation of Node Conditons  Node attributes  Select operator (σ) on Node table  Node types, functions, and locations  Hierarchy operator (χ) – Return the specified concept and all successor concepts A B name = Glucose ISA compound ISA gene P query query plan for node A Node TYPE σ name=Glucose χ compound ⋈ Node.TYPE=TYPE Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

9 9 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR How to evaluate Path conditions?  Recursively traverse the graph  Edge  Arbitrary number of joins  No possibility to optimize the execution a b graph ⋈ Edge ⋈ …⋈ … Need for new logical and physical operators Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

10 10 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Path Existence Operator, Φ  Node variables A and B  Set of nodes V bound to A  Set of nodes W bound to B  Path variable P  Condition on P : path from A to B  A Φ B returns the set of node pairs (v,w) for which paths from v  V to w  W in G exist. A B name = Glucose ISA compound ISA gene P query Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

11 11 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Physical Implementation of Φ  Graph traversal at query time  Breadth-first or depth-first search  Query precomputed index structure  Transitive closure (only for small graphs)  GRIPP [Trißl et al., 2007] – GRIPP index table, IND(G) –one instance for every node v in G Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

12 12 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR GRIPP Index Creation  Depth-first traversal of G A B D H E F G R [0[0 C [1[1 [2[2 [3[3 [5[5,4],4],6],6],7],7] [8[8,9],9] [10,19] [11,14] [15,18],20],21] [12 [16  We reach a node v  for the first time – add tree instance of v to IND(G) – proceed traversal  again – add non-tree instance of v to IND(G) – do not traverse child nodes of v,13],17 ] Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

13 13 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR  Is node C reachable from node D? GRIPP Index Table, IND(G) A B D H E F G R [0[0 C [1[1 [2[2 [3[3 [5[5,4],4],6],6],7],7] [8[8,9],9] [10,19] [11,14] [15,18],20],21] [12 [16,13],17 ] nodeprepostinst R 021tree A 120tree B 27 E 34 F 56 C 89 D 1019tree G 1114tree B 1213non H 1518tree A 1617non Graph, G GRIPP index, IND(G) C D Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

14 14 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Order Tree, O(G) nodeprepostinst R 021tree A 120tree B 27 E 34 F 56 C 89 D 1019tree G 1114tree B 1213non H 1518tree A 1617non Order tree, O(G) w reachable from v iff v pre < w pre < v post Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

15 15 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Order Tree, O(G) nodeprepostinst R 021tree A 120tree B 27 E 34 F 56 C 89 D 1019tree G 1114tree B 1213non H 1518tree A 1617non Order tree, O(G) w reachable from v iff v pre < w pre < v post Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

16 16 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Query strategy – Step 1  Retrieve the reachable instance set of start node v, called RIS(v)  Retrieve RIS(D)  Requires only a single query on IND(G)  If C  RIS(D)  return true  stop the search  Else  proceed to Step 2 Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

17 17 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Query strategy – Step 2  Search for non-tree instances in RIS(v)  The nodes of these instances are hop nodes  Check every i  RIS(D)  If i is tree instance – [G and H] – Done  If i is non-tree instance – [A and B] – i has no successors in O(G), but possibly in G – proceed to Step 3 Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

18 18 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Query strategy – Step 3  Extend the search  using hop nodes v 1, …, v n  Obtain the tree instance of node B  Proceed to Step 1  Repeat steps 1…3 until  an instance of node C is found  or no more hop nodes are available Depth-first traversal of O(G) using hop nodes Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

19 19 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR GRIPP – Sets of Nodes A B D H E F G R C Graph, G Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion A P B Node D C E

20 20 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR GRIPP – Sets of Nodes  Two different strategies  Single node pair  Evaluate reachability for every node pair separately  Set-oriented  Evaluate reachability for the set in one step Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

21 21 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Query GRIPP – Single Node Pair  First evaluate reachability(D,E)  Then reachability(D,C) separately true

22 22 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Query GRIPP – Set-oriented  First query the order tree completely  Then search used nodes and target nodes  If pre Used < pre Target < post Used  true nodeprepost D 1019 B 27 A 120 Used nodes nodeprepost C 89 E 34 Target nodes true Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

23 23 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Cost model  Single node pair strategy  query time linear in size of target set  better for few target nodes  Set-oriented strategy  almost constant query times  better for many target nodes Average query time for both strategies and increasing size of target node set on a graph with 10,000 nodes and 20,000 edges Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

24 24 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Outline  Motivation  Optimize Graph Queries  Evaluate node conditions  Evaluate path conditions  Future Work  Relational algebra for graph queries  Conclusion Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

25 25 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Future Work  Towards an algebra for graph queries  Define new operators – Logical – Physical  Determine cost functions – Estimate the size of result sets  Define rewrite rules – Which operations can be pushed? Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

26 26 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Future Work – New Operators  Path length operator  Evaluate the length of a path  Possible solution – Store parts of paths – e.g., up to length x [Giugno & Shasha, 2002] a b graph Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

27 27 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Future Work  Cost Model  Assign cost models to physical operators  Estimate the size of result sets  Between how many node pairs does a path exist? – Possibly of certain length?  Possible solution – Sampling Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

28 28 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Rewrite Query Plan A B name = Glucose ISA compound ISA enzyme P query SELECT B FROM network LET node A, node B, path P WHERE A.name = ‘Glucose’ AND A ISA compound AND B ISA enzyme AND P.path = A[-*]B; Node TYPE σ name=Glucose χ compound ⋈ Node.TYPE=TYPE Node TYPE χ enzyme ⋈ Node.TYPE=TYPE Φ πBπB Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

29 29 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion Better Plan? Node TYPE σ name=Glucose χ compound ⋈ Node.TYPE=TYPE Node TYPE χ enzyme ⋈ Node.TYPE=TYPE Φ πBπB 1 18,000 Node TYPE σ name=Glucose χ compound ⋈ Node.TYPE=TYPE Node TYPE χ enzyme ⋈ B.TYPE=TYPE Φ πBπB 2 1 20,000 2,000

30 30 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR Conclusion  Optimize the execution of graph queries  Use cost-based query optimization  Extend relational algebra  New operators – Path existence operator, Φ – Path length operator  Cost functions – Estimate the size of result sets  Rewrite rules Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion

31 Thanks for your attention Special thanks to my PhD supervisor Ulf Leser Silke Trißl Humboldt-Universität zu Berlin Work sponsored by IDAR 2007

32 32 Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR References  U. Leser. A query language for biological networks. Bioinformatics, 21 Suppl 2:ii33–ii39, Sep 2005.  B. Eckman and P. G. Brown Graph data management for molecular and cell biology. IBM J. Res & Dev., 50(6):545 – 560, Nov 2006.  F. Sohler and R. Zimmer. Identifying active transcription factors and kinases from expression data using pathway queries. Bioinformatics, 21 Suppl 2:ii115-ii122, Sep 2005.  J. McHugh and J. Widom. Query Optimization for XML. In Proc. of the VLDB Conference, pages 315–326, 1999. Morgan Kaufmann.  V. Wu, J. M. Patel, and H. V. Jagadish. Structural Join Order Selection for XML Query Optimization. In Proc. of the ICDE Conference, pages 443–454, 2003. IEEE Computer Society.  S. Trißl and U. Leser. Fast and Practical Indexing and Querying of Very Large Graphs. In Proc. of the ACM SIGMOD Conference, to appear, 2007. ACM Press.


Download ppt "Cost-based Optimization of Graph Queries Silke Trißl Humboldt-Universität zu Berlin Knowledge Management in Bioinformatics IDAR 2007."

Similar presentations


Ads by Google