1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.

Slides:



Advertisements
Similar presentations
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Advertisements

Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
BackTracking Algorithms
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
1 Abdeslame ALILAOUAR, Florence SEDES Fuzzy Querying of XML Documents The minimum spanning tree IRIT - CNRS IRIT : IRIT : Research Institute for Computer.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Information Retrieval and Databases: Synergies and Syntheses IDM Workshop Panel 15 Sep 2003 Jayavel Shanmugasundaram Cornell University.
1 Chapter 19: Information Retrieval. ©Silberschatz, Korth and Sudarshan19.2Database System Concepts - 5 th Edition, Sep 2, 2005 Chapter 19: Information.
COMP630 Paper Presentation by Haomian(Eric) Wang.
Chapter 19: Information Retrieval
CAREER: Towards Unifying Database Systems and Information Retrieval Systems NSF IDM Workshop 10 Oct 2004 Jayavel Shanmugasundaram Cornell University.
1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Information Retrieval
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
1 Ranking Inexact Answers. 2 Ranking Issues When inexact querying is allowed, there may be MANY answers –different answers have a different level of incompleteness.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.
Querying Structured Text in an XML Database By Xuemei Luo.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Inexact Querying of XML. XML Data May be Irregular Relational data is regular and organized. XML may be very different. –Data is incomplete: Missing values.
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
CS 415 – A.I. Slide Set 5. Chapter 3 Structures and Strategies for State Space Search – Predicate Calculus: provides a means of describing objects and.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Gökay Burak AKKUŞ Ece AKSU XRANK XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ.
MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?
Symbol Tables and Search Trees CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.
Union-find Algorithm Presented by Michael Cassarino.
Algorithmic Detection of Semantic Similarity WWW 2005.
Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
1 Information Retrieval LECTURE 1 : Introduction.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Presenter: Qi He.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Databases and Information Retrieval: Rethinking the Great Divide SIGMOD Panel 14 Jun 2005 Jayavel Shanmugasundaram Cornell University.
Text Search over XML Documents Jayavel Shanmugasundaram Cornell University.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Incomplete Answers over Semistructured Data Kanza, Nutt, Sagiv PODS 1999 Slides by Yaron Kanza.
Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
1 Ranking Inexact Answers. 2 Ranking Issues When inexact querying is allowed, there may be MANY answers –different answers have a different level of incompleteness.
Database System Concepts, 5th Ed. ©Sang Ho Lee Chapter 19: Information Retrieval.
Topic 2: binary Trees COMP2003J: Data Structures and Algorithms 2
Database Management System
XRANK: Ranked Keyword Search over XML Documents
Information Retrieval
A paper on Join Synopses for Approximate Query Answering
Information Retrieval and Web Search
Computing Full Disjunctions
B+ Tree.
CMSC 341 Lecture 10 B-Trees Based on slides from Dr. Katherine Gibson.
Information Retrieval
Indexing and Hashing Basic Concepts Ordered Indices
Structure and Content Scoring for XML
Trees CMSC 202, Version 5/02.
CMSC 202 Trees.
Early Profile Pruning on XML-aware Publish-Subscribe Systems
MCN: A New Semantics Towards Effective XML Keyword Search
Structure and Content Scoring for XML
A Semantic Peer-to-Peer Overlay for Web Services Discovery
Chapter 31: Information Retrieval
Information Retrieval and Web Design
Chapter 19: Information Retrieval
Introduction to XML IR XML Group.
Presentation transcript:

1 Keyword Search over XML

2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are not appropriate for the naive user: –if XML “replaces” HTML as the web standard, users can’t be expected to write graph queries Allow Keyword Search over XML!

3 Keyword Search A keyword search is a list of search terms There can be different ways to define legal search terms. Examples: –keyword:label, e.g., author:Smith –keyword, e.g., :Smith –label, e.g., author: –value (without distinguishing between keywords and labels)

4 Challenges (1) Determining which part of the XML document corresponds to an answer –When searching HTML, the result units are usually documents –When searching XML, a finer granularity should be returned, e.g., a subtree

5 What should be returned for the query :ACID, :Kempster ?

6 Challenges (2) Avoiding the return of non-meaningfully related elements –XML documents often contain many unrelated fragments of information. Can these information units be recognized?

7 What should be returned for the query :XML, author: ?

8 What should be returned for the query :XML, :Kempster ?

9 Challenges (3) Ranking mechanisms –How should document fragments/XML elements be ranked Ideas?

10 In what order should the answers be returned for :ACID, author: ?

11 Defining a Search Semantics When defining a search over XML, all previous challenges must be considered. We must decide: –what portions of a document are a search result? –should any results be filtered out since they are not meaningful? –how should ranking be performed Typically, research focuses on one of these problems and provides simple solutions for the other problems.

12 XRank: Ranked Keyword Search over XML Documents Guo, Shao, Botev, Shanmugasundram SIGMOD 2003

13 Queries and their Semantics Queries are keywords k 1,…,k n, as in a search engine Query results are portions of XML documents that contain all words. Formally: –Let v be a node in the document. To determine whether v should be returned: First, “remove” any descendents of v that contain all the keywords k 1,…,k n. If v still contains all of k 1,…,k n, then v should be a result of the search. –Intuition: Only return v if no more specific element can be returned. Note: Containment is via child edges, not IDREF edges

14 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the language … At first site, the XQL language looks… … Querying XML in Xyleme … … What should be returned for the query XQL language?

15 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the language … At first site, the XQL language looks… … Querying XML in Xyleme … … What should be returned for the query XQL language?

16 Ranking Results: Intuition Granularity of ranking –In HTML, there is a rank for each document –In XML, we want a rank for each element. Different elements in the same document may have different ranks Propose to extend ideas used for ranking HTML: –PageRank: Documents with more incoming links are more important (recursive definition) –Proximity: If the document contains the search terms close together, then the document is more important Overall Rank: combination of PageRank and proximity

17 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the language … At first site, the XQL language looks… … Querying XML in Xyleme … … Should both papers be ranked the same?

18 Topics We discuss: –Ranking –The Index Structure –Query Processing

19 Ranking Results Take into consideration –hyperlinks –proximity We only discuss here ranking by the linking structure. Ranking by proximity can easily be defined (ideas?) What kind of “links” are the in a graph of XML documents?

20 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the language … At first site, the XQL language looks… … Querying XML in Xyleme … … Child/Parent “links”

21 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the language … At first site, the XQL language looks… … Querying XML in Xyleme … … IDREF “links”

22 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the language … At first site, the XQL language looks… … Querying XML in Xyleme … … XLink “links” (out of the document)

23 v : Hyperlink edge 1-d: Probability of random jump d: Probability of following hyperlink d /3 Remember: Page Rank Number of documents Number of outgoing links

24 A Graph of XML documents Nodes: N –each element in a document is a node Edges: E = CE  CE -1  HE –CE are “containment links”, i.e., there is an edge (u,v) in CE if u is a parent of v in the XML document –HE are “hyperlinks”, i.e., there is an edge (u,v) in HE if there is an IDREF link or XLink link from u to v Want to define ElemRank, the parallel to PageRank, but for XML elements

25 Attempt 1 at ElemRank v Hyperlink edge Containment edge There are now 4 ways to get to an element. Consider all in the formula.

26 Attempt 1 at ElemRank: Problem v Hyperlink edge Containment edge Consider a paper with few sections and many references. The more references there are, the less important each section is. Why?

27 Attempt 2 at ElemRank v Hyperlink edge Containment edge Consider Hyperlinks and Structural links separately

28 Attempt 2 at ElemRank: Problem v Hyperlink edge Containment edge In fact, better to consider parent- child links differently from child-parent links

29 Actual ElemRank v Hyperlink edge Containment edge Consider Hyperlinks, Parent links and Child links separately

30 Interpretation in terms of Random Walks The element rank of e is the probability that e will be reached if we start at a random element and at each point we chose one of the following options: –with probability 1-d 1 -d 2 -d 3 jump to a random element in a random page –with probability d 1 follow a random hyperlink from the current element –with probability d 2 follow a random edge to a child element –with probability d 3 follow the parent edge

31 ElemRank Example Suppose that d 1 = d 2 = d 3 = 0.3 In what order will the nodes be ranked? What will be the formula for each node? 1 Hyperlink edge Containment edge 2 3 4

32 Think About it Very nice definition of ElemRank Does it make sense? Would ElemRank give good results in the following scenarios: –IDREFs connect articles with articles that they cite –IDREFs connect managers with their departments –IDREFs connect cleaning staff with their departments in which they work –IDREFs connect countries with bordering contries (as in the CIA factbook)

33 Topics We discuss: –Ranking –The Index Structure –Query Processing

34 Indexing We now discuss the index structure Recall that we will be ranking according to ElemRank Recall that we want to return “most specific elements” How should the data be stored in an index?

35 Naive Method Treat elements as documents: Normal inverted lists Ricardo 0 ; 4 ; 5 ; 8 XQL 0 ; 4 ; 5 ; 7 Problem: Space Overhead How much space is needed in storage?

36 Naive Method Treat elements as documents: Normal inverted lists Ricardo 0 ; 4 ; 5 ; 8 XQL 0 ; 4 ; 5 ; 7 Problem: Spurious Results Cant simply return intersection of the lists, since if a node satisfies a query, so do all its ancestors

37 Dewey Encoding of ID Use path information to identify elements – DeweyID An ancestor’s ID is a prefix of its descendant’s ID Actually (not shown) all the node ids are prefixed by the document number July …XML and …David Carmel … … …… ……

38 Dewey Inverted List (DIL) Store, for each keyword a list containing : – the id of the node containing the keyword –the rank of the node containing the keyword –the positions of the keyword in the node Rank and positions are needed to compute ranking To simplify, in the following slides, we only store lists of node ids

39 Topics We discuss: –Ranking –The Index Structure –Query Processing

40 Query Processing Challenges: –How do we find nodes that contain all keywords? –How do find only the most specific node that contains all keywords? –Can this be done in a single scan of the inverted keyword lists?

41 Example: Document 47 th Document in Corpus proceedings paper titleabstractsection subsection … XQL … … language … … XQL … … XQL language … paper … XQL …

42 Example: Document with IDs 47 th Document in Corpus proceedings paper titleabstractsection subsection … XQL … … language … … XQL … … XQL language … paper … XQL …

43 Example: Inverted Lists proceedings paper titleabstractsection subsection … XQL … … language … … XQL … … XQL language … XQL Lists contain ids for nodes that directly contain keyword. Lists are sorted language paper … XQL …

44 Example: Inverted Lists proceedings paper titleabstractsection subsection … XQL … … language … … XQL … … XQL language … XQL We want to find nodes that should be returned. Which? language paper … XQL …

45 Algorithm: Data Structures XQL language Contains[1]Contains[2] DeweyID Result heap: ContainsAll

46 Algorithm: Pseudo Code Find smallest next entry in inverted lists Find longest common prefix of entry and dewey stack Pop all non-matching values from dewey stack. When popping: –propogate down containment information, if containsAll is false –if containsAll turns from false to true, add result to output Add non-matching values from entry into dewey stack. Mark containment for entry’s keyword

47 Example: Algorithm XQL language Contains[1]Contains[2] DeweyID Result heap: ContainsAll

48 Example: Algorithm XQL language Contains[1]Contains[2] DeweyID Result heap: Smallest entry is for keyword 1, XQL. lcp with Dewey stack = none. Pop (nothing). Add (all). ContainsAll

49 Example: Algorithm XQL language  Contains[1]Contains[2] DeweyID Result heap: ContainsAll

50 Example: Algorithm XQL language  Contains[1]Contains[2] DeweyID Result heap: ContainsAll Next smallest entry is for keyword 2, language. lcp with Dewey stack = Pop non-matching entries

Example: Algorithm XQL language  0 47 Contains[1]Contains[2] DeweyID Result heap: ContainsAll Next smallest entry is for keyword 2, language. lcp with Dewey stack = Add additional entries

Example: Algorithm XQL language  0  0 47 Contains[1]Contains[2] DeweyID Result heap: ContainsAll Next smallest entry is for keyword 2, language. lcp with Dewey stack =

Example: Algorithm XQL language  0  0 47 Contains[1]Contains[2] DeweyID Result heap: ContainsAll Next smallest entry is for keyword 1, XQL. lcp with Dewey stack = Pop non-matching entries

Example: Algorithm XQL language  0 47 Contains[1]Contains[2] DeweyID Result heap: ContainsAll Next smallest entry is for keyword 1, XQL. lcp with Dewey stack = Continue on Blackboard!

Try It! Which nodes would be returned for the keyword search John Ben? Show the state of data structures at the point that the first answer is printed out, for the keyword search John Ben. 55

Inexact Querying of XML

XML Data May be Irregular Relational data is regular and organized. XML may be very different. –Data is incomplete: Missing values of attributes in elements –Data has structural variations: Relationships between elements are represented differently in different parts of the document –Data has ontology variations: Different labels are used to describe nodes of the same type (Note: In some of the upcoming slides, we have labels on edges instead of on nodes.)

Movie Database Movie Actor T.V. Series Film Actor TitleName Title Kyle MacLachlan Natalie Portman Harrison Ford 1977 Dune Star Wars Twin Peaks 36 Year Year 21 Actor Name 30 Mark Hamill Léon Movie 13 Title 33 Magnolia The movie has a year attribute Incomplete Data The year of the movie is missing

Movie Database Movie Actor T.V. Series Film Actor TitleName Title Kyle MacLachlan Natalie Portman Harrison Ford 1977 Dune Star Wars Twin Peaks 36 Year Year Actor Name 30 Mark Hamill Léon Movie 13 Title 33 Magnolia Variations in Structure 11 Movie below Actor Actor below Movie

Movie Database Movie Actor T.V. Series Film Actor TitleName Title Kyle MacLachlan Natalie Portman Harrison Ford 1977 Dune Star Wars Twin Peaks 35 Year Year 21 Actor Name 30 Mark Hamill Léon Movie 13 Title 34 Magnolia A movie labelA film label Ontology Variations

The description of the schema is large (e.g., a DTD of XML) The description of the schema is large (e.g., a DTD of XML) It is difficult to use the schema when formulating queries It is difficult to use the schema when formulating queries Data is contributed by many users in a variety of designs Data is contributed by many users in a variety of designs The query should deal with different structures of data The query should deal with different structures of data The structure of the database is changed frequently The structure of the database is changed frequently Queries should be rewritten frequently Queries should be rewritten frequently Need to allow the user to write an “approximate query” and have the query processor deal with it

The Problem In many different domains, we are given the option to query some source of information Usually, the user only gets results if the query can be completely answered (satisfied) In many domains, this is not appropriate, e.g., –The user is not familiar with the database –The database does not contain complete information –There is a mismatch between the ontology of the user and that of the database

What Do Users Need? Users need a way to get interesting partial answers to their queries, especially if a complete answer does not exist These partial answers should contain maximal information Problem: –It is easy to define when an answer satisfies a query –Hard to say when an answer that does not satisfy a query is of interest –Hard to say which incomplete answers are better than others

Inexact Answers Many different definitions have been given –For each definition, query processing algorithms have been defined Examples: –Allow some of the nodes of the query to be unmatched –Allow edges in the query to be matched to paths in the database –Allow nodes to be matched to nodes with labels that have a similar meaning Be careful so that answers are meaningful!

65 Tree Pattern Relaxation Amer-Yahia, Cho, Srivastava EDBT 2002

66 Tree Patterns Queries are tree patterns, as considered in previous lessons Book CollectionEditor NameAddress Double line indicates descendent

67 Relaxed Queries Four types of “relaxations” are allowed on the trees Node Generalization: Assume that we know a relationship of types/super-types among labels. Allow label to be changed to super-type Book CollectionEditor NameAddress Document CollectionEditor NameAddress

68 Relaxed Queries Leaf Node Deletion: Delete a leaf node (and its incoming edge) from the tree Book CollectionEditor NameAddress Book Editor NameAddress

69 Relaxed Queries Edge Generalization: Change a parent-child edge to an ancestor-descendent edge Book CollectionEditor NameAddress Book Editor NameAddress Collection

70 Relaxed Queries Subtree Promotion: A query subtree can be promoted so that it is directly connected to its former grandparent by an ancestor-descendent edge Book CollectionEditor NameAddress Book Editor Name Address Collection

71 Composing Relaxations Relaxations can be composed. Are the following relaxations of Q? Book CollectionEditor NameAddress Q Book Collection Book CollectionAddress Name DocumentAddress

72 Approximate Answers and Ranking An approximate answer to Q is an exact answer to a relaxed query derived from Q In order to give different answers different rankings, tree patterns are weighted Each node and edge has 2 weights – value when exactly satisfied, value when satisfied by a relaxation Book CollectionEditor NameAddress (7, 1) (4, 3) (2, 1) (6, 0) (5, 0) (8, 5) (6, 0) (4, 0) (3, 0) A fragment of a document that exactly satisfies the query will have a score of: 45

73 Example Ranking Book CollectionEditor NameAddress (7, 1) (4, 3) (2, 1) (6, 0) (5, 0) (8, 5) (6, 0) (4, 0) (3, 0) Book Person Name Address Details Sam NY How much would this answer score?

74 Example Ranking Book Collection Editor NameAddress (7, 1) (4, 3) (2, 1) (6, 0) (5, 0) (8, 5) (6, 0) (4, 0) (3, 0) Book Person Name Address Details Sam NY How much would this answer score?

75 Problem Definition Given an XML document D, a weighted tree pattern Q and a threshold t, find all approximate answers of Q in D whose scores are ≥ t Naive strategy to solve the problem: –Find all relaxations of Q –For each relaxation, compute all exact answers –remove answers with score below t Is this a good strategy?

76 Problem Definition Given an XML document D, a weighted tree pattern Q and a threshold t, find all approximate answers of Q in D whose scores are ≥ t A better strategy to compute an answer to a relaxation of a query: –Intuition: Compute the query as a series of joins –Can use stack-merge algorithms (studied before) for computing joins –filter out intermediate results whose scores are too low

77 The Query Plan We now show the how to derive a plan for evaluating queries in this setting First, we show how an exact plan is derived Then, we consider how each individual relaxation can be added in Finally, we show the complete relaxed plan

78 Query Plan: Exact Answers Book CollectionEditor NameAddress (7, 1) (4, 3) (2, 1) (5, 0) (8, 5) (6, 0) (4, 0) (3, 0) BookCollection Editor Address Name c(Book, Collection) c(Book, Editor) c(Editor, Name) d(Editor, Address) c(x,y) = y is child of x d(x,y) = y is descendent of x (6, 0)

79 Query Plan: Exact Answers Book CollectionEditor NameAddress (7, 1) (4, 3) (2, 1) (5, 0) (8, 5) (6, 0) (4, 0) (3, 0) BookCollection Editor Address Name c(Book, Collection) c(Book, Editor) c(Editor, Name) d(Editor, Address) Remember, to compute a join, e.g., of Book and Collection, we actually find the list of Books and the list of Collections (from the index) and perform the stack-merge algorithms (6, 0)

80 Adding Relaxations into Plan Node generalization: Book relaxed to Document Book Collection Editor Address Name c(Book, Editor) c(Editor, Name) d(Editor, Address) Document c(Book, Collection) c(Document, Collection) c(Document, Editor)

81 Adding Relaxations into Plan Edge generalization: Relax Editor-Name Edge Book Collection Editor Address Name c(Book, Editor) c(Editor, Name) d(Editor, Address) c(Book, Collection) c(Editor, Name) or (Not exists c(Editor,Name) and d(Editor, Name(( Written in short as: c(Editor, Name) or d(Editor, Name( We only allow relaxations when a direct child does not exist

82 Adding Relaxations into Plan Subtree Promotion: Promote tree rooted at Name Book Collection Editor Address Name c(Book, Editor) c(Editor, Name) d(Editor, Address) c(Book, Collection) c(Editor, Name) or (Not exists c(Editor,Name) and d(Book, Name(( Written in short as: c(Editor, Name) or d(Book, Name(

83 Adding Relaxations into Plan Leaf Node Deletion: Make Address Optional Book Collection Editor Address Name c(Book, Editor) c(Editor, Name) d(Editor, Address) c(Book, Collection) Outer Join Operator: Means that should join if possible, but not delete values that cannot join

84 Combining All Possible Relaxations All approximate answers can be derived from the following query plan Document Collection Editor Address Name c(Document, Editor) OR d(Document, Editor) c(Editor, Name) OR d(Editor, Name) OR d(Document,Name) d(Editor, Address) OR d(Document, Address) c(Book, Collection) OR d(Document, Collection) Book CollectionEditor NameAddress (7, 1) (4, 3) (2, 1) (5, 0) (8, 5) (6, 0) (4, 0) (3, 0) (6, 0)

85 Creating “Best Answers” Want to find answers whose ranking is over the threshold t Naive solution: Create all answers. Delete answers with low ranking Algorithm Thres: Goal of the algorithm is to prune intermediate answers that cannot possibly meet the specified threshold

86 Associating Nodes with Maximal Weight The maximal weight of a node in the evaluation plan is the largest value by which the score of an intermediate answer computed for that node can grow Document Collection Editor Address Name c(Document, Editor) OR d(Document, Editor) c(Editor, Name) OR d(Editor, Name) OR d(Document,Name) d(Editor, Address) OR d(Document,Address) c(Book, Collection) OR d(Document, Collection)

87 Book CollectionEditor NameAddress (7, 1) (4, 3) (2, 1) (5, 0) (8, 5) (6, 0) (4, 0) (3, 0) Document Collection Editor Address Name c(Document, Editor) OR d(Document, Editor) c(Editor, Name) OR d(Editor, Name) OR d(Document,Name) d(Editor, Address) OR d(Document,Address) c(Book, Collection) OR d(Document, Collection) (38)(39) (6, 0) (30)(40) (39) (41) (21) (7) (0)

88 Algorithm Thres Relaxed query evaluation plan is computed bottom-up –Note that the joins are computed for all matching intermediate results at the same time At each step, intermediate results are computed, along with their scores If the sum of an intermediate result score with the maximal weight of the current node is less than the threshold, prune the intermediate result

89 Example: Threshold = 35 Book Editor Name Address Details Sam NY Document Collection Editor Name c(Document, Editor) OR d(Document, Editor) c(Editor, Name) OR d(Editor, Name) OR d(Document,Name) d(Editor, Address) OR d(Document,Address) c(Book, Collection) OR d(Document, Collection) (38)(39) (30)(40) (39) (41) (21) (7) (0) Book CollectionEditor NameAddress (7, 1) (4, 3) (2, 1) (5, 0) (8, 5) (6, 0) (4, 0) (3, 0) Address (6, 0) When will the answer be pruned?

90 Try It! Book Collection Editor NameAddress (7, 1) (4, 3) (2, 1) (6, 0) (5, 0) (8, 5) (6, 0) (4, 0) (3, 0) Document Name Address Sam NY 1) How much would this answer score? Collection

91 (8, 5) Try It (cont) Book CollectionEditor Name (7, 1) (4, 3) (2, 1) (5, 0) (6, 0) 2) What will the exact plan look like? FNameLName 3) What will the plan look like if all possible relaxations are added? 4) What is maximal weight by which the score of an intermediate answer can grow, for each node? (2, 1) (2, 0)(1, 0)