The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld.

The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld Yehoshua Sagiv

Overview Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Keyword Proximity Search System Overview Algorithm for Answer Generation Ranking Answers Conclusions & Future Work

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Keyword Search The natural (and popular) option: Keyword Search Schema-Free Extraction of Data Nowadays… Exposure to many databases Different types (relational, XML, RDF…) Different schemas Not easy to use traditional paradigms of querying (e.g., SQL, XQuery, SPARQL) and, moreover, they require a thorough understanding of the schema Goal: Enable users to instantly pose (inaccurate) queries without knowing the schema P roblem: Inherently different from standard IR

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Data have varying degrees of structure –Relational (w/ foreign keys), XML (w/ id-references) –Natural representation by a graph –Usually, data-centric rather than document-centric A query is a set of keywords No structural constraints Keyword Proximity Search (KPS) The Goal: Extract meaningful parts of data w.r.t. the keywords Agrawal et al. ICDE 02 Hristidis et al., VLDB 02,03, ICDE 03 Bhalotia et al. VLDB 05 Kacholia al., VLDB 06 Ding et al., ICDE 07 Liu et al., SIGMOD 06 Wang et al., VLDB 06 Luo et al., SIGMOD 07 …

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Example: Search in RDB IDName Population 22 Amsterdam 1101407 73Brussels951580 IDNameHead Q. 135EU73 175ESA81 CountryOrg. B135 NL135 search Belgium, Brussels CodeNameAreaCapital NL Netherlands 3733022 BBelgium3051073 CitiesOrganizations CountriesMemberships

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 IDName Population 22 Amsterdam 1101407 73Brussels951580 IDNameHead Q. 135EU73 175ESA81 CountryOrg. B135 NL135 search Belgium, Brussels CodeNameAreaCapital NL Netherlands 3733022 BBelgium3051073 CitiesOrganizations CountriesMemberships Brussels is the capital city of Belgium

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 IDName Population 22 Amsterdam 1101407 73Brussels951580 IDNameHead Q. 135EU73 175ESA81 CountryOrg. B135 NL135 CodeNameAreaCapital NL Netherlands 3733022 BBelgium3051073 CitiesOrganizations CountriesMemberships Brussels hosts EU and Belgium is a member search Belgium, Brussels

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Example: Search in XML search Yannakakis, Approximation

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Yannakakis wrote a paper about Approximation search Yannakakis, Approximation

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Yannakakis is cited by a paper about Approximation search Yannakakis, Approximation

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Data Graphs Structural and keyword nodes Edges and nodes may have weights – Weak relationships are penalized by large weights Each keyword has one occurrence in the data graph (technical)

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08Queries Q={ Summers, Cohen, coffee } Queries are sets of keywords from the data graph

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 An Answer is a Reduced Subtree An answer is a subtree of the data graph Contains all keywords of the query Has no redundant edges (and nodes) 3 variants: directed, undirected, strong (undirected, kws are leaves); This paper

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Previous Solutions Lack of guarantees Highly relevant answers might be missed, and / or Inefficient algorithms Rather simple data sets – a (very) small number of relevant answers They considered data that are essentially collections of entities, namely, DBLP, IMDB, Lyrics, etc. An answer is usually within the scope of an entity e.g., the keywords appear in a single movie Crucial problems ignored In particular, the repeated information problem Especially pervasive in complex data graphs

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08Contributions A system for keyword proximity search An algorithm for generating answers with guaranteesAn algorithm for generating answers with guarantees Does not miss (valuable) answers Efficient (polynomial delay) Answers generated in a 2-approximate order by height repeated-informationA ranking technique that is aware of the repeated-information problem Gives preference to answers with low similarity to earlier ones Experimentation over a highly-cyclic data graph The Mondial database Many meaningful connections among keywords

The MONDIAL Database Institute for Informatics Georg-August-Universität Göttingen http://www.dbis.informatik.uni-goettingen.de/Mondial/

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08Challenges Huge no. of answers; not instantiated!Huge no. of answers; not instantiated! Not simple to generate all relevant answers, even if ranking is ignored For practical ranking functions, enumerating the answers in ranked order is probably impossible For example, finding the smallest answer is the intractable Steiner-tree problem Redundancy / repeated information Many answers are very similar (altogether provide a low amount information) Crucial in complex (highly cyclic) data graphs We employ a two-phase architecture:

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Architecture: Generator + Ranker Answer Generator Generates next M·k answers (simplified ranking function) Answer Generator Generates next M·k answers (simplified ranking function) top-k answers (relative to those that have already been printed) search(keywords) next k answers Ranker Ranks all answers generated up to now (- printed ones)Ranker Ranks all answers generated up to now (- printed ones) Simplified ranking at first [Bhalotia et al., ICDE02, VLDB05]

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Generating the Top Answers: Not Trivial! To demonstrate the difficulty of generating the good (top) answers, lets see how existing approaches operate on a simple example:

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Find the Answers in this Example!

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 The BANKS Approach nodes v (in a good order) and keyword occurrences: Generate the min-height subtree emanating from v nodes v (in a good order) and keyword occurrences: Generate the min-height subtree emanating from v Answers are directed subtrees [Bhalotia et al., ICDE02, VLDB05]

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 The BANKS Approach Answers are directed subtrees Never generated! What about this answer? nodes v (in a good order) and keyword occurrences: Generate the min-height subtree emanating from v nodes v (in a good order) and keyword occurrences: Generate the min-height subtree emanating from v [Bhalotia et al., ICDE02, VLDB05]

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 The NUITS Approach nodes v (in a good order): Generate the min-weight subtree that includes v nodes v (in a good order): Generate the min-weight subtree that includes v Answers are undirected subtrees [Ding et al., ICDE07]

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 The NUITS Approach nodes v (in a good order): Generate the min-weight subtree that includes v nodes v (in a good order): Generate the min-weight subtree that includes v Answers are undirected subtrees This node is redundant It is actually the previous answer! [Ding et al., ICDE07]

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 The NUITS Approach nodes v (in a good order): Generate the min-weight subtree that includes v nodes v (in a good order): Generate the min-weight subtree that includes v Answers are undirected subtrees Again, the previous answer! [Ding et al., ICDE07] This node is redundant

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 The NUITS Approach Never generated! What about this answer? nodes v (in a good order): Generate the min-weight subtree that includes v nodes v (in a good order): Generate the min-weight subtree that includes v Answers are undirected subtrees [Ding et al., ICDE07] Severe limit on # of generated answers! ( one per node)

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 The DISCOVER / DBXplorer Approach possible queries Q (from the schema) in inc. size: Evaluate Q over the database possible queries Q (from the schema) in inc. size: Evaluate Q over the database All answers are generated in ranked order! [Hristidis et al., VLDB02,03, ICDE03][Agrawal et al. ICDE02] Easy to implement! DBMS queries– No in-memory graph algorithms

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 The DISCOVER / DBXplorer Approach possible queries Q (from the schema) in inc. size: Evaluate Q over the database possible queries Q (from the schema) in inc. size: Evaluate Q over the database But many queries do not generate any answer at all! Worst case: exponential in the data Limited Ranking! by the query (rather than the answer) weight [Hristidis et al., VLDB02,03, ICDE03][Agrawal et al. ICDE02] Inefficient!

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 We Need Generators w/ Guarantees! All answers are generated I n particular, each of the relevant answers is produced at some point (100% recall is achievable) Controlled order of answers F or instance, increasing weight, increasing height, approximate (what is the ratio?) / heuristic order Efficiency T he top-k answers should be generated efficiently B ound on time between successive answers

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Order by Increasing Weight / Height If Then Top-k Answers

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Approximate and Heuristic Orders Approximate order Heuristic order There is a provable bound on the extent to which the actual order can deviate from the optimal one Intuitively, expected to be close to the optimal order, but there is no guarantee

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 C-Approximate Order (inc. Weight / Height) If Then C-Approximation of the Top-k Answers [Fagin et al., PODS01] C-Approximation of the Top-k Answers [Fagin et al., PODS01] C

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Our Approach PODS06: Enum. by (exact / approx) inc. weight Problem: Repeated application of Steiner-tree algs Heavy – hard to implement efficiently Here: Follow the basic approach of PODS06 But, we adopt the BANKS idea of using height ( weight) for the enumeration order Recall: BANKS might miss highly relevant answers Thus, we bypass Steiner trees and obtain a much faster algorithm answers are not missedapproximate orderpoly. delayOur alg. has all 3 guarantees: answers are not missed, approximate order, poly. delay

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Find the shortest answer (w/o constraints) An Overview of the Algorithm Enum. by (2-approx.) increasing height Find (a 2-approx. of) the shortest answer under constraints Task: Task: Task: Lawler / Yen method Types of Constraints: Inclusion: include edge e Exclusion: exclude edge e Backward-search (Dijkstra) iterators (~ BANKS) The intricate part …

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Finding an Answer under Constraints Inclusion: include edge e Exclusion: exclude edge e Handling exclusion constraints is easy Simply remove the excluded edges from the graph

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Inclusion Constraints are the Problem Inclusion: include edge e Exclusion: exclude edge e But it is not an answer! The shortest subtree that contains the kws and satisfies the consts redundant edge Not reduced (has redundancy) Moreover, includes a previously printed answer Sometimes, no answer at all!

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 The Correct Answer Inclusion: include edge e Exclusion: exclude edge e Technique: 1. 1. Generate a min-height subtree (as in the wrong solution) 2. 2. Not an answer? modify Intricate to guarantee 2-approx. Details in the proceedings Technique: 1. 1. Generate a min-height subtree (as in the wrong solution) 2. 2. Not an answer? modify Intricate to guarantee 2-approx. Details in the proceedings

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Running Times Each entry is an avg. of 4 queries

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Alg. Order vs. Weight Order How many answers are generated in order to obtain the top-k (among 1000) according to weight? Each entry is an avg. of 4 queries

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Effective Approx. Ratio: Height Effective Approx. Ratio: Height 3 keywords 2 keywords % k (answers) Effective approx. ratio worst / best (among first k)

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Effective Approx. Ratio: Height Effective Approx. Ratio: Height 5 keywords 4 keywords % k (answers) worst / best (among first k) Effective approx. ratio

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Effective Approx. Ratio: Weight Effective Approx. Ratio: Weight 3 keywords 2 keywords % k (answers) Effective approx. ratio worst / best (among first k)

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Effective Approx. Ratio: Weight Effective Approx. Ratio: Weight 5 keywords 4 keywords % k (answers) Effective approx. ratio worst / best (among first k)

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 The Basic Ranking Function abs-rel ( a ) = 1 weight ( a ) weight ( a ) = Σ weight ( node ) + Σ weight ( edge ) node aedge a

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Determining the Weight of an Edge Many orgs enter country weak connection (large weight) org. enters many countries weak connection (large weight) Strong connection (small weight)Strongest!

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 The Basic Ranking Function (contd) abs-rel ( a ) = 1 weight ( a ) weight ( a ) = Σ weight ( node ) + Σ weight ( edge ) node aedge a weight ( node ) = fixed (1) weight ( edge ) = log ( 1 + α · out ( v 1 t 2 ) + (1 α )· in ( t 1 v 2 ) ) edge = ( v 1, v 2 ) tag ( v i ) = t i # t 2 nodes with edges from v 1 # t 1 nodes with edges to v 2 Relevant answers but …

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Answers with High Similarity

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 But each individual answer is relevant! Combinations of Connections

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Dynamic Ranking Candidate Answers Output … Next-Answer() a extract-top-candidate() print( a ) for all candidates c and pairs of keywords k 1, k 2 if c and a connect k 1 and k 2 similarly, then penalize( c ) What does it mean ?

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Two Types of Similarity The same connection Isomorphic connection (same schema) k 1, k 2 = Belgium, France a c1c1 c2c2 Penalty: 1 Penalty: p ( 1) 2 options: Sum over printed answers Max over printed answers

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 The General Ranking Function abs-rel (c) = 1 weight (c) rpt-inf (c) = p or 1 k 1, k 2 kws printed answers or max p or 1 k 1, k 2 kws printed answers score (c) = 1 + ε · rpt-inf (c) abs-rel (c) 1

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Score Loss vs. Diversity Sum, p=1.0 Max, p=0.1 5 keywords Avg. of 4 queries Top-20 answers % of max. ε Score (1/weight)Connections (u.t. iso.)Connections The bottom configuration is better than the top one Smaller reduction of score for similar/higher degree of diversity

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08Conclusions KPS in complex data graphs has inherent problems that are ignored in existing systems 2-component arch.: answer generator & ranker 1 st component: Enum. algorithm w/ guarantees Efficient, correct (no missed answers), 2-approximate order by height In the paper: Ext. to OR semantics (exact order) 2 nd component: Dynamically ranks candidates by penalizing them for repeated information Our experiments over Mondial suggest a tuning of the parameters that gives the best tradeoff between information gain and score loss

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Current & Future Research Improve / optimize the answer generator Successful: Parallelism Concurrent queries? Implement different answer generators E.g., by (approx.) increasing weight [KS-PODS06] Assessment by humans Relevancy / repeated information Methodology example: [Zhang et al., SIGIR02] Other aspects Answer presentation

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Answer Presentation On the Web, we instantly get the meaning of an answer (Web page) by the <title>, URL and, possibly, a snippet of the text In KPS, understanding the meaning of a subtree is note straightforwardneed to derive the semantics from the graphical presentation

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Whats the Meaning of this Answer? A snapshot of BANKS demo (http://www.cse.iitb.ac.in/banks/) IMDB Harder in XML! No division into relations (everything is element / attribute) What information is needed to describe a node?

Benny Kimelfeld Keyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD08 Answer Presentation On the Web, we instantly understand the meaning of an answer (Web page) by reading the element, the URL and, possibly, a snapshot of the text In KPS, understanding the meaning of a subtree is cumbersome since we need to derive the semantics from the presentation Solution: (under develop.) Graphical presentation is based on restructuring answers in terms of of entities, properties and relationships Apply heuristics for determining the minimal set of properties required for each entity

Thank you! Questions?

The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld.

Similar presentations

Presentation on theme: "The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld.

Similar presentations

Presentation on theme: "The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld."— Presentation transcript:

Similar presentations

About project

Feedback