Keyword Search on Graph-Structured Data

Slides:

Advertisements

Similar presentations

Optimizing Join Enumeration in Transformation-based Query Optimizers ANIL SHANBHAG, S. SUDARSHAN IIT BOMBAY VLDB 2014

Advertisements

Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,

Keyword Searching in Relational Databases

DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

Information Retrieval in Practice

Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.

Xyleme A Dynamic Warehouse for XML Data of the Web.

Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.

Keyword Proximity Search on Graphs M.Sc. Systems Course The Hebrew University of Jerusalem, Winter 2006.

CS Lecture 9 Storeing and Querying Large Web Graphs.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

Chapter 19: Information Retrieval

1 Keyword Search on External Memory Data Graphs Bhavana Dalvi* Meghana Kshirsagar # S. Sudarshan Indian Institute of Technology, Bombay *: Current affiliation:

Overview of Search Engines

Fast Algorithms for Top-k Personalized PageRank Queries Manish Gupta Amit Pathak Dr. Soumen Chakrabarti IIT Bombay.

Bidirectional Expansion for Keyword Search on Graph Databases Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan.

Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.

Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University

Keyword Search on External Memory Data Graphs Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan PVLDB 2008 Reported by: Yiqi Lu.

Network Aware Resource Allocation in Distributed Clouds.

Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.

1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.

1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti Ravi Kumar

DATA MINING LECTURE 13 Absorbing Random walks Coverage.

Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.

Querying Structured Text in an XML Database By Xuemei Luo.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Database and Query Model ◦ Informal Model ◦ Formal Model ◦ Query and Answer Model 

Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.

Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.

CIKM Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.

Union-find Algorithm Presented by Michael Cassarino.

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

Lecture 3: Uninformed Search

Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.

1 CS 430: Information Discovery Lecture 5 Ranking.

Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.

Keyword Searching and Browsing in Databases using BANKS Charuta Nakhe, Arvind Hulgeri, Gaurav Bhalotia, Soumen Chakrabarti, S. Sudarshan Presented by Sushanth.

Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.

Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)

1 Relational Factor Graphs Lin Liao Joint work with Dieter Fox.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Zaiben Chen et al. Presented by Lian Liu. You’re traveling from s to t. Which gas station would you choose?

XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Neighborhood - based Tag Prediction

Information Retrieval

Probabilistic Data Management

Chapter 12: Query Processing

Keyword Searching and Browsing in Databases using BANKS

Information Retrieval

Keyword Searching and Browsing in Databases using BANKS

Structure and Content Scoring for XML

Keyword Searching and Browsing in Databases using BANKS

Bidirectional Query Planning Algorithm

Structure and Content Scoring for XML

Chapter 31: Information Retrieval

Chapter 19: Information Retrieval

Presentation transcript:

Keyword Search on Graph-Structured Data 4/26/2017 Keyword Search on Graph-Structured Data S. Sudarshan IIT Bombay Joint work with Soumen Chakrabarti, Gaurav Bhalotia, Charuta Nakhe, Rushi Desai, Hrishi K., Arvind Hulgeri, Bhavana Dalvi and Meghana Kshirsagar Jan 2009

Outline Motivation and Graph Data Model Query/Answer models 4/26/2017 Outline Motivation and Graph Data Model Query/Answer models Tree answer model Proximity queries Graph Search Algorithms Backward Expanding Search Bidirectional Search Search on external memory graphs Conclusion

Keyword Search on Semi-Structured Data 4/26/2017 Keyword Search on Semi-Structured Data Keyword search of documents on the Web has been enormously successful Much data is resident in databases Organizational, government, scientific, medical data Deep web Goal: querying of data from multiple data sources, with different data models Often with no schema or partially defined schema

Keyword Search on Structured/Semi-Structured Data 4/26/2017 Key differences from IR/Web Search: Normalization (implicit/explicit) splits related data across multiple tuples To answer a keyword query we need to find a (closely) connected set of entities that together match all given keywords soumen crawling or soumen byron Focused Crawling … Soumen C. Byron Dom writes author paper Sudarshan BANKS: Keyword search…

Graph Data Model Lowest common denominator across many data models 4/26/2017 Graph Data Model Lowest common denominator across many data models Relational node = tuple, edge = foreign key XML Node = element, edge = containment/idref/keyref HTML node = page, edge = hyperlink Documents node = document, edge = links inferred by data extraction Knowledge representation node = entity, edge = relationship Network data e.g. social networks, communication networks

Graph Data Model (Cont) 4/26/2017 Graph Data Model (Cont) Nodes can have labels E.g. relation name, or XML tag textual or structured (attribute-value) data Edges can have labels

Outline Motivation and Graph Data Model Query/Answer models 4/26/2017 Outline Motivation and Graph Data Model Query/Answer models Tree answer model Proximity queries Graph Search Algorithms Backward Expanding Search Bidirectional Search Search on external memory graphs Conclusion

Query/Answer Models Basic query model: Alternative answer models 4/26/2017 Query/Answer Models Basic query model: Keywords match node text/labels Can extend query model with attribute specification, path specifications e.g. paper(year < 2005, title:xquery), Alternative answer models tree connecting nodes matching query keywords nodes in proximity to (near) query keywords

4/26/2017 Tree Answer Model Answer: Rooted, directed tree connecting keyword nodes In general, a Steiner tree Multiple answers possible Answer relevance computed from answer edge score combined with answer node score paper Focused Crawling writes writes author author Soumen C. Byron Dom Eg. “Soumen Byron”

Answer Ranking Naïve model: answers ranked by number of edges Problem: 4/26/2017 Answer Ranking Naïve model: answers ranked by number of edges Problem: Some tuples are connected to many other tuples E.g. highly cited papers, popular web sites Highly connected tuples create misleading shortcuts six degrees of separation Solution: use directed edges with edge weights allow answer tree to have edge uv if original graph has vu, but at higher cost

Edge Weight Model-1 Forward edge weight (edge present in data) 4/26/2017 Edge Weight Model-1 Forward edge weight (edge present in data) Default to 1, can be based on schema Lower weight  closer connection Create extra backward edges vu for each edge uv present in data Edge weight  log(1+#edges pointing to v) Overall Answer-tree Edge Score EA = 1/ (S edge weights) Higher score  better result “cites” link weights > “writes” link weights changed to “cites” link weight greater than “writes” link weight 3 1 1 1

Edge Weight Model -2 Probabilistic edge scoring model 4/26/2017 Edge Weight Model -2 Probabilistic edge scoring model Edge traversal probability (from a given node): Forward  1/out-degree Backward  1/in-degree Can be weighted by edge type Path weight =  probability of following each edge in path Edge score = log(edge traversal probability) Answer-tree Edge Score EA = (harmonic) mean of path weights from root to each leaf Note: other edge weight models possible our search algorithms are independent of how edge weights are computed

Node Weight Node prestige based on indegree 4/26/2017 Node Weight Node prestige based on indegree More incoming edges  higher prestige PageRank style transfer of prestige Node weight computing using biased random walk model Node weight: function of node prestige, other optional criteria such as TF/IDF Answer-tree Node score NA = root node weight + S leaf node weights

Overall Tree Answer Score 4/26/2017 Overall Tree Answer Score Overall score of answer tree A combine tree and node scores for details, and recall/precision metrics see BANKS papers in ICDE 2002 and VLDB 2005 Anecdotal results on DBLP Bibliography “Transaction”: Jim Gray’s classic paper and textbook at the top because of prestige (# of citations) “soumen sudarshan”: several coauthored papers, links via common co-authors “goldman shivakumar hector”: The VLDB 98 proximity search paper, followed by citation/co-author connections

4/26/2017 Answer Models Tree Answer Model Proximity (near query) model

Proximity Queries Node weight by proximity Example applications 4/26/2017 Proximity Queries Node weight by proximity author (near olap) (on DBLP) faculty (near earthquake) (on IITB thesis database) Node prestige > if close to multiple nodes matching near keywords Example applications Finding experts on a particular area OLAP over uncertain .. Widom Raghu Computing sparse cubes… Overview of OLAP… Allocation in OLAP …

Proximity via Spreading Activation 4/26/2017 Idea: Each “near” keyword has activation of 1 Divided among nodes matching keyword, proportional to their node prestige Each node keeps fraction 1-μ of its received activation and spreads fraction μ amongst its neighbors Combine activation ai received from neighbors a = 1 – Π(1-ai) (belief function) Graph may have cycles Iterate till convergence

Example Answers Anecdotal results on DBLP Bibliography 4/26/2017 Example Answers Anecdotal results on DBLP Bibliography author (near recovery): Dave Lomet, C. Mohan, etc sudarshan(near change): Sudarshan Chawate sudarshan(near query): S. Sudarshan Queries can combine proximity scores with tree scores hector sudarshan(near query) vs. hector sudarshan author(near transactions) data integration Insert answers to a buffer (heap) as are generated changed to Insert answers to a small buffer (heap) as are generated

Related Work Proximity Search 4/26/2017 Related Work Proximity Search Goldman, Shivakumar, Venkatasubramanian and Garcia-Molina [VLDB98] Considers only shortest path from each node, aggregates across nodes Our version aggregates evidence from alternative paths E.g. author (near “Surajit Chaudhuri”) Object Rank [VLDB04] Similar idea to ours, precomputed

Related Work Keyword querying on relational databases 4/26/2017 Related Work Keyword querying on relational databases DBExplorer (Microsoft, ICDE02) Discover (UCSD, VLDB02, VLDB03), Use SQL generation, not applicable to arbitrary graphs ranking based only on #nodes/edges Keyword querying on XML : Tree Model XRank (Cornell, SIGMOD03), proximity in XML (AT&T Research, VLDB03), Schema-Free XQuery (Michigan, VLDB04), Tree model is too limited Keyword querying on XML: Graph Model XKeyword (UCSD, ICDE03, VLDB03), SphereSearch (MaxPlanck, VLDB05)

Outline Motivation and Graph Data Model Query/Answer models 4/26/2017 Outline Motivation and Graph Data Model Query/Answer models Tree answer model Proximity queries Graph Search Algorithms Backward Expanding Search Bidirectional Search Search on external memory graphs Conclusion

4/26/2017 Finding Answer Trees Backward Expanding Search Algorithm (Bhalotia et al, ICDE02): Intuition: find vertices from which a forward path exists to at least one node from each Si. Run concurrent single source shortest path algorithm from each node matching a keyword Create an iterator for each node matching a keyword Traverse the graph edges in reverse direction Output next nearest node on each get-next() call Do best-first search across iterators Output an answer when its root has been reached from each keyword Answer heap to collect and output results in score order

Backward Expanding Search 4/26/2017 Backward Expanding Search Query: soumen byron paper Focused Crawling writes Soumen C. Byron Dom authors

Backward Exp. Search Limitations 4/26/2017 Backward Exp. Search Limitations Too many iterators (one per keyword node) Solution: single iterator per keyword (SI-Bkwd search) tracks shortest path from node to keyword Changes answer set slightly Different justifications for same root may be lost Not a big problem in practice Nodes explored for different keywords can vary greatly E.g. “mining” or “query” vs “knuth” High fan-out when traversing backwards from some nodes Connection with join ordering Similar to traversing backwards from all relations that have selections

Bidirectional Search: Motivation 4/26/2017 Bidirectional Search: Motivation

Bidirectional Search: Intuition 4/26/2017 Bidirectional Search: Intuition First cut solution: Don’t expand backward if keyword matches many nodes Instead explore forward from other keywords Problems Doesn’t deal with high fan-out during search What should cutoff for not expanding be? Better solution: [Kacholia et al, VLDB 2005] Perform forward search from all nodes reached Prioritize expansion of nodes based on path weight (as in backward expanding search) + spreading activation to penalize frequent keyword and bushy trees

Bidirectional Search: Example 4/26/2017 Bidirectional Search: Example OLAP Divesh Harper Query: harper divesh olap

Bidirectional Search (1) 4/26/2017 Bidirectional Search (1) Spreading activation to prioritize backward search (Different from spreading activation for near queries) Lower weight edges get higher share of activation Nodes prioritized by sum of activations Single forward iterator

Bidirectional Search (2) 4/26/2017 Bidirectional Search (2) Forward search iterator Forward search from all nodes reached by backward search Track best forward path to each keyword Initially infinite cost Whenever this changes, propagate cost change to all affected ancestors 2,∞ 2,2 ∞,∞ ∞,2 k1 k2

Bidirectional Search (3) 4/26/2017 Bidirectional Search (3) On each path length update (due to backward or forward search) Check if node can reach all keywords If so, add it to output heap When to output nodes from heap For each keyword Ki, track Mi Mi: minimum path length to Ki among all yet-to-be-explored nodes in backward search tree Edge score bounds: What is the best possible edge score of a future answer? Bounds similar to NRA algorithm (Fagin) Cheaper bounds (e.g. 1/Max(Mi)) or heuristics (e.g. 1/Sum(Mi)) can be used Output answer if its score is > overall score upper bound for future answers

Performance Worst case complexity: polynomial in size of graph But for typical (average) case, even linear is too expensive Intuition: typical query should access only small part of graph Studied experimentally Datasets: DBLP, IMDB, US Patent Queries: manually created Typical cases < 1 second to generate answer 10K-100K nodes explored

Performance Results Two versions of backward search: 4/26/2017 Performance Results Two versions of backward search: Iterator per node (MI-Bkwd) vs Iterator per keyword (SI-Bkwd) Origin size: number of nodes matching keywords Time ratio MI/SI Very minor loss in recall

Performance Results SI-Bkwd versus Bidirectional search 4/26/2017 Performance Results SI-Bkwd versus Bidirectional search Bidirectional search gain increases with origin size, # keywords

Related Work (1) Publish as document approach 4/26/2017 Related Work (1) Publish as document approach Gather related data into a (virtual) document and index the document (Su/Widom, IDEAS05) Positives Avoids run-time graph search Works well for a class of applications E.g. Bibliographic data  DBLP page per author Negatives Not all connections can be captured Duplication of data across multiple documents High index space overhead

Related Work (2) DPBF (Ding et al., ICDE07) 4/26/2017 Related Work (2) DPBF (Ding et al., ICDE07) dynamic programming technique exact for top-1 answer, heuristic for top-k BLINKS (He et al.,SIGMOD07) Round-robin expansion across iterators Optimal within a factor of m, with m keywords Forward index: node to keyword distance Used instead of searching forward single level index: impractically large space bi-level index: > main memory IO efficiency?

Outline Motivation and Graph Data Model Query/Answer models 4/26/2017 Outline Motivation and Graph Data Model Query/Answer models Tree answer model Proximity queries Graph Search Algorithms Backward Expanding Search Bidirectional Search Search on external memory graphs Conclusion

External Memory Graph Search 4/26/2017 External Memory Graph Search Graph representation quite efficient Requires of < 20 bytes per node/edge Problem: what if graph size > memory? Alternative 1: Virtual Memory thrashing Alternative 2 (for relational data): SQL not good for top-K answer generation across multiple SQL queries Alternative 3: use compressed graph representation to reduce IO [Dalvi et al, VLDB 2008]

Supernodes and Superedges 4/26/2017 Supernodes and Superedges

Multi-granular Graph Dumb algorithm 4/26/2017 Multi-granular Graph Dumb algorithm search on supernode graph get k*F answers, expand their supernodes into memory, search on resultant graph no guarantees on answers Better idea: use multi-granular graph Supernode graph in memory Some nodes expanded Expanded nodes are part of cache Algorithms on multi-granular graph (coming up)

4/26/2017 Multi-granular Graph

4/26/2017 Expanding Nodes Key idea: Edge score of answer containing a supernode is lower bound on actual edge score of any corresponding real answer

Iterative Search Iterative search on multi-granular graph 4/26/2017 Iterative Search Iterative search on multi-granular graph Repeat search on current multi-granular graph using any search algorithm, to find top results expand super nodes in top results Until top K answers are all pure Guarantees finding top-K answers Very good IO efficiency But high CPU cost due to repeated work Details: nodes expanded above never evicted from “virtual memory” cache

4/26/2017 Incremental Search Idea: when node expanded, incrementally update state of search algorithm to reflect change in multi-granular graph Run search algorithm until top K answers are all pure Currently implemented for backward search Modifies the state of the Dijkstra shortest path algorithm used by backward search One shortest path search iterator per keyword SPI tree: shortest path iterator tree

4/26/2017 Incremental Search (1) SPI tree for k1

4/26/2017 Incremental Search (2)

4/26/2017 Incremental Search (3)

External Memory Search: Performance 4/26/2017 External Memory Search: Performance Queries

External Memory Search: Performance 4/26/2017 External Memory Search: Performance Supernode graph very effective at minimizing IO Cache misses with incremental often < # nodes matching keywords Iterative has high CPU cost VM (backward search with cache as virtual memory) has high IO cost Incremental combines low IO cost with low CPU cost

Conclusions Keyword search on graphs continues to grow in importance 4/26/2017 Conclusions Keyword search on graphs continues to grow in importance E.g. graph representation of extracted knowledge in YAGO/NAGA (Max Planck) Ranking is critical Edge and node weights, spreading activation Efficient graph search is important In-memory and external-memory

Ongoing/Future Work External memory graph search 4/26/2017 Ongoing/Future Work External memory graph search Compression ratios for supernode graph for DBLP/IMDB: factor of 5 to 10 Ongoing work on graph clustering shows good results Graph search in a parallel cluster Goal: search integrated WWW/Wikipedia graph New search algorithms Integration with existing applications To provide more natural display of results, hiding schema details Authorization

BANKS References Keyword Searching and Browsing in databases using BANKS, Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, S. Sudarshan ICDE 2002 User Interaction in the BANKS System, Demo paper, B. Aditya, Soumen Chakrabarti, Rushi Desai, Arvind Hulgeri, Hrishikesh Karambelkar, Rupesh Nasre, Parag, S. Sudarshan ICDE 2003 Bidirectional Expansion For Keyword Search on Graph Databases, Varun Kacholia, Shashank Pandit, Soumen Chakrabarti, S Sudarshan, Rushi Desai and Hrishikesh Karambelkar, VLDB 2005 Keyword Search on External Memory Data Graphs Bhavana Dalvi, Meghana Kshirsagar and S. Sudarshan, VLDB 2008

4/26/2017 Thanks!

Time and Nodes Explored Bidir Nodes BiDir Time

Screenshots (1) author (near recovery)

Near Queries with Multiple Keywords 4/26/2017 Spread activation from each keyword separately Then combine the activations from different keywords OR: use addition or belief combination AND: take product of activations Gives better results

The BANKS System Database User BANKS HTTP JDBC Database User BANKS Web Server + Servlets Available on the web, with DBLP, IMDB and IITB ETD data http://www.cse.iitb.ac.in/banks/ No programming needed for customization Minimal preprocessing to create indices and give weights to links Provides keyword search coupled with extensive browsing features Schema browsing + data browsing Hyperlinks are automatically added to all displayed results Browsing data by grouping and creating crosstabs Graphical display of data: bar charts, pie charts, etc

BANKS Architecture Data resident on disk Graph structure of data resident in memory Nodes and edges with their types/counts 16x|V|+8x|E| bytes Search done in memory Why Allows us to use interesting graph traversal based algorithms without being constrained by SQL and related performance issues With current memory sizes, database graphs for most applications will fit in memory

Probabilistic Edge Score Model (2) Paths from root to leaves are considered separately, even if they share edges More efficient search algorithms with this models (He et al., SIGMOD07) 0.5 0.167 1