Keyword Searching in Relational Databases

Slides:

Advertisements

Similar presentations

Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over.

Advertisements

Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?

Fast Algorithms for Top-k Personalized PageRank Queries Manish Gupta Amit Pathak Dr. Soumen Chakrabarti IIT Bombay.

Single Source Shortest Paths

CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.

Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

The Volcano/Cascades Query Optimization Framework

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,

Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.

Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.

Intranet Mediator Clement Yu Department of Computer Science University of Illinois at Chicago.

Keyword Proximity Search on Graphs M.Sc. Systems Course The Hebrew University of Jerusalem, Winter 2006.

6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.

Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.

Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

Chapter 19: Information Retrieval

Overview of Search Engines

Fast Algorithms for Top-k Personalized PageRank Queries Manish Gupta Amit Pathak Dr. Soumen Chakrabarti IIT Bombay.

Bidirectional Expansion for Keyword Search on Graph Databases Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan.

Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.

NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.

Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University

Search Engines and Information Retrieval Chapter 1.

Keyword Search on External Memory Data Graphs Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan PVLDB 2008 Reported by: Yiqi Lu.

DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.

1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.

DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das Cathy Wang

Querying Structured Text in an XML Database By Xuemei Luo.

Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Database and Query Model ◦ Informal Model ◦ Formal Model ◦ Query and Answer Model 

Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.

A Model for Fast Web Mining Prototyping Nivio Ziviani UFMG – Brazil Álvaro Pereir a Ricardo Baeza-Yates Jesus Bisbal UPF – Spain.

Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:

Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.

Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.

Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.

Effective Keyword-Based Selection of Relational Databases By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung Presented by Deborah Kallina.

Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.

Keyword Search on Graph-Structured Data

Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.

Internal and External Sorting External Searching

Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.

1 CS 430: Information Discovery Lecture 5 Ranking.

Keyword Searching and Browsing in Databases using BANKS Charuta Nakhe, Arvind Hulgeri, Gaurav Bhalotia, Soumen Chakrabarti, S. Sudarshan Presented by Sushanth.

© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

General Architecture of Retrieval Systems 1Adrienn Skrop.

XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Information Retrieval

Introduction to Query Optimization

Evaluation of Relational Operations: Other Operations

Keyword Searching and Browsing in Databases using BANKS

Selected Topics: External Sorting, Join Algorithms, …

Keyword Searching and Browsing in Databases using BANKS

Keyword Searching and Browsing in Databases using BANKS

Bidirectional Query Planning Algorithm

Evaluation of Relational Operations: Other Techniques

Chapter 31: Information Retrieval

Chapter 19: Information Retrieval

Presentation transcript:

Keyword Searching in Relational Databases Esha Palta (05329017) Kumar Gaurav Bijay (02005013)

Dilbert Strip 

Motivation Keyword search We have SQL, why keyword-querying? SQL - not appropriate for naive users So many online databases (imdb, citeseer, bseindia …) – user cannot keep track of schema for all of these

Simple Approaches Using Form interfaces How about Google? Require separate form for each type of query – confusing Not suitable for ad-hoc queries – how many forms will you provide? How about Google? Export data from db to documents and do keyword- querying on these Suffers from duplication overheads Google wants all keywords in one document. DB is often normalized, so need to join tables and store as documents Multiple combinations of tables to join. Not scalable … Probably that’s why there is no Google DB Seach 

Differences from Web Search Related data split across multiple tuples due to normalization Different keywords may match tuples from different relations What joins are to be computed can only be decided on the fly Need to find result containing all keywords and rank them somehow Paper (PaperId, PaperName) Writes (AuthorId, PaperId) Author (AuthorId, AuthorName) Cites (Citing, Cited) The DBLP Bibliography Schema

Systems for DB search BANKS (Browsing and Keyword Search) – IITB (ICDE ’02) DBXplorer – Microsoft Research (ICDE ’02) ObjectRank – IBM, UCSD, FIU (VLDB ’04) Bidirectional BANKS – IITB (VLDB ’05)

Systems for DB search BANKS (Browsing and Keyword Search) – IITB (ICDE ’02) DBXplorer – Microsoft Research (ICDE ’02) ObjectRank – IBM, UCSD, FIU (VLDB ’04) Bidirectional BANKS – IITB (VLDB ’05) will cover in depth

BANKS (ICDE ’02)

The BANKS system BANKS Architecture Available on the web http://www.cse.iitb.ac.in/banks Connects to database using JDBC JDBC metadata features used to provide schema browsing Preprocesses db User BANKS HTTP JDBC Web-server Database

Basic Model Database: modeled as a graph Nodes = tuples Edges = references between tuples foreign key (assume for this talk), inclusion dependencies, .. Edges are directed. PaperId:PaperName BANKS01:Keyword Search MO:MultiQuery Optimizn paper AuthorID:PaperId Charuta:BANKS01 writes AuthorId Charuta S. Sudarshan Prasan Roy author DBLP example

The BANKS Answer Model Query: set of search terms {t1, t2, .., tn} For each search term ti we find set of nodes Si matching ti Eg: Query = Sudarshan Roy (t1= Sudarshan, t2 = Roy) Answer: rooted, directed tree connecting nodes matching keywords Root node has special significance, may be restricted to some relations E.g. relations representing entities, not relationships May include intermediate nodes not in any Si (Steiner Tree) Multiple answers Ranking based on proximity + prestige

Answer Example Query: sudarshan roy Paper MultiQuery Optimization Writes Writes Author Author S. Sudarshan Prasan Roy We would like to find sets of (closely) connected tuples that match all given keywords

Edge Directionality Directed tree will miss desired answers. For eg: Query = DBXplorer ObjectRank So, for each forward edge, BANKS adds a back edge CitedBy Cited BANKS Cites DBXPlorer Cites ObjectRank Here Banks cites both DBXplorer and ObjectRank. Cited CitedBy BANKS Cites DBXPlorer Cites ObjectRank

Edge Directionality What if we ignore directionality? Some popular tuples are connected to many other tuples E.g. Students -> departments -> university Problem: A popular tuple would create misleading shortcuts between tuples E.g. every student would be closely linked with every other student via the department/university Solution: define different forward and backward edge weights Forward edges: In the direction of the foreign key reference

Edge Weight Weight of forward edge based on schema e.g. citation link weights > “writes” link weights Weight of backward edge = indegree of edges pointing to the node 3 1 1 1

Edge Weight Scaling Normalize edge score Escore(e) Make edge weight scale-free by dividing edge weigth by wmin Problem: Some backward edges have unduly large weights Depress the scale by defining Escore(e) as log(1+w(e)/wmin ) Overall Escore E = 1 / (1 + e Escore(e))

Node Weight Set weight of a Node = Indegree of the node As per prestige rankings nodes with multiple pointers to them get a higher prestige So, higher node weight corresponds to higher prestige Problem: Nodes with many in-edges result in skewed answers Subdue extreme node weights by using log(1+indegree) Node score Nscore = Average of node scores (root-node-weight +  leaf-node-weights)

Combining Scores Combining two independent metrics: node weight and edge weight Normalize each to 0-1 Combine using weighting factor  Additive: (1- ) Escore +  Nscore Multiplicative: Escore * Nscore Performance study to compare alternatives and to find reasonable values for 

First Step – Symbol Table The first step is to build a symbol table This table is in the db and is not normalized Example: Keyword List of Matching Nodes Database {NICDE_2, NVLDB_3, …} Search {NBANKS1, NBANKS2, NDBXPLR,…} Rank {NOBJRNK, NXRANK, NSPHSRCH, …} … N… are nodes that match the keyword.

Searching for Best Answers Backward Expanding Search Algorithm: Assume: graph fits in memory Idea: find vertices from which a forward path exists to at least one node from each Si. Run concurrent single source shortest path algorithm from each node matching a keyword Create an iterator for each node matching a keyword Traverse the graph edges in reverse direction Output a node whenever it is on the intersection of the sets of nodes reached from each keyword Answer trees may not be generated in relevance order

Backward Expanding Search Query: sudarshan roy MultiQuery Optimization paper writes S. Sudarshan Prasan Roy authors Iterators

BANKS Query Result Example Result of “Sudarshan Roy”

Answers need not be always in Relevance order Result Ordering Answers need not be always in Relevance order This tree is output Better Root Missed 2 2 5 2 2 1

Result Ordering (contd…) Solution: Generate all connection trees and then sort them Increases computation costs and leads to a greatly increased time to generate initial results Create a small heap ordered on the relevance of the trees Output highest ranked tree from heap to user when heap is full What about duplicate results? Maintain a list of generated results for duplicate detection Discard result according to relevance

Experience and Performance BANKS provides keyword search coupled with extensive browsing facilities Schema browsing + data browsing Graphical display of data Implemented using Java + servlets Keyword search response times typically 1 to 3 seconds on DBLP database with 100,000 tuples/300,000 edges P3 600 MHz, 512 MB RAM

Anecdotes “Mohan” “Transaction” “Sunita Seltzer” Returns C. Mohan at top based on prestige (number of papers written) “Transaction” Returns Jim Gray’s classic paper and textbook as top answers based on prestige (number of citations) “Sunita Seltzer” No common papers, but both have papers with Stonebraker: system finds this connection

Effect of Parameters Log scaling of edge weights worked well (1- ) E +  N versus E Nmade little difference Best with  = .2 (subdue node weights but not entirely) EdgeLog

BANKS (VLDB ’05)

Motivation BANKS performs poorly if Keyword matches lot of nodes (so lot of Dijkstra sources) Search hits a node with large fan – in. … Wastes time Sudarshan Roy

New Ideas – Forward Search Why only backward, lets search forward too : How about fwd Searching ? … Sudarshan Roy

New Ideas - Activation Activation :- Cannot forward search from each node. Spread activation from keyword nodes to others. Activation is like Page Rank with decay. High Activation  close to many keywords.

Activation Spreading Spreading Activation Node with highest activation explored first Activation spread to neighbors (μ = 0.3) Gives low activation to neighbors of hubs

Modifications to Model Graph model stays the same. BANKS is concerned with search more than how to tune parameters or define node – weights / edge – weights. BANKS code : Tree Node – Score, N = Tree Edge – Score, E = Total Score = ENl (l = 0.2)

The New Algorithm Need two priority queues : Qin - do backward search from these nodes Qout - do forward search from these nodes Each node, n keeps 3 variables per keyword, ti sp [i] : Node to got to from n for shortest-path to ti distance [i] : Length of the shortest-path from n to ti Activation [i] : Activation to n from keyword ‘ti’

The New Algorithm – continued… Set initial activation keyword nodes and add to Qin for backward-search. At each step, pick node with maximum activation i.e. if (Qin.getMaxActivation > Qout. getMaxActivation)) // use node from Qin else // use node from Qout If node from Qin, do backward search and add itself to Qout. (newly explored nodes into Qin) If node from Qout, do forward search If node has reached from all keyword, generate result- tree. [answer is buffered as results can be out of order]

Explanation with example Qin Qout N100 N4 Roy Sudarshan N1 … N3 N2 Roy Sudarshan

Explanation with example Qin Qout N100 N4 Roy Sudarshan N1 N2 N1 … N3 N2 Roy Sudarshan

Explanation with example Qin Qout N100 N2 N3 … N100 N1 Roy Sudarshan N4 N1 … N3 N2 Roy Sudarshan Result Found !

Generation of top-k results If we know the score of next-best answer, all buffered answers with better score can be output. Need upper bounds

Computation of upper bound For each keyword ti, we have explored nodes upto some length – say li. So, next – best – score (approx.) = This is not a true upper bound, but works quite well and is simple !

Are we losing answers ? BANKS – I used many Dijkstra states, BANKS – II uses 2 only – forward and backward search- states. The result is that we can now lose answers !

This is the generated answer. Answer Loss Example Ny K1 Nx K2 K1 Ny Nx K1 K2 Ny K1 K2 This is the generated answer. This answer is lost.

But, we will generate this tree rooted at Nx: So, a rotated tree with same nodes but different root is often generated ! NY K2 NX K1

Metrics of Performance Manually obtain best relevant answers. Determine 2 times : Time taken to produce last relevant answer. Time taken to output last relevant answer. Search algorithms MI-Bkwd: original backward search Iterator for every node matching a keyword SI-Bkwd: backward search with single backward iterator Bidirec: bidirectional search Datasets DBLP, IMDB ~ 2 million nodes, 9 million edges US Patent DB ~ 4 million nodes, 15 million edges

Graph - I MI-Bkwd versus SI-Bkwd SI-Bkwd gain increases with origin size, # keywords

Graph - II SI-Bkwd versus Bidirec Bidirec gain increases with origin size, # keywords

A Critique BANKS needs a lot of memory. Need to cluster and keep parts of graph on disk. Work is in progress 

DBXplorer (ICDE ’02)

DBXplorer : (Microsoft Research) Use symbol – table to determine which tables to join. Generate all possible table – join combinations : Figure : T1, T2, T3, T4 and T5 are tables

Cool ideas in DBXplorer Symbol table need not be at tuple level. If column has an index, column – level symbol table is ok. Table Compression : e.g. : Keywords Columns Keywords Columns K1 K1 C1 C1 K2 K2 X C2 K3 C2 K3 K4 K4 K5 K5 Intermediate Column

ObjectRank (VLDB ‘04)

ObjectRank (IBM, FIU, UCSD) Creates objects in database. Object definition is manual. e.g. in DBLP, author, conference and paper can be defined as objects. Heavily inspired by PageRank. Each node is given global ObjectRank just like PageRank of Google.

ObjectRank Ideas Keyword-level ObjectRank : for each keyword, precompute and save object ranks of nodes [can optimize by defining cut-off) Score of node, n w.r.t. keyword k : scorek(n) = f (Global-object-rank (n), Objectrankk (n)) At run time, scores are combined : scorek1,k2,…,km(n) = scorek1(n) * scorek2(n) * …* scorekm(n)

ObjectRank Algorithm and answers If graph is DAG or near DAG, topologically sort and spread ObjectRank in this order. Answers are single objects and not Cluster / group as in BANKS. Demo at : http://teriyaki.ucsd.edu:9099/objrank/main05_new.html

Conclusion Studied BANKS, both versions. Covered cool ideas from DBXplorer and ObjectRank. Graph of BANKS must be made disk-resident.

References Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, and S. Sudarshan. Keyword Searching and Browsing in Databases using BANKS. In International Conference on Data Engineering (ICDE), pages 1083–1096, 2002. Varun Kacholia, Shashank Pandit, Soumen Chakrabarti, S. Sudarshan et. al. Bidirectional Expansion for Keyword Search on Graph Databases. In VLDB Conference, pages 505–516, 2005. Sanjay Agrawal, Surajit Chaudhari, and Gautam Das. DBXplorer: A System for Keyword-Based Search over Relational Databases. In International Conference on Data Engineering (ICDE), pages 5–22, 2002. Andrey Balmin, Vagelis Hristidis, and Yannis Papakonstantinou. ObjectRank: Authority-Based Keyword Search in Databases. In VLDB Conference, pages 564–575, 2004.

Appendix Extra slides

Browsing - May add?????? Hyperlinks are there for all primary key foreign key attributes Each table is displayed with set of tools for interacting with data Projection (using drop), Selection, Join, Group-by, Sort Template facilities to do a variety of tasks Browsing data by grouping and creating crosstabs e.g., theses grouped by department and year Hierarchical views of data Nested XML style, even on relational data Graphical displays Bar charts, pie charts, etc Templates are generic and can be applied on any data matching assumed schema Can be applied after applying selections New templates can be created by user, interactively

Example of Browsing in BANKS

Related Work DataSpot (DTL)/Mercado Intuifind [VLDB 98] Based on patent by Palmon (filed 1995, granted 1998) Based on hypergraph model, similar answer model to ours Differences: our model of backward link weights and prestige Proximity Search [VLDB98] Different model of proximity based on adding up support No edge weights, prestige, different evaluation algorithm Information units (linked Web pages) [WWW10] No directionality, only studied in Web context Microsoft DBExplorer (this conference) No ranking, based on SQL generation Addresses efficient construction of text indexes Microsoft English query

Extensions Summarization of output group the output tuples into sets that have same tree structure define the notion of similarity between two result trees perform restricted search Metadata queries (attribute:keyword queries) For example: author:levy match all the tuples of a relation costly Forward searching approach

Proposed Conclusions and Future Work BANKS is an integrated browsing and keyword querying system for relational databases Future work: Keyword queries on XML Disambiguating queries by selecting Nodes: G.W.Bush: “Bush Jr” or “Bush Sr” Tree structure: “coauthors” or “cites” Boolean queries Metadata queries Summarization of output