Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.

Slides:



Advertisements
Similar presentations
Toward Scalable Keyword Search over Relational Data Akanksha Baid, Ian Rae, Jiexing Li, AnHai Doan, and Jeffrey Naughton University of Wisconsin VLDB 2010.
Advertisements

Chapter 10: Designing Databases
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Discovering Queries based on Example Tuples
13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
PREFER: A System for the Efficient Execution of Multi-parametric Ranked Queries Vagelis Hristidis University of California, San Diego Nick Koudas AT&T.
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
The State of the Art in Distributed Query Processing by Donald Kossmann Presented by Chris Gianfrancesco.
Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,
Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,
Distributed Databases John Ortiz. Lecture 24Distributed Databases2  Distributed Database (DDB) is a collection of interrelated databases interconnected.
SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor Presented by Sergey Shepshelvich 1.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Keyword Proximity Search on XML Graphs Vagelis Hristidis Yannis Papakonstatinou Andrey Presenter: Feng Shao.
Combining Keyword Search and Forms for Ad Hoc Querying of Databases Eric Chu, Akanksha Baid, Xiaoyong Chai, AnHai Doan, Jeffrey Naughton University of.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
1 Efficient IR-Style Keyword Search over Relational Databases 12 December 2005 Databases and the Internet Seminar on Databases and the Internet The Hebrew.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
Physical Database Design Chapter 6. Physical Design and implementation 1.Translate global logical data model for target DBMS  1.1Design base relations.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das Cathy Wang
Querying Structured Text in an XML Database By Xuemei Luo.
Theory and Application of Database Systems A Hybrid Approach for Extending Ontology from Text He Wei.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
GEORGIOS FAKAS Department of Computing and Mathematics, Manchester Metropolitan University Manchester, UK. Automated Generation of Object.
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Date : 2012/10/25 Author : Yosi Mass, Yehoshua Sagiv Source : WSDM’12 Speaker : Er-Gang Liu Advisor : Dr. Jia-ling Koh 1.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
Finding Experts Using Social Network Analysis 2007 IEEE/WIC/ACM International Conference on Web Intelligence Yupeng Fu, Rongjing Xiang, Yong Wang, Min.
Mayssam Sayyadian, AnHai Doan University of Wisconsin - Madison Hieu LeKhac University of Illinois - Urbana Luis Gravano Columbia University Efficient.
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, Keyword Search on Relational Data Streams Alexander Markowetz Yin.
Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.
Effective Keyword-Based Selection of Relational Databases By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung Presented by Deborah Kallina.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Dec. 13, 2002 WISE2002 Processing XML View Queries Including User-defined Foreign Functions on Relational Databases Yoshiharu Ishikawa Jun Kawada Hiroyuki.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Ten Thousand SQLs Kalmesh Nyamagoudar 2010MCS3494.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
Keyword Searching and Browsing in Databases using BANKS Charuta Nakhe, Arvind Hulgeri, Gaurav Bhalotia, Soumen Chakrabarti, S. Sudarshan Presented by Sushanth.
Welcome to CPSC 534B: Information Integration Laks V.S. Lakshmanan Rm. 315.
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.
One Platform for Mining Structured and Unstructured Data: Dream or Reality? VLDB Panel 13 Sep 2006 Jayavel Shanmugasundaram Yahoo! Research.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Structured-Value Ranking in Update- Intensive Relational Databases Jayavel Shanmugasundaram Cornell University (Joint work with: Lin Guo, Kevin Beyer,
Supporting Ranking and Clustering as Generalized Order-By and Group-By
MCN: A New Semantics Towards Effective XML Keyword Search
Structure and Content Scoring for XML
A Framework for Testing Query Transformation Rules
Introduction to XML IR XML Group.
Accelerating Regular Path Queries using FPGA
Presentation transcript:

Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis Papakonstantinou University of California, San Diego

Motivation Keyword search is the dominant information discovery method in documents Increasing amount of data is stored in databases Plain text coexists with structured data

Motivation Up until recently, information discovery in databases required: –Knowledge of schema –Knowledge of a query language (e.g., SQL) –Knowledge of the role of the keywords Goal: Enable IR-style keyword search over DBMSs without the above requirements

IR-Style Search over DBMSs IR keyword search well developed for document search Modern DBMSs offer IR-style keyword search over individual text attributes What is equivalent to document in databases?

Example – Complaints Database Schema

Example - Complaints Database Data tupleIdprodIdcustIddatecomments c1p121c “disk crashed after just one week of moderate use on an IBM Netvista X41” c2p131c “lower-end IBM Netvista caught fire, starting apparently with disk” c3p131c “IBM Netvista unstable with Maxtor HD” tupleIdprodIdmanufacturermodel p1p121 “Maxtor” “D540X” p2p131 “IBM” “Netvista” p3p141“Tripplite”“Smart 700VA” tupleIdcustIdnameoccupation u1c3232“John Smith” “Software Engineer” u2c3131“Jack Lucas” “Architect” u3c3143“John Mayer” “Student” Complaints Customers Products

Example – Keyword Query [Maxtor Netvista] tupleIdprodIdcustIddatecomments c1p121c “disk crashed after just one week of moderate use on an IBM Netvista X41” c2p131c “lower-end IBM Netvista caught fire, starting apparently with disk” c3p131c “IBM Netvista unstable with Maxtor HD” tupleIdprodIdmanufacturermodel p1p121 “Maxtor” “D540X” p2p131 “IBM” “Netvista” p3p141“Tripplite”“Smart 700VA” tupleIdcustIdnameoccupation u1c3232“John Smith” “Software Engineer” u2c3131“Jack Lucas” “Architect” u3c3143“John Mayer” “Student” Complaints Customers Products

Keyword Query Semantics (definition of “document” in databases) Keywords are: in same tuple in same relation in tuples connected through primary-foreign key relationships Score of result: distance of keywords within a tuple distance between keywords in terms of primary- foreign key connections IR-style score of result tree

Example – Keyword Query [Maxtor Netvista] tupleIdprodIdcustIddatecomments c1p121c “disk crashed after just one week of moderate use on an IBM Netvista X41” c2p131c “lower-end IBM Netvista caught fire, starting apparently with disk” c3p131c “IBM Netvista unstable with Maxtor HD” tupleIdprodIdmanufacturermodel p1p121 “Maxtor” “D540X” p2p131 “IBM” “Netvista” p3p141“Tripplite”“Smart 700VA” tupleIdcustIdnameoccupation u1c3232“John Smith” “Software Engineer” u2c3131“Jack Lucas” “Architect” u3c3143“John Mayer” “Student” Complaints Customers Products Results: (1) c3, (2) p2  c3, (3) p1  c1

Result of Keyword Query Result is tree T of tuples where: each edge corresponds to a primary- foreign key relationship no tuple of T is redundant (minimality) - “AND” query semantics: Every query keyword appears in T - “OR” query semantics: Some query keywords might be missing from T

Score of Result T Combining function Score combines scores of attribute values of T One reasonable choice: Score=  a  T Score(a)/size(T) Attribute value scores Score(a) calculated using the DBMS's IR “datablades”

Shortcomings of Prior Work Simplistic ranking methods (e.g., based only on size of connecting tree), ignoring well-studied IR ranking strategies No straightforward extension to improve efficiency by returning just top-k results Not good in handling free-text attributes [DBXplorer,DISCOVER]

Example – Keyword Query [Maxtor Netvista] tupleIdprodIdcustIddatecomments c1p121c “disk crashed after just one week of moderate use on an IBM Netvista X41” c2p131c “lower-end IBM Netvista caught fire, starting apparently with disk” c3p131c “IBM Netvista unstable with Maxtor HD” tupleIdprodIdmanufacturermodel p1p121 “Maxtor” “D540X” p2p131 “IBM” “Netvista” p3p141“Tripplite”“Smart 700VA” tupleIdcustIdnameoccupation u1c3232“John Smith” “Software Engineer” u2c3131“Jack Lucas” “Architect” u3c3143“John Mayer” “Student” Complaints Customers Products Results: (1) c3, (2) p2  c3, (3) p1  c1 Score(c3) = 4/3 Score(p2  c3) = (1+4/3)/2 = 7/6 Score(p1  c1) = (1+1/3)/2 = 4/6 score 1/3 4/3 score 1 1 0

Architecture Complaints Q = [(c3,comments,1.33), (c1,comments,0.33), (c2,comments,0.33)] Products Q = [(p1,manufacturer,1), (p2,model,1)] Complaints Q Products Q Complaints Q  Products Q Complaints Q  Customer {}  Complaints Q Complaints Q  Product {}  Complaints Q... SELECT * FROM Complaints Q c, Products Q p WHERE c.prodId = p.prodId AND c.prodId=? AND c.custId = ?;... [Maxtor Netvista] c3 p2  c3 p1  c2

Architecture Complaints Q = [(c3,comments,1.33), (c1,comments,0.33), (c2,comments,0.33)] Products Q = [(p1,manufacturer,1), (p2,model,1)] Complaints Q Products Q Complaints Q  Products Q Complaints Q  Customer {}  Complaints Q Complaints Q  Product {}  Complaints Q... SELECT * FROM Complaints Q c, Products Q p WHERE c.prodId = p.prodId AND c.prodId=? AND c.custId = ?;... [Maxtor Netvista] c3 p2  c3 p1  c2

Candidate Network Generator Find all trees of tuple sets (free or non-free) that may produce a result, based on DISCOVER's CN generator [VLDB 2002] Use single non-free tuple set for each relation –allows “OR” semantics –fewer CNs are generated –extra filtering step required for “AND” semantics

Candidate Network Generator Example For query [Maxtor Netvista], CNs: Complaints Q Products Q Complaints Q  Products Q Complaints Q  Customer {}  Complaints Q Complaints Q  Product {}  Complaints Q Non-CNs: Complaints Q  Customer {}  Complaints {} Product Q  Complaints {}  Product Q

Architecture c3 p2  c3 p1  c2 Complaints Q = [(c3,comments,1.33), (c1,comments,0.33), (c2,comments,0.33)] Products Q = [(p1,manufacturer,1), (p2,model,1)] Complaints Q Products Q Complaints Q  Products Q Complaints Q  Customer {}  Complaints Q Complaints Q  Product {}  Complaints Q... SELECT * FROM Complaints Q c, Products Q p WHERE c.prodId = p.prodId AND c.prodId=? AND c.custId = ?;... [Maxtor Netvista]

Execution Algorithms Users usually want top-k results. Hence, submitting to DBMS a SQL query for each CN (Naïve algorithm) is inefficient. When queries produce at most very few results, Naïve algorithm is efficient, since it fully exploits DBMS. Monotonic combining functions: if results T, T' have same schema and for every attribute Score(a i )≤Score(a' i ) then Score(T)≤Score(T')

Sparse Algorithm: Example Execution CNresults scoreMFS Products Q Complaints Q Complaints Q  Products Q c2 7 7 p1 9 9 c1  p1 (9+5)/2=7 (9+7)/2 = 8 Best when query produces at most a few results

Single Pipelined Algorithm: Example Execution CN: Complaints Q  Products Q MPFS =Max[(5+9)/2, (7+6)/2]=7Max[(1+9)/2, (7+6)/2]=6.5 resultscore Results queue p 1 →c 1 7 Output: p 1 →c 1 Max[(1+9)/2, (7+1)/2]=5 p 2 →c p 2 →c 2 Get next tuple from most promising non-free tuple set

Global Pipelined Algorithm : Example Execution global MPFS=max(MPFS i ) over all CNs C i Best when query produces many results.

Hybrid Algorithm Estimate number of results. –For “OR”-semantics, use DBMS estimator –For “AND”-semantics, probabilistically adjust DBMS estimator. If at most a few query results expected, then use Sparse Algorithm. If many query results expected, then use Global Pipelined Algorithm.

Related Work DBXplorer [ICDE 2002], DISCOVER [VLDB 2002] –Similar three-step architecture –Score = 1/size(T) –Only AND semantics –No straightforward extension for efficient top-k execution BANKS [ICDE 2002], Goldman et al. [VLDB 1998] –Database viewed as graph –No use of schema Florescu et al. [WWW 2000], XQuery Full-Text Ilyas et al. [VLDB 2003], J* algorithm [VLDB 2001] –Top-k algorithms for join queries

Experiments – DBLP Dataset DBLP contains few citation edges. Synthetic citation edges were added such that average # citations is 20. Final dataset is 56MB. Experiments run over state-of-the-art commercial RDBMS. C: Conference Y: Year P: Paper A: Author

OR Semantics: Effect of Maximum Allowed CN Size Average execution time of keyword top-10 queries

OR Semantics: Effect of Number of Objects Requested k Average execution time of keyword queries with maximum candidate-network size of 6

OR Semantics: Effect of Number of Query Keywords Average execution time of 100 top-10 queries with maximum candidate-network size of 6

Conclusions Extend IR-style ranking to databases. Exploit text-search capabilities of modern DBMSs, to generate results of higher quality. Support both “AND” and “OR” semantics. Achieve substantial speedup over prior work via pipelined top-k query processing algorithms.

Questions?

Compare algorithms wrt Result size OR-semantics Max CN size = 6, top-10, 2 keywords, OR-semantics AND-semantics

Ranking Functions Proposed algorithms support tuple monotone combining functions That is, if results T, T ' have same schema and for every attribute Score(a i )≤Score(a ' i ) then Score(T)≤Score(T ' )