Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
 Introduction  Views  Related Work  Preliminaries  Problems Discussed  Algorithm LPTA  View Selection Problem  Experimental Results.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Addressing Diverse User Preferences in SQL-Query-Result Navigation SIGMOD ‘07 Zhiyuan Chen Tao Li University of Maryland, Baltimore County Florida International.
Efficient Query Evaluation on Probabilistic Databases
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.
Outline SQL Server Optimizer  Enumeration architecture  Search space: flexibility/extensibility  Cost and statistics Automatic Physical Tuning  Database.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
Comparing Offline and Online Statistics Estimation for Text Retrieval from Overlapped Collections MS Thesis Defense Bhaumik Chokshi Committee Members:
1 Oct 30, 2006 LogicSQL-based Enterprise Archive and Search System How to organize the information and make it accessible and useful ? Li-Yan Yuan.
Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan Susan T.Dumains Eric Horvitz MIT,CSAILMicrosoft Researcher Microsoft.
1 Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan* Stanford University Surajit Chaudhuri Vivek Narasayya Microsoft Research.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
1 Bayesian Reasoning Chapter 13 CMSC 471 Adapted from slides by Tim Finin and Marie desJardins.
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Relevance Feedback Users learning how to modify queries Response list must have least some relevant documents Relevance feedback `correcting' the ranks.
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
Automated Ranking Of Database Query Results  Sanjay Agarwal - Microsoft Research  Surajit Chaudhuri - Microsoft Research  Gautam Das - Microsoft Research.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI.
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das Cathy Wang
Querying Structured Text in an XML Database By Xuemei Luo.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Uncertainty Uncertain Knowledge Probability Review Bayes’ Theorem Summary.
Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
12/7/20151 Math b Conditional Probability, Independency, Bayes Theorem.
Effective Keyword-Based Selection of Relational Databases By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung Presented by Deborah Kallina.
Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.
Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.
Dec. 13, 2002 WISE2002 Processing XML View Queries Including User-defined Foreign Functions on Relational Databases Yoshiharu Ishikawa Jun Kawada Hiroyuki.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Basics of Databases and Information Retrieval1 Databases and Information Retrieval Lecture 1 Basics of Databases and Information Retrieval Instructor Mr.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.
Scrubbing Query Results from Probabilistic Databases Jianwen Chen, Ling Feng, Wenwei Xue.
Chapter 13: Query Processing
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Structured-Value Ranking in Update- Intensive Relational Databases Jayavel Shanmugasundaram Cornell University (Joint work with: Lin Guo, Kevin Beyer,
15.1 – Introduction to physical-Query-plan operators
Query Reranking As A Service
Quick Review Probability Theory
Quick Review Probability Theory
Probabilistic Data Management
Chapter 15 QUERY EXECUTION.
Lecture 16: Probabilistic Databases
Probabilistic Ranking of Database Query Results
Probabilistic Databases
Automatic Global Analysis
Prefer: A System for the Efficient Execution
Probabilistic Ranking of Database Query Results
Probabilistic Information Retrieval
Presentation transcript:

Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik Presented by: Kiran Karnam

 Introduction & Motivation  Problem Definition  Architecture  Ranking Function  Implementation  Experiments  Conclusions & Limitations

 Many-answers problem  Two alternative solutions: Query reformulation Automatic ranking  Apply probabilistic model in IR to DB tuple ranking

 Many answers problem SELECT * FROM REALTOR_DB WHERE CITY=‘SEATTLE’ ;

 Query reformulation  Automatic ranking

 Specified Attributes city  Unspecified Attributes View School District Boat Dock

 Global Score: Global score which captures the global importance of unspecified attribute values. Eg: VIEW=‘WATERFRONT’  Conditional Score: which captures the strengths of dependencies (or correlations) between specified and unspecified attribute values. Eg: If CITY=‘SEATTLE’ and VIEW=‘WATERFRONT’

 Important Rules and Theorem required  Bayes’ Rule: p(a/b) = [ p(b/a) p(a) ] / [p(b)]  Product Rule: p(a,b/c) = p(a/c) * p(b/a,c)

 Bayes theorem shows the relation between two conditional probabilities which are the reverse of each other  The probability of an event A given an event B depends not only on the relationship between events A and B but on the marginal probability (or "simple probability") of occurrence of each event

 Document (Tuple) t, Query Q R: Relevant Documents R = D - R: Irrelevant Documents

 Tuple t is considered as a document  Partition t into t(X) and t(Y)  t(X) and t(Y) are written as X and Y  Derive from initial scoring function until final ranking function is obtained

 Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed

 If Many Queries Specify Set X of Conditions then there is Preference Correlation between Attributes in X.  Global: E.g., If Many Queries ask for Waterfront then p(Waterfront=TRUE) is high.  Conditional: E.g., If Many Queries ask for 4-Bedroom Houses in Good School Districts, then p(Bedrooms=4 | SchoolDistrict=`good’), p(SchoolDistrict=`good’ | Bedrooms=4) are high.

 Final Ranking Formula is Where: p(y|W) = Relative frequency of unspecified attribute ‘y’ given workload ‘W’ p(y|D)= Relative frequency of unspecified attribute ‘y’ given data base ‘D’ p(x|y,W)=Frequency of correlation between x and y in W P(x|y,D)=Frequency of correlation between x and y in D

 Pre processing ◦ Atomic probability module ◦ Index module  Intermediate Knowledge Reference layer  Query processing ◦ Scan algorithm ◦ List merge algorithm

 Computation of modules: p(y | W), p(y | D), p(x | y, W), and p(x | y, D) for all distinct values of x and y.  Storing these atomic probabilities as database tables in intermediate knowledge representation layer with appropriate indexes.  Computation of index module resulting in conditional and global lists table.

 CONDITIONAL LISTS Cx: Contains in descending order  GLOBAL LISTS Gx: Contains in descending order

 Select Tuples that Satisfy the Query  Scan and Compute Score for Each Result-Tuple  Return Top-K Tuples Scan algorithm is Inefficient Many tuples in the answer set  Another approach Pre-compute top-K tuples for all possible queries Still infeasible in practice  Trade-off solution Pre-compute ranked lists of tuples for all possible atomic queries At query time, merge ranked lists to get top-K tuples

 Databases Used ◦ MSN Home Advisor database ( ◦ Internet Movie Database Software and Hardware: Microsoft SQL Server2000 RDBMS P4 2.8-GHz PC, 1 GB RAM C#, Connected to RDBMS through DAO

 Quality Experiments  Performance Experiments

Query: select * from SeattleHomes where City=‘Seattle’ and Bedroom=1;  Conditional ranked condos with garages the highest  Global failed to recognize importance of the unspecified attribute Garage=‘Y’

 User preference of rankings 5 new queries Users were given the top-5 results

 Compare 2 algorithms ◦ Scan algorithm ◦ List Merge algorithm

 Execution time of performance algorithms

 Completely Automated Approach for the Many-Answers Problem which Leverages Data and Workload Statistics and Correlations LIMITATION: Existence of correlations between text and non-text data. Future Work  Empty-Answer Problem  Handle Plain Text Attributes

 Surajit Chaudhuri, Gautam Das, Vagelis Hristidis, Gerhard Weikum, Probabilistic Ranking of Database Query Results, VLDB  users.cs.fiu.edu/~vagelis/presentations/ProbRanking.ppt  