Supporting Queries with Imprecise Constraints Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati Dept. of Computer.

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

VLDB ‘07 Query Processing over Incomplete Autonomous Databases Garrett Wolf (Arizona State University) Hemal Khatri (MSN Live Search) Bhaumik Chokshi (Arizona.

Clustering Categorical Data The Case of Quran Verses

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Fast Algorithms For Hierarchical Range Histogram Constructions

Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,

WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

Optimal Ad Ranking for Profit Maximization Raju Balakrishnan (Arizona State University) Subbarao Kambhampati (Arizona State University) TexPoint fonts.

Efficient Query Evaluation on Probabilistic Databases

Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.

Object Detection by Matching Longin Jan Latecki. Contour-based object detection Database shapes: …..

Comparing Offline and Online Statistics Estimation for Text Retrieval from Overlapped Collections MS Thesis Defense Bhaumik Chokshi Committee Members:

Ontology-Based Free-Form Query Processing for the Semantic Web by Mark Vickers Supported by:

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.

Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor Presented by Sergey Shepshelvich 1.

Sensitivity Analysis & Explanations for Robust Query Evaluation in Probabilistic Databases Bhargav Kanagal, Jian Li & Amol Deshpande.

Adaptive Query Processing for Data Aggregation: Mining, Using and Maintaining Source Statistics M.S Thesis Defense by Jianchun Fan Committee Members: Dr.

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.

© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.

Issues in Bridging DB & IR Announcements: Next class: Interactive Review (Come prepared) Homework III solutions online Demos tomorrow (instructions will.

Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.

Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.

What is Cluster Analysis?

Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati.

Adaptive Information Integration Subbarao Kambhampati Thanks to Zaiqing Nie, Ullas Nambiar & Thomas Hernandez Talk at USC/Information.

Mariam Salloum (YP.com) Xin Luna Dong (Google) Divesh Srivastava (AT&T Research) Vassilis J. Tsotras (UC Riverside) 1 Online Ordering of Overlapping Data.

1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Winter Semester 2003/2004Selected Topics in Web IR and Mining7-1 7 Top-k Queries on Web Sources and Structured Data 7.1 Top-k Queries over Autonomous Web.

Niloy J. Mitra Leonidas J. Guibas Mark Pauly TU Vienna Stanford University ETH Zurich SIGGRAPH 2007.

Network Aware Resource Allocation in Distributed Clouds.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

“Artificial Intelligence” in Database Querying Dept. of CSE Seung-won Hwang.

Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Querying Structured Text in an XML Database By Xuemei Luo.

1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :

Decomposition-by-Normalization (DBN): Leveraging Approximate Functional Dependencies for Efficient Tensor Decomposition Mijung Kim (Arizona State University)

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,

1 How to make sense out of unstructured data? Yi Chen Dept. of Computer Science and Engineering Arizona State University.

OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.

Query Processing over Incomplete Autonomous Databases Presented By Garrett Wolf, Hemal Khatri, Bhaumik Chokshi, Jianchun Fan, Yi Chen, Subbarao Kambhampati.

All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.

Network Community Behavior to Infer Human Activities.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.

03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.

Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.

Mikko Harju*, Juuso Liesiö**, Kai Virtanen*

Rule Induction for Classification Using

Subbarao Kambhampati (Arizona State University)

Machine Learning for Online Query Relaxation

Data Integration for Relational Web

Subbarao Kambhampati (Arizona State University)

GPX: Interactive Exploration of Time-series Microarray Data

Introduction to Information Retrieval

Anthony Okorodudu CSE Answering Imprecise Queries over Autonomous Web Databases By Ullas Nambiar and Subbarao Kambhampati Anthony Okorodudu.

Presentation transcript:

Supporting Queries with Imprecise Constraints Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati Dept. of Computer Science Arizona State University 18 th July, AAAI -06, Boston, USA [WebDB 2004; VLDB 2005 (d); WWW 2005 (p); ICDE 2006]

Supporting Queries with Imprecise Constraints Dichotomy in Query Processing Databases User knows what she wants User query completely expresses the need Answers exactly matching query constraints IR Systems User has an idea of what she wants User query captures the need to some degree Answers ranked by degree of relevance Autonomous Un-curated DB Inexperienced, Impatient user population

Supporting Queries with Imprecise Constraints Why Support Imprecise Queries ? Want a ‘sedan’ priced around $7000 A Feasible Query Make =“Toyota”, Model=“Camry”, Price ≤ $7000 What about the price of a Honda Accord? Is there a Camry for $7100? Solution: Support Imprecise Queries  ………  1998  $6500  Camry  Toyota  2000  $6700  Camry  Toyota  2001  $7000  Camry  Toyota  1999  $7000  Camry  Toyota

Supporting Queries with Imprecise Constraints Others are following …

Supporting Queries with Imprecise Constraints The Problem: Given a conjunctive query Q over a relation R, find a set of tuples that will be considered relevant by the user. Ans(Q) ={x|x Є R, Rel(x|Q,U) >c} Constraints – Minimal burden on the end user – No changes to existing database – Domain independent What does Supporting Imprecise Queries Mean? Autonomous Un-curated DB Inexperienced, Impatient user population

Supporting Queries with Imprecise Constraints Assessing Relevance Function Rel(x|Q,U)  We looked at a variety of non-intrusive relevance assessment methods – Basic idea is to learn the relevance function for user population rather than single users  Methods – From the analysis of the (sample) data itself Allows us to understand the relative importance of attributes, and the similarity between the values of an attribute  [ICDE 2006;WWW 2005 poster] – From the analysis of query logs Allows us to identify related queries, and then throw in their answers  [WIDM 2003; WebDB 2004] – From co-click patterns Allows us to identify similarity based on user click pattern  [Under Review]

Supporting Queries with Imprecise Constraints Our Solution: AIMQ

Supporting Queries with Imprecise Constraints The AIMQ Approach [For the special case of empty query, we start with a relaxation that uses AFD analysis]

Supporting Queries with Imprecise Constraints An Illustrative Example Relation:- CarDB(Make, Model, Price, Year) Imprecise query Q :− CarDB(Model like “Camry”, Price like “10k”) Base query Q pr :− CarDB(Model = “Camry”, Price = “10k”) Base set A bs Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000” Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2001”

Supporting Queries with Imprecise Constraints Obtaining Extended Set  Problem: Given base set, find tuples from database similar to tuples in base set.  Solution: – Consider each tuple in base set as a selection query. e.g. Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000” – Relax each such query to obtain “similar” precise queries. e.g. Make = “Toyota”, Model = “Camry”, Price = “”, Year =“2000” – Execute and determine tuples having similarity above some threshold.  Challenge: Which attribute should be relaxed first? – Make ? Model ? Price ? Year ? Solution: Relax least important attribute first.

Least Important Attribute Definition: An attribute whose binding value when changed has minimal effect on values binding other attributes. Does not decide values of other attributes Value may depend on other attributes E.g. Changing/relaxing Price will usually not affect other attributes but changing Model usually affects Price Dependence between attributes useful to decide relative importance Approximate Functional Dependencies & Approximate Keys  Approximate in the sense that they are obeyed by a large percentage (but not all) of tuples in the database Can use TANE, an algorithm by Huhtala et al [1999]

Supporting Queries with Imprecise Constraints Deciding Attribute Importance  Mine AFDs and Approximate Keys  Create dependence graph using AFDs – Strongly connected hence a topological sort not possible  Using Approximate Key with highest support partition attributes into – Deciding set – Dependent set – Sort the subsets using dependence and influence weights  Measure attribute importance as CarDB(Make, Model, Year, Price) Decides: Make, Year Depends: Model, Price Order: Price, Model, Year, Make 1- attribute: { Price, Model, Year, Make} 2-attribute: {(Price, Model), (Price, Year), (Price, Make).. } Attribute relaxation order is all non- keys first then keys Greedy multi-attribute relaxation

Tuple Similarity Tuples obtained after relaxation are ranked according to their similarity to the corresponding tuples in base set where Wi = normalized influence weights, ∑ Wi = 1, i = 1 to |Attributes(R)| Value Similarity Euclidean for numerical attributes e.g. Price, Year Concept Similarity for categorical e.g. Make, Model

Supporting Queries with Imprecise Constraints Categorical Value Similarity  Two words are semantically similar if they have a common context – from NLP  Context of a value represented as a set of bags of co-occurring values called Supertuple  Value Similarity: Estimated as the percentage of common {Attribute, Value} pairs – Measured as the Jaccard Similarity among supertuples representing the values ST(Q Make=Toy ota ) ModelCamry: 3, Corolla: 4,…. Year2000:6,1999:5 2001:2,…… Price5995:4, 6500:3, 4000:6 Supertuple for Concept Make=Toyota JaccardSim(A,B) =

August 15 th 2005Answering Imprecise Queries over Autonomous Databases Value Similarity Graph Ford Chevrolet Toyota Honda Dodge Nissan BMW

Supporting Queries with Imprecise Constraints Empirical Evaluation  Goal – Evaluate the effectiveness of the query relaxation and similarity estimation  Database – Used car database CarDB based on Yahoo Autos CarDB( Make, Model, Year, Price, Mileage, Location, Color) Populated using 100k tuples from Yahoo Autos – Census Database from UCI Machine Learning Repository Populated using 45k tuples  Algorithms – AIMQ RandomRelax – randomly picks attribute to relax GuidedRelax – uses relaxation order determined using approximate keys and AFDs – ROCK: RObust Clustering using linKs (Guha et al, ICDE 1999) Compute Neighbours and Links between every tuple  Neighbour – tuples similar to each other  Link – Number of common neighbours between two tuples Cluster tuples having common neighbours

Supporting Queries with Imprecise Constraints Efficiency of Relaxation Average 8 tuples extracted per relevant tuple for Є =0.5. Increases to 120 tuples for Є=0.7. Not resilient to change in Є Average 4 tuples extracted per relevant tuple for Є=0.5. Goes up to 12 tuples for Є= 0.7. Resilient to change in Є Random Relaxation Guided Relaxation

Supporting Queries with Imprecise Constraints Accuracy over CarDB 14 queries over 100K tuples Similarity learned using 25k sample Mean Reciprocal Rank (MRR) estimated as Overall high MRR shows high relevance of suggested answers

Supporting Queries with Imprecise Constraints Handling Imprecision & Incompleteness  Incompleteness in data – Databases are being populated by Entry by lay people Automated extraction  E.g. entering an “accord” without mentioning “Honda”  Imprecision in queries – Queries posed by lay users Who combine querying and browsing General Solution: “Expected Relevance Ranking” Relevance Function Density Function Challenge: Automated & Non-intrusive assessment of Relevance and Density functions

Supporting Queries with Imprecise Constraints Handling Imprecision & Incompleteness