Download presentation
Presentation is loading. Please wait.
1
Machine Learning for Online Query Relaxation
Ion Muslea SRI International 333 Ravenswood Menlo Park, CA 94025 Published in SIGKDD
2
Ion Muslea Director of Research & Product Development,
Scalability and Big Data at SDL
3
What is paper is about? given a query that returns an empty
Failing query problem given a query that returns an empty answer, how can one relax the query’s constraints so that it returns a non-empty set of tuples?
4
Motivation
5
Motivation
6
Why Query fails? Too many constraints
Database does not have enough tuples Users often want everything to be satisfied
7
Intuition Relax failing queries Relax the constraints
Discover implicit relationships among various domain attributes Use this knowledge to relax the constraints
8
Intuition explained And It Fails!!!
- laptops that have large screens (i.e., Display ≥ 1700) weigh more than three pounds; - fast laptops with large hard disks (CPU ≥ 2.5GHz And HDD ≥60GB) cost more than $2,000.
9
Intuition Formalized Step 1: Extracting domain knowledge
In the form of rules Step 2: Finding the “most useful” rule Step 3: Relaxing the failing query
10
Intuition explained Step 1: Extracting domain knowledge
Randomly-chosen subset, D’ For each constraint in Q0 (e.g., CPU ≥ 2.5 GHz), use D’ to find patterns that predict whether this constraint is satisfied
11
Intuition explained Step 1: Extracting domain knowledge
12
Intuition explained Step 1: Extracting domain knowledge
13
Intuition explained Step 2: Finding the “most useful” rule
Converts rules to existential statements
14
Intuition explained Step 2: Finding the “most useful” rule
Find the Q1 which is the most similar to Q0 How? (nearest-neighbor techniques)
15
Intuition explained Step 3: Relaxing the failing query
16
Intuition explained Step 3: Relaxing the failing query
How to get Qr from Q1 and Q0? dropping the original constraint on the hard disk keeping the constraint on CPU unchanged setting the values in the constraints on Price, Display, and W eight to the least constraining ones
17
Intuition explained Step 3: Relaxing the failing query
Price and HDD are dropped out
18
LOQR Learning for Online Query Relaxation
19
LOQR
20
Step 1: Extracting domain knowledge
21
Step 2: Finding the “refiner statement”
22
Step 3: Refining the failing conjunction
23
EXPERIMENTS Five algorithms evaluated Loqr
loqr-50 (one variant of loqr) loqr-90 (another variant of loqr) S-nn (Baseline 1) r-nn (Baseline 2)
24
S-nn Find the example Ex ∈ D that is the most similar to Ck
Use Ex to create a conjunction Ck’ Use Ck’ as the relaxed query Doesn’t learn rules at all
25
r-nn Apply s-nn uses Ck’ to relax Ck and apply the relaxed rule
Doesn’t learn rules at all
26
loqr-50 Vs loqr-90 Generate over-relaxed queries that are highly unlikely to fail, but return a (relatively) large number of tuples [loqr-90] Allow 90% of the possible tuples VS Create under-relaxed queries that return fewer tuples, but are more likely to fail [loqr-50] Allow 50% of the possible tuples
27
loqr-50 Vs loqr-90
28
Datasets Six different datasets Laptops
1257 laptop configurations extracted from yahoo.com five numeric attributes: price, CPU speed, RAM, HDD space and weight Other Five (UC Irvine repository) breast cancer Wisconsin (bcw) low resolution spectrometer (lrs) Pima Indians diabetes (pima) water treatment plant (water) waveform data generator (wa ve)
29
The Setup Given a failing query Q and a dataset D, each algorithm uses D to generate a relaxed query QR QR is then evaluated on a test set that consists of all examples in the target database except the ones in D Dataset D variations: 50, 100, 150, , 350 examples 100 arbitrary instances of D for each number
30
The Setup 7 failing queries
For each query, each query relaxation algorithms is run 100 times The results reported here are the average of these 700 runs
31
Performance measures Robustness Coverage
what percentage of the failing queries are successfully relaxed (i.e., they don’t fail anymore)? Coverage what percentage of the examples in the test set satisfy the relaxed query?
32
Results (Robustness)
33
Results (Robustness) Loqr obtains by far the best results
s-nn and r-nn, display extremely poor robustness Loqr-50 and loqr-90 are better than baselines
34
Results (Coverage)
35
Results (Coverage) Low-coverage results preferred? !
A low-coverage, non-robust algorithm is of little practical importance loqr - not so spectacular!! loqr-90 is excellent (authors claim ) robustness levels between 69% and 98% coverage under 5%
36
Time complexity loqr is extremely fast Depends on the
size of the dataset D the number of attributes in the query creates a new dataset for each attribute in the query. the more attributes in the query , the longer it takes to process the query .
37
Online vs offline learning
OFF-k, an offline variant performs the learning step only once, independently of the constraints for discrete attributes: learns to predict each discrete value from the values of the other attributes for continuous attributes it discretizes the attribute’s range of values in D
38
Online vs offline learning
Two offline versions (i.e., k = 2 and k = 3) for both loqr and loqr-90
39
Online vs offline learning
both loqr and loqr-90 clearly outperform their offline variants
40
Query-driven learning
four main scenarios no constraints Offline class-attribute constraints LOQR set of hard constraints subset of constraints that must be satisfied all constraints simultaneously replacing the original values of all the attributes
41
Some Issues Sampling Create separate datasets from each attribute?
Entirely Random? Create separate datasets from each attribute? C4.5 is a greedy algorithm Is that a problem? Use only the closet rule to relax the query? Why not use more?
42
Conclusion Novel, data-driven approach to query relaxation
loqr is a fast algorithm Successfully relaxes the vast majority of the failing queries
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.