MindReader: Querying databases through multiple examples Yoshiharu Ishikawa (Nara Institute of Science and Technology, Japan) Ravishankar Subramanya (Pittsburgh.

MindReader: Querying databases through multiple examples Yoshiharu Ishikawa (Nara Institute of Science and Technology, Japan) Ravishankar Subramanya (Pittsburgh Supercomputing Center) Christos Faloutsos (Carnegie Mellon University)

Outline  Background & Introduction  Query by Example  Our Approach  Relevance Feedback  What’s New in MindReader?  Proposed Method  Problem Formulation  Theorems  Experimental Results  Discussion & Conclusion

Query-by-Example: an example Searching “mildly overweighted” patients The doctor selects examples by browsing patient database

Query-by-Example: an example Searching “mildly overweighted” patients : very good : good The doctor selects examples by browsing patient database

Query-by-Example: an example Searching “mildly overweighted” patients Height Weight : very good : good The doctor selects examples by browsing patient database

Query-by-Example: an example Searching “mildly overweighted” patients Height Weight : very good : good The doctor selects examples by browsing patient database The examples have “oblique” correlation

Query-by-Example: an example Searching “mildly overweighted” patients Height Weight : very good : good The doctor selects examples by browsing patient database The examples have “oblique” correlation We can “guess” the implied query

Query-by-Example: an example Searching “mildly overweighted” patients Height Weight : very good : good The doctor selects examples by browsing patient database The examples have “oblique” correlation We can “guess” the implied query q

Query-by-Example: the question Assume that  user gives multiple examples  user optionally assigns scores to the examples  samples have spatial correlation

Query-by-Example: the question Assume that  user gives multiple examples  user optionally assigns scores to the examples  samples have spatial correlation How can we “guess” the implied query?

Our Approach  Automatically derive distance measure from the given examples  Two important notions: 1. diagonal query: isosurfaces of queries have ellipsoid shapes 2. multiple-level scores: user can specify “goodness scores” on samples

Isosurfaces of Distance Functions Euclideanweighted Euclidean generalized ellipsoid distance q qq

Distance Function Formulas  Euclidean D (x, q) = (x – q) 2  Weighted Euclidean D (x, q) =  i m i (x i – q i ) 2  Generalized ellipsoid distance D (x, q) = (x – q) T M (x – q)

Relevance Feedback  Popular method in IR  Query is modified based on relevance judgment from the user  Two major approaches 1. query-point movement 2. re-weighting

Relevance Feedback — Query-point Movement —  Query point is moved towards “good” examples — Rocchio’s formula in IR Q0Q0 Q 0 : query point

Relevance Feedback — Query-point Movement —  Query point is moved towards “good” examples — Rocchio’s formula in IR Q 0 : query point : retrieved data Q0Q0

Relevance Feedback — Query-point Movement —  Query point is moved towards “good” examples — Rocchio’s formula in IR Q 0 : query point : retrieved data : relevance judgments Q0Q0

Relevance Feedback — Query-point Movement —  Query point is moved towards “good” examples — Rocchio’s formula in IR Q1Q1 Q 0 : query point : retrieved data : relevance judgments Q 1 : new query point Q0Q0

Relevance Feedback — Re-weighting —  Standard Deviation Method in MARS (UIUC) image retrieval system

Relevance Feedback — Re-weighting —  Standard Deviation Method in MARS (UIUC) image retrieval system  Assumption: the deviation is high the feature is not important

Relevance Feedback — Re-weighting —  Standard Deviation Method in MARS (UIUC) image retrieval system  Assumption: the deviation is high the feature is not important f2f2 f1f1

Relevance Feedback — Re-weighting —  Standard Deviation Method in MARS (UIUC) image retrieval system  Assumption: the deviation is high the feature is not important f1f1 f2f2

Relevance Feedback — Re-weighting —  Standard Deviation Method in MARS (UIUC) image retrieval system  Assumption: the deviation is high the feature is not important “bad” feature “good” feature f1f1 f2f2

Relevance Feedback — Re-weighting —  Standard Deviation Method in MARS (UIUC) image retrieval system  Assumption: the deviation is high the feature is not important  For each feature, weight w i = 1/  i  is assigned “bad” feature “good” feature f1f1 f2f2

Relevance Feedback — Re-weighting —  Standard Deviation Method in MARS (UIUC) image retrieval system  Assumption: the deviation is high the feature is not important  For each feature, weight w i = 1/  i  is assigned “bad” feature “good” feature f1f1 f2f2 Implied Query

Relevance Feedback — Re-weighting —  Standard Deviation Method in MARS (UIUC) image retrieval system  Assumption: the deviation is high the feature is not important  For each feature, weight w i = 1/  ｊ  is assigned  MARS didn’t provide any justification for this formula “bad” feature “good” feature f1f1 f2f2 Implied Query

Outline zBackground & Introduction pQuery by Example pOur Approach pRelevance Feedback pWhat’s New in MindReader? zProposed Method pProblem Formulation pTheorems zExperimental Results zDiscussion & Conclusion

What’s New in MindReader? MindReader  does not use ad-hoc heuristics  cf. Rocchio’s expression, re-weighting in MARS  can handle multiple levels of scores  can derive generalized ellipsoid distance

What’s New in MindReader? MindReader can derive generalized ellipsoid distances q

Isosurfaces of Distance Functions Euclideanweighted Euclidean generalized ellipsoid distance q qq

Isosurfaces of Distance Functions Euclidean Rocchio weighted Euclidean generalized ellipsoid distance q qq

Isosurfaces of Distance Functions Euclidean Rocchio weighted Euclidean MARS generalized ellipsoid distance q qq

Isosurfaces of Distance Functions Euclidean Rocchio weighted Euclidean MARS generalized ellipsoid distance MindReader q qq

Method: distance function Generalized ellipsoid distance function  D (x, q) = (x – q) T M (x – q), or  D (x, q) =  j  k m jk (x j – q j ) (x k – q k )  q : query point vector  x : data point vector  M = [m jk ] : symmetric distance matrix

Method: definitions  N : no. of samples  n : no. of dimensions (features)  x i : n -d sample data vectors x i = [x i1, …, x in ] T  X : N×n sample data matrix X = [x 1, …, x N ] T  v : N -d score vector v = [v 1, …, v N ]

Method: problem formulation Problem formulation Given  N sample n -d vectors  multiple-level scores (optional) Estimate  optimal distance matrix M  optimal new query point q

Method: optimality  How do we measure “optimality”?  minimization of “penalty”  What is the “penalty”?  weighted sum of distances between query point and sample vectors  Therefore,  minimize  i (x i – q) T M (x i – q)  under the constraint det(M) = 1

Theorems: theorem 1  Solved with Lagrange multipliers  Theorem 1: optimal query point  q = x = [x 1, …, x n ] T = X T v /  v i poptimal query point is the weighted average of sample data vectors

Theorems: theorem 2 & 3  Theorem 2: optimal distance matrix  M = (det(C)) 1/n C –1  C = [c jk ] is the weighted covariance matrix  c jk =  v i (x ik - x k ) (x ij - x j )  Theorem 3  If we restrict M to diagonal matrix, our method is equal to standard deviation method  MindReader includes MARS!

Experiments 1. Estimation of optimal distance function  Can MindReader estimate target distance matrix M hidden appropriately?  Based on synthetic data  Comparison with standard deviation method 2. Query-point movement 3. Application to real data sets  GIS data

Experiment 1: target data Two-dimensional normal distribution

Experiment 1: idea  Assume that the user has “hidden” distance M hidden in his mind  Simulate iterative query refinement  Q: How fast can we discover “hidden” distance?  Query point is fixed to (0, 0)

Experiment 1: iteration steps 1. Make initial samples: compute k -NNs with Euclidean distance 2. For each object x, calculate its score that reflects the hidden distance M hidden 3. MindReader estimates the matrix M 4. Retrieve k -NNs with the derived matrix M 5. If the result is improved, go to step 2

Experiment 1: scores  Calculation of scores in terms of “hidden” distance function 1. Calculate distance from the query point q based on hidden distance matrix M hidden  d = D (x, q) (0  v  2. Translate distance value d to score (-  v  us = exp(-d 2 /2)  v = log s / (1 - s)

Experiment 1: evaluation measures  Used to check whether the query result is improved or not  CD- k measure  CD stands for “cumulative distance”  for k -NNs retrieved by matrix M, compute actual distance by matrix M hidden then take summation

Experiment 1: final k -NNs  Ellipse: isosurface for M hidden  Red points: final k - NNs obtained by standard deviation method  Green points: final k - NNs obtained by MindReader

Experiment 1: speed of convergence  x-axis: no. of iterations  y-axis: CD- k measure value  Red ： standard deviation method  Green: MindReader  Blue: best CD- k value possible for the data set

Experiment 1: changes of isosurfaces After 0th and 2nd iterations

Experiment 1: changes of isosurfaces After 4th and 8th iterations

Experiment 2: query-point movement  Starts from query point (0.5, 0.5)  MindReader converges to M hidden with five iterations

Experiment 3: real data set  End-points of road segments from the Montgomery County, MD  Data is normalized to [-1, 1]  [-1, 1]  The query specifies five points along route I-270  Can we estimate good distance function?

Experiment 3: isosurfaces After 0th and 2nd iterations: fast convergence!

Discussion: efficiency  Don’t worry about speed!  ellipsoid query support using spatial access methods:  Seidl & Kriegel [VLDB97]  Ankerst, Branmüller, Kriegel, Seidl [VLDB98]  for the derived distance, we can efficiently use spatial index

Conclusion MindReader automatically guess diagonal queries from the given examples  multiple levels of scores  includes “Rocchio” and “MARS” (standard deviation method)  problem formulation & solution  evaluation based on the experiments

MindReader: Querying databases through multiple examples Yoshiharu Ishikawa (Nara Institute of Science and Technology, Japan) Ravishankar Subramanya (Pittsburgh.

Similar presentations

Presentation on theme: "MindReader: Querying databases through multiple examples Yoshiharu Ishikawa (Nara Institute of Science and Technology, Japan) Ravishankar Subramanya (Pittsburgh."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MindReader: Querying databases through multiple examples Yoshiharu Ishikawa (Nara Institute of Science and Technology, Japan) Ravishankar Subramanya (Pittsburgh.

Similar presentations

Presentation on theme: "MindReader: Querying databases through multiple examples Yoshiharu Ishikawa (Nara Institute of Science and Technology, Japan) Ravishankar Subramanya (Pittsburgh."— Presentation transcript:

Similar presentations

About project

Feedback