Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor 1.

Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor 1

Motivation – Too Many Results In interactive database querying, we often get more results than we can comprehend immediately Try search a popular keyword When do you actually click over 2-3 pages of results? – 85% of users never go to the second page [1,2] 2

Why IR Solutions Do NOT Apply Sorting and ranking are standard IR techniques – Search engines show most relevant hits in the first page However, for a database query, all tuples in the query result set are equally relevant – For example, Select * from Cars where price < 13,000 – All matching results should be available to user – What to do when there are millions of results? 3

Make the First Page Count If no user preference information available, how to best arrange results? – Sort by attribute? – Random selection? – Others? Show the most “representative” results – Best help users learn what is in the result set – User can decide further actions based on representatives 4

Our Proposal – MusiqLens Experience 5

Suppose a user wants a 2005 Civic but there are too many of them… 6

MusiqLens on the Car Data IDMODELPRICEYEARMILEAGECONDITION 872Civic$12,000200550,000Good 122 more like this 901Civic$16,000200540,000Excellent 345 more like this 725Civic$18,500200530,000Excellent 86 more like this 423Civic$17,000200542,000Good 201 more like this 132Civic$9,500200586,000Fair 185 more like this 322Civic$14,000200573,000Good 55 more like this 7

MusiqLens on the Car Data IDMODELPRICEYEARMILEAGECONDITION 872Civic$12,000200550,000Good 122 more like this 901Civic$16,000200540,000Excellent 345 more like this 725Civic$18,500200530,000Excellent 86 more like this 423Civic$17,000200542,000Good 201 more like this 132Civic$9,500200586,000Fair 185 more like this 322Civic$14,000200573,000Good 55 more like this 8

Zooming in: 2005 Honda Civics ~ ID 132 IDMODELPRICEYEARMILEAGECONDITION 342Civic$9,800200572,000Good 25 more like this 768Civic$10,000200560,000Good 10 more like this 132Civic$9,500200586,000Fair 63 more like this 122Civic$9,500200576,000Good 5 more like this 123Civic$9,100200581,000Fair 40 more like this 898Civic$9,000200569,000Fair 42 more like this 9

Now Suppose User Filters by “Price < 9,500” IDMODELPRICEYEARMILEAGECONDITION 342Civic$9,800200572,000Good 25 more like this 768Civic$10,000200560,000Good 10 more like this 132Civic$9,500200586,000Fair 63 more like this 122Civic$9,500200576,000Good 5 more like this 123Civic$9,100200581,000Fair 40 more like this 898Civic$9,000200569,000Fair 42 more like this 10

IDMODELPRICEYEARMILEAGECONDITION 123Civic$9,100200581,000Fair 40 more like this 898Civic$9,000200569,000Fair 42 more like this 133Civic$9,300200587,000Fair 33 more like this 126Civic$9,200200589,000Good 3 more like this 129Civic$8,900200581,000Fair 20 more like this 999Civic$9,000200587,000Fair 12 more like this After Filtering by “Price < 9,500” 11

Challenges Metric challenge –W–What is the best set of representatives? Representative finding challenge –H–How to find them efficiently? Query challenge –H–How to efficiently adapt to user’s query operations? 12

Finding a Suitable Metric Users should be the ultimate judge – Which metric generates the representatives that I can learn the most from User study – Use a set of candidates – Users observe the representatives – Users estimate more data points in the data – Representatives lead to best estimation wins 13

Metric Candidates Sort by attributes Uniform random sampling Density-biased sampling [3] Sort by typicality [4] K-medoids – Average – Maximum 14

Density-biased Sampling Proposed by C. R. Palmer and C. Faloutsos [3] Sample more from sparse regions, less from dense regions To counter the weakness of uniform sampling where small clusters are missed 15

Sort by Typicality 16 Proposed by Ming Hua, Jian Pei, et al [4] Figure source: slides from Ming Hua

Metric Candidates - K-medoids A medoid of a cluster is the object whose average or maximum dissimilarity to others is smallest – Average medoid and max medoid K-medoids are k objects, each from a cluster where the object is the medoid Why not K-means – K-means cluster centers do not exist in database – We must present real objects to users 17 C

Plotting the Candidates 18 Data: Yahoo! Autos, 3922 data points. Normalized price and mileage to 0-1.

19 Plotting the Candidates - Typicality

20 Plotting the Candidates – k-medoids

User Study Procedure Users are given – 7 sets of data, generated using the 7 candidate methods – Each set consists of 8 representative points Users predict 4 more data points – That are most likely in the data set – Should not pick those already given Measure the predication error 21

Predication Quality Measurement 22 P1P1 P2P2 D1D1 D2D2 SoSo For data point S o : MinDist: D 1 MaxDist: D 2 AvgDist: (D 1 +D 2 )/2

Performance – AvgDist and MaxDist 23 For AvgDist: Avg-Medoid is the winner. For MaxDist: Max-Medoid is the winter.

Performance – MinDist 24 Avg-Medoid seems to be the winner

Verdict Although result is insignificant in MinDist, overall AvgMeoid is better than Density Based on AvgDist and MinDist: Avg-Medoid Based on MaxDist: Max-Medoid In this paper, we choose average k-medoids – Our algorithm can extend to max-medoids with small changes 25 Statistical Significance of Result

Challenges Metric challenge – What is the best set of representatives? Representative finding challenge – How to find them efficiently? Query challenge – How to efficiently adapt to user’s query operations? 26

Cover Tree Based Algorithm Cover Tree was proposed by Beygelzimer, Kakade, and Langford in 2006 [5] Briefly discuss Cover Tree properties Cover Tree based algorithms for computing k- medoids 27

Cover Tree Properties (1) 28 Figure modified from slides of Cover Tree authors CiCi C i+1 Points in the Data (One Dimension) Nesting: for all i, Assume all pair-wise distance <= 1.

Cover Tree Properties (2) 29 CiCi C i+1 Covering: node in C i is within distance of to its children in C i+1 Distance from node to any descendant is less than This value is called the “span” of the node. Distance from node to any descendant is less than This value is called the “span” of the node.

Cover Tree Properties (3) 30 Figure modified from slides of Cover Tree authors CiCi C i+1 Points in the Data Separation: nodes in C i are separated by at least

Additional Stats for Cover Tree (2D Example) 31 Density (DS): number of points in the subtree DS = 10 DS = 3 Centroid (CT): geometric center of points in the subtree p

k-medoid Algorithm Outline We descend the cover tree to a level with more than k nodes Choose an initial k points as first set of medoids (seeds) – Bad seeds can lead to local minimums with a high distance cost Assigning nodes and repeated update until medoids converge 32

Cover Tree Based Seeding 33 Descend the cover tree to a level with more than k nodes (denote as level m) Use the parent level (m-1) as starting point for seeds – Each node has a weight, calculated as product of span and density (the contribution of the subtree to the distance cost) – Expand nodes using a priority queue – Fetch the first k nodes from the queue as seeds

A Simple Example: k = 4 34 Span = 2 Span = 1 Span = 1/2 Span = 1/4 Priority Queue on node weight (density * span): S 3 (5), S 8 (3), S 5 (2) S 8 (3/2), S 5 (1), S 3 (1), S 7 (1), S 2 (1/2) Final set of seeds

Update Process 1.Initially, assign all nodes to closest seed to form k clusters 2.For each cluster, calculate the geometric center Use centroid and density information to approximate subtree 3.Find the node that is closest to the geometric center, designate as a new medoid 4.Repeat from step 1 until medoids converge 35

Challenges Metric challenge – What is the best set of representatives? Representative finding challenge – How to find them efficiently? Query challenge – How to efficiently adapt to user’s query operations? 36

Query Adaptation Handle user actions – Zooming – Selection (filtering) Zooming – Expand all nodes assigned to the medoid – Run k-medoid algorithm on the new set of nodes 37

Selection Effect of selection on a node – Completely invalid – Fully valid – Partially valid Estimate the validity percentage (VG) of each node Multiply the VG with weight of each node 38

System Architecture 39 DBMS k-Medoid Generator k-Medoid Generator Zooming Operator Zooming Operator Query Operator Query Operator Client User Interface Initial Query Query results Medoids Cover-tree Indexer Zooming Operations Medoids Query Operations Medoids MUSIQLENSMUSIQLENS

Experiments – Initial Medoid Quality Compare with R-tree based method [6] Data sets – Synthetic dataset: 2D points with zipf distribution – Real dataset: LA data set from R-tree Portal, 130k points Measurement – Time to compute the medoids – Average distance from a data point to its medoid 40

Results on Synthetic Data 41 For various sizes of data, Cover-tree based method outperforms R-tree based method Time Distance

Result on Real Data 42 For various k values, Cover-tree based method outperforms R-tree based method on real data

Query Adaptation 43 Synthetic DataReal Data Compare with re-building the cover tree and running the k-medoid algorithm from scratch. Time cost of re-building is orders-of-magnitude higher than incremental computation.

Related Work Classic/textbook k-medoid methods – Partition Around Medoids (PAM) and Clustering LARge Applications (CLARA), L. Kaufman and P. Rousseeuw, 1990 – CLARANS, R. T. Ng and J. Han, TKDE 2002 Tree-based methods – Focusing on Representatives (FOR), M. Ester, H. Kriegel, and X. Xu, KDD 1996 – Tree-based Partitioning Querying (TPAQ), K. Mouratidis, D. Papadias, and S. Papadimitriou, VLDBJ 2008 44

Related Work (2) Clustering methods – For example, BIRCH, T. Zhang, R. Ramakrishnan, and M. Livny, SIGMOD 1996 Result presentation methods – Automatic result categorization, K.Chakrabarti, S.Chaudhuri, and S.wonHwang, SIGMOD 2004 – DataScope, T. Wu, et al, VLDB 2007 Other recent work – Finding representative set from massive data, ICDM 2005 – Generalized group by, C. Li, et al, SIGMOD 2007 – Query result diversiﬁcation, E. Vee et al., ICDE 2008 45

Conclusion We proposed MusiqLens framework for solving the many-answer problem We conducted user study to select a metric for choosing representatives We proposed efficient method for computing and maintaining the representatives under user actions Part of the database usability project at Univ. of Michigan – Led by Prof. H.V. Jagadish – http://www.eecs.umich.edu/db/usable/ 46

Thank you. 47 Bin Liu, binliu@umich.edu Questions?

References [1] E. Agichtein, E. Brill, S. T. Dumais, and R. Ragno, Learning user interaction models for predicting web search result preferences. SIGIR, 2006 [2] B. J. Jansen and A. Spink. How are we searching the world wide web? a comparison of nine search engine transaction logs. Inf. Process. Manage., 42(1), 2006 [3] C. R. Palmer and C. Faloutsos. Density biased sampling: An improved method for data mining and clustering. In SIGMOD Conference, 2000 [4] M. Hua, J. Pei, A. W.-C. Fu, X. Lin, and H. Fung Leung. Efficiently answering top-k typicality queries on large databases. In VLDB, pages 890{901, 2007. [5] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In ICML, 2006. [6] K. Mouratidis, D. Papadias, and S. Papadimitriou. Tree-based partition querying: a methodology for computing medoids in large spatial datasets. VLDB J., 17(4):923- 945, 2008. 48

Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor 1.

Similar presentations

Presentation on theme: "Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor 1.

Similar presentations

Presentation on theme: "Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor 1."— Presentation transcript:

Similar presentations

About project

Feedback