Dynamic Faceted Search for Discovery- driven Analysis Debabrata Sash, Jun Rao, Nimrod Megiddo, Anastasia Ailamaki, Guy Lohman CIKM’08 Speaker: Li, Huei-Jyun.

Dynamic Faceted Search for Discovery- driven Analysis Debabrata Sash, Jun Rao, Nimrod Megiddo, Anastasia Ailamaki, Guy Lohman CIKM’08 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling Date: 2008/12/18 1

Outline Introduction Terminology and Problem Statement Measure of “Interestingness” Implementing Dynamic Faceted Search Evaluation Conclusion and Future work 2

Introduction Today’s faceted search systems are designed for browsing catalog data and are not directly suitable for discovery-driven exploration To preserve browsing consistency, facets selected for navigation tend to be “static” When browsing online catalogs, the navigational facets are single-dimensional only 3

Introduction Propose a dynamic faceted search system for the kind of discovery-driven analysis that is often performed in On-Line Analytical Processing (OLAP) systems From a potentially large search result, this paper wants to automatically and dynamically discover a small set of facets and values that are deemed most “interesting” to a user 4

Terminology and Problem Statement Defn 1. A repository D is a collection of documents Each of which is composed of some free text and one or more pairs Given a value f in facet F, we call an instance of F All unique values associated with a facet F form the domain of F 5

Terminology and Problem Statement Defn 2. Organize the domain of these facets into a facet hierarchy Each node in the hierarchy stores a pair A node is the parent of another node if for each document, F 2 = f 2 implies F 1 = f 1 6

Terminology and Problem Statement Defn 3. Assume a query q on the repository has the form “keywords && F 1 = f 1 && F 2 = f 2 …” The result of q is denoted by D q Includes the set of documents having the specified keywords Satisfying all constraints on selected facets 7

Terminology and Problem Statement Defn 4. Given a query q, define a facet summary for a facet set F 1, …, F m as a list of tuples over D q f i is an instance of facet F i A(f 1, …, f m ) is an aggregate of documents in D q that contain all these facet instances 8

Terminology and Problem Statement Problem Definition: Given a repository of documents with n facets, a query q, 2 integers K 1 & K 2  select K 1 facet sets and a facet summary for each with up to K 2 tuples that are the most “interesting” to a user 9

Measure of “Interestingness” Interestingness: How surprising an actual aggregated value is, given a certain expectation 10

Measure of “Interestingness” *Setting the Expectation For a given set of facet values f 1, …, f m from F 1, …, F m : C D (f 1, …, f m ): the count of the number of documents with all those facet values in D C q (f 1, …, f m ): the count of the number of documents with all those facet values in D q E[C q (f 1, …, f m )]: an “expected” value for C q (f 1, …, f m ) Natural 、 navigational 、 ad hoc 11

Measure of “Interestingness” *Setting the Expectation Natural: For an individual facet instance : (uniformity assumption) For an instance f 1, …, f m of a facet set: (independence assumption) 12

Measure of “Interestingness” *Setting the Expectation Navigational: Ad hoc: User can tell the system to set expectation based on an arbitrary query q of the user’s choice Set the count for each facet value proportionally based on the distribution of the result of q 13

Measure of “Interestingness” *Measuring Degree of Interestingness Single facet instance: By evaluating it with respect to a scenario in which its associated count is generated by random sampling The smaller the probability of observing the count under random sampling, the more interesting the facet instance 14

Measure of “Interestingness” *Measuring Degree of Interestingness p-value: Suppose that a certain facet value occurs in r out of R documents in the repository and in q out of Q documents in the output of a certain query Also suppose The interestingness of that facet value vis-à-vis the query: the probability that in a random sample of size Q there will be at least q documents with that facet value hypergeometric distribution  normal distribution or Poisson distribution 15

Measure of “Interestingness” *Measuring Degree of Interestingness The whole facet: For each facet F, we consider the p-values of only the k most interesting values in F, replace  The final measure: MaxWeight: assign 1 to w 1 and 0 to the rest AvgWeight: assign each w i an equal weight HybridWeight: average the interesingness computed by MaxWeight and AvgWeight 16

Implementing Dynamic Faceted Search Solr: indexes facets without storing them Enumerates every facet instance from the index and intersects its posting list with D q From the intersected set, it derives the count on facet value f Caches each posting list to a bitset If the bitset is dense: bitmap Otherwise: a hash map of document IDs 17

Implementing Dynamic Faceted Search Improving Solr: Solr limitation 1: has to choose a threshold that decides the representation of the bitset  represent a bitset as a compressed bitmap using Word-Aligned Hybrid (WAH) code 18

Implementing Dynamic Faceted Search WAH There are 2 types of words: Literal words: a verbatim representation of 31 bits Fill words: encodes the length of a list of all 0’s and 1’s in 30 bits A bitmap is broken into groups of 31 bits first and then converted into a sequence of literal and fill words Operations on bitmaps such as intersection can be performed on WAH code directly without decoding 19

Implementing Dynamic Faceted Search Improving Solr: Solr limitation 2: it has to intersect the matching document set D q with the bitset of every facet instance  reduce the number of intersections by building a directory structure called bitset tree on top of the bitsets of a facet 20

Implementing Dynamic Faceted Search Building and Using a Bitset Tree Starting with the leaf nodes, for each bitset b corresponding to facet instance, we create an entry Then divide all entries into groups of size s For each group, we generate a leaf node holding all entries in that group 21

Evaluation *Setup DBLP Contains about 13,000 papers published in 26 venues (e.g., SIGMOD, VLDB, TODS, etc) in the past 30 years It has 14 facets organized in 6 hierarchies, including author, venue, time (e.g., decade, year), location (e.g., country, city), number of authors per paper, number of citations per paper Use the title of each paper as text for keywords searches Conduct the user survey 22

Evaluation *Setup Patent Has about 1.8 million U.S. patents from the past 30 years 16 facets organized into 10 hierarchies Use for performance evaluation 23

Evaluation *Result from a User Survey Performed tests on 3 keyword queries 2 are provided by author: “distributed”, “mining” Users pick the 3 keyword 1 base on natural 2 base on navigational 1 used complete repository 1 used previous query 24

Evaluation *Result from a User Survey 25

Evaluation *Result from a User Survey Our dynamic approach also received some negative feedback Overall, the feedback for the natural expectation is neutral Different ways of aggregating the degree of interestingness HybridWeight(7) > MaxWeight(6) > AvgHeight(2) 26

Evaluation *Performance Results Environment: Implemented in Java 3GHz P4 desktop machine with 1GB memory A single disk drive, running Linux Version: 1. simple: inverted index 2. Solr 3. compressed: improves Solr by WAH code 4. tree: improves Solr by bitset trees 5. compressed-tree: both WAH and bitset tree on Solr 27

Evaluation *Performance Results Scaling with Data Size Run a query that matches 25,000 docs using tree Break the total time into search time & summary computation time 28

Evaluation *Performance Results 29

Evaluation *Performance Results 30

Conclusion and Future Work Develop a novel dynamic faceted search system support OLAP-style discovery-driven analysis on a large set of structured and unstructured data Propose an intuitive and effective way of measuring “interestingness” Propose a novel navigational,method of setting a user’s expectation 31

Conclusion and Future Work Incorporate user feedback in facet selection How to extend the aggregates to functions other than count Sum, average on some numerical measures How to support dynamic faceted search in a distributed environment 32

Dynamic Faceted Search for Discovery- driven Analysis Debabrata Sash, Jun Rao, Nimrod Megiddo, Anastasia Ailamaki, Guy Lohman CIKM’08 Speaker: Li, Huei-Jyun.

Similar presentations

Presentation on theme: "Dynamic Faceted Search for Discovery- driven Analysis Debabrata Sash, Jun Rao, Nimrod Megiddo, Anastasia Ailamaki, Guy Lohman CIKM’08 Speaker: Li, Huei-Jyun."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dynamic Faceted Search for Discovery- driven Analysis Debabrata Sash, Jun Rao, Nimrod Megiddo, Anastasia Ailamaki, Guy Lohman CIKM’08 Speaker: Li, Huei-Jyun.

Similar presentations

Presentation on theme: "Dynamic Faceted Search for Discovery- driven Analysis Debabrata Sash, Jun Rao, Nimrod Megiddo, Anastasia Ailamaki, Guy Lohman CIKM’08 Speaker: Li, Huei-Jyun."— Presentation transcript:

Similar presentations

About project

Feedback