Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semi-Automated Creation of Facet Hierarchies Marti Hearst School of Information, UC Berkeley Joint work with Dr. Emilia Stoica.

Similar presentations


Presentation on theme: "Semi-Automated Creation of Facet Hierarchies Marti Hearst School of Information, UC Berkeley Joint work with Dr. Emilia Stoica."— Presentation transcript:

1 Semi-Automated Creation of Facet Hierarchies Marti Hearst School of Information, UC Berkeley Joint work with Dr. Emilia Stoica

2 Marti Hearst, Taxonomy Bootcamp ‘06 Outline  Faceted Metadata  Definition  Advantages  Flamenco:  Search Interface Design using Faceted Metadata  Castanet:  (Semi) Automated Tool for Creation of Category Systems  Comparison to State-of-the-Art Alternatives  Conclusions

3 Marti Hearst, Taxonomy Bootcamp ‘06 Focus: Search and Navigation of Large Collections Image Collections E-Government Sites Shopping Sites Digital Libraries

4 Marti Hearst, Taxonomy Bootcamp ‘06  Study by Vividence in 2001 on 69 Sites  70% eCommerce  31% Service  21% Content  2% Community  Poorly organized search results  Frustration and wasted time  Poor information architecture  Confusion  Dead ends  "back and forthing"  Forced to search Problems with Site Search

5 Marti Hearst, Taxonomy Bootcamp ‘06 What we want to Achieve  Integrate browsing and searching seamlessly  Support exploration and learning  Avoid dead-ends, “pogo’ing”, and “lostness”

6 Marti Hearst, Taxonomy Bootcamp ‘06 Main Idea  Use hierarchical faceted metadata  Design the interface to:  Allow flexible navigation  Provide previews of next steps  Organize results in a meaningful way  Support both expanding and refining the search

7 Marti Hearst, Taxonomy Bootcamp ‘06 The Problem With Hierarchy  Most things can be classified in more than one way.  Most organizational systems do not handle this well.  Example: Animal Classification otter penguin robin salmon wolf cobra bat Skin Covering Locomotion Diet robin bat wolf penguin otter, seal salmon robin bat salmon wolf cobra otter penguin seal robin penguin salmon cobra bat otter wolf

8 Marti Hearst, Taxonomy Bootcamp ‘06  Inflexible  Force the user to start with a particular category  What if I don’t know the animal’s diet, but the interface makes me start with that category?  Wasteful  Have to repeat combinations of categories  Makes for extra clicking and extra coding  Difficult to modify  To add a new category type, must duplicate it everywhere or change things everywhere The Problem with Hierarchy

9 Marti Hearst, Taxonomy Bootcamp ‘06 The Problem With Hierarchy start furscalesfeathers swimflyrun slither furscalesfeathersfurscalesfeathers fish rodents insects fish rodents insects fish rodents insects fish rodents insects fish rodents insects fish rodents insects fish rodents insects fish rodents insects fish rodents insects salmonbatrobinwolf …

10 Marti Hearst, Taxonomy Bootcamp ‘06 The Idea of Facets  Facets are a way of labeling data  A kind of Metadata (data about data)  Can be thought of as properties of items  Facets vs. Categories  Items are placed INTO a category system  Multiple facet labels are ASSIGNED TO items

11 Marti Hearst, Taxonomy Bootcamp ‘06 The Idea of Facets  Create INDEPENDENT categories (facets)  Each facet has labels (sometimes arranged in a hierarchy)  Assign labels from the facets to every item  Example: recipe collection Course Main Course Cooking Method Stir-fry Cuisine Thai Ingredient Bell Pepper Curry Chicken

12 Marti Hearst, Taxonomy Bootcamp ‘06 The Idea of Facets  Break out all the important concepts into their own facets  Sometimes the facets are hierarchical  Assign labels to items from any level of the hierarchy Preparation Method Fry Saute Boil Bake Broil Freeze Desserts Cakes Cookies Dairy Ice Cream Sorbet Flan Fruits Cherries Berries Blueberries Strawberries Bananas Pineapple

13 Marti Hearst, Taxonomy Bootcamp ‘06 Using Facets  Now there are multiple ways to get to each item Preparation Method Fry Saute Boil Bake Broil Freeze Desserts Cakes Cookies Dairy Ice Cream Sherbet Flan Fruits Cherries Berries Blueberries Strawberries Bananas Pineapple Fruit > Pineapple Dessert > Cake Preparation > Bake Dessert > Dairy > Sherbet Fruit > Berries > Strawberries Preparation > Freeze

14 Marti Hearst, Taxonomy Bootcamp ‘06 Example: Nobel Prize Winners Collection (Before and After Facets)

15 Marti Hearst, Taxonomy Bootcamp ‘06 Only One Way to View Laureates

16 Marti Hearst, Taxonomy Bootcamp ‘06 First, Choose Prize Type

17 Marti Hearst, Taxonomy Bootcamp ‘06 Next, view the list! The user must first choose an Award type (literature), then browse through the laureates in chronological order. No choice is given to, say organize by year and then award, or by country, then decade, then award, etc.

18 Marti Hearst, Taxonomy Bootcamp ‘06 Flamenco Interface: Using Hierarchical Faceted Metadata

19 Marti Hearst, Taxonomy Bootcamp ‘06 Opening View Select literature from PRIZE facet

20 Marti Hearst, Taxonomy Bootcamp ‘06 Group results by YEAR facet

21 Marti Hearst, Taxonomy Bootcamp ‘06 Select 1920’s from YEAR facet

22 Marti Hearst, Taxonomy Bootcamp ‘06 Current query is PRIZE > literature AND YEAR: 1920’s. Now remove PRIZE > literature

23 Marti Hearst, Taxonomy Bootcamp ‘06 Now Group By YEAR > 1920’s

24 Marti Hearst, Taxonomy Bootcamp ‘06 Hierarchy Traversal: Group By YEAR > 1920’s, and drill down to 1921

25 Marti Hearst, Taxonomy Bootcamp ‘06 Select an individual item

26 Marti Hearst, Taxonomy Bootcamp ‘06 Use Endgame to expand out

27 Marti Hearst, Taxonomy Bootcamp ‘06 Use Endgame to expand out

28 Marti Hearst, Taxonomy Bootcamp ‘06 Or use “More like this” to find similar items

29 Marti Hearst, Taxonomy Bootcamp ‘06 Start a new search using keyword “California”

30 Marti Hearst, Taxonomy Bootcamp ‘06 Note that category structure remains after the keyword search

31 Marti Hearst, Taxonomy Bootcamp ‘06 The query is now a keyword ANDed with a facet subhierarchy

32 Marti Hearst, Taxonomy Bootcamp ‘06 Using Facets  The system only shows the labels that correspond to the current set of items  Start with all items and all facets  The user then selects a label within a facet  This reduces the set of items (only those that have been assigned to the subcategory label are displayed)  This also eliminates some subcategories from the view.

33 Marti Hearst, Taxonomy Bootcamp ‘06 Advantages of Facets  Can’t end up with empty results sets  (except with keyword search)  Helps avoid feelings of being lost.  Easier to explore the collection.  Helps users infer what kinds of things are in the collection.  Evokes a feeling of “browsing the shelves”  Is preferred over standard search for collection browsing in usability studies.  (Interface must be designed properly)

34 Marti Hearst, Taxonomy Bootcamp ‘06 Advantages of Facets  Seamless to add new facets and subcategories  Seamless to add new items.  Helps with “categorization wars”  Don’t have to agree exactly where to place something  Interaction can be implemented using a standard relational database.  May be easier for automatic categorization

35 Marti Hearst, Taxonomy Bootcamp ‘06 Information previews  Use the metadata to show where to go next  More flexible than canned hyperlinks  Less complex than full search  Help users see and return to previous steps  Reduces mental work  Recognition over recall  Suggests alternatives  More clicks are ok only if (J. Spool)  The “scent” of the target does not weaken  If users feel they are going towards, rather than away, from their target.

36 Marti Hearst, Taxonomy Bootcamp ‘06 Facets vs. Hierarchy  Early Flamenco studies compared allowing multiple hierarchical facets vs. just one facet.  Multiple facets was preferred and more successful.

37 Marti Hearst, Taxonomy Bootcamp ‘06 Limitation of Facets  Do not naturally capture MAIN THEMES  Facets do not show RELATIONS explicitly Aquamarine Red Orange Door Doorway Wall  Which color associated with which object? Photo by J. Hearst, jhearst.typepad.com

38 Marti Hearst, Taxonomy Bootcamp ‘06 Terminology Clarification  Facets vs. Attributes  Facets are shown independently in the interface  Attributes just associated with individual items  E.g., ID number, Source, Affiliation  However, can always convert an attribute to a facet  Facets vs. Labels  Labels are the names used within facets  These are organized into subhierarchies  Synonyms  There should be alternate names for the category labels  Currently (in Flamenco) this is done with subcategories  E.g., Deer has subcategories “stag”, “fawn”, “doe”

39 Marti Hearst, Taxonomy Bootcamp ‘06 Usability Study Results

40 Marti Hearst, Taxonomy Bootcamp ‘06 Flamenco Usability Studies  Usability studies done on 3 collections:  Recipes (epicurious): 13,000 items  Architecture Images: 40,000 items  Fine Arts Images: 35,000 items  Conclusions:  Users like and are successful with the dynamic faceted hierarchical metadata, especially for browsing tasks  Very positive results, in contrast with studies on earlier iterations.

41 Marti Hearst, Taxonomy Bootcamp ‘06 Most Recent Usability Study  Participants & Collection  32 Art History Students  ~35,000 images from SF Fine Arts Museum  Study Design  Within-subjects  Each participant sees both interfaces  Balanced in terms of order and tasks  Participants assess each interface after use  Afterwards they compare them directly  Data recorded in behavior logs, server logs, paper-surveys; one or two experienced testers at each trial.  Used 9 point Likert scales.  Session took about 1.5 hours; pay was $15/hour

42 Marti Hearst, Taxonomy Bootcamp ‘06 Post-Interface Assessments All significant at p<.05 except “simple” and “overwhelming”

43 Marti Hearst, Taxonomy Bootcamp ‘06 Post-Test Comparison 1516 230 129 428 823 624 283 131 229 FacetedBaseline Overall Assessment More useful for your tasks Easiest to use Most flexible More likely to result in dead ends Helped you learn more Overall preference Find images of roses Find all works from a given period Find pictures by 2 artists in same media Which Interface Preferable For:

44 How to Create Facet Hierarchies? Our Approach: Castanet

45 Marti Hearst, Taxonomy Bootcamp ‘06 Example: Recipes (3500 docs)

46 Marti Hearst, Taxonomy Bootcamp ‘06 Castanet Output (shown in Flamenco)

47 Marti Hearst, Taxonomy Bootcamp ‘06 Castanet Output (shown in Flamenco)

48 Marti Hearst, Taxonomy Bootcamp ‘06 Castanet Output (shown in Flamenco)

49 Marti Hearst, Taxonomy Bootcamp ‘06 Castanet Output (shown in Flamenco)

50 Marti Hearst, Taxonomy Bootcamp ‘06 Castanet Output (shown in Flamenco)

51 Marti Hearst, Taxonomy Bootcamp ‘06

52 Our Approach: Leverage the structure of WordNet

53 Marti Hearst, Taxonomy Bootcamp ‘06 Our Approach  Leverage the structure of WordNet Documents WordNet Get hypernym paths Select terms Build tree Compress tree Divide into facets

54 Marti Hearst, Taxonomy Bootcamp ‘06 1. Select Terms  Select well-distributed terms from the collection  Eliminate stopwords  Retain only those terms with a distribution higher than a threshold (default: top 10%) Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree

55 Marti Hearst, Taxonomy Bootcamp ‘06 2. Build Core Tree  Get hypernym path if term: - has only one sense, or - matches a pre-selected WordNet domain  Adding a new term increases a count at each node on its path by # of docs with the term. frozen dessert sundae entity substance,matter nutriment dessert ice cream sundae frozen dessert entity substance,matter nutriment dessert sherbet,sorbet sherbet  Build a “backbone”  Create paths from unambiguous terms only  Bias the structure towards appropriate senses of words Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree

56 Marti Hearst, Taxonomy Bootcamp ‘06 2. Build Core Tree (cont.)  Merge hypernym paths to build a tree sundae entity substance,matter nutriment dessert ice cream sundae frozen dessert entity substance,matter nutriment dessert sherbet,sorbet sherbet frozen dessert sundae sherbet substance,matter nutriment dessert sherbet,sorbet frozen dessert entity ice cream sundae

57 Marti Hearst, Taxonomy Bootcamp ‘06 3. Augment Core Tree  Attach to Core tree the terms with more than one sense  Favor the more common path over other alternatives Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree

58 Marti Hearst, Taxonomy Bootcamp ‘06 Augment Core Tree (cont.) Date (p1) Date (p2) entity abstraction substance,matter measure, quantity food, nutrient fundamental quality nutriment time period food calendar day (18) edible fruit (78) date date Choose this path since it has more items assigned

59 Marti Hearst, Taxonomy Bootcamp ‘06 4. Compress Tree  Rule 1: Eliminate a parent with fewer than k children unless it is the root or its distribution is larger than 0.1*max dist ice cream sundae dessert sundae frozen dessert sherbet,sorbet sherbet parfait dessert frozen dessert sundae parfait sherbet abstraction Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree

60 Marti Hearst, Taxonomy Bootcamp ‘06 4. Compress Tree (cont.)  Rule 2:  Eliminate a child whose name appears within the parent’s name sundae dessert frozen dessert parfait sherbet dessert sundaeparfaitsherbet abstraction Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree

61 Marti Hearst, Taxonomy Bootcamp ‘06 5. Divide into Facets Divide into facets

62 Marti Hearst, Taxonomy Bootcamp ‘06 5. Divide into Facets (Remove top levels) sugar syrup entity substance,matter food,nutriment ingredient,fixings food stuff,food product sweetening herb flavorer parsley oregano sugar syrup sweetening herb flavorer parsley oregano Rule 1: Eliminate very general categories (e.g., entity, abstraction). If no paths are longer than threshold t, then done. Else: Divide into facets Rule 2: Undo first step. Then eliminate all top levels until the maximum length of any path in the resulting hierarchy is t.

63 Marti Hearst, Taxonomy Bootcamp ‘06 Disambiguation  Ambiguity in:  Word senses  Paths up the hypernym tree Sense 1 for word “tuna” organism, being => plant, flora => vascular plant => succulent => cactus => tuna Sense 2 for word “tuna” organism, being => fish => food fish => tuna => bony fish => spiny-finned fish => percoid fish => tuna 2 paths for same word2 paths for same sense

64 Marti Hearst, Taxonomy Bootcamp ‘06 How to Select the Right Senses and Paths?  First: build core tree  (1) Create paths for words with only one sense  (2) Use Domains  Wordnet has 212 Domains  medicine, mathematics, biology, chemistry, linguistics, soccer, etc.  Automatically scan the collection to see which domains apply  The user selects which of the suggested domains to use or may add own  Paths for terms that match the selected domains are added to the core tree  Then: add remaining terms to the core tree.

65 Marti Hearst, Taxonomy Bootcamp ‘06 Optional Step: Domains  To disambiguate, use Domains  Wordnet has 212 Domains  medicine, mathematics, biology, chemistry, linguistics, soccer, etc.  A better collection has been developed by Magnini 2000  Assigns a domain to every noun synset  Automatically scan the collection to see which domains apply  The user selects which of the suggested domains to use or may add own  Paths for terms that match the selected domains are added to the core tree

66 Marti Hearst, Taxonomy Bootcamp ‘06 Using Domains dip glosses: Sense 1: A depression in an otherwise level surface Sense 2: The angle that a magnet needle makes with horizon Sense 3: Tasty mixture into which bite-size foods are dipped dip hypernyms Sense 1 Sense 2 Sense 3 solid shape, form food => concave shape => space => ingredient, fixings => depression => angle => flavorer Given domain “food”, choose sense 3

67 Castanet Evaluation

68 Marti Hearst, Taxonomy Bootcamp ‘06 Castanet Evaluation  This is a tool for information architects, so people of this type did the evaluation  We compared output on  Recipes  Biomedical journal titles  We compared to two state-of-the-art algorithms  LDA (Blei et al. 04)  Subsumption (Sanderson & Croft ’99)

69 Marti Hearst, Taxonomy Bootcamp ‘06 Subsumption Output (shown in Flamenco)

70 Marti Hearst, Taxonomy Bootcamp ‘06 Subsumption Output (shown in Flamenco)

71 Marti Hearst, Taxonomy Bootcamp ‘06 Subsumption Output (shown in Flamenco)

72 Marti Hearst, Taxonomy Bootcamp ‘06 Subsumption Output (shown in Flamenco)

73 Marti Hearst, Taxonomy Bootcamp ‘06 LDA Output (shown in Flamenco)

74 Marti Hearst, Taxonomy Bootcamp ‘06 LDA Output (shown in Flamenco)

75 Marti Hearst, Taxonomy Bootcamp ‘06 LDA Output (shown in Flamenco)

76 Marti Hearst, Taxonomy Bootcamp ‘06 Evaluation Method  Information architects assessed the category systems  For each of 2 systems’ output:  Examined and commented on top-level  Examined and commented on two sub-levels  Then comment on overall properties  Meaningful?  Systematic?  Likely to use in your work?

77 Marti Hearst, Taxonomy Bootcamp ‘06 Evaluation Results  Results on recipes collection for “Would you use this system in your work?”  Yes in some cases or yes definitely:  Pine (Castanet): 29/34  Oak (LDA): 0/18  Birch (Subsumption): 6/16  Results on quality of categories:

78 Marti Hearst, Taxonomy Bootcamp ‘06 Opportunities for Tagging  New opportunity: Tagging, folksonomies  (flickr de.lici.ous)  People are created facets in a decentralized manner  They are assigning multiple facets to items  This is done on a massive scale  This leads naturally to meaningful associations

79 Marti Hearst, Taxonomy Bootcamp ‘06 Conclusions  Flexible application of hierarchical faceted metadata is a proven approach for navigating large information collections.  Midway in complexity between simple hierarchies and deep knowledge representation.  Currently in use on e-commerce sites; spreading to other domains  Systems are needed to help create faceted metadata structures  Our WordNet-based algorithm, while not perfect, seems like it will be a useful tool for Information Architects.

80 Marti Hearst, Taxonomy Bootcamp ‘06 Acknowledgements  Flamenco Team  Brycen Chun, Ame Elliott, Jennifer English, Kevin Li, Rashmi Sinha, Emilia Stoica, Kirsten Swearingen, Ka- Ping Yee  Castanet  Emilia Stoica  Funding  This work supported in part by NSF (IIS-9984741)

81 For more information: flamenco.berkeley.edu Thank you! Marti Hearst


Download ppt "Semi-Automated Creation of Facet Hierarchies Marti Hearst School of Information, UC Berkeley Joint work with Dr. Emilia Stoica."

Similar presentations


Ads by Google