Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.

Slides:



Advertisements
Similar presentations
Nutrient Analysis vs. Simplified Nutrient Assessment
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst and Megan Richardson* School of Information, Berkeley *Dept.
Semi-Automated Creation of Facet Hierarchies Marti Hearst School of Information, UC Berkeley Joint work with Dr. Emilia Stoica.
Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Aki Hecht Seminar in Databases (236826) January 2009
Measuring Information Architecture CHI 01 Panel Position Statement Marti Hearst UC Berkeley.
Social Tagging and Search Marti Hearst UC Berkeley.
Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Measuring Information Architecture Marti Hearst UC Berkeley.
Cone Trees and Collapsible Cylindrical Trees
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Semi-Automated Creation of Facet Hierarchies Marti Hearst School of Information, UC Berkeley Joint work with Dr. Emilia Stoica.
A metadata-based approach Marti Hearst Associate Professor BT Visit August 18, 2005.
Yahoo Visit Day Joint Reseach Opportunities Marti Hearst UC Berkeley School of Information.
Faceted Metadata for Information Architecture and Search Marti Hearst, SIMS at UC Berkeley Preston Smalley & Corey Chandler, eBay User Experience & Design.
Measuring Information Architecture Marti Hearst UC Berkeley.
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
Faceted Metadata in Search Interfaces Marti Hearst UC Berkeley School of Information This Research Supported by NSF IIS
Transforming Tags to (Faceted) Tagsonomies Marti Hearst UC Berkeley School of Information This Research Supported by NSF IIS
WSD using Optimized Combination of Knowledge Sources Authors: Yorick Wilks and Mark Stevenson Presenter: Marian Olteanu.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Process Modeling SYSTEMS ANALYSIS AND DESIGN, 6 TH EDITION DENNIS, WIXOM, AND ROTH © 2015 JOHN WILEY & SONS. ALL RIGHTS RESERVED. 1 Roberta M. Roth.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Information retrieval thur jan data…. framework for today’s lecture…
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Data Flow Diagrams (DFDs)
Heapsort Based off slides by: David Matuszek
Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Growing a Tree in the Forest: Constructing Folksonomies by Integrating Structured Metadata Anon Plangprasopchok 1, Kristina Lerman 1, Lise Getoor 2 1 USC.
CS-424 Gregory Dudek Today’s outline Administrative issues –Assignment deadlines: 1 day = 24 hrs (holidays are special) –The project –Assignment 3 –Midterm.
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.
Information retrieval wed sept data…. -start at 6.45.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Querying Structured Text in an XML Database By Xuemei Luo.
Date: 2013/8/27 Author: Shinya Tanaka, Adam Jatowt, Makoto P. Kato, Katsumi Tanaka Source: WSDM’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Estimating.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
16/10/2006Open Taxonomy Open Taxonomy: A Tag Browser/Editor to Increase Findability By Ken Cooley
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
MetaLib 4 User Guide. 2 MetaLib 4 Access MetaLib at: – MetaLib may be used at two different levels –
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
(c) 2007 McGraw-Hill Higher Education. All rights reserved. Accountability and Teacher Evaluation Chapter 14.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.
Text Analytics Workshop Applications Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Learning Taxonomic Relations from Heterogeneous Evidence Philipp Cimiano Aleksander Pivk Lars Schmidt-Thieme Steffen Staab (ECAI 2004)
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Semantic Grounding of Tag Relatedness in Social Bookmarking Systems Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme ISWC 2008 Hyewon Lim January.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
1 The tree data structure Outline In this topic, we will cover: –Definition of a tree data structure and its components –Concepts of: Root, internal, and.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
NLP Support for Faceted Navigation in Scholarly Collections
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Category-Based Pseudowords
Review-Level Aspect-Based Sentiment Analysis Using an Ontology
Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou
Presentation transcript:

Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley

Motivation Want to assign labels from multiple hierarchies

Motivation Hot and Sweet Chicken: 1 pepper, 2 apricots, 1 pound chicken breast, 1 Tbsp gingerroot Meat Chicken Vegetables pepper Fruit Apricot Flavor gingerroot

Castanet Carves out a structure from the hypernym (IS-A) relations within WordNet Produces surprisingly good results for a wide range of subjects e.g., arts, medicine, recipes, math, news, bibliographical records

WordNet Challenges A word may have more than one sense - Fine granularity of word sense distinctions e.g., newspaper (#1) - daily publication on folded sheets newspaper (#3) - physical object - Ambiguity for the same sense tuna #1 cactus #2 fish food fish bony fish

WordNet Challenges (cont.) The hypernym path may be quite long (e.g., sense #3 of tuna has 14 nodes) Sparse coverage of proper names and noun phrases (not addressed)

Algorithm Goals Build a set of facet hierarchies Balance depth and breadth Avoid “skinny” paths Don’t go too deep or too broad Choose understandable labels Disambiguate words Currently a word can take on only one sense

Our Approach Documents Select terms WordNet Build core tree Augment core tree Remove top level categories Compress Tree Divide into facets

1. Select Terms Select well-distributed terms from the collection Eliminate stopwords Retain only those terms with a distribution higher than a threshold (default: top 10%) Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree

2. Build Core Tree Get hypernym path if term: - has only one sense, or - matches a pre-selected WordNet domain Adding a new term increases a count at each node on its path by # of docs with the term. frozen dessert sundae entity substance,matter nutriment dessert ice cream sundae frozen dessert entity substance,matter nutriment dessert sherbet,sorbet sherbet Build a “backbone” Create paths from unambiguous terms only Bias the structure towards appropriate senses of words Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree

2. Build Core Tree (cont.) Merge hypernym paths to build a tree sundae entity substance,matter nutriment dessert ice cream sundae frozen dessert entity substance,matter nutriment dessert sherbet,sorbet sherbet frozen dessert sundae sherbet substance,matter nutriment dessert sherbet,sorbet frozen dessert entity ice cream sundae

3. Augment Core Tree Attach to Core tree the terms with more than one sense Favor the more common path over other alternatives Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree

Augment Core Tree (cont.) Date (p1) Date (p2) entity abstraction substance,matter measure, quantity food, nutrient fundamental quality nutriment time period food calendar day (18) edible fruit (78) date date Choose this path since it has more items assigned

Optional Step: Domains To disambiguate, use Domains Wordnet has 212 Domains medicine, mathematics, biology, chemistry, linguistics, soccer, etc. A better collection has been developed by Magnini 2000 Assigns a domain to every noun synset Automatically scan the collection to see which domains apply The user selects which of the suggested domains to use or may add own Paths for terms that match the selected domains are added to the core tree

Using Domains dip glosses: Sense 1: A depression in an otherwise level surface Sense 2: The angle that a magnet needle makes with horizon Sense 3: Tasty mixture into which bite-size foods are dipped dip hypernyms Sense 1 Sense 2 Sense 3 solid shape, form food => concave shape => space => ingredient, fixings => depression => angle => flavorer Given domain “food”, choose sense 3

4. Compress Tree Rule 1: Eliminate a parent with fewer than k children unless it is the root or its distribution is larger than 0.1*max dist ice cream sundae dessert sundae frozen dessert sherbet,sorbet sherbet parfait dessert frozen dessert sundae parfait sherbet abstraction Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree

4. Compress Tree (cont.) Rule 2: Eliminate a child whose name appears within the parent’s name sundae dessert frozen dessert parfait sherbet dessert sundaeparfaitsherbet abstraction Documents WordNet Select terms Build core tree Comp. tree Remove top level categ. Augm. core tree

5. Divide into Facets Divide into facets

5. Divide into Facets (Remove top levels) sugar syrup entity substance,matter food,nutriment ingredient,fixings food stuff,food product sweetening herb flavorer parsley oregano sugar syrup sweetening herb flavorer parsley oregano Rule 1: Eliminate very general categories (e.g., entity, abstraction). If no paths are longer than threshold t, then done. Else: Divide into facets Rule 2: Undo first step. Then eliminate all top levels until the maximum length of any path in the resulting hierarchy is t.

Example: Recipes (3500 docs)

Castanet Output (shown in Flamenco)

Castanet Output

Castanet Evaluation This is a tool for information architects, so people of this type did the evaluation We compared output on Recipes Biomedical journal titles We compared to two state-of-the-art algorithms LDA (Blei et al. 04) Subsumption (Sanderson & Croft ’99)

Subsumption Output

LDA Output

Evaluation Method Information architects assessed the category systems For each of 2 systems’ output: Examined and commented on top-level Examined and commented on two sub-levels Then comment on overall properties Meaningful? Systematic? Likely to use in your work?

Evaluation (cont.) Sample questions for top level categories: - Would you add/remove/rename any category ? - Did this category match your expectations ? Sample questions for a specific category: - Would you add/move/remove any sub-categories ? - Would you promote any sub-category to top level ? General questions: - Would you use Castanet ? - Would you use LDA ? - Would you use Subsumption ? - Would you use list of most frequent terms ?

Evaluation Results Results on recipes collection for “Would you use this system in your work?” # “Yes in some cases” or “yes, definitely”: Castanet: 29/34 LDA: 0/18 Subsumption: 6/16 Baseline: 25/34 Average response to questions about quality (4 = “strongly agree”)

Evaluation Results Average responses for top-level categories 4= no changes, 1 = change many Average responses for 2 subcategories

Needed Improvements Take spelling variations and morphological variants into account Use verbs and adjectives, not just nouns Normalize noun phrases Allow terms to have more than one sense Improve algorithm for assigning documents to categories.

Opportunities for Tagging New opportunity: Tagging, folksonomies (flickr, de.lici.ous) People are created facets in a decentralized manner They are assigning multiple facets to items This is done on a massive scale This leads naturally to meaningful associations

Conclusions Flexible application of hierarchical faceted metadata is a proven approach for navigating large information collections. Midway in complexity between simple hierarchies and deep knowledge representation. Currently in use on e-commerce sites; spreading to other domains Systems are needed to help create faceted metadata structures Our WordNet-based algorithm, while not perfect, seems like it will be a useful tool for Information Architects.

Conclusions Castanet builds a set of faceted hierarchies by finding IS-A relations between terms using WordNet. The method has been tested on various domains: medicine, recipes, math, news, arts, bibliographical records Usability study shows: Castanet is preferred to other state-of-the art solutions. Information architects want to use the tool in their work.

Learn More Funding This work supported in part by NSF (IIS ) For more information: Stoica, E., Hearst, M., and Richardson, M., Automating Creation of Hierarchical Faceted Metadata Structures, NAACL/HLT 2007 See