eClassifier: Tool for Taxonomies

Slides:



Advertisements
Similar presentations
Web Center Certification Sitemap / Formatting Content Web Center Certification Training Intuit Financial Services University.
Advertisements

From Words to Meaning to Insight
ISDSI 2009 Francesco Guerra– Università di Modena e Reggio Emilia 1 DB unimo Searching for data and services F. Guerra 1, A. Maurino 2, M. Palmonari.
1 End-User Programming to Support Classroom Activities on Small Devices Craig Prince University of Washington VL/HCC 2008.
ELibrary Topic Search Basics eLibrary topic search allows users to locate articles and multimedia resources –Relevant to K-12 curricula and user.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
OvidSP Flexible. Innovative. Precise. Introducing OvidSP Resources.
Taxonomy & Ontology Impact on Search Infrastructure John R. McGrath Sr. Director, Fast Search & Transfer.
Cultural Heritage in REGional NETworks REGNET Project Meeting Content Group Part 1: Usability Testing.
A Novel Visualization Model for Web Search Results An Application of the Solar System Metaphor Tien N. Nguyen and Jin Zhang Electrical and Computer Engineering.
The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
Google Search Appliance November 2, 2010 Susan Fagan.
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
0 - 0.
SUBTRACTING INTEGERS 1. CHANGE THE SUBTRACTION SIGN TO ADDITION
Addition Facts
Creating Data Entry Screens in Epi Info
Introduction Lesson 1 Microsoft Office 2010 and the Internet
Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,
Click to add text Card Sorting and Cluster Analysis Sophie Wood Agency Support Services
Information Systems Today: Managing in the Digital World
Campaign Overview Mailers Mailing Lists
Word Lesson 7 Working with Documents
Microsoft Access.
State of Connecticut Core-CT Project Query 8 hrs Updated 6/06/2006.
“The Honeywell Web-based Corrective Action Solution”
Copyright© 2003 Avaya Inc. All rights reserved Avaya Interactive Dashboard (AID): An Interactive Tool for Mining Avaya Problem Ticket Database Ziyang Wang.
© Arjen P. de Vries Arjen P. de Vries Fascinating Relationships between Media and Text.
The world leader in serving science TQ ANALYST SOFTWARE Putting your applications on target.
Devising Secure Sockets Layer-Based Distributed Systems: A Performance-Aware Approach Norman Lim, Shikharesh Majumdar,Vineet Srivastava, Dept. of Systems.
Semantic multimedia annotation tool Tutorial authors : Batatia, Piombo
Addition 1’s to 20.
Test B, 100 Subtraction Facts
Week 1.
From Words to Meaning to Insight Julia Cretchley & Mike Neal.
Lesson 15 Working with Tables
KEOD 2013 – 20 th September 2013 A Comprehensive Framework for Semantic Annotation of Web Content Manuel Fiorelli 1, Maria Teresa Pazienza 2, Armando Stellato.
Computer Concepts BASICS 4th Edition
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 14: Protection.
Chapter 14 Writing and Presenting The Systems Proposal
Manage Student Progression Manage Progression Requirement.
PolyAnalyst Data and Text Mining tool Your Knowledge Partner TM www
Taxonomies of Knowledge: Building a Corporate Taxonomy Wendi Pohs, Iris Associates
Chapter 5: Information Retrieval and Web Search
State of Connecticut Core-CT Project Query 4 hrs Updated 1/21/2011.
Almaden Services Research © 2008 IBM Corporation Intellectual Property Analytics Turning Unstructured Information Into Value Jeffrey T. Kreulen, Ph.D.
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
Databases C HAPTER Chapter 10: Databases2 Databases and Structured Fields  A database is a collection of information –Typically stored as computer.
CPSC 203 Introduction to Computers T59 & T64 By Jie (Jeff) Gao.
Blaz Fortuna, Marko Grobelnik, Dunja Mladenic Jozef Stefan Institute ONTOGEN SEMI-AUTOMATIC ONTOLOGY EDITOR.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
Fourth Edition Discovering the Internet Discovering the Internet Complete Concepts and Techniques, Second Edition Chapter 3 Searching the Web.
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
Data Mining By Dave Maung.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Cage: A Keyword.
Microsoft FrontPage 2003 Illustrated Complete Integrating a Database with a Web Site.
Lesson 4.  After a table has been created, you may need to modify it. You can make many changes to a table—or other database object—using its property.
CPSC 203 Introduction to Computers T97 By Jie (Jeff) Gao.
CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Third Edition Discovering the Internet Discovering the Internet Complete Concepts and Techniques, Second Edition Chapter 3 Searching the Web.
Clustering of Web pages
Taxonomies, Lexicons and Organizing Knowledge
InnovationQ Plus Quick Start Guide
Mining Anchor Text for Query Refinement
Information Retrieval and Web Design
Presentation transcript:

eClassifier: Tool for Taxonomies Scott Spangler spangles@almaden.ibm.com IBM Almaden Research Center San Jose, CA

Assertions on Taxonomy Generation Manual methods are too labor intensive, limit scope and scale, and are not maintainable Canned taxonomies are a niche solution There are many “natural” or “right” taxonomies, even on the same collection Clustering, canned taxonomies and other methods are good starting points, but not enough

Salient Features of eClassifier Clustering algorithm independent bias towards speed for interaction Classification algorithm independent evaluate multiple algorithms for given taxonomy pick best algorithm for each level in taxonomy Multiple methods to seed taxonomy: import, clustering, query based Multiple methods for evaluating, editing and validating taxonomies Given a taxonomy, analysis/discovery against structured and unstructured information

eClassifier Principles Apply multiple text mining algorithms to textual data sets in a practical manner. Provide consistently good results, the goal is not perfection. Utilize domain expertise by giving the user control over the mining process. Provide tools, metrics and reports to draw useful conclusions from the analysis.

The Mining Process Create a dictionary of terms (words and phrases) Prune dictionary (prune irrelevant terms) Cluster documents based on this dictionary Examine the resulting taxonomy, modifying based on domain expertise Create multiple taxonomies (divide and conquer) Do deeper analysis by creating keyword classifications, comparing taxonomies, inspecting dictionary co-occurrence, examining recent trends

The Class Table For viewing and understanding each level in a taxonomy

Understanding Class Metrics Class Naming Convention Shortest possible name that covers the examples “,” => OR “&” => AND X_Y => X followed by Y NONE => no useful text Miscellaneous => No easy description Cohesion A measure of similarity between documents in the same class (0-different terms, 100-same terms) Distinctness A measure of similarity between documents in different classes (0-very similar, 100-very unique)

Dictionary Tool Edit -> Dictionary Tool Use this to edit the features on which the taxonomy is based Delete irrelevant or ambiguous terms Generate and edit synonyms

Dictionary Generation Files StopWords words excluded from the dictionary Synonyms different forms of the same semantic term IncludeWords words that always appear in dictionary Stock Phrases text to be ignored in creating dictionary Synonyms and Stock Phrases can be automatically generated and then edited

Refinement of Classes Subclass Classes Merge Classes Delete Classes Subdivide an existing class into multiple subclass at the next level in the taxonomy Merge Classes Delete Classes Rename Class Undo Don’t be afraid to try things Save .obj files contain all information eClassifier uses .class files contain class membership Read

Class View For understanding the concepts and contents of a given class View the text Most typical Least typical View the source Web page View distinguishing terms View deduced rules for classification and related documents

Keyword Searching Edit->Keyword Search Search for Dictionary terms Use “and” , “or” and “_” Searching within a class Related Words Look at Trends Create new Classes See where the matching documents occur via Class Table

Document/Page Viewer Sorting Documents View distinguishing terms Most typical Least typical View distinguishing terms Representative use of important words Moving documents Trend Reports

Keyword Class Generation Execute->Classify by Keywords Open queries (KCG files) One query per line .AND. , .OR., (, ) Add, Rename, Delete queries Prioritize – Move up and down Multiple/only one class Ambiguous/first matching class Run Queries Save Queries Run eClassifier

Comparing Taxonomies File->Compare Taxonomies File->Read Structured Information Co-occurrence counts and affinities Trend View documents Transpose Report (CSV)

Dictionary Co-occurrence View->Dictionary Co-occurrence Type ahead searching Co-occurrence counts and affinities Trend View documents Zoom in Change Metric -> dependency

Advanced Features Visualization Subclass from Structured Information Make Classifier Read Template Import Category Add a category from another saved taxonomy Select Metrics Add other columns to the Class table BIW

Visualization Look at relationships between selected classes Discover sub-clusters Find “borderline” examples View/Move Documents Navigator Touring