Presentation is loading. Please wait.

Presentation is loading. Please wait.

eClassifier: Tool for Taxonomies

Similar presentations


Presentation on theme: "eClassifier: Tool for Taxonomies"— Presentation transcript:

1 eClassifier: Tool for Taxonomies
Scott Spangler IBM Almaden Research Center San Jose, CA

2 Assertions on Taxonomy Generation
Manual methods are too labor intensive, limit scope and scale, and are not maintainable Canned taxonomies are a niche solution There are many “natural” or “right” taxonomies, even on the same collection Clustering, canned taxonomies and other methods are good starting points, but not enough

3 Salient Features of eClassifier
Clustering algorithm independent bias towards speed for interaction Classification algorithm independent evaluate multiple algorithms for given taxonomy pick best algorithm for each level in taxonomy Multiple methods to seed taxonomy: import, clustering, query based Multiple methods for evaluating, editing and validating taxonomies Given a taxonomy, analysis/discovery against structured and unstructured information

4 eClassifier Principles
Apply multiple text mining algorithms to textual data sets in a practical manner. Provide consistently good results, the goal is not perfection. Utilize domain expertise by giving the user control over the mining process. Provide tools, metrics and reports to draw useful conclusions from the analysis.

5 The Mining Process Create a dictionary of terms (words and phrases)
Prune dictionary (prune irrelevant terms) Cluster documents based on this dictionary Examine the resulting taxonomy, modifying based on domain expertise Create multiple taxonomies (divide and conquer) Do deeper analysis by creating keyword classifications, comparing taxonomies, inspecting dictionary co-occurrence, examining recent trends

6 The Class Table For viewing and understanding each level in a taxonomy

7 Understanding Class Metrics
Class Naming Convention Shortest possible name that covers the examples “,” => OR “&” => AND X_Y => X followed by Y NONE => no useful text Miscellaneous => No easy description Cohesion A measure of similarity between documents in the same class (0-different terms, 100-same terms) Distinctness A measure of similarity between documents in different classes (0-very similar, 100-very unique)

8 Dictionary Tool Edit -> Dictionary Tool
Use this to edit the features on which the taxonomy is based Delete irrelevant or ambiguous terms Generate and edit synonyms

9 Dictionary Generation Files
StopWords words excluded from the dictionary Synonyms different forms of the same semantic term IncludeWords words that always appear in dictionary Stock Phrases text to be ignored in creating dictionary Synonyms and Stock Phrases can be automatically generated and then edited

10 Refinement of Classes Subclass Classes Merge Classes Delete Classes
Subdivide an existing class into multiple subclass at the next level in the taxonomy Merge Classes Delete Classes Rename Class Undo Don’t be afraid to try things Save .obj files contain all information eClassifier uses .class files contain class membership Read

11 Class View For understanding the concepts and contents of a given class View the text Most typical Least typical View the source Web page View distinguishing terms View deduced rules for classification and related documents

12 Keyword Searching Edit->Keyword Search Search for Dictionary terms
Use “and” , “or” and “_” Searching within a class Related Words Look at Trends Create new Classes See where the matching documents occur via Class Table

13 Document/Page Viewer Sorting Documents View distinguishing terms
Most typical Least typical View distinguishing terms Representative use of important words Moving documents Trend Reports

14 Keyword Class Generation
Execute->Classify by Keywords Open queries (KCG files) One query per line .AND. , .OR., (, ) Add, Rename, Delete queries Prioritize – Move up and down Multiple/only one class Ambiguous/first matching class Run Queries Save Queries Run eClassifier

15 Comparing Taxonomies File->Compare Taxonomies
File->Read Structured Information Co-occurrence counts and affinities Trend View documents Transpose Report (CSV)

16 Dictionary Co-occurrence
View->Dictionary Co-occurrence Type ahead searching Co-occurrence counts and affinities Trend View documents Zoom in Change Metric -> dependency

17 Advanced Features Visualization Subclass from Structured Information
Make Classifier Read Template Import Category Add a category from another saved taxonomy Select Metrics Add other columns to the Class table BIW

18 Visualization Look at relationships between selected classes
Discover sub-clusters Find “borderline” examples View/Move Documents Navigator Touring


Download ppt "eClassifier: Tool for Taxonomies"

Similar presentations


Ads by Google