Definitions Clustering – Automatically generating groups of similar documents based on distance or proximity measures Categorizing – Analyzing documents and assigning them to predefined categories
The Mantra Knowledge is in the eye of the beholder, but reflecting end user needs is as critical as representing texts....and it takes work!
Benefits Discovery: the serendipitous find Navigation: browsing, relationships between categories Analysis: previously unknown topics
Benefits Reuse information Improve information quality Make new connections Identify affinities Enhance full text search
High Level Process Determine user information needs Create initial taxonomy Edit, rename categories Create affinities Categorize new documents Test the UI Train the taxonomy
Determine end user needs Audit information needs Audit content – Is there an existing taxonomy? – How clean is the meta-data? – Look for existing descriptive fields Select sources – Map to an existing business process – Get functional buy-in – Be format-agnostic, but look for lots of text
Create initial taxonomy Rules – Coverage: wide or deep – Number of levels – Categories per document – Documents per category Manual versus automatic
Edit, rename categories Editing process – Scan the entire taxonomy – Spot-check the documents in the categories – Focus on unique terms in the labels and assign new names as you go – Move documents to appropriate categories when necessary – Merge and delete redundant categories Term approval process
Categorize documents Manual or automatic? Authors, content experts, or editors? Legacy applications?
Test Test with users – Use their UI Can users find what they need? Any missing categories? Do the groups of documents make sense? Do the categories complement full- text search?
Issues Set appropriate expectations Control the organization of information Trust the system Human intervention vs impartiality Legacy controlled vocabularies Tight integration with IT/Admin