Presentation on theme: "Content Categorization A Road Map Julia Marshall USAID (Bridgeborn Inc.)"— Presentation transcript:
Content Categorization A Road Map Julia Marshall USAID (Bridgeborn Inc.)
What Is Your Goal?
Write Down Your Goal Samples: Discover what topics are most mentioned in a set of documents and whether they change over time Create metadata records more quickly than we can with human catalogers/indexers
Flavors of Text Analytics Text Mining Discovering Data Finding Patterns within the Data Auto-Categorization Structuring data to a schema Assigning pre-determined tags
Assess Your Resources People Materials to be categorized Processes Metadata schemas Systems Budget
People How many people will you have? What is their expertise? IT People Indexers/Subject Matter Experts Web Developers/Designers Project Manager How much time will they have to devote to this project?
Materials to be categorized How much material? What format is it in? Paper? Digital files? OCR’d files? What shape is it in?
Processes Are processes already in place for categorization? If so, how is the process done? Who does the process? How standardized is the process?
Metadata Schemas Does your organization have: Thesaurus of topics? Personal name authority files? Organizational name authority files? Gazetteers or geographic names? Standard list of types of documents? Standard way dates are handled?
Systems Will there be a system that consumes output from the SAS Content Cat Studio? How will the system consume the SAS output? Will there need to be code to pull the text of the documents through SAS Code to push the SAS output into your consuming system?
Budget How much money can you spend?
Assess the Costs Tools Application Server space/equipment Staff time Preparatory costs
Select a plan/tool that best fits your organization’s needs Revisit your original goal What do you have the resources to do? Revise your goal to fit your circumstances Find the best tool for the job
Strategize the Implementation What metadata/processes to automate? What are priorities for the above processes? What are the easiest to automate? How much time will it take? Who’s doing what?
A Brief Digression on Project Management
Manage the Management Manage Expectations Pick a “Quick Win” piece of the project Keep them informed at a level that they can understand
The SAS Content Categorization Studio is Plugged in - Now What Do I Do?
Create Profiles For each piece of metadata to auto- categorize, write a profile that tells the application which terms to assign for each document Each term will need a unique set of rules assigned that tell the application when to apply that particular term – and when not to
Tips for Writing Profile Rules Simpler is better – at first Analyze a sample of documents to be auto-categorized – what words show up with the term Differentiate between “Concept” and “Context” Document your rules and your updates as you write them.
Collect Sample Sets of Documents Need at least 3 sets. (Probably more). 1 st set for writing profile 2 nd set for testing 3 rd set for the final test
Run the 1 st Sample Set Against your Profile Each document will have terms that SAS assigned to it Each term will have a relevancy score Rank the terms by the highest to lowest relevancy score Look at the top 5-10 terms
Evaluate the Output Do the top 5-10 terms make sense? Are the terms too general? What phrases in the set of documents caused SAS to pick those terms? How do you need to rewrite the rules?
Rewrite the Rules in the Profile Based on the Output
Repeat Repeat Repeat Repeat Repeat Repeat Repeat As Needed
I’ve Created the Profile Now What? The Output is the Way I Want
Integrate the Output Design the Workflow Interface Design Connect to Local Systems Train staff More tests
Design the Workflow I Where is the data in each step? Who is handling the data? What has to happen to move the data to the next step? Documents SAS Profile Java Code XML Code Metadata in DEC
Design the Workflow II
Interface Design Sample: USAID Geographic Term(s): USAID Geographic Term(s) SAS values: SAS GeoDescriptor Run Date:
Connect to Local Systems
Train Staff IT Staff Profile managers Output evaluators
Test the Integrated System Gather test samples – again! Run the profile in your test environment Does the output stay the same? Can you update the profiles? Are other users of the system able to use/update the output?
If you answered yes: CELEBRATE!
Maintain the System Documentation Tests Staff training Follow up evaluations
Lessons Learned the Hard Way Be careful using outside data Buy only what you need