Presentation is loading. Please wait.

Presentation is loading. Please wait.

Content Categorization A Road Map Julia Marshall USAID (Bridgeborn Inc.)

Similar presentations

Presentation on theme: "Content Categorization A Road Map Julia Marshall USAID (Bridgeborn Inc.)"— Presentation transcript:

1 Content Categorization A Road Map Julia Marshall USAID (Bridgeborn Inc.)


3 What Is Your Goal?

4 Write Down Your Goal Samples: Discover what topics are most mentioned in a set of documents and whether they change over time Create metadata records more quickly than we can with human catalogers/indexers

5 Flavors of Text Analytics Text Mining Discovering Data Finding Patterns within the Data Auto-Categorization Structuring data to a schema Assigning pre-determined tags

6 Assess Your Resources People Materials to be categorized Processes Metadata schemas Systems Budget

7 People How many people will you have? What is their expertise? IT People Indexers/Subject Matter Experts Web Developers/Designers Project Manager How much time will they have to devote to this project?

8 Materials to be categorized How much material? What format is it in? Paper? Digital files? OCR’d files? What shape is it in?

9 Processes Are processes already in place for categorization? If so, how is the process done? Who does the process? How standardized is the process?

10 Metadata Schemas Does your organization have: Thesaurus of topics? Personal name authority files? Organizational name authority files? Gazetteers or geographic names? Standard list of types of documents? Standard way dates are handled?

11 Systems Will there be a system that consumes output from the SAS Content Cat Studio? How will the system consume the SAS output? Will there need to be code to pull the text of the documents through SAS Code to push the SAS output into your consuming system?

12 Budget How much money can you spend?

13 Assess the Costs Tools Application Server space/equipment Staff time Preparatory costs

14 Select a plan/tool that best fits your organization’s needs Revisit your original goal What do you have the resources to do? Revise your goal to fit your circumstances Find the best tool for the job

15 Strategize the Implementation What metadata/processes to automate? What are priorities for the above processes? What are the easiest to automate? How much time will it take? Who’s doing what?

16 A Brief Digression on Project Management

17 Manage the Management Manage Expectations Pick a “Quick Win” piece of the project Keep them informed at a level that they can understand

18 The SAS Content Categorization Studio is Plugged in - Now What Do I Do?

19 Create Profiles For each piece of metadata to auto- categorize, write a profile that tells the application which terms to assign for each document Each term will need a unique set of rules assigned that tell the application when to apply that particular term – and when not to

20 Tips for Writing Profile Rules Simpler is better – at first Analyze a sample of documents to be auto-categorized – what words show up with the term Differentiate between “Concept” and “Context” Document your rules and your updates as you write them.

21 Sample Profile Build Logging (OR, (MIN_2, “Logging”, “Selective logging”, “Illegal logging”, “Logging concession”, “Timber extraction”, “Sawmill”, (SENT, “logging”, “impact”), (SENT, “timber”, “harvest”))) Trees (OR, (MAXOC_50, (NOTIN, “Trees”, “Teak trees”), (NOTIN, “trees”, “fruit trees”), NOTINSENT, “Trees”, “Timber”), (NOTINSENT, “trees”, “logging”)))

22 Collect Sample Sets of Documents Need at least 3 sets. (Probably more). 1 st set for writing profile 2 nd set for testing 3 rd set for the final test

23 Run the 1 st Sample Set Against your Profile Each document will have terms that SAS assigned to it Each term will have a relevancy score Rank the terms by the highest to lowest relevancy score Look at the top 5-10 terms

24 Evaluate the Output Do the top 5-10 terms make sense? Are the terms too general? What phrases in the set of documents caused SAS to pick those terms? How do you need to rewrite the rules?

25 Rewrite the Rules in the Profile Based on the Output

26 Repeat Repeat Repeat Repeat Repeat Repeat Repeat As Needed

27 I’ve Created the Profile Now What? The Output is the Way I Want

28 Integrate the Output Design the Workflow Interface Design Connect to Local Systems Train staff More tests

29 Design the Workflow I Where is the data in each step? Who is handling the data? What has to happen to move the data to the next step? Documents SAS Profile Java Code XML Code Metadata in DEC

30 Design the Workflow II

31 Interface Design Sample: USAID Geographic Term(s): USAID Geographic Term(s) SAS values: SAS GeoDescriptor Run Date:

32 Connect to Local Systems

33 Train Staff IT Staff Profile managers Output evaluators

34 Test the Integrated System Gather test samples – again! Run the profile in your test environment Does the output stay the same? Can you update the profiles? Are other users of the system able to use/update the output?

35 If you answered yes: CELEBRATE!

36 Maintain the System Documentation Tests Staff training Follow up evaluations

37 Lessons Learned the Hard Way Be careful using outside data Buy only what you need

38 Thank you! Email:

Download ppt "Content Categorization A Road Map Julia Marshall USAID (Bridgeborn Inc.)"

Similar presentations

Ads by Google