Content Categorization A Road Map Julia Marshall USAID (Bridgeborn Inc.)

Slides:



Advertisements
Similar presentations
Building a Top Down Ontology From the Bottom Up Step by Step Approach for Identifying & Constructing Dimensions of an Ontology draft (v0.8): DeniseBedford.
Advertisements

Comparison of BIDS ISI (Enhanced) with Web of Science Lisa Haddow.
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
 Replace Information in () & underlined with Agency Specific Information  Replace Decision Tree & Category/Folder Examples with Agency Developed Ones.
New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
Wynne HARLEN Susana BORDA CARULLA Fibonacci European Training Session 5, March 21 st to 23 rd, 2012.
Applying Crowd Sourcing and Workflow in Social Conflict Detection By: Reshmi De, Bhargabi Chakrabarti 28/03/13.
ECM RFP 101 Presented by: Carol Mitchell C.M. Mitchell Consulting.
Developing a Culture of Information Management You’ve selected your ECM solution – Now what? Paul Bauman TOWER Software December 13, 2006.
Leveraging Your Taxonomy to Increase User Productivity MAIQuery and TM Navtree.
Neural Technology and Fuzzy Systems in Network Security Project Progress 2 Group 2: Omar Ehtisham Anwar Aneela Laeeq
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
© 2004, The Trustees of Indiana University 1 OneStart Workflow Basics Brian McGough, Manager, Systems Integration, UITS Ryan Kirkendall, Lead Developer.
Bill Querry EDU 742 Help Students take organized Notes
Implementing Metadata Marjorie M K Hlava, President Access Innovations, Inc. Albuquerque, NM
What is Business Intelligence? Business intelligence (BI) –Range of applications, practices, and technologies for the extraction, translation, integration,
Business Processes and Workflow How to go from idea to implementation
Metadata (for the data users downstream) RFC GIS Workshop July 2007 NOAA/NESDIS/NGDC Documentation.
Forethought Knowledge is our most important engine of production – Alfred Marshal Knowledge is the key resource of the 21st century Problem today is.
9 Feb 2004Mikko Mäkinen & Saija Ylönen Joint UNECE/Eurostat/OECD work session on statistical metadata (METIS) Geneva, 9-11 February 2004, Topic (ii): Metadata.
Copyright © 2006, SAS Institute Inc. All rights reserved. Enterprise Guide 4.2 : A Primer SHRUG : Spring 2010 Presented by: Josée Ranger-Lacroix SAS Institute.
Ihr Logo Data Explorer - A data profiling tool. Your Logo Agenda  Introduction  Existing System  Limitations of Existing System  Proposed Solution.
SednaSpace A software development platform for all delivers SOA and BPM.
Moodle (Course Management Systems). Assignments 1 Assignments are a refreshingly simple method for collecting student work. They are a simple and flexible.
1 Bonham, chapter 8 Knowledge Management. 2  8.1 Success Levels  8.2 Externally Focused KM  8.3 Internally Focused KM  8.4 PMO-Supported KM
What You Need before You Deploy Master Data Management Presented by Malcolm Chisholm Ph.D. Telephone – Fax
The Information School at the University of Washington Information Audits Bob Boiko UW iSchool ischool.washington.edu Metatorial Services Inc.
Using SAS® Information Map Studio
Personal Information Management Vitor R. Carvalho : Personalized Information Retrieval Carnegie Mellon University February 8 th 2005.
How to use the VSS to design a National Strategy for the Development of Statistics (NSDS) 1.
28-29 th March 2006CCP4 Automation STAB MeetingCCP4i and Automation 1 CCP4i and Automation : Opportunities and Limitations Peter Briggs, CCP4.
FOCUS – Framing, Organizing, Collecting, Understanding, and Synthesizing Paul Friga’s McKinsey Engagement.
ETL Extract. Design Logical before Physical Have a plan Identify Data source candidates Analyze source systems with data- profiling tools Receive walk-through.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
STAKEHOLDER MEETING Selecting Interventions to Improve Utilization of the IUD City, Country Date Insert MOH logoInsert Project logoInsert USAID logo (Note:
Chapter II: 9-Step Proposal Process: An Overview.
Any data..! Any where..! Any time..! Linking Process and Content in a Distributed Spatial Production System Pierre Lafond HydraSpace Solutions Inc
Chapter 11 Using SAS ® Web Report Studio. Section 11.1 Overview of SAS Web Report Studio.
Personal Project. Topic Modeling and Presenting Data from a Publication Objectives –Using XML related techniques to model and present data from a publication.
E-discovery Discussion. 2 Policies and Procedures Do you have a set of e-discovery policies and procedures? – Who is the lead for e-discovery efforts.
Introduction to XML MIS3502: Application Integration and Evaluation Paul Weinberg Presentation by David Schuff.
Coding Compliance Components Writing Custom Policies for Auditing, Expiration and More Jason Morrill Program Manager Windows SharePoint Services.
Common Sense Validation Using SAS Lisa Eckler Lisa Eckler Consulting Inc. TASS Interfaces, December 2015.
Lifecycle Metadata for Digital Objects October 23, 2006 Creation Metadata.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Advanced Technical Writing 2006 Session #13. Today In Class ► The third analytic perspective: workflows & production models ► Thinking about “metadata”
Copyright © 2001, SAS Institute Inc. All rights reserved. Data Mining Methods: Applications, Problems and Opportunities in the Public Sector John Stultz,
Greenbush. An informed citizen possesses the knowledge needed to understand contemporary political, economic, and social issues. A thoughtful citizen.
Project Management Methodology Project Closing. Project closing stage Must be performed for all projects, successfully completed or shut off by management.
CIS-NG CASREP Information System Next Generation Shawn Baugh Amy Ramirez Amy Lee Alex Sanin Sam Avanessians.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
SharePoint University of the Highlands and Islands SharePoint for Records Management.
1 DATA Act Information Model Schema (DAIMS) Version 1.0 Briefing June 2016.
Data mining in web applications
Use of technology in conducting censuses in Latin America and the Caribbean United Nations Technical Meeting on Use of Technology in Population and Housing.
Project Management: Messages
Monitoring and Evaluation
13 YEARS 11/2000 – 11/2013 Automated Privilege Detection, De-Threading & Automated Priv Logs 1st Quarter 2014 Confidential.
Concept of a Danube River Basin GIS
Global Consumer Insights
Vanessa Tosello (IFREMER), Flavian Gheorghe (MARIS)
Course: Module: Lesson # & Name Instructional Material 1 of 32 Lesson Delivery Mode: Lesson Duration: Document Name: 1. Professional Diploma in ERP Systems.
An Introduction to the Research Process
Extracting Recipes from Chemical Academic Papers
GIL Users Group Meeting
Reportnet 3.0 Database Feasibility Study – Approach
What is a System? A system is a collection of interrelated components that work together to perform a specific task.
Integrated Statistical Production System WITH GSBPM
Staff Turnover and Silos in Our State, Oh My!
OU BATTLECARD: WebLogic Server 12c
Presentation transcript:

Content Categorization A Road Map Julia Marshall USAID (Bridgeborn Inc.)

What Is Your Goal?

Write Down Your Goal Samples: Discover what topics are most mentioned in a set of documents and whether they change over time Create metadata records more quickly than we can with human catalogers/indexers

Flavors of Text Analytics Text Mining Discovering Data Finding Patterns within the Data Auto-Categorization Structuring data to a schema Assigning pre-determined tags

Assess Your Resources People Materials to be categorized Processes Metadata schemas Systems Budget

People How many people will you have? What is their expertise? IT People Indexers/Subject Matter Experts Web Developers/Designers Project Manager How much time will they have to devote to this project?

Materials to be categorized How much material? What format is it in? Paper? Digital files? OCR’d files? What shape is it in?

Processes Are processes already in place for categorization? If so, how is the process done? Who does the process? How standardized is the process?

Metadata Schemas Does your organization have: Thesaurus of topics? Personal name authority files? Organizational name authority files? Gazetteers or geographic names? Standard list of types of documents? Standard way dates are handled?

Systems Will there be a system that consumes output from the SAS Content Cat Studio? How will the system consume the SAS output? Will there need to be code to pull the text of the documents through SAS Code to push the SAS output into your consuming system?

Budget How much money can you spend?

Assess the Costs Tools Application Server space/equipment Staff time Preparatory costs

Select a plan/tool that best fits your organization’s needs Revisit your original goal What do you have the resources to do? Revise your goal to fit your circumstances Find the best tool for the job

Strategize the Implementation What metadata/processes to automate? What are priorities for the above processes? What are the easiest to automate? How much time will it take? Who’s doing what?

A Brief Digression on Project Management

Manage the Management Manage Expectations Pick a “Quick Win” piece of the project Keep them informed at a level that they can understand

The SAS Content Categorization Studio is Plugged in - Now What Do I Do?

Create Profiles For each piece of metadata to auto- categorize, write a profile that tells the application which terms to assign for each document Each term will need a unique set of rules assigned that tell the application when to apply that particular term – and when not to

Tips for Writing Profile Rules Simpler is better – at first Analyze a sample of documents to be auto-categorized – what words show up with the term Differentiate between “Concept” and “Context” Document your rules and your updates as you write them.

Sample Profile Build Logging (OR, (MIN_2, “Logging”, “Selective logging”, “Illegal logging”, “Logging concession”, “Timber extraction”, “Sawmill”, (SENT, “logging”, “impact”), (SENT, “timber”, “harvest”))) Trees (OR, (MAXOC_50, (NOTIN, “Trees”, “Teak trees”), (NOTIN, “trees”, “fruit trees”), NOTINSENT, “Trees”, “Timber”), (NOTINSENT, “trees”, “logging”)))

Collect Sample Sets of Documents Need at least 3 sets. (Probably more). 1 st set for writing profile 2 nd set for testing 3 rd set for the final test

Run the 1 st Sample Set Against your Profile Each document will have terms that SAS assigned to it Each term will have a relevancy score Rank the terms by the highest to lowest relevancy score Look at the top 5-10 terms

Evaluate the Output Do the top 5-10 terms make sense? Are the terms too general? What phrases in the set of documents caused SAS to pick those terms? How do you need to rewrite the rules?

Rewrite the Rules in the Profile Based on the Output

Repeat Repeat Repeat Repeat Repeat Repeat Repeat As Needed

I’ve Created the Profile Now What? The Output is the Way I Want

Integrate the Output Design the Workflow Interface Design Connect to Local Systems Train staff More tests

Design the Workflow I Where is the data in each step? Who is handling the data? What has to happen to move the data to the next step? Documents SAS Profile Java Code XML Code Metadata in DEC

Design the Workflow II

Interface Design Sample: USAID Geographic Term(s): USAID Geographic Term(s) SAS values: SAS GeoDescriptor Run Date:

Connect to Local Systems

Train Staff IT Staff Profile managers Output evaluators

Test the Integrated System Gather test samples – again! Run the profile in your test environment Does the output stay the same? Can you update the profiles? Are other users of the system able to use/update the output?

If you answered yes: CELEBRATE!

Maintain the System Documentation Tests Staff training Follow up evaluations

Lessons Learned the Hard Way Be careful using outside data Buy only what you need

Thank you!