Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Similar presentations


Presentation on theme: "Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services"— Presentation transcript:

1 Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com

2 2 Agenda  Development - Foundation  Case Study 1 – Internet News  Case Study 2 – Tale of two taxonomies  Case Study 3 – Software Evaluation and Beyond  Exercises

3 3 Text Analytics Development: Foundation  Articulated Information Management Strategy (K Map) – Content and Structures and Metadata – Search, ECM, applications - and how used in Enterprise – Community information needs and Text Analytics Team  POC establishes the preliminary foundation – Need to expand and deepen – Content – full range, basis for rules-training – Additional SME’s – content selection, refinement  Taxonomy – starting point for categorization / suitable?  Databases – starting point for entity catalogs

4 4 Knowledge Architecture Audit: Knowledge Map Project Foundation Contextual Interviews Information Interviews App/Content Catalog User SurveyStrategy Document Meetings, work groups Overview High Level: Process Community Info behaviors of Business processes Technology and content All 4 dimensions Meetings, work groups General Outline Broad Context Deep Details Complete Picture New Foundation

5 5 Taxonomy Development Process: Progressive Refinement Taxonomy Model Information Interviews Content Analysis RefineMap Community Governance Plan Buy/Find work groups Overview Info behaviors, Card Sorts Bottom Up Prototypes Interviews Evaluate Refine Interviews Develop, Refine General Outline Preliminary Taxonomy Taxonomy 1.0 Taxonomy 1.0-1.9 Tax 2.0Taxonomy

6 6 Text Analytics Development: Categorization Process  Starter Taxonomy – If no taxonomy, develop initial high level (see Chart)  Analysis of taxonomy – suitable for categorization – Structure – not too flat, not too large – Orthogonal categories  Content Selection – Map of all anticipated content – Selection of training sets – if possible – Automated selection of training sets – taxonomy nodes as first categorization rules – apply and get content

7 7 Text Analytics Development: Categorization Process  First Round of Categorization Rules  Term building – from content – basic set of terms that appear often / important to content  Add terms to rule, apply to broader set of content  Repeat for more terms – get recall-precision “scores”  Repeat, refine, repeat, refine, repeat  Get SME feedback – formal process – scoring  Get SME feedback – human judgments  Text against more, new content  Repeat until “done” – 90%?

8 8 Text Analytics Development: Entity Extraction Process  Facet Design – from KA Audit, K Map  Find and Convert catalogs: – Organization – internal resources – People – corporate yellow pages, HR – Include variants – Scripts to convert catalogs – programming resource  Build initial rules – follow categorization process – Differences – scale, “score” – Recall – find all entities – Precision – correct assignment to entity class – Issue – disambiguation – Ford company, person, car

9 9 Case Study - Background  Inxight Smart Discovery  Multiple Taxonomies – Healthcare – first target – Travel, Media, Education, Business, Consumer Goods,  Content – 800+ Internet news sources – 5,000 stories a day  Application – Newsletters – Editors using categorized results – Easier than full automation

10 10 Case Study - Approach  Initial High Level Taxonomy – Auto generation – very strange – not usable – Editors High Level – sections of newsletters – Editors & Taxonomy Pro’s - Broad categories & refine  Develop Categorization Rules – Multiple Test collections – Good stories, bad stories – close misses - terms  Recall and Precision Cycles – Refine and test – taxonomists – many rounds – Review – editors – 2-3 rounds  Repeat – about 4 weeks

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18 Case Study - Issues  Taxonomy Structure – Aggregate nodes vs. independent nodes – Children Nodes – subset – rare  Depth of taxonomy and complexity of rules – Trade-off need to update and usefulness of categories  Multiple avenues - Facets – source – New York Times – can put into rules or make it a facet to filter results  When to use filter or terms – experimental  Recall more important than precision – editors role

19 19 Case Study – Lessons Learned  Combination of SME and Taxonomy pros  Combination of Features – Entity extraction, terms, Boolean, filters, facts  Training sets and find similar are weakest – Somewhat useful during development for terms  No best answer – taxonomy structure, format of rules – Need custom development  Plan for ongoing refinement  This stuff actually works!

20 20 Enterprise Environment – Case Studies  A Tale of Two Taxonomies – It was the best of times, it was the worst of times  Basic Approach – Initial meetings – project planning – High level K map – content, people, technology – Contextual and Information Interviews – Content Analysis – Draft Taxonomy – validation interviews, refine – Integration and Governance Plans

21 21 Enterprise Environment – Case One – Taxonomy, 7 facets  Taxonomy of Subjects / Disciplines: – Science > Marine Science > Marine microbiology > Marine toxins  Facets: – Organization > Division > Group – Clients > Federal > EPA – Instruments > Environmental Testing > Ocean Analysis > Vehicle – Facilities > Division > Location > Building X – Methods > Social > Population Study – Materials > Compounds > Chemicals – Content Type – Knowledge Asset > Proposals

22 22 Enterprise Environment – Case One – Taxonomy, 7 facets  Project Owner – KM department – included RM, business process  Involvement of library - critical  Realistic budget, flexible project plan  Successful interviews – build on context – Overall information strategy – where taxonomy fits  Good Draft taxonomy and extended refinement – Software, process, team – train library staff – Good selection and number of facets  Final plans and hand off to client

23 23 Enterprise Environment – Case Two – Taxonomy, 4 facets  Taxonomy of Subjects / Disciplines: – Geology > Petrology  Facets: – Organization > Division > Group – Process > Drill a Well > File Test Plan – Assets > Platforms > Platform A – Content Type > Communication > Presentations

24 24 Enterprise Environment – Case Two – Taxonomy, 4 facets  Environment Issues – Value of taxonomy understood, but not the complexity and scope – Under budget, under staffed – Location – not KM – tied to RM and software Solution looking for the right problem – Importance of an internal library staff – Difficulty of merging internal expertise and taxonomy

25 25 Enterprise Environment – Case Two – Taxonomy, 4 facets  Project Issues – Project mind set – not infrastructure – Wrong kind of project management Special needs of a taxonomy project Importance of integration – with team, company – Project plan more important than results Rushing to meet deadlines doesn’t work with semantics as well as software

26 26 Enterprise Environment – Case Two – Taxonomy, 4 facets  Research Issues – Not enough research – and wrong people – Interference of non-taxonomy – communication – Misunderstanding of research – wanted tinker toy connections Interview 1 implies conclusion A  Design Issues – Not enough facets – Wrong set of facets – business not information – Ill-defined facets – too complex internal structure

27 27 Taxonomy Development Conclusion: Risk Factors  Political-Cultural-Semantic Environment – Not simple resistance - more subtle – re-interpretation of specific conclusions and sequence of conclusions / Relative importance of specific recommendations  Understanding project scope  Access to content and people – Enthusiastic access  Importance of a unified project team – Working communication as well as weekly meetings

28 28 Text Analytics Development Case Study 3 – POC – Government Agency  Demo of SAS – Teragram / Enterprise Content Categorization

29 29 Conclusion  Enterprise Context – strategic, self knowledge  Importance of a good foundation – Importance of Taxonomy Structure – mapped to use – POC a head start on development  Importance of Text Analytics Vision / Strategy – Infrastructure resource, not a project  Balance of expertise and local knowledge  Importance of Usability for refinement cycles  Difference of taxonomy and categorization – Concepts vs. text in documents

30 Questions? Tom Reamy tomr@kapsgroup.com KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com


Download ppt "Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services"

Similar presentations


Ads by Google