Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics Evaluation A Case Study: Amdocs Tom Reamy Chief Knowledge Architect.

Similar presentations


Presentation on theme: "Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics Evaluation A Case Study: Amdocs Tom Reamy Chief Knowledge Architect."— Presentation transcript:

1 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics Evaluation A Case Study: Amdocs Tom Reamy Chief Knowledge Architect KAPS Group http://www.kapsgroup.com

2 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics Evaluation Case Study  Agenda Introduction – Text Analytics Basics Evaluation Process & Methodology Two Stages – Initial Filters & POC Initial Evaluation Results Proof of Concept Methodology Results Final Recommendation Sentiment Analysis and Beyond Conclusions

3 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 3 KAPS Group: General  Knowledge Architecture Professional Services  Virtual Company: Network of consultants – 8-10  Partners – SAS – 2 Whitepapers (Semantic infrastructure)  GAO, FDA, Amdocs – Sales & Development Projects  Other Partners: Smart Logic, FAST, Concept Searching, etc.  Consulting, Strategy, Knowledge architecture audit  Services:  Text Analytics evaluation, development, consulting, customization  Knowledge Representation – taxonomy, ontology, Prototype  Knowledge Management: Collaboration, Expertise, e-learning  Applied Theory – Faceted taxonomies, complexity theory, natural categories

4 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics Evaluation Case Study Text Analytics Features  Noun Phrase Extraction  Catalogs with variants, rule based dynamic  Multiple types, custom classes – entities, concepts, events  Feeds facets  Summarization  Customizable rules, map to different content  Fact Extraction  Relationships of entities – people-organizations-activities  Ontologies – triples, RDF, etc.  Sentiment Analysis  Rules – Objects and phrases

5 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 5 Introduction to Text Analytics Text Analytics Features  Auto-categorization  Training sets – Bayesian, Vector space  Terms – literal strings, stemming, dictionary of related terms  Rules – simple – position in text (Title, body, url)  Semantic Network – Predefined relationships, sets of rules  Boolean– Full search syntax – AND, OR, NOT  Advanced – DIST (#), PARAGRAPH, SENTENCE  This is the most difficult to develop  Build on a Taxonomy  Combine with Extraction  If any of list of entities and other words

6 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 6

7 #analytics2011 7

8 #analytics2011 8

9 #analytics2011 9

10 #analytics2011 10

11 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 11

12 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Evaluating Text Analytics Software Start with Self Knowledge  Strategic and Business Context  Strategic Questions – why, what value from the taxonomy/text analytics, how are you going to use it  Info Problems – what, how severe  Formal Process - KA audit – content, users, technology, business and information behaviors, applications - Or informal for smaller organization, application specific initiatives  Text Analytics Strategy/Model – forms, technology, people  Existing taxonomic resources, software  Need this foundation to evaluate and to develop 12

13 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Evaluating Text Analytics Software Start with Self Knowledge  Do you need it – and what blend if so?  Taxonomy Management Full Functionality  Multiple taxonomies, languages, authors-editors  Technology Environment – Text Mining, ECM, Enterprise Search  Where is it embedded, integration issues  Publishing Process – where and how is metadata being added – now and projected future  Can it utilize auto-categorization, entity extraction, summarization  Applications – text mining, BI, CI, Social Media, Mobile? 13

14 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Evaluation Process & Methodology Team - Interdisciplinary  IT – Large software purchase, needs assessment »Text Analytics is different – semantics »Construction company designing your house  Business – Understand the business needs »Don’t understand information »Restaurant owner doing the cooking  Library - know information, search »Don’t understand the business, non-information experts »Accountant doing financial strategy  Team – 3 KAPS - Information  5-8 Amdocs – SME - business, Technical. 14

15 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 15 Evaluation Process & Methodology Amdocs Requirements / Initial Filters  Platform – range of capabilities  Categorization, Sentiment analysis, etc.  Technical  API’s, Java based, Linux run time  Scalability – millions of documents a day  Import-Export – XML, RDF  Total Cost of Ownership  Vendor Relationship - OEM  Usability, Multiple Language Support

16 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Evaluation Process & Methodology Two Phases  Phase I – Traditional Software Evaluation  Filter One- Ask Experts - reputation, research – Gartner, etc. »Market strength of vendor, platforms, etc.  Filter Two - Feature scorecard – minimum, must have, filter to top 3  Filter Three – Technology Filter – match to your overall scope and capabilities – Filter not a focus  Filter Four – In-Depth Demo – 3-6 vendors  Phase II - Deep POC (2) – advanced, integration, semantics 16

17 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Phase I – Case Study  Attensity  SAP – Inxight  Clarabridge  ClearForest  Concept Searching  Data Harmony / Access Innovations  Expert Systems  GATE (Open Source)  IBM  Lexalytics  Multi-Tes  Nstein  SAS  SchemaLogic  Smart Logic  Content Management  Enterprise Search  Sentiment Analysis Specialty  Ontology Platforms 17

18 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 18 Phase I - 4 Demos  SmartLogic  Taxonomy Management, good interface  20 types of entities, API’s, XML-Http  Full Platform – no Sentiment Analysis  Expert Systems  Different Approach – Semantic Network – 400,000 words / 3,500 rules, 65 types of relationships  Strong out of the box – 80%, no training sets  Language concerns – no Spanish, high cost to develop new ones  Customization – add terms and relationships, develop rules – uncertain how much effort, use their professional linguists

19 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 19 Phase I - 4 Demos  SAS- Content Categorization & Sentiment  Full Platform – categorization, entity, sentiment – integrated  API’s, XML, Java – ease of integration  Strong history of company, range of experience  IBM – Classification, Concept Analytics – Two products  Classification Module – statistical emphasis »Once trained, it could “learn” new words »Rapid development / depends on training sets  Content Analytics, Languageware Workbench »Full Platform

20 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 20 Phase I – Findings  SAS & IBM – Full Platform, OEM Experience, multilingual  Proven ability to scale, customizable components, mature tool sets  SAS was the strongest offering  Capabilities, experience, integrated tool sets  IBM good second choice  Capabilities, experience - multiple products – strength and weakness  Single Vendor POC - Demonstrate it can be done  Ability to dive more deeply into capabilities, issues  Stronger foundation for future development, Learn the software better  Danger of missing better choice  Two Vendor POC  Balance of depth and full testing

21 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Phase II - Proof Of Concept - POC  4-6 weeks POC – bake off / or short pilot  Measurable Quality of results is the essential factor  Real life scenarios, categorization with your content  2-3 rounds of development, test, refine / Not OOB  Need SME’s as test evaluators – also to do an initial categorization of content  Majority of time is on auto-categorization  Need to balance uniformity of results with vendor unique capabilities – have to determine at POC time  Taxonomy Developers – expert consultants plus internal taxonomists 21

22 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Phase II – POC: Range of Evaluations  Basic Question – Can this stuff work at all?  Auto-categorization to existing taxonomy – variety of content  Essential Issue is complexity of language  Clustering – automatic node generation  Summarization  Entity extraction – build a number of catalogs – design which ones based on projected needs – example privacy info (SS#, phone, etc.)  Entity example –people, organization, methods, etc.  Essential issue is scale and disambiguation  Evaluate usability in action by taxonomists 22

23 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 23 Phase II – POC: Evaluation Criteria & Issues  Basic Test Design – categorize test set  Score – by file name, human testers  Categorization & Sentiment – Accuracy 80-90%  Effort Level per accuracy level  Quantify development time – main elements  Comparison of two vendors – how score?  Combination of scores and report  Quality of content & initial human categorization  Normalize among different test evaluators  Quality of taxonomists – experience with text analytics software and/or experience with content and information needs and behaviors  Quality of taxonomy – structure, overlapping categories

24 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Phase II – POC: Risks  CIO/CTO Problem –This is not a regular software process  Language is messy not just complex  30% accuracy isn’t 30% done – could be 90%  Variability of human categorization / expression  Even professional writers – journalists examples  Categorization is iterative, not “the program works”  Need realistic budget and flexible project plan  Anyone can do categorization  Librarians often overdo, SME’s often get lost (keywords)  Meta-language issues – understanding the results  Need to educate IT and business in their language 24

25 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics POC Outcomes Categorization of CSR Notes  Content –2,000 CSR notes categorized by humans  Variation among human categorization  Recall (finding all the correct documents)  Precision (not categorizing documents from other categories)  Precision is harder than recall  Two scores – raw and corrected – only raw for IBM precision  First score was very low, with an extra round got it up  Uncategorized documents – 50,000 – look at top 10 in each category 25

26 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics POC Outcomes Categorization Results SASIBM Recall-Motivation92.690.7 Recall-Actions93.888.3 Precision – Mot.84.3 Precision-Act100 Uncategorized87.5 Raw Precision7346 26

27 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics POC Outcomes Vendor Comparisons  SAS has a much more complete set of operators – NOT, DIST, ORDDIST, START  IBM team was able to develop work arounds for some – more development effort  Operators impact most other features – Sentiment analysis, Entity and Fact Extraction, Summarization, etc.  SAS has relevancy – can be used for precision, applications  Sentiment Analysis – SAS has workbench, IBM would require more development  SAS also has statistical modeling capabilities  Development Environment & Methodology  IBM as toolkit provides more flexibility but it also increases development effort, enforces good method 27

28 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics POC Outcomes Vendor Comparisons - Conclusions  Both can do the job  Product vs. Tool Kit (SAS has toolkit capabilities also)  IBM will require more development effort  Boolean Operators – NOT, DIST, ORDDIST, START, etc. »In rules, entity and fact extraction  Sentiment Analysis – rules, statistical  Summarization  Rule building more programming than taxonom y  IBM harder to learn – POC had 2X effort for IBM  Conclusion: Buy SAS ECC and Sentiment Workbench 28

29 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 29 Sentiment Analysis Development Process  Combination of Statistical and categorization rules  Start with Training sets – examples of positive, negative, neutral documents  Develop a Statistical Model  Generate domain positive and negative words and phrases  Develop a taxonomy of Products & Features  Develop rules for positive and negative statements  Test and Refine  Test and Refine again

30 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 30

31 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 31

32 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 32

33 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 33

34 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 34

35 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 35 Beyond Sentiment: Behavior Prediction Case Study – Telecom Customer Service  Problem – distinguish customers likely to cancel from mere threats  Analyze customer support notes  General issues – creative spelling, second hand reports  Develop categorization rules  First – distinguish cancellation calls – not simple  Second - distinguish cancel what – one line or all  Third – distinguish real threats

36 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 36 Beyond Sentiment Behavior Prediction – Case Study  Basic Rule  (START_20, (AND,  (DIST_7,"[cancel]", "[cancel-what-cust]"),  (NOT,(DIST_10, "[cancel]", (OR, "[one-line]", "[restore]", “[if]”)))))  Examples:  customer called to say he will cancell his account if the does not stop receiving a call from the ad agency.  cci and is upset that he has the asl charge and wants it off or her is going to cancel his act  ask about the contract expiration date as she wanted to cxl teh acct Combine sophisticated rules with sentiment statistical training and Predictive Analytics

37 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 37 Beyond Sentiment - Wisdom of Crowds Crowd Sourcing Technical Support  Example – Android User Forum  Develop a taxonomy of products, features, problem areas  Develop Categorization Rules:  “I use the SDK method and it isn't to bad a all. I'll get some pics up later, I am still trying to get the time to update from fresh 1.0 to 1.1.”  Find product & feature – forum structure  Find problem areas in response, nearby text for solution  Automatic – simply expose lists of “solutions”  Search Based application  Human mediated – experts scan and clean up solutions

38 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 38 Beyond Sentiment: Expertise Analysis  Apply Sentiment Analysis techniques to Expertise  Expertise Characterization for individuals, communities, documents, and sets of documents  Experts prefer lower, subordinate levels  Novice prefer higher, superordinate levels  General Populace prefers basic level  Experts language structure is different  Focus on procedures over content  Develop expertise rules – sentiment and categorization

39 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 39 Expertise Analysis Expertise – application areas  Taxonomy / Ontology development /design – audience focus  Card sorting – non-experts use superficial similarities  Business & Customer intelligence – add expertise to sentiment  Deeper research into communities, customer s  Text Mining - Expertise characterization of writer, corpus  eCommerce – Organization/Presentation of information – expert, novice  Expertise location- Generate automatic expertise characterization based on documents  Experiments - Pronoun Analysis – personality types  Essay Evaluation Software - Apply to expertise characterization »Model levels of chunking, procedure words over content

40 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics Evaluation Conclusions  Start with Self Knowledge – text analytics not an end in itself  Initial Evaluation – filters, not scorecards  Weights change output – need self knowledge for good weights  Proof of Concept – essential  OOB doesn’t tell you how it will work in real world  Content and Scenarios is your real world  Good idea even if you know SAS is the answer  Importance of operators, relevance for a platform  Sentiment needs full platform capabilities  Everyone has room for improvement 40

41 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics Future Directions  Start with the 80% of significant content that is not data  Enterprise search, content management, Search based applications  Text Analytics and Text Mining  Text Analytics turns text into data – Build better TM Apps  Better extraction and add Subject / Concepts  Sentiment and Beyond – Behavior, Expertise  Text Mining and Text Analytics  TM enriching TA  Taxonomy development  New Content Structures, ensemble models  Text Analytics and Predictive Analytics  More content, New content – social, interactive – CSR  New sources of content/data = new & better apps  Add Learning & Cognitive Science and the future is ? 41

42 Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Questions? Tom Reamy tomr@kapsgroup.com KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com


Download ppt "Copyright © 2011, SAS Institute Inc. All rights reserved. #analytics2011 Text Analytics Evaluation A Case Study: Amdocs Tom Reamy Chief Knowledge Architect."

Similar presentations


Ads by Google