Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2013 IBM Corporation Content Analytics with Enterprise Search Putting Your Content in Motion Realize the value of content to transform your business.

Similar presentations

Presentation on theme: "© 2013 IBM Corporation Content Analytics with Enterprise Search Putting Your Content in Motion Realize the value of content to transform your business."— Presentation transcript:

1 © 2013 IBM Corporation Content Analytics with Enterprise Search Putting Your Content in Motion Realize the value of content to transform your business 1 1

2 © 2013 IBM Corporation2 IBM Leadership in Search, Text Analysis and Classification  IBM has a 50+ year history in text analysis and discovery –As early as 1957, IBM published pioneer research done on text classification (and related topics, such as text search, and automatic creation of text abstracts)  IBM invests ~$50M annually in research and development for search and text analytics –Over 200 people actively engaged in R&D –IBM holds over 200 patents in information access with more each year

3 © 2013 IBM Corporation 3 … to form large text-based collections from multiple internal and external sources (and types), including ECM repositories, structured data, social media and more. … from collections to confirm what is suspected or uncover something new - before customizing models and integrating with other systems and processes Aggregate and extract from multiple sources Organize, analyze and visualize Search and explore to derive insight Uncover business insight through unique visual-based approach … enterprise content (and data) by identifying trends, patterns, correlations, anomalies and business context from collections. Content Analytics Going from raw information to rapid insight

4 © 2013 IBM Corporation 4  Multiple views for visual analysis, exploration and investigation ─8 unique views of content, including subdocument views  Dynamically search and explore content for new business insight ─Connections and Dashboard views to easily detect insights ─Add your own custom views  Powerful solution modeling and support for advanced classification tools for more accurate and deeper insight ─Enhanced analytics configuration tools  Deliver rapid insight to other systems, users and applications for complete business view ─Quickly generate Cognos BI reports, link between Cognos reports and ICA views ─Deliver analysis to IBM Case Manager solutions IBM Content Analytics – A platform for rapid insight 4

5 © 2013 IBM Corporation Content Analytics – A platform for rapid insight Document Analysis Facets Time Series Deviations / Trends Dashboard 5 Facet Pairs Connections Sentiment

6 © 2013 IBM Corporation Delivery of Insight to Users, Systems and Processes Industry Solutions Business Intelligence Predictive Systems ECM Advanced Case Mgt Solution and Modeling Tools IBM Content Analytics Studio IBM Content Classification External and Internal Information Sources Sources Analysis Exploration Interactive Assessment and Discovery of Business Insight IBM Content Analytics IBM Content Analytics Approach

7 © 2013 IBM Corporation What is Text Analytics? Text Analytics (NLP*) describes a set of linguistic, statistical, and machine learning techniques that allow text to be analyzed and key information extraction for business integration What is Content Analytics? Content Analytics (Text Analytics + Mining) refers to the text analytics process plus the ability to visually identify and explore trends, patterns, and statistically relevant facts found in various types of content spread across internal and external content sources * Natural Language Processing Text Analytics is the basis for Content Analytics 7 Not only was the pick-up line at the counter very long, but I waited 30 minutes just to talk to a rude representative who gave me a car that smelled like smoke, had stained floor mats, a dented fender, and only half a tank of gas

8 © 2013 IBM Corporation8 8 Analyzed Content (and Data) “Owner” “reports” “check engine lite” “flashes” “after refueling”... Source Information Corporate (Contact Center, Test Data, Dealer notes, ECM, etc.) and External (NHTSA, Edmunds, Consumer Reports, MotorTrend etc.) Noun Verb Noun PhrasePrep Phrase Person Issue Warning Driver action Component Issue: “Engine Light” Situation: “Refueling” Extracted Concept Content Analytics UIMA Pipeline + Annotators Fine grain control over the entities and facets that are created Content Analytics Crawlers IBM Master Data Mgmt RDB Real-time NLP REST API Content Push API IBM Content Analytics – How it works

9 © 2013 IBM Corporation9 Wikipedia : UIMA standard UIMA stands for Unstructured Information Management Architecture. An OASIS standard[2] as of March 2009, UIMA is to date the only industry standard for content analytics[ UIMA is a component software architecture for the development, discovery, composition, and deployment of multi- modal analytics for the analysis of unstructured information and its integration with search technologies developed by IBM. The source code for a reference implementation of this framework has been made available on SourceForge, and later on the website of the Apache Software Foundation. An example is a logistics analysis software system that could convert unstructured data such as repair logs and service notes into relational tables. These tables can then be used by automated tools to detect maintenance or manufacturing problems. Other examples are systems that are used in medical environments to analyze clinical notes. UIMA

10 © 2013 IBM Corporation10 UIMA  Unstructured Information Management Architecture –Analysis engines are interchangeable and reusable –Analysis engines pass artifacts via common data store, CAS (Common Analysis Structure), as annotations –Watson Jeopardy Challenge utilized UIMA (UIMA-AS) to win quiz champions Part-of-Speech Annotator Morphological Annotator Noun Verb PrepositionNumNounPreposition Named Entity Annotator Event Annotator Crime Name=theft 11:30Queens Time timeOfDay = noon City cityName=New York cityDistrict=Queens ina.m.stolenwasPorscheat Sports Car UIMA Framework

11 © 2013 IBM Corporation UIMA Compliant Analysis Engines in ICA  ICA provides a number of annotators for advanced text analysis –Language Identification –Linguistic Analysis –Dictionary Lookup –Pattern Matcher –Named Entity Recognition –Document Classification  Custom text analysis can be added as a UIMA annotator –e.g. An annotator that recognizes product number and add additional information such as product name, release date or price Classification Custom Analytics Tokenization Word Analytics Named Entity Recognition Multi-word Analytics Language Identification UIMA

12 © 2013 IBM Corporation c bfd/entry/overview3?lang=en

13 © 2013 IBM Corporation13 Text Analytics Catalog

14 © 2013 IBM Corporation With Content Analytics Studio, you can – Create language and domain specific dictionaries – Write rules to match character patterns – Write rules to identify patterns of tokens and other annotations – Create UIMA annotators based on these dictionaries and rules – Annotate text documents and view the details of annotations – Annotate collections of documents all without needing to write code or understand underlying technology Content Analytics Studio 14 Content Analytics Studio is an integrated development environment (IDE) for creating your own custom analysis engine

15 © 2013 IBM Corporation Content Analytics Studio 15 View Project Resources

16 © 2013 IBM Corporation Content Analytics Studio 16 Sample text for building a model

17 © 2013 IBM Corporation Content Analytics Studio 17 UIMA Pipeline components

18 © 2013 IBM Corporation Iterative Process  Content analytics is an iterative process of: 1.Build : process documents 2.Validate : verify resources updates 3.Analyze : perform analytics to find new insights 4.Modify : update resources with new insights

19 © 2013 IBM Corporation19 Architecture

20 © 2013 IBM Corporation20 ICA V3.0 System Architecture Document Cache Raw Data Store Indexer Indexer Service SchedulerLogging Control Configuration Monitor Security Common Infrastructure Exporter Crawler Framework Thumbnail Index Facet Count Sub Index Taxonomy Index Search Index Custom Crawler QuickPlace Crawler Domino Doc Mgt Crawler Notes Crawler SharePoint Crawler Exchange Server Crawler NNTP Crawler DB2 Crawler Content Integrator Crawler DB2 Content Mgr Crawler FileNet P8 Crawler Web Crawler Seed List Crawler Web Content Mgr Crawler WebSphere Portal Crawler Agent for File System Crawler Global Processing Web Link Analysis Thumbnail Generation Collection Export Plug-in Contents Miner UI Admin UI Enterprise Search UI REST Application Real-time NLP Application Document Processor Parser Document Generator Annotator UIMA Text Analytics & Search Runtime Inspector Custom Point RDB Crawler Plug-in JDBC DB Crawler Win FS Crawler Unix FS Crawler Importer Framework CSV Importer Case Mgr Crawler Document Categorizer Document Cluster Term of Interest SIAPI Application Content Analytics Studio Cognos BI Integration Cognos BI XML CSV

21 © 2013 IBM Corporation21 UIMA Compliant Analysis Engines in ICA Language Identification Lexical Analysis - Paragraph/Sentence Segmentation - Tokenization - Character Normalization - Lemmatization - Part-of-Speech Tagging Phrasal Analysis - Shallow Parsing - Named Entity Extraction - Phrase Recognition (Sentiment Analysis, etc.) - Deep Parsing - JJSA (Japanese only) Document Classification Custom Analysis UIMA PEAR Add-on Custom Analysis Engine Input TextAccording to finance report, IBM Corp.’s EPS increased by 10.1%. Language Identification Sentence Segmentation Tokenization Case Normalization Lemmatization Part-of-Speech Tagging Named Entity Extraction Phrase Recognition AccordingtofinancereportIBMCorp.’s’sEPSincreasedby10.1% according IBMCorp. corporationincrease English IBMCorp.’s’sEPS10.1% Positive (finance – increase) prepositionnoun(singular) preposition adjective noun(singular)noun(proper)posessiveverb(past tense)numeral Resolve many ambiguities in text Recognize domain specific terms / expressions Deal with grammatical characteristics of each language (e.g. English, Chinese, Japanese, French, German, …)

22 © 2013 IBM Corporation22 Supported Languages Collection TypeICA V3.0 Text Analytics Collection Arabic Chinese (Traditional/Simplified) Danish Dutch English French German Hebrew Italian Japanese Portuguese Spanish Arabic Czech Polish Russian *1

23 © 2013 IBM Corporation23 Supported Data Data Source  Web –HTTP/HTTPS ( RSS 、 Atom ) –News groups (NNTP) –WebSphere Portal, Web Content Management  Relational Database –DB2 family (DB2 UDB, Informix, DB2 for iSeries, DB2 for z/OS) –Oracle, MS SQL Server, Sybase –VSAM, IMS, CA-Datacom, Software AG Adabas  Collaboration System –Lotus Notes/Domino databases, Quick Place, Domino.Doc –Lotus Quickr, Lotus Connections –MS Exchange  Content Management System –IBM Case Manager –DB2 Content Manager –Documentum, FileNet CS, FileNet P8, Hummingbird, LiveLink Open Text, Portal Document Manager (PDM), Microsoft Sharepoint  File System –Unix File System –Windows File System Data Format  Plain Text  HTML  XML  Office Document –Adobe Portable Document Format (PDF) –MS Rich Text Format (RTF) –MS Word, MS Excel, MS PowerPoint –Lotus Word Pro, Lotus 1-2-3, Lotus Freelance, –Ichitaro More than 300 format can be supported by changing the configuration file

24 © 2013 IBM Corporation ICA Web Application Security  In case of WAS, global security needs to be configured for login setting

25 © 2013 IBM Corporation Document Level Security by Security Token  You can assign security token at crawling by – Add the fixed value as security token – Assign the security token based on field values (only some crawlers) – Attach the token programmatically using custom crawler plug-in  It needs to customize search application to pass tokens that the current user has  The search engine returns documents only if the given tokens match to indexed security tokens on each document Plugin Plug-in Parser Indexer Search runtime Crawler Data source 1.Assigning security tokens to documents Or extracted from native data source 2.User authentication and credential retrieval 3.Results filtering by matching Security tokens with user credentials Search Index

26 © 2013 IBM Corporation26 APIs  Custom Search and Admin applications can be implemented by REST API  Language independent  Provides all required functions for creating a search UI – Search navigation – Facet navigation – Search functions Faceted search Fetch content, thumbnails and previous document List spell correction, synonym expansions and type-ahead suggestions And more…  Provides required functions for administrating search – Managing collections – Controlling and monitoring components – Adding documents to a collection

27 IBM Software Group - ECM IBM Content Analytics with Enterprise Search 27 Extension Point List Component / NameDescriptionPublished Crawler Crawler Plug-inAllow users to modify crawled content or metadataPublished Web Crawler Pre-fetch Plug-inAllow users to modify parameters of HTTP requestPublished Web Crawler Post-parseAllow users to modify crawled content or metadataPublished Archive File Handling Plug-InAllow users to add a custom archive file handling (e.g. LZH, RAR)Published Crawler FrameworkAllow users to easily create a new crawler for unsupported data sourcesPRIVATE Document Processing UIMA Annotator Plug-inAllow users to integrate their own text analytics logicPublished Real-time NLPAllow users to execute UIMA pipeline without indexing documentPublished Thumbnail Generation Plug-inAllow users to integrate own logic to generate thumbnail imagesPRIVATE Export Document Export Plug-inAllow users to export processed documents to the outside of ICAPublished Deep Inspector Plug-inAllow users to export the text mining results to the outside of ICAPublished Text Analytics & Search Runtime Text Analytics & Search REST APIREST interface for Text Analytics & REST API. Complies w/ SR20Published SIAPISearch APIPublished Fetch / IMCProvide content fetch APIs and IMC accessPublished Post Filtering Plug-InAllows users to add a custom impersonation logicPublished Administration Admin REST APIREST interface for AdministrationPublished Text Miner Application Custom Analytics ViewAllow users to integrate their own visualization for text miningPublished

28 © 2013 IBM Corporation 28 IBM Content Analytics: Analysis Export Capability Export 1 Crawled Document Export Export documents with its metadata and content as those are crawled 2 Analyzed Document Export Export documents with the result of text Analytics such as Natural Language Processing, Named Entity Extraction, classification or user implemented logic before indexing 3 Searched Document Export Export documents limited by search or analysis with original content from the index RDB Limit documents by search or analysis Content Analytics Crawler Data Store Parser / Tokenizer UIMA Annotators Indexer Search Index Plug-in Exporter IBM Master Data Mgmt Content Intelligence Consumers ECM Solutions Import InfoSphere

29 © 2013 IBM Corporation29 Basic Analytics and Search Concepts  Structured Content – data that has unambiguous values and is easily processed by a computer program.  Unstructured Content – information that is generally recorded in a natural language as free text.  Text Analytics – A form of natural language processing that includes linguistic, statistical, and machine learning techniques for analyzing text and extracting key information  Collection – A set of data sources and options for crawling, parsing, indexing, and searching those data sources  Analytics Collection – a collection that is set up to be used for content mining.  Search Collection – a collection that is set up to be used for search application  Crawler – A software program that retrieves documents from data sources and gathers information that can be used to create search indexes  Annotator – A software component that performs specific linguistic analysis tasks and produces and records annotations  Parser – A program that interprets documents that are added to the data store. The parser extracts information from the documents and prepares them for indexing, search, and retrieval

30 © 2013 IBM Corporation30 Q&A

Download ppt "© 2013 IBM Corporation Content Analytics with Enterprise Search Putting Your Content in Motion Realize the value of content to transform your business."

Similar presentations

Ads by Google