Presentation is loading. Please wait.

Presentation is loading. Please wait.

Content Analytics with Enterprise Search Putting Your Content in Motion Realize the value of content to transform your business 1 1.

Similar presentations


Presentation on theme: "Content Analytics with Enterprise Search Putting Your Content in Motion Realize the value of content to transform your business 1 1."— Presentation transcript:

1 Content Analytics with Enterprise Search Putting Your Content in Motion Realize the value of content to transform your business 1 1

2 IBM Leadership in Search, Text Analysis and Classification
IBM has a 50+ year history in text analysis and discovery As early as 1957, IBM published pioneer research done on text classification (and related topics, such as text search, and automatic creation of text abstracts) IBM invests ~$50M annually in research and development for search and text analytics Over 200 people actively engaged in R&D IBM holds over 200 patents in information access with more each year

3 Content Analytics Going from raw information to rapid insight
Uncover business insight through unique visual-based approach Aggregate and extract from multiple sources Organize, analyze and visualize Search and explore to derive insight … to form large text-based collections from multiple internal and external sources (and types), including ECM repositories, structured data, social media and more. … enterprise content (and data) by identifying trends, patterns, correlations, anomalies and business context from collections. … from collections to confirm what is suspected or uncover something new - before customizing models and integrating with other systems and processes The Basis for content analytics, the secret sauce if you will, is the ability to dynamically analyze by aggregating multiple content sources of different content types – whether its word docs or s, or web pages residing in file systems, content repositories or portals in order to rapidly analyze it to surface trends, the relationship patterns, and anomalous associations… Through an exploratory, interactive and easy to use, feature rich views to help search, organize and drill into your content, correlate relationships and derive a level of understanding that was previously inaccessible and unknown. With this level analysis of content your customers are better suited to make more informed business decisions.

4 IBM Content Analytics – A platform for rapid insight
Multiple views for visual analysis, exploration and investigation 8 unique views of content, including subdocument views Dynamically search and explore content for new business insight Connections and Dashboard views to easily detect insights Add your own custom views Powerful solution modeling and support for advanced classification tools for more accurate and deeper insight Enhanced analytics configuration tools Deliver rapid insight to other systems, users and applications for complete business view Quickly generate Cognos BI reports, link between Cognos reports and ICA views Deliver analysis to IBM Case Manager solutions High level feature of Content Analytics 4 4 4

5 Content Analytics – A platform for rapid insight
Document Analysis Facets Dashboard Time Series Sentiment All visualizations of Content Analytics Connections Deviations / Trends Facet Pairs

6 IBM Content Analytics Approach
Interactive Assessment and Discovery of Business Insight Sources Analysis Exploration IBM Content Analytics Delivery of Insight to Users, Systems and Processes Solution and Modeling Tools Industry Solutions Business Intelligence Predictive Systems ECM Advanced Case Mgt IBM Content Analytics Studio IBM Content Classification External and Internal Information Sources

7 What is Content Analytics?
Text Analytics is the basis for Content Analytics Not only was the pick-up line at the counter very long, but I waited 30 minutes just to talk to a rude representative who gave me a car that smelled like smoke, had stained floor mats, a dented fender, and only half a tank of gas What is Text Analytics? Text Analytics (NLP*) describes a set of linguistic, statistical, and machine learning techniques that allow text to be analyzed and key information extraction for business integration What is Content Analytics? Content Analytics (Text Analytics + Mining) refers to the text analytics process plus the ability to visually identify and explore trends, patterns, and statistically relevant facts found in various types of content spread across internal and external content sources Until recently, technology to analyze unstructured data was immature and computing infrastructure made it difficult. Now companies are starting to use these technologies to understand the content in call center notes, s, claim forms, etc. Point – discovery of information, not keyword search Content analytics can be used with content from content management systems or used in conjunction with other unstructured data from any other corporate system or outside sources * Natural Language Processing 7 7 7

8 Analyzed Content (and Data)
IBM Content Analytics – How it works Real-time NLP REST API Noun Verb Noun Phrase Prep Phrase Person Issue Warning Driver action Component Issue: “Engine Light” Situation: “Refueling” Extracted Concept Content Push API Content Analytics Crawlers “Owner” “reports” “check engine lite” “flashes” “after refueling” ... Content Analytics UIMA Pipeline + Annotators Analyzed Content (and Data) Source Information Corporate (Contact Center, Test Data, Dealer notes, ECM, etc.) and External (NHTSA, Edmunds, Consumer Reports, MotorTrend etc.) High-level “how it works”: Content Analytics crawls and analyzes unstructured content and breaks apart sentences to understand their concepts and meanings through understanding of “parts of speech” and other NLP techniques. The concepts are transformed into parts of speech facets and can be annotated such that other concepts in the text are assigned to custom facets. This information is then viewed through the Content Analytics Miner applications (multiple views). Additionally, ICAwES can import data from SPSS, Cognos and MDM for structured and unstructured analysis. RDB Fine grain control over the entities and facets that are created IBM Master Data Mgmt 8

9 Wikipedia : UIMA standard
UIMA stands for Unstructured Information Management Architecture. An OASIS standard[2] as of March 2009, UIMA is to date the only industry standard for content analytics[ UIMA is a component software architecture for the development, discovery, composition, and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies developed by IBM. The source code for a reference implementation of this framework has been made available on SourceForge, and later on the website of the Apache Software Foundation. An example is a logistics analysis software system that could convert unstructured data such as repair logs and service notes into relational tables. These tables can then be used by automated tools to detect maintenance or manufacturing problems. Other examples are systems that are used in medical environments to analyze clinical notes.

10 UIMA Unstructured Information Management Architecture Noun Verb
Analysis engines are interchangeable and reusable Analysis engines pass artifacts via common data store, CAS (Common Analysis Structure), as annotations Watson Jeopardy Challenge utilized UIMA (UIMA-AS) to win quiz champions Part-of-Speech Annotator Morphological Annotator Noun Verb Preposition Num Named Entity Annotator Event Annotator Crime Name=theft 11:30 Queens Time timeOfDay = noon City cityName=New York cityDistrict=Queens in a.m. stolen was Porsche at Sports Car UIMA Framework

11 UIMA Compliant Analysis Engines in ICA
ICA provides a number of annotators for advanced text analysis Language Identification Linguistic Analysis Dictionary Lookup Pattern Matcher Named Entity Recognition Document Classification Custom text analysis can be added as a UIMA annotator e.g. An annotator that recognizes product number and add additional information such as product name, release date or price Language Identification Tokenization Word Analytics Named Entity Recognition Multi-word Analytics Classification Custom Analytics UIMA

12 https://www.ibm.com/developerworks/mydeveloperworks/blogs/36db6433-2f c bfd/entry/overview3?lang=en

13 Text Analytics Catalog

14 Content Analytics Studio
Content Analytics Studio is an integrated development environment (IDE) for creating your own custom analysis engine With Content Analytics Studio, you can Create language and domain specific dictionaries Write rules to match character patterns Write rules to identify patterns of tokens and other annotations Create UIMA annotators based on these dictionaries and rules Annotate text documents and view the details of annotations Annotate collections of documents all without needing to write code or understand underlying technology

15 Content Analytics Studio
Develop your Custom Text Analysis with Tooling Build language and domain resources into a LanguageWare dictionary. Develop rules to spot facts, entities and relationships. Create and test UIMA annotators with a collection of documents. Export your Custom Text Analysis Easily generate the annotators to be Content Analytics ready Deploy your Custom Text Analysis with in ICA Import newly created annotators via Content Analytics administration console and associate it to a collection. View Project Resources

16 Content Analytics Studio
Supports iterative development Start with small document Develop small set of resources to analyze Review results Update resources and review improvements Gradually increase size of document set and complexity of resources Tools to monitor improvement Sample text for building a model

17 Content Analytics Studio
Supports iterative development Start with small document Develop small set of resources to analyze Review results Update resources and review improvements Gradually increase size of document set and complexity of resources Tools to monitor improvement UIMA Pipeline components

18 Content analytics is an iterative process of:
Build : process documents Validate : verify resources updates Analyze : perform analytics to find new insights Modify : update resources with new insights

19 Architecture

20 ICA V3.0 System Architecture
Content Analytics Studio Crawler Framework Collection FileNet P8 Crawler Web Crawler Document Processor Real-time NLP Application DB2 Content Mgr Crawler WebSphere Portal Crawler Document Processor Document Processor Parser Document Generator Annotator UIMA Content Integrator Crawler Inspector Web Content Mgr Crawler Contents Miner UI DB2 Crawler Seed List Crawler Text Analytics & Search Runtime Enterprise Search UI JDBC DB Crawler SharePoint Crawler Win FS Crawler REST Application Exchange Server Crawler Raw Data Store Taxonomy Index Search Index Unix FS Crawler Crawler Plug-in Indexer Service SIAPI Application Agent for File System Crawler Notes Crawler Indexer Document Cache Facet Count Sub Index Thumbnail Index QuickPlace Crawler Case Mgr Crawler Domino Doc Mgt Crawler RDB Custom Crawler Global Processing Exporter Export Plug-in XML Web Link Analysis Thumbnail Generation Document Categorizer CSV Importer Framework CSV Importer NNTP Crawler CSV Document Cluster Term of Interest Cognos BI Integration Cognos BI Common Infrastructure Scheduler Logging Control Configuration Monitor Security Admin UI Custom Point

21 UIMA Compliant Analysis Engines in ICA
Resolve many ambiguities in text Recognize domain specific terms / expressions Deal with grammatical characteristics of each language (e.g. English, Chinese, Japanese, French, German, …) Language Identification Lexical Analysis Paragraph/Sentence Segmentation Tokenization Character Normalization Lemmatization Part-of-Speech Tagging Phrasal Analysis Shallow Parsing - Named Entity Extraction - Phrase Recognition (Sentiment Analysis, etc.) - Deep Parsing - JJSA (Japanese only) Document Classification Custom Analysis Add-on UIMA PEAR Custom Analysis Engine Input Text According to finance report, IBM Corp.’s EPS increased by 10.1%. Language Identification Sentence Segmentation Tokenization Case Normalization Lemmatization Part-of-Speech Tagging Named Entity Extraction Phrase Recognition English According to finance report IBM Corp. ’s EPS increased by 10.1% according corporation increase preposition noun(singular) adjective noun(proper) posessive verb(past tense) numeral IBM Corp. IBM Corp. ’s EPS 10.1% Positive (finance – increase)

22 Supported Languages Collection Type ICA V3.0 Text Analytics Collection
Arabic Chinese (Traditional/Simplified) Danish Dutch English French German Hebrew Italian Japanese Portuguese Spanish Czech Polish Russian *1

23 Supported Data Data Source Data Format Web Plain Text
HTTP/HTTPS (RSS、Atom) News groups (NNTP) WebSphere Portal, Web Content Management Relational Database DB2 family (DB2 UDB, Informix, DB2 for iSeries, DB2 for z/OS) Oracle, MS SQL Server, Sybase VSAM, IMS, CA-Datacom, Software AG Adabas Collaboration System Lotus Notes/Domino databases, Quick Place, Domino.Doc Lotus Quickr, Lotus Connections MS Exchange Content Management System IBM Case Manager DB2 Content Manager Documentum, FileNet CS, FileNet P8, Hummingbird, LiveLink Open Text, Portal Document Manager (PDM), Microsoft Sharepoint File System Unix File System Windows File System Data Format Plain Text HTML XML Office Document Adobe Portable Document Format (PDF) MS Rich Text Format (RTF) MS Word, MS Excel, MS PowerPoint Lotus Word Pro, Lotus 1-2-3, Lotus Freelance, Ichitaro More than 300 format can be supported by changing the configuration file

24 ICA Web Application Security
In case of WAS, global security needs to be configured for login setting

25 Document Level Security by Security Token
You can assign security token at crawling by Add the fixed value as security token Assign the security token based on field values (only some crawlers) Attach the token programmatically using custom crawler plug-in It needs to customize search application to pass tokens that the current user has The search engine returns documents only if the given tokens match to indexed security tokens on each document 2.User authentication and credential retrieval Parser Indexer Search runtime Data source Plugin Search Index Crawler Plug-in Plug-in 3.Results filtering by matching Security tokens with user credentials 1.Assigning security tokens to documents Or extracted from native data source

26 APIs Custom Search and Admin applications can be implemented by REST API Language independent Provides all required functions for creating a search UI Search navigation Facet navigation Search functions Faceted search Fetch content, thumbnails and previous document List spell correction, synonym expansions and type-ahead suggestions And more… Provides required functions for administrating search Managing collections Controlling and monitoring components Adding documents to a collection

27 Extension Point List Component / Name Description Published Crawler
Crawler Plug-in Allow users to modify crawled content or metadata Web Crawler Pre-fetch Plug-in Allow users to modify parameters of HTTP request Web Crawler Post-parse Archive File Handling Plug-In Allow users to add a custom archive file handling (e.g. LZH, RAR) Crawler Framework Allow users to easily create a new crawler for unsupported data sources PRIVATE Document Processing UIMA Annotator Plug-in Allow users to integrate their own text analytics logic Real-time NLP Allow users to execute UIMA pipeline without indexing document Thumbnail Generation Plug-in Allow users to integrate own logic to generate thumbnail images Export Document Export Plug-in Allow users to export processed documents to the outside of ICA Deep Inspector Plug-in Allow users to export the text mining results to the outside of ICA Text Analytics & Search Runtime Text Analytics & Search REST API REST interface for Text Analytics & REST API. Complies w/ SR20 SIAPI Search API Fetch / IMC Provide content fetch APIs and IMC access Post Filtering Plug-In Allows users to add a custom impersonation logic Administration Admin REST API REST interface for Administration Text Miner Application Custom Analytics View Allow users to integrate their own visualization for text mining

28 IBM Content Analytics: Analysis Export Capability
Content Intelligence Consumers 1 Crawled Document Export Export documents with its metadata and content as those are crawled Crawler Export Data Store Plug-in 2 Analyzed Document Export Export documents with the result of text Analytics such as Natural Language Processing, Named Entity Extraction, classification or user implemented logic before indexing IBM Master Data Mgmt RDB Content Analytics Parser / Tokenizer UIMA Annotators Plug-in Export Exporter Import Indexer 3 Searched Document Export Export documents limited by search or analysis with original content from the index Search Index Plug-in Export ECM Solutions Limit documents by search or analysis InfoSphere

29 Basic Analytics and Search Concepts
Structured Content – data that has unambiguous values and is easily processed by a computer program. Unstructured Content – information that is generally recorded in a natural language as free text. Text Analytics – A form of natural language processing that includes linguistic, statistical, and machine learning techniques for analyzing text and extracting key information Collection – A set of data sources and options for crawling, parsing, indexing, and searching those data sources Analytics Collection – a collection that is set up to be used for content mining. Search Collection – a collection that is set up to be used for search application Crawler – A software program that retrieves documents from data sources and gathers information that can be used to create search indexes Annotator – A software component that performs specific linguistic analysis tasks and produces and records annotations Parser – A program that interprets documents that are added to the data store. The parser extracts information from the documents and prepares them for indexing, search, and retrieval

30 Q&A


Download ppt "Content Analytics with Enterprise Search Putting Your Content in Motion Realize the value of content to transform your business 1 1."

Similar presentations


Ads by Google