Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness Reginald J. Twigg,

Similar presentations


Presentation on theme: "The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness Reginald J. Twigg,"— Presentation transcript:

1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness Reginald J. Twigg, Ph.D. Capture, Classification and Taxonomy, IBM ECM This session provides an overview of the role of taxonomy and classification management in ECM, outlines IBM’s strategy for enabling ECM (starting with FileNet P8) classification and taxonomy management and introduces the offering. This session is a Classification for ECM 101, starting high level with key challenges for managing unstructured enterprise content, then moving into more detail on the CM maturity model, and finally, the offering. Nowhere, can any other vendor provide such a broad set of capabilities and amass so many experts in ECM. This is a key IBM unique. And a customer value.

2 Agenda The Challenge of Unstructured Content Key Concepts and Terms
Taxonomy, Classification and ECM Adoption Classification Technologies for ECM

3 The Challenge of Managing Unstructured Content

4 80% of Enterprise Data is Unstructured
Billing statements Claims images Customer correspondence Mortgage docs Contracts Signed BOLs Healthcare EOBs Marketing collateral Website content Voice authorizations Signature cards Credit enrollments Material Safety Data Sheets ISO 9000 docs Plant schematics Product images Spec sheets ….and much more! There are two key categories of enterprise information – data and content. Data are managed by relational databases. The advent of the relational database ushered in a wave of enterprise applications, starting with ERP, then CRM, Supply Chain Management, etc. These business solutions leveraged the database to help organizations manage their structured transactional business processes. However, most enterprise information – especially that used in many core business process for services, government and other sectors – is unstructured. According to Gartner, at least 80% of enterprise information is unstructured, and many organizations report growth at over 100% per year. Because unstructured content is the substance of many critical line of business processes, it has become necessary to deploy new technologies for managing its processes. Databases

5 What is Enterprise Content?
According to Forrester, unstructured content falls into 3 main categories – transactional, business and persuasive. Imaging customers are most familiar with transactional content, such as forms, faxes, reports and scanned images. Business content includes process-centric and collaborative document management, while persuasive includes digital assets and web-based content. Unstructured enterprise content is tightly coupled with business processes. Content is created in processes, changed through business processes and also drives those processes. Finding ways to manage these content forms drives the effectiveness of many core business processes in customer service, case management, document management, etc.

6 Where do I start? Organizing the explosion of unstructured content becomes critical: We’ve got 600 GB of content from basic content services all over the enterprise. How can we get this content efficiently mapped into our ECM taxonomy? We’ve been managing our content without classifying it for a few years now. How can our users navigate amongst this existing content in a way that’s intuitive for our business? The lawyers have to review 400,000 electronic documents for their case. How can we make sure they don’t waste their time? You’re likely familiar with the statistic by now – 80% of the information in the enterprise is unstructured (as opposed to structured, database content). By bringing that content under management with ECM and cataloguing it, enterprises are working hard to both save time and cost by further leveraging the content, but also manage risk and enforce the compliance of that content. To meet those ends, we’ve seen our customers begin to struggle with organizing this content. Organizing the content, lending structure to this unstructured tangle of documents and leaves enterprises thinking, “Where do I start?”. Getting this content organized is rising as a hurdle in our customers adoption of ECM. Some examples: Customers find that they simply cannot classify content as they bulk ingest content into ECM. There is simply no one to manually address the problem of getting hundreds and hundreds of gigabytes of content mapped into their standard ECM taxonomy It’s not just the new content – it’s also the content that you might already have under management – there wasn’t a classification solution that you had the time or budget for in the past, and now you’re left with content that’s unclassified. Yes, you have it under management, but its still relatively unstructured – you did the bare minimum. Now you’re having trouble finding this content and have better access to this content in an intuitive manner for your business. Finally, you can’t always proactively organizing your content – new applications (like a legal discovery review process) might demand new ways of approaching the information. Maybe you need to prioritize your content and create workflows that make sure your lawyers are devoting the bulk of their attention to the right area.

7 Business Value of Classification for ECM
Key Business Drivers ECM Taxonomy and Classification Compliance, Records, Legal Discovery In Process Classification Message Tagging, Classification and Monitoring 2 1 3 4 Increase accessibility of content under management Automated, High Scale Classification Classify at ingestion and/or re-classify over time Taxonomy Evolution Tools Enhanced Accessibility Taxonomy Proposer Increase legal discovery review effectiveness while reducing risk Legal Discovery Prioritization and Workflow Assignment Records Classification and Exception Handling Storage and Retention Policy Assignment Increase worker productivity and automate content related decisions Ad Hoc Category Suggestion Content-Based Workflow Selection Content Based Decision Making Reduce inquiry costs, automate message routing and increase customer satisfaction , Chat Routing Agent Response Suggestion Supervision and Monitoring Automatic Customer Response From a broad perspective, there are a variety of applications of classification, both within ECM and outside the guises of ECM. Classification, at its core, provides the enterprise with automated decision making capability. This decision making capability can in turn lend itself to a wide set of automatic actions: assigning categories or taxonomies, filtering, routing, answer suggestions. These automated actions can be leveraged in a variety of use cases. Lets review some of them: 1) For the customer who needs to accurately catalog content, we’ve got a solution for automating this process – automating the process of classifying content as it is brought under management or as it needs to be reclassified. Beyond this application, there are other applications of classification 2) In the compliance space, you can use classification to automate the records declaration process and save every employee time as they struggle to decide what “category” their content fits into. In the eDiscovery space you can you use classification to better organize your case content. Alternatively, you can organize your content in a manner that’s unique to a particular legal case for better navigation. You could divide up the content and categorize it in a way that makes sense for assignment of workloads to different lawyers with different areas of expertise or even generate a list of high priority documents for review by your most valuable legal minds – leaving the drudgery to the lower priced knowledge workers. 3) There is intersection with BPM in the sense that classification can be used to identify the proper workflow for a content or make a workflow more efficient by automatically populating attributes or making strong suggestions to the executor of the workflow 4) Finally, classification has applications in handling – whether it be to automatically respond to a customer inquiry, route an incoming or monitor it for inappropriate content. <CLICK> But in this presentation, we’re going to focus on the value of the first use case -- #1 – and classification’s application to the problem of taxonomies in ECM . . .

8 Ability to Structure Content with Databases
There is an inverse relationship between the growth of unstructured enterprise content and the ability to get value out of it using traditional relational databases and data-centric enterprise applications. The challenge of managing unstructured content and using it in key business processes depends upon providing ways to rationalize, bring order to, and enable users to gain access to the right content at the right time. Ability to Structure Content with Databases Percent of corporate information value managed in traditional databases Unstructured Data Data Creation And Demand Structured Data OLTP and BI (narrow scope) Application Types Compliance, Competitive Intelligence (wide scope) Source: Gartner

9 Multiple Repositories Make Access Difficult
1 repository “The Future of Content in the Enterprise,” Connie Moore and Robert Markham Don't know 5% 17% 36% More than 15 repositories 2-5 repositories 25% 14% According to Forrester, global 2000 organizations have multiple content repositories. Each repository has one or more content management application built on top of it. Most notable is that a quarter of large enterprises have more than 15 content repositories with dependent applications. This can be a real barrier to setting policies and standards for the enterprise, since the organization must interpret the different organizational concepts and schema across different content management repositories. Please note that these statistics include mainstream content management repositories (e.g., FileNet, IBM, OTEX, DCTM) and not other content management systems. 10-15 repositories 6-10 repositories 4% Base: 81 North American decision-makers (multiple responses accepted)

10 And Then There’s SharePoint, File Shares and . . .
Which leads us to SharePoint. The ease of deployment has made MSFT SharePoint the choice for many departmental collaboration and simple content management solutions. These solutions, including Quickr, Team Rooms and other departmental content management applications, are loosely-organized, relying on users to classify and manage the lifecycle of their content. The challenge is to gain access to, and manage this content without sacrificing productivity. Being able to classify this content is foundational to being able to manage it. They look out over that horizon, look out over their own enterprise to find all that other content they’re determined to bring under management. But its not just one file system, or one sharepoint instance. These repositories, these silos, they’re everywhere. There could be hundreds of sharepoint repositories. There could be hundreds of file shares. Silos associated with basic content services can litter the enterprise. Maybe this content is classified. Probably not. And to make matters worse, even if there is a taxonomy for each of these content silos, and the content is all classified, it isn’t likely to match the standard information model you’ve built into your ECM architecture. Each of them can come with their own designed taxonomy Or an organic taxonomy – a “Folksonomy”.

11 Key Concepts and Terms Now we turn to classification and taxonomy management. Before diving deeper into the role of taxonomy and classification management in ECM, it is important to level-set on the key concepts and definitions. Although many of you have experience with classification (here ask how many audience members are currently working on class/tax projects), the terminology can vary. This section attempts to provide some clarity on these concepts.

12 Key Concepts Metadata: a means of describing, locating, cataloging, and activating content as objects in a software ecosystem (literally, data about data). Enterprise Catalog: a centralized and normalized metadata model for unstructured content for the purposes of providing consistent services across all ECM applications. Taxonomy: a hierarchical structure of information components, any part of which can be used to classify a content item in relation to other items in the structure. Classification: a coding of content items as members of a group for the purposes of cataloging them or associating them with a taxonomy. Taxonomy - Paraphrased from Gartner: “A taxonomy is a classification, typically hierarchical, of information components (for example, terms, concepts, graphics and sounds) and the relationships among them. In a hierarchical structure, it reflects increasing levels of specificity the further down the hierarchy a particular element lies. Taxonomies may be used to represent membership in various domains, and can support information organization, discovery, presentation and access. The taxonomic organization and its labels serve as metadata for the content they organize.” Hype Cycle for the High Performance Workplace 2006, Rita E. Knox et al. 14 July 2006 (ID: G ), p. 34. Classification - Classification is rooted in natural and library sciences for the purposes of identification and location. Wikipedia provides the most useful definition and analogy for ECM classification: “A library classification is a system of coding and organizing library materials (books, serials, audiovisual materials, computer files, maps, manuscripts, realia) according to their subject and allocating a call number to that information resource. Similar to classification systems used in biology, bibliographic classification systems group entities that are similar together typically arranged in a hierarchical tree structure (assuming none-faceted system) Classification of a piece of work consists of two steps. Firstly the 'aboutness' of the material is ascertained. Next, a call number based on the classification system will be assigned to the work using the notation of the system Classification systems in libraries generally play two roles. Firstly they facilitate subject access (See Cutter) by allowing the user to find out what works or documents the library has on a certain subject. Secondly, they provide a known location for the information source to be located (e.g where it is shelved).”

13 Taxonomy Is . . . Not turning animals into trophies
A system for organizing the corpus of business content Before the builds, point out that we often get asked about what taxonomy is. I have been asked if it is about stuffing animals. Hit the first build – what taxonomy is not. Second build explains what it is in simple terms. The library image also provides the organizing principles for ECM and the use of the enterprise catalog for organizing and providing access to the corpus of content.

14 Taxonomy and Classification in ECM
Classification Examples: Document Classing Foldering Taxonomy Examples: Enterprise Content Catalog Industry Standard Document Taxonomies (ISO, XMI) Methods: Rules-Based: Applies pre-determined rules for ‘if, then’ classification of text and properties Analytics-Based: Applies algorithms to interpret classes in order to apply classification rules to them These are the most fundamental types of classification and taxonomy. Most ECM customers will be familiar with these concepts, so it should provide a baseline for the remaining discussion. 2 key methods of classification: Rules-based is most common and very useful for classification. Rules-based classification allows setting pre-determined rules for classing content, then determining the appropriate actions to take on it. The ‘if-then’ logic of rules-based approaches quite literally performs classification (‘if this’ classifies a content item) in the process of taking action on it. Records Crawler is a solid rules-based engine for classification and provides a useful tool for many classification tasks. Analytics-based classification provides content intelligence as a way of understanding content that can not be easily classified through pre-defined rules. Using advanced analytical algorithms, analytics-based classification can learn a taxonomy and interpret unknown content to automate its classification. These approaches have the benefits of reducing the number of exceptions to manage, as well as help classify content in file shares and collaboration solutions where little or inconsistent metadata are common. These approaches are complementary – rules-based classification can address most content, while analytics-based classification helps manage exceptions. Today, the IBM Classification Module (available through the OmniFind offering) provides out-of-the-box analytics classification.

15 ECM Taxonomy Illustrated
Before we go any further, lets pause to define some terms. Taxonomy can mean different things to different people. In the spirit of “a picture is worth a thousand words”, we’ve thrown up some examples of taxonomies. Lets build up to a definition of a taxonomy Metadata is data about content. Think of metadata as an attribute of a piece of content or a tag associated to a piece of content. In turn, a taxonomy is a collection of the different metadata values, for a particular attribute (or category). All the different values an attribute could have are collected into a metadata structure called a taxonomy. Some easy examples of taxonomies within the context of P8 are folders or document classes. But really, any other attribute that has been designed into your ECM deployment makes up a taxonomy. The act of deciding where in a particular taxonomy a document should be placed is the act of classification. And that decision – that action of classification – is at the basis of the solution we’ll discuss today. ================== Other notes to the speaker: Some outright, formal definitions: Metadata: A means of describing, locating, cataloging, and activating content as objects in a software ecosystem (literally, data about data) Taxonomy: A hierarchical structure of information components, any part of which can be used to classify a content item in relation to other items in the structure Classification: A coding of content items as members of a group for the purposes of cataloging them or associating them with a taxonomy =========================== Taxonomy - Paraphrased from Gartner: “A taxonomy is a classification, typically hierarchical, of information components (for example, terms, concepts, graphics and sounds) and the relationships among them. In a hierarchical structure, it reflects increasing levels of specificity the further down the hierarchy a particular element lies. Taxonomies may be used to represent membership in various domains, and can support information organization, discovery, presentation and access. The taxonomic organization and its labels serve as metadata for the content they organize.” Hype Cycle for the High Performance Workplace 2006, Rita E. Knox et al. 14 July 2006 (ID: G ), p. 34. Classification - Classification is rooted in natural and library sciences for the purposes of identification and location. Wikipedia provides the most useful definition and analogy for ECM classification: “A library classification is a system of coding and organizing library materials (books, serials, audiovisual materials, computer files, maps, manuscripts, realia) according to their subject and allocating a call number to that information resource. Similar to classification systems used in biology, bibliographic classification systems group entities that are similar together typically arranged in a hierarchical tree structure (assuming none-faceted system) Classification of a piece of work consists of two steps. Firstly the 'aboutness' of the material is ascertained. Next, a call number based on the classification system will be assigned to the work using the notation of the system Classification systems in libraries generally play two roles. Firstly they facilitate subject access (See Cutter) by allowing the user to find out what works or documents the library has on a certain subject. Secondly, they provide a known location for the information source to be located (e.g where it is shelved).”

16 Taxonomy, Classification and ECM Adoption
Now let’s connect the dots – how classification and taxonomy play in ECM.

17 Drive New Business Value from Content
Improve Content Access Organize Unstructured Content Content Classification Solutions Derive Business Insight

18 Business Drivers for ECM Taxonomy Management
Proliferating departmental solutions Content Management Collaboration (SP, Quickr, Team Rooms, Wikis) User-based classification and high workforce turnover Productivity declines as knowledge disappears Legal discovery is a secondary concern Mergers and Acquisitions – need to reconcile disparate content management practices, repositories and processes Challenges – the growing number of departmental solutions is an obvious challenge. How do organizations balance the need for knowledge-worker productivity with establishing and maintaining corporate standards for business process and governance. High workforce turnover risks a loss of knowledge and understanding of the uses of content. Although lawsuits are an obvious example (an employee leaves and then sues the company), the loss of productivity that occurs from the loss of knowledge about the location and uses of content is a higher, more measurable cost. Studies have suggested that the average company invests $50,000 to train new knowledge workers. Add to this cost the cost of recreating documents that may be ‘out there somewhere’, and turnover can add significant expense to relying on users to manage classification. M&A is another obvious challenge. Not only must different content be accessed and integrated, but this content is always borne of different processes, users and work cultures. Being able to understand acquired taxonomies is essential to integrating them into the new ECM system.

19 Classification is Hard Work
Key Business Challenges ECM Taxonomy and Classification Most organizations face content taxonomy pain – especially as they standardize around ECM Mapping content to taxonomy during ingestion Reclassifying content under management Evolving taxonomies as new types of content emerge Integrating folksonomies (SharePoint) into a master taxonomy Increase accessibility of content under management Automated, High Scale Classification Classify at ingestion and/or re-classify over time Taxonomy Evolution Tools Enhanced Accessibility Taxonomy Proposer 1 1

20 Organization is the Root Cause
Most organizations face content taxonomy barriers – especially as they standardize around ECM Assigning categories en masse Reclassifying existing content as taxonomies evolve Merging taxonomies Integrating the wisdom of folksonomies In Enterprise Content Management, taxonomies ensure that content is accurately catalogued and easily accessible. Having consistent and reliable access to unstructured content is arguably the foundation to realizing the business benefits of ECM, and all subsequent content-centric enterprise applications will realize their ROI by leveraging this essential capability. Enterprise information architects see the standardization of content under a single set of rules and policies, as a key driver for ECM adoption. ECM platform standardization on IBM FileNet P8 enables the management of unstructured content across the enterprise by metadata management into a single, unified catalog. Standardization also raises a unique challenge – namely, how to manage content mapped under widely different metadata structures (i.e. different taxonomies) spread across multiple departments, repositories, and applications. This manifests itself in some pain points that you yourself might be pondering or tackling right now. 1) The easiest, most basic use case to think of is bringing new content under management into ECM. Its stored on a fileshare and you need to accurately catalog it – thousands of documents coming in ‘en masse’, requiring classification. 2) The next pain point is related – what if you’ve already brought content under management, and either because you didn’t have the facility before or the need didn’t exist at ingestion time, the content doesn’t have the proper classification. You need to reclassify your content. 3) The business scenarios go on You might have merged with a company with its own ECM. How are you going to rectify the conflicting taxonomies? 4) You’ve got taxonomies distributed through the enterprise – they don’t exactly match your standard taxonomy – but they’ve got interesting aspects to them – how are you going to normalize your standard taxonomy to take this wisdom into account?

21 Challenges and Impacts of Merging Taxonomies
Misclassification – change is constant, and master taxonomies must manage multiple custom taxonomies for each content source “Folksonomies” from departmental collaboration solutions are created by users and unmanaged by ECM standards Impact: Unreliable Metadata – Inconsistencies lose or mislabel content Process Misfires – Poor metadata triggers incorrect events and workflows Information challenges. Unreliable metadata is particularly a challenge with BCS (Basic Content Services) and local collaboration solutions (SharePoint, Quickr). These solutions rely on users to manage metadata, if it is managed at all. Without consistent metadata, traditional classification methods break down. “Folksonomies” are user-created, purpose-built taxonomies. Systems like SharePoint and Team Rooms allow users to create and use their own classifications. While folksonomies might have their work purpose, they create a challenge for applying consistent standards to their content. Classification flux is the fact that taxonomies are dynamic and constantly subject to change. For this reason classification is not a single-shot but must be performed on a regular basis to ensure that taxonomies are not out of synch. Scale is the Challenge – Automation is Essential

22 Classification Barriers to ECM Maturity
The keynote session provided an introduction to the ECM maturity model. Classification challenges can pose hurdles to moving up the maturity curve, specifically: Build 1 (ingestion): moving from a siloed to a more integrated content management environment poses the challenge of being able to interpret and classify content across multiple silos. Getting metadata into FileNet P8 from multiple content silos is a major challenge, since understanding how to classify it often requires manual effort. At this stage, automation is needed to manage the volume and types of content. Build 2 (Standardization): Establishing an ECM platform as the enterprise standard for content-centric business applications has demonstrable ROI in reducing operating costs and allowing the consistent capturing and application of policies across the organization. The hurdle to standardization is applying standards across multiple taxonomies. The inability to reconcile different taxonomies can lead to failure to standardize. Build 3 (Enforcement): Once standards are set, classification is necessary to provide ongoing enforcement of content management policies across the organizations. Content must be understood in order to apply both policies and process to it. Classification and taxonomy serve as the basis for this understanding. Compliance typically surfaces the pain of classification and enforcement, although many content management applications raise the requirement. Note: Often the challenge of taxonomy management is less about defining an enterprise taxonomy than about applying it to the different taxonomies that exist in work processes and different forms of content. In this case, taxonomy does not need to be imposed as much as it needs to interpret, understand, and classify content in different applications and sources. Classification Barriers to ECM Maturity Scale / Scope Truly Enterprise Class Systems Level 5 Federated Activation & Policy Based Level 4 Federation & Activation LOB and Departmental Level 3 Search & Discovery Classification hurdle #3 - Enforcement Zero-click Policy based Compliance Active Storage And Retrieval End-user and Process Driven Normalized and federated security Re-use enabled Ability to find and use content across departments LOB Systems Classification hurdle #2 - Standardization End-user and Process Driven Normalized and federated security Re-use enabled Ability to find and use content across departments Federated Retention Level 2 Silos & Storage Ad-hoc usage Normalized security Some re-use Directory Server based security End-user Driven Loosely defined responsibility Ability to find content across departments Classification hurdle #1 - Ingestion Level 1 Chaos Ad-hoc usage Fragmented security Little re-use Application based, siloed workflow End-user Driven Loosely defined responsibility No re-use across departments Ad-hoc usage No re-use No workflow End-user Driven Undefined Responsibility File system security Technology & Capabilities Unified Repository Content/Process Fusion Active enforcement of Compliance Integrated Content, Process and Compliance capabilities Available as a service Federated Shared Drive Multiple disparate content repositories Paper Stand-alone Maturity Evolution Over Time

23 Lessons Learned From ERP Adoption
Getting Classification Right: ‘Garbage in = garbage out’ is often used in metadata management projects to describe the problem of building a metadata model on inconsistent sources. Driving Process on Taxonomies: ERP systems depending on 3 master taxonomies – material, vendor and customer. These taxonomies drive events, workflow definition and the development of transaction-centric business process applications Mastering Metadata: The ability to deploy new enterprise applications depends upon the re-usability, scalability and integrity of the metadata model System of Record is Required for Standardization: Establishes an enterprise standard that can be audited Forms the foundation for building demonstrable best practices Enforces consistency of data capture and output ECM has many lessons to learn from the mass adoption of ERP. SAP arguably became the market leader in this space by closely managing classification, taxonomy, metadata and process. ECM has similar challenges today. Getting Classification Right: Managing data input requires discipline. ERP systems tightly controlled the classification of data and data entry to maintain the integrity of their metadata models. Much of the pain of ERP implementation was training users to enter data in a highly-structured, disciplined way. Classification is at the core of this process. ECM will have special challenges, however, since much unstructured content is produced by knowledge workers in their own processes. The challenge for ECM will be to classify in ways that maintain and enforce process and policies without interfering with productivity. Driving Process on Taxonomies: SAP built the R/3 system on 3 master taxonomies – material, vendor and customer. The relationship among these 3 taxonomies formed the basis of their event model, workflow and application stack. This provided a consistent way of applying taxonomy to improve the efficiency of their business processes. The challenge for ECM adoption is to maintain productivity while providing the benefits of master taxonomies. Mastering Metadata: Not a new concept to FileNet P8 users, mastering metadata for unstructured content ensures consistent access to, and interaction with, that content across different business functions and sources. Establishing a System of Record: First used in data management in the 1980s, a ‘system of record’ provides a basis of accountability for all of the information contained in the system. This means it can pass audit (internal, external, regulatory, industry) and serves as the basis for demonstrating policy compliance. ERP systems established themselves as the ‘system of record’ for transactional business processes – most notably General Ledger, costing, inventory, procurement, etc. The challenge for ECM is to establish a system of record for unstructured content and content-centric business processes, records management, document management and other ECM processes.

24 Customer Lessons for Mastering ECM Taxonomies
‘Master’ taxonomy of record required for Compliance Business process applications Merged master taxonomies become large and unwieldy Multiple taxonomies require integration and translation Centralized, decentralized, or hybrid? Intelligent Classification increasingly is used to manage: Taxonomy merging from multiple use cases Taxonomy/folksonomy translation from distributed content sources

25 A Look at ECM Classification Technologies
This final section introduces the strategy and offering for classification and taxonomy management for FileNet P8.

26 State of Classification Management Technologies
ECM Classification/Taxonomy is an emerging discipline Industry standard taxonomies: Focus on business function or transaction types Have not reached the enterprise level Classification best practices: Content ingestion Application development reclassification Classification software focuses on content ingestion: Electronic content ( , Office documents, free-form text) Paper content (document images) requires OCR Search is not enough – must drive value in the business process

27 Criteria For ECM Classification Management Solutions
Integrate with and support the ECM metadata model Interpret a highly-federated content ecosystem Go beyond search to catalog and manage content Build on advanced analytic technologies – rules alone are not enough Interpret content to extract meaningful (meta)data Employ multiple methods (engines) for classification Integrate teaching/learning

28 Common Platform for Electronic Content Classification
Queue Classification and Monitoring Compliance, Records, Legal Discovery Classification Platform ECM Taxonomy and Classification In Process Classification

29 IBM Classification/Taxonomy Strategy for ECM
Enterprise Services for Active Content Classification/Reclassification at Capture File Shares SharePoint, Quickr Federated Repositories Taxonomy Management for Exposing P8 taxonomies (Enterprise Manager) for classifying enterprise content Extending taxonomies as enabling services for Content-Centric BPM Applications Establish System of Record for Master Content Management IBM ECM’s approach to classification and taxonomy management is to provide these as an enterprise service. This service helps to ensure the integrity of metadata from different content sources and access to key content from applications built on the P8 platform. This service, which we call Master Content Management, is designed to provide classification and taxonomy as a foundational, universal capability of which all ECM/BPM applications can take advantage. This solution, using existing IBM products in the Discovery portfolio (specifically, the classification analytics and integration products acquired through iPhrase), is available today as part of the OmniFind offering. We are completing an integration of this analytics product with FileNet P8 to provide a solution for classifying content in file shares, collaboration solutions (such as SharePoint) and other content sources. An analytics-based engine, this classification/taxonomy management solution for P8 goes beyond keyword search to provide syntactic, semantic and contextual interpretation of content (metadata and full text) for classification. With the ability to learn from teaching documents and user actions, this analytics-based classification engine increases the quality of classification with use. Multiple algorithms are provided out of the box, as well as rapid deployment of integrations with different content sources. This product will be announced as the ‘ECM Taxonomy Server’ in May, 2007.

30 IBM Classification Module for Electronic Content
Organize your ECM content Automated classification and filtering Combines text analytics understanding with rules Acquires domain specificity from your own content Unique learning technology for adaptive classification Suggests new categories or even seeds an entirely new taxonomy Rectifies conflicting taxonomies Market proven, scalable platform With that in mind, lets introduce to you the solution from IBM – the IBM Classification Module for IBM FileNet P8. At its core, it can automate the classification of content as its ingested or content that is already under management. It can also filter out content that doesn’t fit the profile of managed content. It does so though a combination of text analytics-based understanding with rules. We’ll spend more time on this, but in summary, text analytics understanding of your content means that the solution reads your content – all of the text in the document or – and assigns metadata to it based on the full body of the content. Its not just looking for one magic word or following a set of policies (though it can if you need it to). Its reading the whole document and taking the context of the whole document into account. When it makes classification decisions, its able to do so in a manner unique to your business because it trains itself from your content (especially the content already classified and under management in P8!). And as you use the software more and more, the solution learns from feedback you provide to it. What you see on the left hand side of the slide here is a snapshot of the Classification Review Tool. It lets you audit the automated actions as well as handle the exception cases. With the auditing, you’re able to monitor the automation – and as you gain confidence in the quality of the decisions being made by the solution, you can ramp down the level of auditing you execute. Through the course of reviewing and auditing the automated actions, these “manual” actions are providing feedback for understanding your content, the solution takes that feedback into account and in turn adapts its understanding in realtime. So the very next classification request will learn from the actions that you’ve taken. And as it learns in this manner, it takes more recent teachings into account heavier than older ones, allowing the system to adapt and evolve its understanding of how to classify content, as your business adapts and evolves. In terms of helping your taxonomy evolve, the solution can also review content items that don’t fit into the current taxonomy and in turn suggest new groups and new categories to augment your existing taxonomy. Furthermore, you can take the full set of content and ask it to recommend not just a new group of values, but an entirely new taxonomy. The core of the solution is built on the IBM Classification Module, a market proven product from IBM that has been in the market for over years.

31 Understanding Content with Text Analytics
The strategic value of this market is paramount to IBM A IP is essential Classification Engine A Legal is currently requiring full approval Training (Teach) B Engineering requires clear requirements Matching C Strategy is Important to the marketing team Categories list and Relevancies (Scores) Feedback Corpus (Categorized) The strategic value of this market is paramount to IBM C The core market for this new product has been defined as such by IBM Audit C: 97%, B: 54%, A: 12% Here, lets dive down a little deeper into the IBM Classification Technology – the core service for interpreting different classifications and normalizing their expression in the central enterprise catalog. This service is a combination of analytics, linguistic and statistical methods and tools. Lets focus on the right hand side first: A document is sent to the classification, the text is “read”. In turn a classification (or set of suggested classifications) are returned for use by the ECM system via text analysis. The content is run through natural language processing to identify the key concepts involved in the document and in turn this concept profile is compared against the training set for statistical similarity. The content is classified as part of the category (or categories) it is most similar to. Each classification, at application runtime, is also paired with a confidence level. This confidence level is used in a variety of ways. Primarily, it is used to set a level of automation. The higher confidence level you require of automated action, the lower the amount of your automation will be. Other applications can use this confidence level to throttle the levels of automation. Moving to the left hand side, the system itself is trained using real content, each associated to a category in the taxonomy. Your actual business content is used to create this statistical profile of your taxonomy and it learns from the best possible examples – your real content. It encompasses the “messiness” of real content and the subtleties that more rule-based approaches might miss. With actual content, we get not only the main topic of the document, but also the full context to help us differentiate similar categories. The capability of understanding the meaning of unstructured text and adapting to changing environments in real-time is what makes IBM Classification Technology unique. The technology understands not only the words used, but also the context of the language, as well as associated metadata. Unlike other technologies, IBM Classification Technology self-learns, becoming more accurate over time without requiring human adjustment. ============================= Other Notes: IBM Classification Technology currently supports language processing and identification in 11 languages, including most west European languages, Japanese, Chinese, and Korean. It can also execute classification in a generic manner for many other languages. The broader solution leverages FileNet P8 to train the system and in turn automate the assignment of classifications.

32 Classification Workflow: Accelerating Content Organization
Classification Review Tool Send to taxonomy proposer Existing Unclassified Managed Content Automatically categorize majority of content File System Classifier The workflow for the ingestion use case starts with the ingestion process where content (or all of your content) is identified for classification from its wide variety of sources. In any case, the Classifier reviews the incoming documents and attempts to automatically classify them in the ECM system, decide where to put them, under which folder, using which document class and/or what attribute. If it feels confident about a document, it can auto-classify it into the right document class and folder. And this level of confidence is confifgurable. If, on the other hand, it feels confident that the document doesn’t belong in the ECM system at all, it doesn’t seem to belong to any existing category in the repository, it can filter out the document, simply remove it and not insert it into P8. (this step, is of course, optional) And in all other cases, it forwards the document to the Classification Review Tool to have it manually reviewed by the user. A fourth scenario exists as well: content that it has confidently classified, but is siphoned off to the review tool for auditing purposes. You can set the level of auditing to meet your needs. We’d typically recommend to start with a high level of auditing to build confidence that you trust the automated actions and then turn down the auditing to lower levels as time moves on. The bulk of the content will be automatically handled in this workflow. But for the content that does flow into the review tool, there are a couple of potential outcomes: The content was properly classified and the reviewer simply confirms the automated decision. Positive feedback is generated for realtime learning. The content was improperly classified (or wasn’t effectively classified at all) and the user assigns the content to an existing category in P8 or creates a new category in P8 and assigns the content to that category. Again, feedback is generated and the system learns in realtime. Finally, the content can be sent for further review in the taxonomy proposer, and used for creating new categories for your taxonomy Filter out documents Basic Content Services Reference: Integration Components Classifier (Runtime Application) Classification Review (UI) Taxonomy Proposer (UI) Content Extractor (training based on P8)

33 Components of the Solution for Text Classification
Classifier Automatically classifies and filters out documents Moves some documents for manual review Classification Review Tool Allows user to manually review documents Content Extractor Extracts content from the ECM system for training Taxonomy Proposer User workflow to identify and name new categories or apply existing taxonomy from P8 Features available today include: Automatic classification of source documents. This classification tool learns from the corpus of content it touches (both native P8 and external sources) to enhance the automation of classification. Review Tool for managing exceptions. The Review Tool also learns from expert interaction to provide additional automation over time. Content Extractor for evaluating content for learning purposes. Taxonomy Proposer uses workflow to enable users and the system to identify new categories/classes and name them for future reference.

34 Classification for Paper Documents
Classification of paper documents occurs in capture process Use cases for paper document classification Recognition using OCR/ICR Classification to associate to folders or doc class Separation to reduce costs and improve process

35 Three Primary Types of Images – The Document Recognition Problem
Less Advanced More Advanced Semi-Structured Structured Un-Structured Structured Structured data Known page layout Consistent formats Limited data fields Good quality documents Semi-Structured Semi-Structured data Unknown page layout Variable formats Tabular data Complex/multi-page documents Variable quality documents Unstructured Unstructured data Unknown/complex page layout Structured forms application forms benefit forms tax forms airline tickets Remittances checks paying in slips Semi-structured forms invoices Unstructured forms correspondence Processing Unstructured and Semi-Structured Forms When it comes to recognizing unstructured or semi-structured forms and other documents types, these same steps still apply, although in a different order and manner. On unstructured forms such as invoices, EOBs, and transportation documents, locating variably-formatted data is the number one priority. The presence of random differences over a series of functionally similar forms means that the forms in question cannot be processed using the traditional template-based approach, in which one software template matches each and every data field on each and every form in a presorted batch. Multiple approaches are required, approaches that create alternatives with results that are compliant with classifying similar data elements. Given the incredible computing horsepower and enormous amount of memory that reside in the average desktop PC, a variety of techniques, including ICR brute force, can be run in parallel to achieve remarkably accurate results that were impossible to achieve only a few years ago. Sometimes, morphological analysis techniques involving blob analysis, edge detection, multi-line character segmentation, and long-line detection can be used to find form objects, columns, and data fields. At other times, the geometrical and spatial relationships between the text data elements, such as rows or subheadings (rather than graphical objects), can locate the places where data most likely will be found. In fact, the data location process need not involve character recognition at all; the text can be treated as integral patterns of blobs. Arthur Gingrande Jr., IMERGE Consulting Steps in the Template-Based Processing of Structured Forms Just as documents must be prepared in order to be fed into a scanner by removing staples, smoothing wrinkles, positioning them for optimal registration, etc., so the image of a form document must be prepared by following these steps before it can be intelligently recognized: Document scanning—Pages of forms are scanned and converted into bit-mapped (usually TIFF) images of forms which are either compressed and stored for later batch processing, or are passed immediately in an uncompressed format to an ICR engine for recognition. Image analysis—The document image is cleaned up. Character image quality is improved, using image enhancement techniques. Background “noise” is removed from the form. Form alignment—The image is registered and deskewed by the ICR software, which automatically aligns the form by locating special symbols on the document called registration marks as guides. Form identification—The document is identified by certain predefined characteristics that the ICR software is trained to look for, so that the zones containing the fields designated for recognition can be located by a customized, predefined ICR template. Form ID attributes can include form numbers, corporate logos, or the name of the form itself imprinted somewhere on the form. Form background removal—This stage is not necessary if the document is a form that was originally printed in a colored (“drop out”) ink that is invisible to the scanner being used. If colored ink is not used, the form image may contain lines, boxes, fine print, and other form attributes—passive data—that tend to confuse the ICR engine. These form attributes must be extracted from the image of the form, so that only the character images—the active data—are left behind. Broken and fragmented characters are automatically repaired and restored to their original shapes. Character field location—The predefined ICR template automatically locates the fields that contain character data. The template identifies which individual fields on the form image require character recognition, and what the nature of those fields are—hand print, machine print, numeric, alphabetic, alphanumeric, etc. The template also identifies which areas are barcodes or check box recognition zones. Character segmentation—Sophisticated software routines analyze, separate, and break down the character fields into isolated characters. If the form is “ICR –friendly,” characters are segmented with the aid of graphic devices such as boxes, tick-marks, and connected boxes called “combs” that serve to force the form user to legibly separate the characters from one another. Character classification— Individual characters are classified by ICR algorithms according to their ASCII category and assigned a confidence value, which is an index of how “certain” the ICR engine “feels” about the selection it has made. Alternate character choices are ranked according to those values, so that they can be incorporated into editing procedures that improve ICR accuracy. For example, the alternate choice “1” might be used instead of the first-ranked choice “I” when contextual analysis reports that the field is all-numeric. Post-processing—The initial or “raw” recognition results are validated using edit procedures such as grammatical rules, spell-checkers, dictionaries, check-sum routines, and look-up tables. Ambiguous and erroneous data fields—the “rejects”— are identified and sent to data entry operators at workstations for manual correction. Manual correction of rejected character fields—The manner in which the data entry operator is presented the rejected data for correction can dramatically impact both the speed and the accuracy of the reject repair process. In particular, the data entry GUI is important because the ergonomics of data entry are what enable a given data entry operator to reach his or her maximum correction speed. What is interesting is that only one of the steps—character classification—is specifically concerned with identifying character data. The rest of the steps have to do with either preparing the imaged characters for classification or interpreting the results of character classification. With so much opportunity for error increasing at each successive step of the way, it is remarkable that ICR accuracy rates can attain (and sometimes exceed) human performance levels. Processing Unstructured and Semi-Structured Forms When it comes to recognizing unstructured or semi-structured forms and other documents types, these same steps still apply, although in a different order and manner. On unstructured forms such as invoices, EOBs, and transportation documents, locating variably-formatted data is the number one priority. The presence of random differences over a series of functionally similar forms means that the forms in question cannot be processed using the traditional template-based approach, in which one software template matches each and every data field on each and every form in a presorted batch. Multiple approaches are required, approaches that create alternatives with results that are compliant with classifying similar data elements. Given the incredible computing horsepower and enormous amount of memory that reside in the average desktop PC, a variety of techniques, including ICR brute force, can be run in parallel to achieve remarkably accurate results that were impossible to achieve only a few years ago. Sometimes, morphological analysis techniques involving blob analysis, edge detection, multi-line character segmentation, and long-line detection can be used to find form objects, columns, and data fields. At other times, the geometrical and spatial relationships between the text data elements, such as rows or subheadings (rather than graphical objects), can locate the places where data most likely will be found. In fact, the data location process need not involve character recognition at all; the text can be treated as integral patterns of blobs.

36 The Document Separation Problem in Image Capture
Separation of documents is a significant expense for a high-volume capture system Typical ‘structured’ recognition technologies are not applicable Manual insertion of separator sheets is the primary workaround today 50% of document preparation labor is spent sorting documents and inserting separator pages – source: TAWPI Where does one document stop and the next begin? Here? Here? Here? Here?

37 Classification Methods for Paper Content (Images)
Image Classification based on the overall layout and structure of a document Includes lines, boxes, logos and placement of text Text Classification based on detailed analysis of the text content of a page Rules-Based Classification performed by searching for specific data or keywords independent of layout Templated Classification determined by the presence of one or more marks, barcodes or items of text in pre-defined locations

38 Waterfall Approach to Classification and Separation
Two-pass system: 1st pass: Classification optimizes performance by using fastest classification techniques first Advanced Text Classification final “catch-all 1 2 3 4 5 6 7 8 Page # Barcode Recognition: First Form X ? ? ? ? ? ? ? 1 ms Image Classification: N/A ? ? First Form Y ? First Form Z ? ? 20 ms Rules Based : N/A ? Last Form X Last Form Y ? Last Form Z 200 ms Text Classification: N/A Middle Form X Middle Form Z 1000 ms

39 Why Invest in Automated Classification?
Accelerate the time to value in your investment in ECM Ensure more accurate content catalogs Having consistent and reliable access to unstructured content is arguably the foundation to realizing the business benefits of ECM, and all subsequent content-centric enterprise applications will realize their ROI by leveraging this essential capability At its core, classification will accelerate your time to value when it comes to your investment in ECM. It will be easier to bring more content under management, quickly. When you do bring it under management it will be accurately catalogued and as a consequence you’ll be lowering your risk and improving your down stream efficiency More content under management and having it catalogued accurately logically leads to content that is more likely to be found, more accessible, more likely to be re-used, more likely to be further leveraged. A key driver to investing in ECM across the enterprise is making your content more accessible – and accurately cataloguing information is key to that value proposition Finally, the last benefit is the one that speaks to every organization – more time. Your knowledge workers not only will be more productive because they’re wasting less time looking for information, but also your true authoring experts – the truly valuable people in your organization – will be freed up from manually tagging and classifying the content Make your content easier to find and leverage Free up your subject matter experts

40 IBM Classification Module for IBM FileNet P8
Summary Accelerate ECM Standardization Poor content classification undermines ECM value – maximize your ECM potential and time-to-value with automated classification Automating Classification Always Pays Typical employees spend 10 hours/week searching for information – slash that time and increase productivity Classification Technologies Automate Classification to Drive Development of Best Practices IBM Classification Module for IBM FileNet P8 Automatically organizing your content by understanding it

41 Questions & Answers Contact Reggie Twigg for more information or to arrange a demonstration


Download ppt "The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness Reginald J. Twigg,"

Similar presentations


Ads by Google