Presentation on theme: "DB2 Information Integrator Software"— Presentation transcript:
1 DB2 Information Integrator Software Jaffa SztejnbokIT Specialist,Information ManagementGlobal Technology Unit
2 Agenda What is Enterprise Information Integration Without Information IntegratorData ChallengesComplementary Information Integration ApproachesIBM DB2 Information Integration Products and ValueIBM’s Information Integrator 8.1Demo
3 What is Enterprise Information Integration ? Provides access to diverse, distributed, and real-time data as if it were a single source, no matter where it resides.Helps businesses :Shorten application development timeImprove productivity and application efficiencyLeverage existing data assets for the benefit of the businessמידלוור המאפשר גישה למגון של נתונים שישובים הצורה מבוזרת כאילו הם מקור נתונים אחד בין אם זה מקור אחד או לאמאפשר אנטגרציה של STRUCTRED ן UNשליפה ועדכון בזמן אמתלבצע המרות של DATA לניתןחים עסקיים וכן הכפלת./ העברת נתונים לשיפור ביצועים וזמינןתEnterprise Information Integration is a category of middleware that allows access to data as though it were in a single database, whether or not it is....thus..The integration of data and content sources to provide real time read/write access, the ability to transform data for business analysis and data interchangeWhen you discuss distributed access, the first thing that often comes to mind is performance and availability. To address these concerns, we have data placement capabilities. Caching helps when performance and availability are at issue.
5 Data Challenges Variety, Velocity, and Volume New composite applications need data from multiple sourcesConsumers expect holistic, personalized, and value-added contentRelational, XML, packaged applications, content repositories, file systems all contain critical business informationIncreasing emphasis on current dataReal-time analyticsBusiness activity monitoringPetabytes will be the measure of available online dataAll client interactions are important ( e.g. instant messages, audio records, Web traffic,…)Internet and intranet contentLet's focus in on Information Integration and start by discussing the data challenges. This is seen as greater variety of data, with reqmts for more current data and increasing volumes of data.New composite applications need business critical information from relational stores, XML documents, content repositories and file systems.Consumers expect personalized views and communications, with value added suggestions (suggested products based on types of past purchases..books, clothing, investments, insurance, bank accounts)There is an increasing emphasize on current data for analysis. This means the need to bridge historical data (typically in the warehouse) and real time data. This is required for customer responsiveness.Business Activity monitoring is required for operational efficiency. Once an event is identified as unusual, our federation capability helps present the important information necessary to handle the unusual event.All client interactions are important, and we see Increased volumes of data exist as we capture all the different information formats: documents, visual, audio, clickstream, messages.These characteristics and challenges, combine to make the case for distributed access. It is difficult to have all the required information in a single store or type of storage.
6 Complementary Information Integration Approaches Consolidate data for local accessData warehousesOperational data storesProduction applicationsCreating additional reference copiesTypically managed by ETL (Extract, Transform, Load) or replication technologiesIntegrated access to distributed sources, Distributed AccessReal time data, e.g., stock quotes , Extending a data warehouse with real-time dataData changes rapidlyWide heterogeneity in data to be accessed ,relational and non-relational formatData which is not practical or possible to copy and when movement of data is smallWant a stable data view with the ability to control refresh intervalsComplex long running joins or transformations are requiredPredictable and repeatable accessNeed high performance for applications"bring data to application"High availability requiredcan't tolerate network outagesNeed historical or trending informationAt IBM, we believe customers Information Integration requirements consists of distributed access (federation), but also consolidated access (Replication & ETL). This chart helps show how a customer would know when to use one or the other..but often they are used together. Example: Use ETL to build a warehouse, replication to keep it automatically updated on a scheduled basis and extend it with federation for queries that require data that it didn't make sense to put in the warehouse.EII or distributed access approaches are indicated whenAccess performance and load on source systems can be traded for overall lower cost implementation.Currency requirements demand a fresh copy of the dataWidely heterogeneous data,or Data that changes rapidlyData security, licensing restrictions, or industry regulations restrict data movementUnique functions must be accessed at data sourceQueries returning small result sets among federated systems,Large volume data that are infrequently accessed,ETL or replication approaches are indicated whenAccess performance or availability requirements demand centralized or local data.Complex transformation is required to achieve semantically consistent dataComplex, multidimensional queriesCurrency requirements demand point-in-time consistency, e.g. close of businessInformation Integration combines different technologies that are complementary. Some data needs to be consolidated for local access. This is typically accomplished by ETL or Replication technologies and is usually the best method for building data warehouses, operational data stores and the data for production applications.Integrated Access adds to consolidated data, by federating it with data that it doesn't make sense to put in the consolidated store..examples include the need for real time data, or to join mixed format data.Integrated Access represents an emerging industry category, referred to as Enterprise Information IntegrationWhile federation is not new, what is new is the customer interest in this space due to their data challenges.
7 IBM DB2 Information Integration Products DataSourcesDB2 Information IntegratorSQL programming modelLeverage SQL skills and toolsFederated data server and replication serverdatawarehousesDB2 Information Integrator for ContentContent programming modelLeverage CM skills and toolsFederated data server, text mining, and workflow enginespreadsheetsrelationaldatabases@…Extended SearchSourcesContentSourcescontentrepositoriesofficereportsfaxThe DB2 Information Integrator family consists of two products. Both offer federation access to diverse and distributed data and content stores, but each presents a different programming model tailored to a different developer community.Here you see a representative set a data sources that these offerings access. For the most part, both offerings can access all the sources shown with a couple exceptions.
8 DB2 Information Integrator 8.1 A Federated Data Server – Query distributed data as if it where a single sourceDefine integrated view across diverse and distributed dataWide range of data and content sourcesExtensible to virtually any data sourceQuery as if a single sourceUse standard SQL query and SQL expressionsInclude text semantics in the searchSurface specialized functions into SQLLeverage query optimization and cachingCompose XML documentsCombine diverse sourcesValidate against DTDs or schemaPublish results to a message queueFamiliar DB programming modelSingle source, relational updatesIntegrated SQLViewDB2 II 8.1 will have both Federation and Replication capabilities. Let's look first at Federation.Extensive sources are available to federate.federated capabilities include federated query, composing XML documents, publishing to mq series queues.Update in this first release is limited to relational sources and only single source at a time.DB2, Oracle, SQL Server, Sybase, Teradata, OLE DB, ODBC, Excel, XML, message queues, Web services, flat files, document repositories, content repositories, LDAP directories, WWW, databases, and more.
9 DB2 Information Integrator 8.1 A Replication Server – Manage consolidation for performance and availabilityDistribute data among relational databasesDB2, Informix, Microsoft, Oracle, Sybase, TeradataSupport flexible topologiesDistribution: One to manyConsolidation: Many to oneMatch data movement modes to usage requirementsTable-at-a-time for warehouse loading during batch windowTransaction-consistent for online dataChoose latency characteristicsScheduled, interval-based, continuousApply transformations in-lineStandard SQL expressions or stored procedure execution.DB2MicrosoftHeterogeneous Replication is part of DB2 II 8.1.This is supported between DB2, Informix, Oracle, Sybase, Mircrosoft and to Teradata.Heterogeneous replication was first available in DataJoiner, and it provides for replication to one or many or from one or many. This support is continued in DB2 Information Integrator.Replication is supported at the table level or transaction consistent level.Replication may take place based on a schedule, time intervals, events or be continuous.Transformations between source and target can be specified with SQL or accomplished via stored procedures.
10 DB2 Information Integrator for Content Define integrated views across diverse and distributed dataIBM Content Manager portfolio and other content repositories e.g. FileNET, Lotus databases, ODBC and JDBC compliant relational databases, and IBM Lotus Extended Search sources (LDAP directories, WWW, databases,…)Search federated dataSearch application uses the IBM Content Manager APIMine additional metadata from text documentsIdentify document languageExtract entities like names or technical termsCategorize documents based on a taxonomyGroup documents based on related contentCreate a document synopsisDefine workflowsDB2 Information Integrator for Content is a re-branding of Enterprise Information PortalIt provides federated access to CM sources as well as other content repositories and relational sourcesContent Manager customers are the primary audience for this product. The application interface and programming model for DB2 Information Integrator for Content is based on the CM object API's.Customers who are primarily working with relational databases should look first at DB2 Information Integrator which provides federation of relational and content sources using a SQL APIII for Content has unique capabilities for text mining and text analysisAlgorithms scan text documents to determine the national language in which it was writtenKey features of the document, such as proper names or technical terms which can be used for classifying the document can be automatically extractedDocuments can be categorized into a customer-specified taxonomyDocuments can be grouped based on the contentAutomatic summarization is available by scanning the document for summary sentencesAn integrated workflow function is available so that any data retrieved can be part of a workflow process. This workflow is based on the embedded copy of MQ Series Workflow which has been tailored for content integration. .
11 DB2 Information Integrator Value Extend current investmentsWork within your existing infrastructureConsolidate data or access distributed data as if it were a single data sourceCombine existing data and content assets in new waysUse familiar SQL programming model and existing toolsBuild on a standards-based, strategic integration platformSpeed time to value for composite applicationsReduce hand-coding 40%-65%Reduce skill requirementsReduce development time by halfControl costsReduce payroll costsReduce need to rip and replaceReduce need to manage redundant dataWe have talked with many IBM and non IBM customers about the value of this product, here are the results of those discussions. Customers have verified these are valuable.
12 Speeding Application Development Development effort to handle:Unique interfaces for each data typeJoining data from varied sourcesAggregation and groupingCorrelating dataRDBMSII handles:Interfaces for each data typeJoining data from varied sourcesTransformationCorrelating dataNon-relational dataSpecial features:Set processingIn-built db transformation functionsOptimisationAutomatic local cachingData driven triggersNon-traditional dataApplicationDeveloperOtherSQL is on OPEN StandardSQL is easily testable, independent of the applicationJDBC, XML, WebServicesintegrating data sources is so complex programmatically that either you 1) don't do it or 2) you pay the price of moving to an integrated store which is extremely costly and may not be justifiable or 3) you risk developing and maintaining very complex code
13 Crystal Decisions Vision Challenge Solution Business Value As a world-leading information infrastructure company, Crystal Decisions helps businesses make better decisions by bringing together their people and their information.ChallengeImprove response time for complex queries over distributedheterogeneous data sourcesSolutionProvides transparent, globally optimized access to heterogeneous, distributed data. Crystal Reports accesses the distributed data as if it were a single database. Response time improvement of up to 98% seen in house.Business Value"Users of Crystal Reports and Crystal Enterprise, with DB2 Information Integrator, can … discover new ways to meet the information needs of their organization."Janet Wood, Vice President of Business Development, Crystal Decisions.Competitive Value“DB2 Information Integrator provides Crystal Reports with exceptionally fast and efficient federated querying capability.“Trevor Smith, Program Manager, Business Development Group, Crystal DecisionsCrystal Decisions is an ISV that provides query and reporting software. They understand their clients often want reports that span information across different types of sources. They have provided that in their product, but were interested in exploring our technology to determine if it provided a performance benefit because of our distributed optimization for heterogeneous environments.Crystal found our federation technology improved performance by up to 98% when doing queries against heterogeneous and distributed data. This underscores the complexity of being able to do join processing efficiently. One can expect to see improved application performance using this product vs doing the joins in an application.
15 Federated Access to Diverse Data This diagram captures the main points of the federation functions in DB2 Information IntegratorIt is accessible from the web and from standard database clients (CLI, ODBC, JDBC)It provides a unified view over a large set of data sourcesAll of the popular relational sources are supported as you see on the rightAlong the bottom are the non-relational sources. The access to these are all read-only, except for WS MQ messages which can be written and readThere are a number of Life Sciences specific sources - Blast, Hmmer, Documentum, Entrez, BioRSText data in flat filesXML documentsExcel spreadsheetsWS MQ messagesA rich set of sources accessible through IBM Lotus Extended Search. A partial list of sources includes:Notes databasesMS ExchangeIBM Content ManagerDomino.DocLDAP directoriesWeb Search engines, e.g., Yahoo, Google, CNN...Content sources (CM)
16 IBM DB2 Information Integrator Software Data federationExtensible read/write access across diverse data and content sourcesDatabase programming model (SQL)Content programming model (OO API)Data placementCaching and replication over heterogeneous informationData transformationSQL, XML, Web servicesAdvanced search and miningMetadata managementPart of a complete integration solutionXML publishing, consumption, and interchangeWebSphere business integrationOpen platform based on industry standardsIntegrating diverse businessinformationacross and beyond the enterpriseInformation IntegrationThis is our vision for Information Integration.The fundamental technologies include Federation, Data Placement, and Transformation.Federation needs to service traditional clients, web services, messaging and workflow.To present data in one of three programming models: SQL and CM (Content Manager object oriented API – this is available with DB2 II for Content only). XML based programming model (XQuery) comes later. We are currently working with the standards body's to define an open standard for XQuery.To enable our customers to use the rich semantics developed for their specific types of sources, and to be able to protect their existing investment in SQL or the CM Object Oriented programming models.When performance or availability become a key concern for a specific application, then our solution has caching. We also have replication as a data placement alternative to enable distributed data copies.Integration requires a robust transformation capability. We believe SQL and XML provide for extensive transformations.At IBM, we formed the Institute for Search and Text analysis. This is part of our research organization. They are devoted to advancing text search and text mining and taking that research into our products.Metadata management is important for bringing together all the sources of data and understanding how the linkages happen.XML is a key part of integration today, and so we provide a comprehensive XML capability. We support the ability to both generate and consume XML documents. We have provided this in our current relational database, and this enables our customers to take advantage of the investment they have in their development of relational skills and in relational applications...the best store for XML is in a relational database: DB2.Information Integration is complementary to the Websphere Integration of People and Processes. And our adherence to open standards, makes us the obvious choice for information integration for any Integration solution.
17 Data FederationTransparency: hides differences among sourcesAppears to be one sourceSupports a high level query languageFunctional compensation and passthruHeterogeneity: integrates data from diverse sourcesRelational, XML, flat files, spreadsheet, messages, content repositories, Web, …High FunctionOne query integrates data from multiple sourcesCapabilities of sources as wellExtensibilityAccess wide range of data sourcesDevelopment wrapper toolkitAutonomyNon-disruptive to data sources, existing applications, systems.FunctionsFederation is the concept that a collection of resources can be viewed and manipulated as if they were a single resource while retaining their autonomy and integrity. There are significant advantages that federation provides:Transparency: Provide one single API for the application to talk to (independent from the number of sources (back end) accessed.Heterogeneity: Relational, ODBC, Flat Files, XML, MQ, Spreadsheet and more (wrappers and functions)Extensibility and openness: so that Federation can be extended to almost any data source (Wrapper development toolkit and a development environment for functions using WebSphere Studio)High function - so that the functions of the API are available across all sources, whether the back end data source has the function or notCompensation for missing functions. for example a flat file source may not have sort. In this case the data is read by the federation server and the server does the sortUnique functions of the back end source can be made available as sql functions, if the wrapper makes them available.Autonomy: for the data sources, because data sources can be federated with little or no impact to existing applications or systems.Performance: the different phases of the optimizer have functions specifically for distributed queries. For example, where it makes sense to take advantage of functions in a back end source, it does so.
18 Performance , Optimization of distributed queries Federation leverages a full database engineQuery Processor, Execution Engine, Catalog, Client Access, Security, TransactionsQuery processing extended for Federated DataPushdown AnalysisAnalyze how to decompose a user queryGenerate an optimal query execution plan using cost estimates including data source knowledge: database statistics, indexes, source functions, server and network capacitiesAllows function compensationOptimization and speed of federation is DB2's II silver bullet. In blue are the additional items for federated optimization over DB2 optimization..The SQL is parsed and then is rewritten to perform well for the optimizer it is aimed at. With the knowledge of what SQL will perform better at that source. Pushdown analysis figures out how to decompose the query. Then cost based optimization looks at the normal statistics, but also at what indexes are available, what functions the each data source can provide and the processor speed and network speed/capacity of each of the sources. Then efficient and specific SQL is produced for the SQL sources and an executable plan is produced. The query is driven over both local and distributed data, with functional compensation where the back end doesn't have the functional capability.
19 Replication Architecture This shows the IBM Replication Architecture.DB2 capture and IMS capture are log based captures.Informix,Sybase, Oracle, MS SQL Server are trigger based captures. There is no capture for Teradata.Depicted is external application captures. Examples of this is a third party capture application written for IDMS by (International Software Products inToronto, Canada who has a product called DARS.) , or a sample program we provide called Data Difference Utility. DDU will compare two load ready DB2 load copies and place the difference in a staging table. Customers are using this for VSAM CDC.Captures place changed data into staging tables. This provides flexibity to have each target different, different tranformations, different columns and different currency.Apply then applies that to either DB2 or to DB2 Information Integrator's federation engine. Then writes can happen to Informix, Sybase, Oracle, MS SQL Server and Teradata.
20 Heterogeneous Caching Feature Improve query performance and availabilityAdministrator defines Materialized Query TablePrecomputed or frequently used valuesAny data from the federated systemApplication indicates ability to use cacheImplicit or explicit useDeveloper enables cache useIf enabled, reads are handled from the cache, writes passed through to the sourceIf not, reads and writes passed through to sourceCache refresh managed:ManuallyBy replicationFlexible caching topologies supportedHeteroegenous Caching is available to improve query performance, by caching information to a materialized query table. You would do this with any frequently used or precomputed values. Refresh of the MQT's can be manual or by replication.MQT's are based on relational data, but you can use them for any federated data...as long as you store it first in a relational table (like DB2).
21 WrappersFour important tasks:Data modelingMap data model to relational data model (tables with rows and columns)Map functions into SQL operationsQuery PlanningRepresent data source capabilitiesPush down as much work to data source as sensibleDetect missing function at source (so engine can compensate)Supply cost and cardinality informationConnection and transaction managementQuery Execution and data retrievalExecute parts of a user’s query for a specific data sourceThe wrappers technology was developed at IBM's Almaden Research Center and enables adding additonal sources that will be transparent and optimized. There is an SDK to make it easier to develop customer wrappers for sources we haven't provided.Wrappers act as partners to our optimizer, and we have architected a solution that delivers higher performance optimization for all sources whose wrappers provide performance information.Wrappers defines how to map one data model to another.Wrappers actually connect to the source and execute the query against the source and retrieve the information.
22 Configuration Configuration steps: Wrapper: the wrapper code module itselfServer: a specific data source, with associated attributesUser mapping: information needed to connect to a specific serverNickname: a specific data set managed by a server, mapped to rows and columns in the federated serverDefined to system via DDL commandsGUI administration generates DDLStored in the system catalogThe first step in configuring a data source is to configure the wrapper itself. One wrapper is configured for each type of data. So if you want to access two Oracle data sources, you configure one wrapper for Oracle. 'Configure' in this context means using the 'Create' action in the CC tool.Next is defining to the wrapper the servers which contain the data sources. For the case of the two Oracle data sources, if they resided on two physical servers, then there would be two Server entries in the Oracle wrapperUser Mapping is the next step. This specifies the mapping of the userid/password on the federated system to a userid/password on the federated source. This step is primarily for relational data sources although for some non-relational sources, user mappings may also apply.Then last is the nickname specification. A nickname is the id used by an application to reference federated data. A nickname refers to a relational table or for non-relational data, a table-like data object. The nickname specifies the columns, and potentially the rows, of the target data that will be visible using this id.Once this data source information is specified to the Control Center, the Center generates the DDL which defines the wrapper, server, users, nicknames to the federated server.The resulting metadata is stored into the federated server's system catalog. I.e., the SYSTABLE, SYSWRAPPER, SYSSERVER... tables are defined and populated with metadata.Now, how does the DBA get the names of the federated data objects to create these registrations? There's a great 'data discovery' function to help in that effort. Let's go to the next slide.
23 Administration ToolsThe Control Center is the administration tool to register new data sources. The GUI helps the administrator with each step in the process.Starting with the control center (left window) the first step is to register the wrapper by clicking the 'Create Wrapper' action.This causes the top middle screen to come up. The administrator merely selects the type of data source from a selection list. The selection list shows all the wrappers that are installed, relational and non-relational.After the type of data source is selected a wrapper object is created in the Control Center screen. Selecting the 'Create Server' action brings up the dialog box (middle, bottom) where the information identifying the server is specified. Optional settings for the server are specified in the 2nd tab in this control (screen in bottom right). Descriptions of the fields and hints are available for the fields in these controls.The next step is to identify the data on the servers that will be available to users. This is where Discovery provides significant help to the administrator. Let's go to the next slide to see how this works.
24 Discovery for Nicknames "Create Nicknames" windowCustomized "Discover" GUILaunches customized GUIReturns Nickname defintionsYou can define Nicknames directly or you can have Discovery assist you. To do this directly, you need alot of knowledge about the datasource and how to define nicknames. Especially for non-relational sources. The easier way is to use discovery. Discovery can assist by getting or showing the data objects and creating the nickname definitions that are pertinent to the object.Here we see the create Nicknames window, and you can see the nicknames defined and define more directly by clicking on the add button...or you can click the discovery key...and launch the customized Discover GUI, which will show the objects and help you define the Nicknames. The Nickname definitions can then be seen in the Create Nicknames window after a refresh of that window.For relational, you can use the GUI to discovery the remote tables, upon which you can create the nicknames.For non-relational..Excel spreadsheets...the GUI will bring back all the spreadsheet names..upon which you can create nickname.For Entrez ( a database of articles that are categorized by different tables for different types of articles, you can create a nickname for each table type.For Extended Search, you define a nickname for the search engines to be searched on a specific server, so...a nickname might point to Google, Yahoo and Lotus Notes...then a search of that nickname would look at all three types of sources and return a list of the documents that matched the search criteria arranged in rank order.for XML, which is a hierarchical view..having parent and children, discover would assist in the creation not only of the nickname, but also help create definitions for views on these nicknames, to accommodate the hierarchical nature of XML.So for XML we see nicknames at the server level, but also at the object level. Future use of this night be for user mappings and for user defined functions (but this is not in R1).Wrappers which support discoverySybaseOracleSQL ServerDB2InformixODBCTeradataHMMEREntrezXMLFlat FileExcelExtended Search
25 Replication Administration DefinitionsManage control definitions for replicationCustomize names and sizes of objectsOperationsStart Capture, Apply, Monitor, Analyzer, and TraceIssue commands such as STOP or STATUSMonitoringPerform static and dynamic monitoringReplication Administration facilitates definitions, and the management of operations. Starting captures, and Apply and monitoring for performance and for completion. Alerts are also available to guide administrators to problems, like sites that are unavailable.
26 Application Development : Access DB2 catalogs and DB2 II federated sources DB2 Development CenterWebsphere StudioDatabase development is a key component in the new Web Sphere Studio Application developer offer.Microsoft Visual Studio .NET
28 Optimization of distributed queries PerformanceOptimization of distributed queriesFor more informationArticles in the System Journal include:Information Integration: A research agendaInformation Integration: A new generation of information technologyData integration through database federationXQuery: an XML query languageXTABLES: Bridging relational technology and XMLXML programming with SQL/XML and XQueryDB2 and Web ServicesBringing together content and data management systems: Challenges and opportunitiesThe integration of business intelligence and knowledge managementUsing flows in information integration
29 SummaryInformation integration is a foundation for companies to build an On Demand Operating Environment enabling them to align their IT infrastructure to business prioritiesDB2 Information Integrator provides access to diverse, distributed, and real-time data as if it were a single source, no matter where it resides.DB2 Information Integrator will help businessesShorten application development timeImprove productivity and application efficiencyRely on IBM’s proven technology and support for open standards
30 The whole is worth more then its constituent parts DB2 Information Integratorhelps businesses to leverage existing data assets into knowledgefor the benefit of the business
31 Don’t forget to give us feedback Presentation Code:A4