IBM Data Virtualization within Cloud Pak for Data July 25, Mukta Singh

IBM Data Virtualization within Cloud Pak for Data July 25, 2019 Mukta Singh
© 2019 IBM Corporation

A Modern Information Architecture
Data Consumers Web Services Mobile Apps Impact of modern Information Architecture Applications Analytics Tools Portals – One data fabric, on a cloud native architecture Governed Enterprise Data Catalog (Metadata) Consumer Layer (Interface Provisioning) Reduce data copies and movement Enable self-service for data Data Virtualization Virtualization Layer Connection Layer (Adaptors) Hadoop Data Data Marts Etc Warehouses Data Sources Ability to connect to data sources globally For the Data Consumer layer data were inside a single database

Unified platform of foundational Data and AI cloud services
Built for Multi-cloud Avoid vendor lock-in and get started on your cloud journey today Customize and Extend with add-on microservices PICK YOUR ADD-ON Containerized Services Db2 Db2 ES Mongo Cognos DSP ISV OS Custom Unified platform of foundational Data and AI cloud services IBM Cloud Cloud Pak DATA PLATFORM #1 Ranked by Forrester KUBERNETES BASED Containerized, easy to manage ICP for Data Platform : Core Cloud-native Data platform Services with Configurable Installation Options Infrastructure options : On-premise or any public cloud of your choice Cloud Pak PICK YOUR CLOUD Private or Public On-Premise Credit goes to Clay Davis

… Cloud Pak for Data– What’s included ?
Open, Extensible API Platform Personalized, Collaborative Team Platform App Developers Business Partners Data Engineers Data Stewards Data Scientists Business Users Extensible “add-ons” from IBM’s Data and AI Portfolio and 3rd Party Services … IBM IBM IBM IBM IBM ISV OS Custom The Ladder to AI Collect Data Organize Data Analyze Data Data virtualization Data warehousing Databases on-demand Data source ingestion Distributed processing Discovery and search Data transformation Data cataloging Business glossary Policies, rules and privacy AI Machine learning learning AI Model build and deploy AI Model management Dashboards and reporting Analytical visualization When clients start with IBM Cloud Private for Data, they get the foundational capabilities for the COLLECT, ORGANIZE and ANALYZE rungs on the AI Ladder, all integrated via the flexibility of cloud-native microservices. They can access data, create policy-based catalogs, build open source AI/ML models, visualize data with dashboards, and more. You can think of these capabilities as cloud-native microservice versions of our core offerings (like Db2, InfoSphere, DataStage, Watson Studio, and Cognos Analytics), integrated into a platform. No assembly required. Powered by: Db2 Family Powered by: InfoSphere Powered by: Watson and Cognos Multicloud Services Logging Monitoring Metering Persistent Storage Kubernetes Security Identity Access Mgmt. Docker Registry / Helm

What if you could query 2 or 2,000 data systems with a single query?
What if you could tap into all of your critical data assets no matter where they physically are? What if you could query 2 or 2,000 data systems with a single query?

IBM Data Virtualization
Query across multiple data sources: Oracle, Db2, SQLServer, Informix, Netezza, MySQL, PostgreSQL, Big SQL, Apache Hive, HDP Hive, Cloudera Impala and more! Rich application capabilities Connect to Data Virtualization with your favorite SQL apps and tools RStudio, Jupyter Notebook, Cognos, Tableau, Microstrategy Scale! 1 or 1,000 at once More than 10x better for several important use cases Secure! Strict access controls Fully encrypted communications. Schema discovery and folding Automatically find and match tables across systems so you can query them as a single virtual table. Deeply integrated w Cloud Pak for Data Enterprise Data Catalog, governance and security. e.g. Automatic publishing of virtualized data into the data catalog. Immediate access via Cognos and Watson Studio

Data virtualization key Use Cases
Driven by patterns needing low data latency, high flexibility with transient schema EDW prototyping and migration (mergers/acquisitions) Virtualization with Big Data (Hadoop, NoSQL, and Data Science) EDW augmentation (offload workload) On-demand Virtual Data Marts saves on cost of standing up EDW Data discovery for ”what if” scenarios across hybrid platforms Data Caching of combined data for frequently accessed data Combine MDM with IoT for Systems of Insight (IT/OT) Data integration preparation tool to complement ETL Master data hub extension to enrich 360 View (e.g. multi-channel CRM) single View across your business real-time analytics without moving data fast TTV with high ROI

Governed Data Virtualization in Cloud Pak for Data
The ability to view, access, manipulate and analyze data without the need to know or understand its physical format or location, and without having to move or copy it. Portals Analytics Tools Web Services Applications Mobile Apps Consumers Governance Catalog (Metadata) Consumer Layer (Interface Provisioning) Business Terms Data Privacy & Protection Rules Lineage Virtualization Layer Virtualization Platform Caching & Optimization Connection Layer (Adaptors) Lakes NoSQL Warehouses Files Marts Web Services Data Sources Cloud Applications

Our Data Catalog Strategy
One Scalable Catalog Available Multi-Cloud Embeddable Catalog Micro- Services Open Metadata Persona Specific Experiences With a single roadmap IBM’s catalog capabilities will lead the market At a high level our combined catalog strategy aligns with 4 key tenets: 1 – A single catalog of ML based capabilities serving metadata management and data catalog use cases across the public and private clouds and on premise. 2 – Commoditization of base catalog capabilities that will be embedded across IBM products. Anyone buying certain IBM products will get a rich catalog for free. IBM will offer value add features on top at a premium that differentiate us in the market 3 – Open metadata will provide the foundation for integration. ODPi interfaces will ensure the IBM platform is open and can be easily integrated with. Proprietary metadata stores are a challenge for our clients 4 – Persona specific UI’s will ensure that each persona that needs to create/manage/govern/consume metadata can do so in an experience that is fit for their purpose. And IT user requires different views of metadata then a knowledge worker. We will ensure this is continued to be enforced in the experiences. © 2019 IBM Corporation

Accelerate with Industry Content
IBM Analytics capabilities support the shift from collection to consumption Data Architects Data Engineers Data Stewards Knowledge Worker Govern the Data Activate the Data Data Collection Discover Understand Consume Prepare Data Science Define Business Glossary, Policies & Rules Connect & Access Data Classify Data Profile Data Assess Quality Curate & Catalog Search & Find Relevant Data Prepare Data for Analysis Build and Train ML/DL Models If we look across the typical Data Management ecosystem we see how the capabilities that we all know and love fit across the personas that are typically involved with the process. The lighter boxes on the left are where our clients have typically invested to date. Extremely important business critical capabilities that are required to collect, govern and manage data. The right hand side contains newer areas that are required to put information in the hands of the knowledge worker and allow them to to do something useful with the data in an efficient manner. All capabilities operate against common metadata that flows through the organization but surfaces up different capabilities through targeted experiences to each persona. Both the left hand side and the right hand side are growing, but its fair to say that we observe a much higher growth and level of interest in the features on the right hand side as clients look to embrace AI and there is a growing need to understand how to monetize data. Data Monetization Accelerate with Industry Content

Data Virtualization (DV) & Data Governance – with Policy Enforcement
Watson Knowledge Catalog Data Lake Repositories Optional: Policy Enforcement via Ranger / Guardium Data Virtualization Services DV Data Access Engine (SQL, APIs, NLQ) Knowledge Catalog Data Assets Details Policies & Rules Lineage …. … Compliance Reporting Discovery & Exploration Self-Services Analytics BI Reporting, Dashboard Business Applications AI, ML & Optimization Relational NoSQL Hadoop Object Store Flat Files Data Integration Deny access Mask/ De-Identify Data Movement Data Replication Mask / De-Identify When a new DV Table is created DV Table is published to catalog Lineage of DV table is generated and published to catalog (what tables make up the new DV table) Data Profile/Discovery/Classification is triggered --> Data Curator Notified. If any personal data found  Automatic assign a PII flag for DV table When a DV Table is accessed DV services call Policy Enforcement Services for any required action (Mask Data, Deny Accesss, etc.) Policy Engine Logs CRM IOT Data Front Office Transactional Policy Enforcement Policy Enforcement

Key Architectural Differentiation
Hub and spoke execution models: Lacks scalability Performance constrained Basis for Federation and our competitors IBM is first to market with a parallel processing model: Theoretically unlimited scalability Ease of addition/removal of sources Execution pushed down into the constellation mesh Hub and Spoke Model coordinator Query Query Result Query issued against the system A coordinator receives the request and fans the work out to edge nodes Edge nodes individually perform as much work as they can based on their own data. Individual results are sent back to the coordinator for final merging and remaining analytics. Coordinator receives intermediary results from all edge nodes, merges results, and performs remaining analytics New Computational Mesh coordinator Query Query Result Query issued against the system A coordinator receives the request and fans the work out to edge nodes Edge nodes self organize into a constellation where they can communicate with a small number of peers. Nodes collaborate to perform almost all analytics, not only analytics on their own data. Coordinator receives mostly finalized results from just a fraction of nodes. Completes the final work for the query result. © 2019 IBM Corporation

Smart Query Processing
IBM Data Virtualization reduces query time by using Parallel Processing, Pushdown Optimization and Connection Pooling Read total quantity sold in Hawaii sold in California sold in New York Total quantity sold across 3 states Order by quantity, return top 10 products Query Results in 1/3 the time! Timeline Data Virtualization Others Using harden technology from IBM’s Enterprise Database team, with advancements from IBM Research, our Data Virtualization (DV) engine is able to rapidly deliver query results from multiple Data Sources by leveraging advanced parallel processing and pushdown optimization Not only is Data Virtualization (DV) fast, DV removes complexity when querying information from multiple Data Sources. Rapidly query, consolidate, sort and report information from a handful to thousands of unique Data Sources (DS) using one SQL statement that uses Virtual Table(s) (VT). Each VT refers to one or more actual Data Sources and Tables. Queries can combine data from a multiplicity of sources including relational databases, spreadsheets, and flat files Read total quantity sold in Hawaii Read total quantity sold in California Read total quantity sold in New York Total quantity sold across 3 states Order by quantity, return top 10 products © 2019 IBM Corporation IBM Analytics © 2018 IBM Corporation

Supported Data Sources
Cloud Pak for Data In the roadmap Db2 family for HDM Db2 for iSeries, zSeries Db2 for z/OS Big SQL IIAS, PDA (Netezza) Informix Derby Oracle SQL Server MySQL PostgreSQL Apache Hive, HDP Hive Cloudera Impala Teradata* MongoDB Hive Excel, CSV, Text* Sybase Snowflake IBM DVM for Z SAP HANA MAP-R Apache Drill SAS MariaDB Amazon Redshift , Dynamo, Aurora Support for any Generic JDBC driver VSAM, IMS, CICS, Adabas CouchDB Stream / MQ Apache Spark SQL Apache Kafka Cloudant Pivotal Greenplum Cassandra Clients Ask - Employer - Mongo – Atlas service ( AWS) Postgres ( AWS) Oracle © 2019 IBM Corporation *Note - manual upload of file /driver needed Note – REST APIs planned

Details of our Design

So …what is Data Virtualization?
The ability to view, access, manipulate and analyze data without the need to know or understand its physical format or location.

Data Virtualization The mission …
Provide simplicity and rich analytics capabilities when virtualizing over large numbers of data sources, flexible schema & data types, automated data & results caching, and compute scaling with industry leading performance.

Going beyond with Simplicity and Scalability
Adding data sources is easy. Many sources can be automatically discovered so you only need to provide credentials Discovering data in the sources is automated. No need to manually import the meta data to Data Virtualization Automatically match similar table structures across data sources. Scaling for the Enterprise Supporting large number of systems. > 20 sources with only the DV service > 1000 sources with optional data source connectors Geographically dispersed sources Multiple remote sources are able to work together for efficient analytics and processing. Data source connectors provide resiliency and performance across high latency and unstable networks.

Giving you Confidence in the Analytics
Governance Provides data classification and quality metrics for virtualized data from any source. Data provenance and lineage to show how each source contributes to the analytics. Security Data lives in the original source with only results ever being transmitted. Fully encrypted communications between all components of the solution. Strong authentication and permission management. Policy enforcement and masking capabilities to come.

Remote Source Connector
Client Applications Common SQL Engine MS SQL Server Netezza (PDA) Oracle Remote Agent Adjacent to data Sybase SparkSQL Administration UI S3… Information Governance (existing at remote Data Source) Governance included within CP4D Network OPTIONAL installation of Remote Agent Architecture

Data Virtualization Query processing
SQL Queries received from client applications Db2 SQL engine used for parsing and optimization DV Virtual Tables are represented as Federated Nicknames with additional attributes Federation pushdown analysis and statement generation is used to build remote statements, both for the distributed DV processing as well as optimized statements targeted to particular remote sources. Special Db2 optimizations exist for driving distributed aggregation, remote join filters. Remote query contains a lot of annotations Distributed agg parameters Join key filter build and apply information Source Parallelism Column and type massaging

DV Engine Remote Statement
After remote statement generation, ordinary Db2 Runtime Section is built. Runtime section represents Db2 processing with individual subsections representing DPF node processing It may contain one more invocations of different remote statements. Each remote statement will be an independent thread executing concurrently in the distributed runtime. Each Remote statement may contain references to one or more remote tables or nested query operations. Driving further parallelism as additional threads are used for each construct. At the data source level each source is accessed in parallel. Another item for August time frame is to further parallelize individual source access.

Deploying Data Virtualization
Fully containerized BigSQL deployment in Docker and Kubernetes. BigSQL Workers, aka Db2 DPF partitions, can be dynamically added to service without REORG / Redistribute. Even ”local” data such as cache data is stored in HDFS and not local Db2 tablespaces Operate as the external interface for Data Virtualization Broad programming/query language support Built-in Federation capabilities with integrated deployment and configuration Query optimization and plan generation Management and Configuration APIs

Creating the Constellation
Small packet of Data Virtualization software deployed on each data node, approximately 50MB. Automatic dynamic and resilient organization of data sources. Query compiler builds a collaborative query plan, forcing the nodes to collaborate 11/24/2020

Real System test Growing a Constellation
Video of constellation growing to 349 Nodes. Network stays compact. 2 and 10 links between nodes No manual configuration. Latency aware connection between nodes Which nodes connect to which others? Fastest reply strategy Diameter of the constellation (i.e. the number of hops between the two furthest nodes) grows logarithmically. Small diameter is ideal for communications. Actual system test performed by Emerging Technology Services, IBM Hursley, United Kingdom

Autonomic Result & Data Caching (Target 2019)
A critical capability of Data Virtualization is result and data caching. Avoid the latency of remote data source queries for commonly used subsets of the data. Caches have to be fully automatic. The administrative overhead of manually defining the caches to use is too high and breaks our goal of simplicity. Result cache design points Result data is cached from the same remote query used to execute a query as part of the DV-FMP processing. Most likely cached into Hadoop from the edge cluster nodes Information about high value queries to cache will be based on the Db2 Package cache and remote statement execution logs. May require query recompile to make use of the cached results for MQT matching

Bringing parallelism to each data source.
Traditional processing queries the data source (on right) then processes large results on a single thread (executor) on left. Solution: Data Virtualization queries data source (on right). Merges results in many parallel threads on left by leveraging parallel read streams. Data Source Single channel is bottleneck for any large result set. Data Virtualization resolves this through parallel read from sources. 64 core

Efficient Cross Source Joins
Oracle Marketing db 1MB Problem: Joining data between multiple databases is a common pattern, but brutally slow. With traditional processing, data from both tables is shipped over the network to the server, where the join is processed. Solution: Data Virtualization uses the early filtering techniques to dramatically reduce transmission costs. Data from the smaller table (inner) is used to build a small filter query that is applied to the larger (outer) table before data is transmitted. In many cases this is 85% effective at filtering data that will be removed by the join before the join is processed, leading commonly to a ~10x acceleration. Db2 Sales db 50GB 64 core © 2019 IBM Corporation

20 databases? Traditional processing gets worse as the number of databases increases
Other Informix Db2 Data Sources Oracle 64 core Simultaneous processing and parallel fetch continues to scale up. © 2019 IBM Corporation

DV Internal Processing

DV Engine Generated SQL for Distribute Aggregation
Original SELECT COUNT(PROMOVALUE2) FROM PROMOTION Remote SQL SELECT SUM( A0.C0) FROM ( SELECT A1.C0 C0 FROM new com.ibm.db2j.GaianQuery( 'SELECT COUNT( A2."PROMOVALUE2") C0 FROM new com.ibm.db2j.GaianTable( ''PROMOTION'', ''SOURCELIST=(MYSQL10000:"POPS_node1", MYSQL10001:"POPS_node2", MYSQL10002:"POPS_node3", MYSQL10003:"POPS_node4", MYSQL10004:"POPS_node5")'', ''"PROMOKEY" INTEGER, "PROMOTYPE" INTEGER, "PROMODESC" CHAR(30), "PROMOVALUE" DECIMAL(5, 2), "PROMOVALUE2" DECIMAL(5, 2),"PROMO_COST" DECIMAL(9, 2)'' ) A2', 'GAIAN_EXTENDER=[pAgg]','', 'C0 DECIMAL(5, 2)') A1 ) A0

Query Processing in the Constellation
Toronto San Hose United Kingdom Toronto Fixed execution within the constellation is impossible because of the highly dynamic nature of the network. Each node instead simultaneously sends the relevant portions of the query to both the connected data source to it’s peers in the network. Combines and process the results as they are received. Duplicate results are avoided by a given node only returning results to the first peer that requested them. Implicitly results in balanced processing of the query through the constellation.

Language Translation in Data Virtualization
Broad set of data sources supported by Data Virtualization each with unique syntax variations. Constellation is not limited only a single data source type. A logical schema is created across all connected sources. Multiple levels of translation as we move from the applications through the constellation down to the data source. User Application JDBC, ODBC, R, Python, etc DV Service Embedded Db2 Supports major language variations, SQL, Oracle SQL, Netezza, etc DV Remote Connector Translates from received SQL to data source dialect. Compensate for missing functionality. Remote Data source Db2, Oracle, MySQL, Excel, CSV, etc.

Aggregation and Joins in DV
Aggregation is one of the major data reduction operations that can be distributed. DV uses the fundamental concepts of distribution aggregation, namely the partial, intermediate and final phases. Each endpoint will push the partial phase into the remote data source or perform it locally in the case of flat files or non-relational sources. Each endpoint will further perform intermediate aggregation combining the result from the remote data sources as well as adjacent nodes in the network. Data flowing over any link is as fully aggregated as possible at the given point. Fully distributed joins in DV is still under development and will include several aspects Distributed join key filters. Utilize the structure of the constellation to allow the filter to be built and used in a single traversal through the constellation. Pushed into the remote data sources as basic SQL predicates Application of the partial-intermediate-final techniques from aggregation to a join model. Most useful when the operands of a join have too low a join key cardinality for join key filters to be effective. Partial aggregation pushdown through Joins Similar concept to PEA in Db2.

DV Engine – Parallel Processing DV-FMP – Constellation – Data Source
Parallel processing of query results is a key element of performance for remote system. Goals: Parallel data source queries, parallel network transmission. Query Processing Query is compiled as normal through the Db2 Query Compiler and the remote statement(s) are generated. One component of the remote query syntax is the degree of parallelism to use when accessing the Data Virtualization constellation and the remote sources. Initially expected to correspond to the number of BigSQL Edge Cluster Nodes associated with DV. Each edge cluster node receives the same remote statement to execute in the subsection and invokes a thread within each DV-FMP for processing. Propagation of the query propagates through the constellation as normal. Endpoint agents with a relevant data source connection will receive the remote query from multiple paths and will understand the appropriate parallelism parameter. Each connection will access a unique range of the data based upon local indexes and statistics. Future may use hash partitioning or other partitioning methods. This ensures each connection path from data source to controlling DV-FMP will receive an approximately equal subset of the result data

DV Engine Multisource Distributed Join Key Filters
Performant Multisource Joins is a component enabling comprehensive distributed processing within the constellation. Goal: Provide early filtering of result data at source. Query processing Based on Db2 Access plan, remote statements will have annotations for building and application of distributed and remote join filters. Build phase happens in parallel at all sources providing components of the first argument of the join. This does not require a separate scan of the data. Filter is distributed asynchronously and in parallel within the constellation to endpoints with an instance of the second argument of the join. Filter is incorporated into the remote query for the execution with the data source of the second argument of the join. Such remote data source query is available to begin processing as soon as the filter is received.

Summary

IBM’s Unique Approach to Data Virtualization
Parallel processing mesh providing execution performance and scalability: Quickly deliver analytics results and easily evolve with new data source demands Central Node scalability and Enterprise robustness (via an HA cluster): Reliability and ability to quickly adapt to increasing business demand Governance Integration and Security: Controlled, managed and secure access to virtual data Common SQL engine, rich set of SQL dialects, application portability Retain the use of existing applications and tools within the business Richness of automation and discovery of Data: Enable more self service and increase productivity (while retaining access control) Underlying platform flexibility with Cloud Pak for Data: Grow or move your Analytics and Virtualization platform with any environment Service Cluster Constellation Caching Policy Data Source Node Analytics Application © 2019 IBM Corporation

Increase Productivity
IBM Cloud Pak for Data Shift to next-gen workloads: Modernize through microservices Reduce Costs Ship Faster Increase Productivity Run your data & analytics services more efficiently – single product stack and better resource utilization Increase agility and respond to changing business requirements through DevOps and continuous delivery Enable governed self-service, collaboration, and provisioning, across teams on an integrated platform Shift from Monolithic to Microservices for agility and efficiency Another top use case is obvious given we’ve built this integrated platform on private cloud – “Modernize your Data & Analytics Workloads”. Shift from monolithic to microservices for agility and efficiency. Reduce costs by running your data & analytics services more efficiently – single product stack and better resource utilization Ship faster increase agility and respond to changing business requirements through DevOps and continuous delivery Increase workforce productivity through governed self-service, collaboration, and provisioning across teams on an integrated platform ICP for Data and IBM Z/ © 2019 IBM Corporation

IBM Data Virtualization within Cloud Pak for Data July 25, Mukta Singh

Similar presentations

Presentation on theme: "IBM Data Virtualization within Cloud Pak for Data July 25, Mukta Singh"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IBM Data Virtualization within Cloud Pak for Data July 25, Mukta Singh

Similar presentations

Presentation on theme: "IBM Data Virtualization within Cloud Pak for Data July 25, Mukta Singh"— Presentation transcript:

Similar presentations

About project

Feedback