© 2014 IBM Corporation ® IBM Software Group Data Analytics using MapReduce framework for DB2's Large Scale XML Data Processing George Wang Lead Software.

Slides:



Advertisements
Similar presentations
IBM InfoSphere Classic Federation Server for z/OS Provide fast, automated SQL access to mainframe data Our understanding of your goals Simplify robust.
Advertisements

The following 10 questions test your knowledge of desired configuration management in Configuration Manager Configuration Manager Desired Configuration.
DIGIDOC A web based tool to Manage Documents. System Overview DigiDoc is a web-based customizable, integrated solution for Business Process Management.
Distributed Data Processing
Thanks to Microsoft Azure’s Scalability, BA Minds Delivers a Cost-Effective CRM Solution to Small and Medium-Sized Enterprises in Latin America MICROSOFT.
SSRS 2008 Architecture Improvements Scale-out SSRS 2008 Report Engine Scalability Improvements.
Warren Heising and Joe Kennedy, IBM Corp. IBM Information Integration: Federated Queries and
MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
Matt Masson| Senior Program Manager
Deploying Visual Studio Team System 2008 Team Foundation Server at Microsoft Published: June 2008 Using Visual Studio 2008 to Improve Software Development.
Passage Three Introduction to Microsoft SQL Server 2000.
Understanding and Managing WebSphere V5
Ravi Sankar Technology Evangelist | Microsoft Corporation
Windows Azure Migrating SQL Server Workloads Speaker Title Organization.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Overview of SQL Server Alka Arora.
AGENDA 1.Introduction 2.Course Policy 3.What is SQL 2000 Server? 4.Client-Server Architecture and Communications 5.SQL 2000 Versions 6.SQL 2000 Server.
© 2012 IBM Corporation ® IBM Software Group IBM Topics DB2 Users Group.
IBM Express Runtime Quick Start Workshop © 2007 IBM Corporation Install IBM Express Runtime Development Environment.
Data File Access API : Under the Hood Simon Horwith CTO Etrilogy Ltd.
IBM News DB2 User Groups September 2015 Mary Book – IBM Technical Sales Manager, DB2 for z/OS, Midwest
The Client/Server Database Environment Ployphan Sornsuwit KPRU Ref.
Windows Azure Migrating Applications and Workloads Speaker Title Organization.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
Securely Synchronize and Share Enterprise Files across Desktops, Web, and Mobile with EasiShare on the Powerful Microsoft Azure Cloud Platform MICROSOFT.
6/1/2001 Supplementing Aleph Reports Using The Crystal Reports Web Component Server Presented by Bob Gerrity Head.
DATABASE CONNECTIVITY TO MYSQL. Introduction =>A real life application needs to manipulate data stored in a Database. =>A database is a collection of.
Windows Azure. Azure Application platform for the public cloud. Windows Azure is an operating system You can: – build a web application that runs.
Powered by Microsoft Azure, PointMatter Is a Flexible Solution to Move and Share Data between Business Groups and IT MICROSOFT AZURE ISV PROFILE: LOGICMATTER.
Copyright 2007, Information Builders. Slide 1 Machine Sizing and Scalability Mark Nesson, Vashti Ragoonath June 2008.
MidVision Enables Clients to Rent IBM WebSphere for Development, Test, and Peak Production Workloads in the Cloud on Microsoft Azure MICROSOFT AZURE ISV.
+ Logentries Is a Real-Time Log Analytics Service for Aggregating, Analyzing, and Alerting on Log Data from Microsoft Azure Apps and Systems MICROSOFT.
Saasabi’s Analytical Processing Engine in the Cloud Makes Business Intelligence Affordable for Everyone COMPANY PROFILE: Saasabi Saasabi is a BizSpark.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Planning Server Deployments Chapter 1. Server Deployment When planning a server deployment for a large enterprise network, the operating system edition.
E-commerce Architecture Ayşe Başar Bener. Client Server Architecture E-commerce is based on client/ server architecture –Client processes requesting service.
IBM Systems Group © 2004 IBM Corporationv 3.04 This presentation is intended for the education of IBM and Business Partner sales personnel. It should not.
1 Copyright © 2007, Oracle. All rights reserved. Installing and Setting Up the Warehouse Builder Environment.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
9 Copyright © 2004, Oracle. All rights reserved. Getting Started with Oracle Migration Workbench.
DreamFactory for Microsoft Azure Is an Open Source REST API Platform That Enables Mobilization of Data in Minutes across Frameworks and Storage Methods.
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
Data Platform and Analytics Foundational Training
Open Source distributed document DB for an enterprise
Virtualization Engine console Bridge Concepts
Primal and Microsoft Azure Deliver Personalized Content, Intelligence, and Analytics That Match Your Content to the Interests of Your Audience MICROSOFT.
Couchbase Server is a NoSQL Database with a SQL-Based Query Language
Windows Azure Migrating SQL Server Workloads
Installation and database instance essentials
Nimble Streamer Helps Media Content Providers Create Streaming Networks Cost-Effectively and Easily by Utilizing Azure’s Worldwide Scalability MICROSOFT.
Design and Implement Cloud Data Platform Solutions
9/13/2018 © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks.
02 | Design and implement database
Built on the Powerful Microsoft Azure Platform, iSwarm Helps Businesses Analyze Social Media Conversations, then Connect with Individuals MICROSOFT AZURE.
Yellowfin: An Azure-Compatible Business Intelligence Platform That Connects People with Their Data for Better Decision Making MICROSOFT AZURE APP BUILDER.
Scalable SoftNAS Cloud Protects Customers’ Mission-Critical Data in the Cloud with a Highly Available, Flexible Solution for Microsoft Azure MICROSOFT.
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
I-POWER JAPAN Gives Small Businesses the Ability to Get Their Work Done from Anywhere, Even a Construction Site, by Using Microsoft Azure MICROSOFT AZURE.
Server & Tools Business
DeFacto Planning on the Powerful Microsoft Azure Platform Puts the Power of Intelligent and Timely Planning at Any Business Manager’s Fingertips Partner.
Data Security for Microsoft Azure
Accelerate Your Self-Service Data Analytics
Unitrends Enterprise Backup Solution Offers Backup and Recovery of Data in the Microsoft Azure Cloud for Better Protection of Virtual and Physical Systems.
Introduction to Apache
One-Stop Shop Manages All Technical Vendor Data and Documentation and is Globally Deployed Using Microsoft Azure to Support Asset Owners/Operators MICROSOFT.
Quasardb Is a Fast, Reliable, and Highly Scalable Application Database, Built on Microsoft Azure and Designed Not to Buckle Under Demand MICROSOFT AZURE.
Microsoft Virtual Academy
Presentation transcript:

© 2014 IBM Corporation ® IBM Software Group Data Analytics using MapReduce framework for DB2's Large Scale XML Data Processing George Wang Lead Software Egnineer, DB2 for z/OS IBM

Information Management Software Disclaimer and Trademarks Information contained in this material has not been submitted to any formal IBM review and is distributed on "as is" basis without any warranty either expressed or implied. Measurements data have been obtained in laboratory environment. Information in this presentation about IBM's future plans reflect current thinking and is subject to change at IBM's business discretion. You should not rely on such information to make business plans. The use of this information is a customer responsibility. IBM MAY HAVE PATENTS OR PENDING PATENT APPLICATIONS COVERING SUBJECT MATTER IN THIS DOCUMENT. THE FURNISHING OF THIS DOCUMENT DOES NOT IMPLY GIVING LICENSE TO THESE PATENTS. TRADEMARKS: THE FOLLOWING TERMS ARE TRADEMARKS OR ® REGISTERED TRADEMARKS OF THE IBM CORPORATION IN THE UNITED STATES AND/OR OTHER COUNTRIES: AIX, AS/400, DATABASE 2, DB2, e- business logo, Enterprise Storage Server, ESCON, FICON, OS/390, OS/400, ES/9000, MVS/ESA, Netfinity, RISC, RISC SYSTEM/6000, System i, System p, System x, System z, IBM, Lotus, NOTES, WebSphere, z/Architecture, z/OS, zSeries The FOLLOWING TERMS ARE TRADEMARKS OR REGISTERED TRADEMARKS OF THE MICROSOFT CORPORATION IN THE UNITED STATES AND/OR OTHER COUNTRIES: MICROSOFT, WINDOWS, WINDOWS NT, ODBC, WINDOWS 95, WINDOWS VISTA, WINDOWS 7 For additional information see ibm.com/legal/copytrade.phtml

Information Management Software Agenda  Motivation  Project Overview  Architecture and Requirements  Technical design problems  Hardware/software constraints, and solutions  System Design and Implementation  Performance and Benchmark showcase  Conclusion, Recommendations and Future Work

Information Management Software IBM’s Big Data Portfolio IBM views Big Data at the enterprise level thus we aren’t honing in on one aspect such as analysis of social media or federated data 1.Data Warehouse (Information Server, DB2 Analytics Accelerator, Netezza, etc.) 2.InfoSphere BigInsights (Hadoop etc.) 3.Stream data capture and analysis 4.Federated data discovery and analysis

Information Management Software IBM DB2 Analytics Accelerator

Information Management Software XML Database XML is known to be a promising and desirable data format for storing and modeling data XML database offers the ability to store data and documents without requiring a database schema XQuery scripting language allows an expression or predicate to be used to process XML data. It’s built on XPath expression XML data can be manipulated using XQuery script language with increasing demand

Information Management Software Use case A DB2 client is to query over 400 TB of tax payers’ profile information formatted in XML data representation using XML query technology. Requirement: interactively analyze XML data in real time Problems: No such technology to analyze XML on HDFS Large scale data offloading process is a performance problem No backend support for importing data from Hadoop by DB2 application in XML format Summary: The lack of analytical query processing technology in Big Data restricts DB2 clients from using OLAP application on XML data

Information Management Software Project Overview Build an interface for RDBMS and Big Data Allow customers to move operational data in XML from System z for integration with other data Enable Online Analytical Processing(OLAP) applications with XML data in DB2 using XQuery technology Invest business value in building a cloud-enabled framework to allow machines to process data analytics in XML representation using XQuery support Explore a Big Data appliance on InfoSphere BigInsights with System z Meet the demand by DB2 customers with new workloads to System z Use the gravitational pull of its transactional data control

Information Management Software Project Architecture 1. User issues a SQL command to query on DB2 table to populate BigInsights’ Hadoop 2. BigInsights queries XML table in DB2 database 3. Loading DB2’s XML data on Hadoop 4. User issues a Jaql’s XPath query on XML data. Result of the query is stored on HDFS 5a. User runs a DB2 UDF to retrieve result of XPath’s query from HDFS back to DB2 database 5b. DB2 requests and stores XPath result from HDFS

Information Management Software Technical Design Problem BigInsights provides data analyzing capabilities in large volumes of data Plus… ▫ DB2 performs XML analytics slowly on z platform ▫ BigInsights does not support XQuery But…. they don’t talk to each other on z/OS! DB2 for z/OS  Provides both XML and XQuery supports

Information Management Software Environment Requirements Hardware Requirement One z server machine for storing XML data with DB2 for z/OS Linux machines with RHEL 6.2 as a Hadoop server for BigInsights 40GB of Disk storage 8 GB of memory Minimum of 4 Nodes in Cluster installation x84 64-bit systems zSystems (z/OS, z/VM and zLinux) are incompatible for deployment at this moment DB2_BigXML VLAN for Traffic Flow Survey Software Requirements IBM InfoSphere BigInsights Enterprise Edition 2.0 DB2 for z/OS 10.1 Mozilla Firefox Eclipse IDE for Java™ EE Runtimes for Java Technology, Version

Information Management Software Hardware constraints and solutions Physical cluster allocated for Linux must not drive more than 80GB of day-to-day data traffic flow with z/OS network Prevent network jam in subnet system within intranet network Solution: All 4 Linux machines are clustered within a privileged internal network Keep z/OS system in connection with Linux machines under VLAN DB2_BigXML for persistence Data transfer from z/OS to Linux is remained in small-sized workload

Information Management Software Software constraints and solutions Size of XML files not feasible in Hadoop’s storage block Hadoop allocates storage block with blocksize of 64MB or 128MB XML file’s size > 128MB needs to be splitted onto multiple blocks Solution: Assume each XML file consumes a size of 64MB or less Each node has a 40GB space, so it can take up to at most 321 XML files at a time without file split

Information Management Software Design and Implementation - Connectivity Data transfer Enable DB2’s JDBC driver to connect to BigInsights’s ad-hoc server via Database Import appliance Connection persists until application commits for both data submission and retrieval

Information Management Software Design and Implementation - Systematic tuning in Big Data Distribute incoming XML files to all nodes of Hadoop stored file system.

Information Management Software Design and Implementation - Systematic tuning in Big Data MapReduce Kickoff spawn mapper function per block to filter relevant information from each XML file each node extracts the file, aggregates and collect filtered data from different nodes into a central repository. XQuery API for Java in Hadoop with Package javax.xml.xquery is imported Use JAQL’s XPath API for Java on Hadoop stored file system Send query results back to DB2z

Information Management Software End-User Tier From InfoSphere’s BigInsights Web Console, deploy Database Import application for loading XML documents using JDBC driver class com.ibm.db2.jcc.DB2Driver. User is able to run a dynamic SQL statement to query on the table by using a SELECT statement. The result of the query is converted in CSV format which is to be stored at /BigXMLdirectory.

Information Management Software Middle-Tier To allow database driver access, create 2 DB2 database drivers which contain crucial database connection parameters to find the target database destination address with access authentication credentials.

Information Management Software Data-Tier XML data stored in DB2 database is converted to a plain text in a file on Hadoop cluster. A plain text with multiple XML tags and unstructured format is transformed using customized application called xmlProcessing which can be deployed from BigInsights web console. This applications reads out each bytes between the first start tag and the stop/end tag. It is assumed that the tags are treated as UTF-8 bytes. In order to rebuild the XML text file back to the structured format that XPath function can parse to query against to, tags are to be removed because the tags are not returned as part of the querying result.

Information Management Software Architecture layout

Information Management Software Performance measurement and Benchmark Data Loading onto Hadoop Transferring the same data in approximately <1GB of XML data from DB2 to Hadoop cluster had done 5 times. The peak elapsed time was 20 second, the quickest elapsed time was 14 second. The average CPU elapsed time was 16 second.

Information Management Software Performance measurement and Benchmark (cont.) Analyzing data using XPath query schema Testing begins with a query which drives down to 2nd level node tag where the predicate is for a matching condition. This is a simple query. The first run took 28 seconds and second time took 23 seconds. For a complicated query where the a lower node tag evaluation is added in addition to the previous matching condition, the performance looks even better. First time run took 24 seconds and second time run with the same query took only 23 seconds of elapsed time. The average time is about 24.5 seconds of CPU elapsed time.

Information Management Software Performance measurement and Benchmark (cont.) Retrieving query result from HDFS to DB2 Retrieving query result back to DB2 requires the use of HDFS_READ function to reads file contents from the HDFS and returns them back to DB2 table. CPU time using SELECT from UDF table by HDFS_READ function takes about 0.06 second to retrieve the data. In general, the data retrieval takes less than 0.1 second of CPU elapsed time using HDFS_READ for file size less than 10MB. It is also determined that the elapsed time of each applications varies for every individual application execution due to network traffic and the variation from data node response time to process the query.

Information Management Software Conclusion and Future Work Implemented OLAP for query processing on XML data in the MapReduce framework Enabled DB2 XML data offloading Enabled XQuery queries offloading from DB2z MapReduce framework is reworked on BigInsights’s ad hoc server to enable XQuery support Aggregated Data in BigInsights is allowed to transfer from HDFS back to DB2 forming a XML table in z/OS mainframe

Information Management Software Future work  Allow XML file separation for multi-block processing  Allow CDC (Changed Data Capture) schema for continuous online transactional processing (OLTP)  Customize the query output type instead of plain text format

Information Management Software Thank you! George Wang IBM