Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Data Lake A New Solution to Business Intelligence.

Similar presentations


Presentation on theme: "The Data Lake A New Solution to Business Intelligence."— Presentation transcript:

1 The Data Lake A New Solution to Business Intelligence

2 Agenda Cas Apanowicz – An Introduction A Little History Traditional DW/BI What is Data Lake Why is better? Architectural Reference New Paradigm and Architectural Reference Future of Data Lake Q&A Appendix A

3 Cas Apanowicz Cas is the founder and was the first CEO of Infobright – the first Open Source Data Warehouse company co-owned by San Microsystems and RBC Royal Bank of Canada He is an accomplished IT consultant and entrepreneur, Internationally recognized IT practitioner who has served as a co-chair and speaker on International conferences. Prior to Infobright, Mr. Apanowicz founded Cognitron Technology, which specialized in developing data mining tools, many of which were used in the health care field to assist in customer care and treatment. Previous to Cognitron Technology, Mr. Apanowicz worked in the Research Centre at BCTel where he developed an algorithm that measured customer satisfaction. At the same time, he was working in the Brain Center at UBC in Vancouver applying ground-breaking algorithms for brain reading interpretation. As well, he offered his expertise to Vancouver General Hospital in applying new technology for recognition of different types of epilepsy. Cas Apanowicz has been designing and delivering BI/DW technology solutions for over 18 years. He has created a BI/DW open source software company and has North American patents in this field. Throughout his career, Cas has held consulting roles with Fortune 500 companies across North America, including Royal Bank of Canada, the New York Stock Exchange, the Federal Government of Canada, Honda, and many others. Cas holds a Master's Degree in Mathematics from the University of Krakow. Cas is an author of North American patents and several publications by renowned publishers such as Springer and Sherbrooke Hospital. He is also regularly invited to be a peer-reviewer by Springer publisher of many IT related publications.

4 A Little History  Big Data has received much attention over past two years, some calling it Ugly Data.  The challenge is dealing with the “mountains of sand” – hundreds, thousands and is cases millions of small, medium, and large data sets which are related, but unintegrated  IT is overtaxed and unable to integrate vast majority of data  New class of software needed to discover relationships between related yet unintegrated data sets

5 Cloud Current BI  Data Analyses  Data Cleansing  Entity Relationship Modeling  Dimensional Modeling  Database Design & Implementation  Database Population through ETL/ELT  Downstream Applications linkage - Metadata  Maintaining the processes Source Data Extensive processes and costs: BI and Hadoop Data Marts Analytical Database Analytical Database Analytical Database Analytical Database Analytical Database

6 BI Reference Architecture Metadata Management Security and Data Privacy System Management and Administration Network Connectivity, Protocols & Access Middleware Hardware & Software Platforms Web Browser Portals Devices (ex.: mobile) Web Services Access Collaboration Business Applications Query & Reporting Data Mining Modeling Scorecard Visualization Embedded Analytics Operational Data Stores Data Warehouse Data Marts Staging Areas Metadata Data Repositories Extraction Transformation Load / Apply Synchronization Transport / Messaging Information Integrity Data Integration Data Flow and Workflow Enterprise Unstructured Informational External Data Sources Supplier Orders Product Promotions Customer Location Invoice ePOS Other HDFS Analytical Data Marts HCatalog – Data Lake Sqoop MapReduce/PIG Load / Apply Single Source HCatalog & Pig Can work with most ETL tools on the market Transport / Messaging Metadata Management - HCatalog

7 Metadata Management Security and Data Privacy System Management and Administration Network Connectivity, Protocols & Access Middleware Hardware & Software Platforms Web Browser Portals Devices (ex.: mobile) Web Services Access Collaboration Business Applications Query & Reporting Data Mining Modeling Scorecard Visualization Embedded Analytics Operational Data Stores Data Warehouse Data Marts Staging Areas Metadata Data Repositories Extraction Transformation Load / Apply Synchronization Transport / Messaging Information Integrity Data Integration Data Flow and Workflow Enterprise Unstructured Informational External Data Sources Supplier Orders Product Promotions Customer Location Invoice ePOS Other BI Reference Architecture

8 Metadata Management Security and Data Privacy System Management and Administration Network Connectivity, Protocols & Access Middleware Hardware & Software Platforms Web Browser Portals Devices (ex.: mobile) Web Services Access Collaboration Business Applications Query & Reporting Data Mining Modeling Scorecard Visualization Embedded Analytics Operational Data Stores Data Warehouse Data Marts Staging Areas Metadata Data Repositories Extraction Transformation Load / Apply Synchronization Transport / Messaging Information Integrity Data Integration Data Flow and Workflow Enterprise Unstructured Informational External Data Sources Supplier Orders Product Promotions Customer Location Invoice ePOS Other Transport / Messaging HCatalog – Hadoop metadata repository and management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop. Extraction is an application used to transfer data, usually from relational databases to a flat file, which can then be use to transport to a landing are of a Data Warehouse and ingest into BI/DW environment. BI Reference Architecture Extraction Sqoop – is a command-line interface application for transferring data between relational databases and Hadoop. It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Exports can be used to put data from Hadoop into a relational database. Source ExtractTarget SourceTarget Sqoop Current BI Proposed BI s ftp Database extract MapReduce – A framework for writing applications that processes large amounts of structured and unstructured data in parallel across large clusters of machines in a very reliable and fault- tolerant manner. Pig – A platform for processing and analyzing large data sets. Pig consists on a high-level language (Pig Latin) for expressing data analysis programs paired with the MapReduce framework for processing these programs. Transformation Landing Staging DW HDFS DM Current BI Proposed BI DM MapReduce/Pig Complex ETL Load / Apply Staging DW DM Current BI Proposed BI DM Synchronization Synchronization – The ETL process takes source data from staging, transforms using business rules and loads into central repository DW. In this scenario, in order to retain information integrity, one has to put in place a synchronization checks & correction mechanism. HDFS as a Single Source – In the proposed solution HDFS acts as a single source of data so there is no danger of desinhronization. The inconsistencies resulted from duplicated or inconsistent data will be reconciled with assistance of HCatalog and proper data governance. Staging DW Landing Synchronization SourceDM HDFS SourceDM Information Integrity Current – Currently there is no special approach to the data quality other than imbedded into the ETL processes and logic. There are tools and approaches to implement QA & QC. Hadoop – More focused approach - While we use HDFS as a one big “Data Lake” QA and QC will be applied at the Data Mart Level where the actual transformations will occur, hence reducing the overall effort. QA & QC will be an integral part of Data Governance and augmented by usage of HCatalog.

9 Metadata Management Security and Data Privacy System Management and Administration Network Connectivity, Protocols & Access Middleware Hardware & Software Platforms Web Browser Portals Devices (ex.: mobile) Web Services Access Collaboration Business Applications Query & Reporting Data Mining Modeling Scorecard Visualization Embedded Analytics Operational Data Stores Data Warehouse Data Marts Staging Areas Metadata Data Repositories Extraction Transformation Load / Apply Synchronization Transport / Messaging Information Integrity Data Integration Data Flow and Workflow Enterprise Unstructured Informational External Data Sources Supplier Orders Product Promotions Customer Location Invoice ePOS Other Data Repositories Operational Data Stores Data Warehouse Data Marts Staging Areas Metadata HDFS HCatalog HCatalog Metadata Management HCatalog – A Hadoop metadata repository and management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop. BI Reference Architecture Hadoop Distributed File System (HDFS) – A reliable and distributed Java-based file system that allows large volumes of data to be stored and rapidly accessed across large clusters of commodity servers

10 HCatalog Metadata Management Security and Data Privacy System Management and Administration Network Connectivity, Protocols & Access Middleware Hardware & Software Platforms Web Browser Portals Devices (ex.: mobile) Web Services Access Collaboration Business Applications Query & Reporting Data Mining Modeling Scorecard Visualization Embedded Analytics Data Flow and Workflow Enterprise Unstructured Informational External Data Sources Supplier Orders Product Promotions Customer Location Invoice ePOS Other HDFS Analytical Data Marts HCatalog Data Repositories Sqoop MapReduce/PIG Load / Apply Single Source HCatalog & Pig Can work with Informatica Data Integration Transport / Messaging BI Reference Architecture

11 CapabilityCurrent BIProposed BI Expected Change Data SourcesSource Applications No Data Integration Extraction from SourceDB ExportSqoopOn-to-one change Transport/MessagingSFTP No Staging Area Transformations/Load Complex ETL CodeNone requiredeliminated Extract from StagingComplex ETL CodeNone requiredeliminated Transformation for DWComplex ETL CodeNone requiredeliminated Load to DWComplex ETL, RDBMSNone requiredeliminated Extract from from DW, Transformation and load to DM Complex ETL code & process to feed DM MapReduce/Pig simplified transformations from HDFS to DM Data Quality, Balance & Controls Imbedded ETL CodeMapReduce/Pig in conjunction with HCatalog; Can also coexist with Informatica Yes BI Reference Architecture

12 CapabilityCurrent BIProposed BI Expected Change Data Repositories Operational Data Stores Additional Data Store (currently sharing resources with BIDW) No additional repository. The BI consumption implemented through appropriated DM Elimination of additional data store Data Warehouse Complex Schema, Expensive platform. Requires complex modeling and design for any new data element Eliminated. All data is collected in HDFS and available for feeding all required Data Marts (DM) - NO Schema on Write. Eliminated Staging Areas Complex Schema, Expensive platform. Requires complex design with any new data element Eliminated. All data is collected in HDFS and available for creation of Data Marts Eliminated Data MartsDimensional Schema No change BI Reference Architecture

13 CapabilityCurrent BIProposed BI Expected Change MetadataNot ImplementedHCatalog Simplified due to simplified processing & existence of native metadata management system. SecurityMature Enterprise Mature Enterprise guaranteed by Cloud provider Less maintenance Analytics WebFocus, Microstrategy, Pentaho, SSRS, etc. No change AccessWeb, mobile, other No change BI Reference Architecture

14 Business Case Solution Component Traditional/Original Proposed DW Discovery Implementation Time6 Months2 Months Cost of Implementation$975,000$197,000 Number of Resources involved in Implementation 174 Maintenance Estimated Cost $195,000$25,000 The client has internally developed BI component strategically positioned in the BI ecosystem. Cas Apanowicz of IT Horizon Corp. was retained to evaluate the solution. The Data Lake approach was recommended resulting in total saving of $778,000 and shortening the implementation time from 6 to 2 month:

15 Thank You Contact information: Cas Apanowicz cas@it-horizon.com 416-882-5464 Questions?


Download ppt "The Data Lake A New Solution to Business Intelligence."

Similar presentations


Ads by Google