The Data Lake A New Solution to Business Intelligence.

Slides:



Advertisements
Similar presentations
FAST FORWARD WITH MICROSOFT BIG DATA Vinoo Srinivas M Solutions Specialist Windows Azure (Hadoop, HPC, Media)
Advertisements

Database – Part 3 Dr. V.T. Raja Oregon State University External References/Sources: Data Warehousing – Mr. Sakthi Angappamudali.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
1 Introduction The Database Environment. 2 Web Links Google General Database Search Database News Access Forums Google Database Books O’Reilly Books Oracle.
Database – Part 2b Dr. V.T. Raja Oregon State University External References/Sources: Data Warehousing – Sakthi Angappamudali at Standard Insurance; BI.
Components and Architecture CS 543 – Data Warehousing.
Accelerated Access to BW Al Weedman Idea Integration.
Business Intelligence Dr. Mahdi Esmaeili 1. Technical Infrastructure Evaluation Hardware Network Middleware Database Management Systems Tools and Standards.
Fraud Detection in Banking using Big Data By Madhu Malapaka For ISACA, Hyderabad Chapter Date: 14 th Dec 2014 Wilshire Software.
Business Intelligence System September 2013 BI.
Data Warehouse Components
Data Warehouse Toolkit Introduction. Data Warehouse Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system. An.
Chapter 4 Database Management Systems. Chapter 4Slide 2 What is a Database Management System (DBMS)?  Database An organized collection of related data.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Data Management Capabilities and Past Performance Dr. Srinivas Kankanahalli.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Internet GIS. A vast network connecting computers throughout the world Computers on the Internet are physically connected Computers on the Internet use.
BUSINESS INTELLIGENCE/DATA INTEGRATION/ETL/INTEGRATION AN INTRODUCTION Presented by: Gautam Sinha.
Agenda 02/20/2014 Complete data warehouse design exercise Finish reconciled data warehouse, bus matrix and data mart Display each group’s work Discuss.
Data Warehouse Tools and Technologies - ETL
Agenda 02/21/2013 Discuss exercise Answer questions in task #1 Put up your sample databases for tasks #2 and #3 Define ETL in more depth by the activities.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
5.1 © 2007 by Prentice Hall 5 Chapter Foundations of Business Intelligence: Databases and Information Management.
SharePoint 2010 Business Intelligence Module 2: Business Intelligence.
Database Systems – Data Warehousing
Best Practices for Data Warehousing. 2 Agenda – Best Practices for DW-BI Best Practices in Data Modeling Best Practices in ETL Best Practices in Reporting.
Data Profiling
DW-1: Introduction to Data Warehousing. Overview What is Database What Is Data Warehousing Data Marts and Data Warehouses The Data Warehousing Process.
Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance.
© 2008 IBM Corporation ® IBM Cognos Business Viewpoint Miguel Garcia - Solutions Architect.
Data Management Console Synonym Editor
Database A database is a collection of data organized to meet users’ needs. In this section: Database Structure Database Tools Industrial Databases Concepts.
1 Reviewing Data Warehouse Basics. Lessons 1.Reviewing Data Warehouse Basics 2.Defining the Business and Logical Models 3.Creating the Dimensional Model.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
5-1 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Datawarehouse A sneak preview. 2 Data Warehouse Approach An old idea with a new interest: Cheap Computing Power Special Purpose Hardware New Data Structures.
Enterprise Data Model for Transportation Planning Presentation to 2009 TRB Planning Application Conference Minhua Wang, Ph.D. Citilabs, Inc.
Chapter 5 DATA WAREHOUSING Study Sections 5.2, 5.3, 5.5, Pages: & Snowflake schema.
Windows Azure. Azure Application platform for the public cloud. Windows Azure is an operating system You can: – build a web application that runs.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
Project Management May 30th, Team Members Name Project Role Gint of Communications Sai
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
IoT Meets Big Data Standardization Considerations
Presenter : Ahmed M. Mosa User Group : SQLHero. Overview  Where is BI in market trend  Information Overload  Business View  BI Stages  BI Life Cycle.
1 Copyright © Oracle Corporation, All rights reserved. Business Intelligence and Data Warehousing.
Harnessing Big Data with Hadoop Dipti Sangani; Madhu Reddy DBI210.
Data Warehouse – Your Key to Success. Data Warehouse A data warehouse is a  subject-oriented  Integrated  Time-variant  Non-volatile  Restructure.
SAP BI – The Solution at a Glance : SAP Business Intelligence is an enterprise-class, complete, open and integrated solution.
2 Copyright © 2006, Oracle. All rights reserved. Defining Data Warehouse Concepts and Terminology.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Apache Hadoop on Windows Azure Avkash Chauhan
Database Principles: Fundamentals of Design, Implementation, and Management Chapter 1 The Database Approach.
Supervisor : Prof . Abbdolahzadeh
We Optimize. You Capitalize Software Development Services
The BI360 Business Intelligence Suite
Data Management Capabilities and Past Performance
Big Data Enterprise Patterns
Overview of MDM Site Hub
Manajemen Data (2) PTI Pertemuan 6.
Data Warehouse.
SYSTEMART, LLC We Optimize. You Capitalize Software Application Development
Database Management System (DBMS)
Big Data - in Performance Engineering
Overview of big data tools
Data Warehousing Concepts
Presentation transcript:

The Data Lake A New Solution to Business Intelligence

Agenda Cas Apanowicz – An Introduction A Little History Traditional DW/BI What is Data Lake Why is better? Architectural Reference New Paradigm and Architectural Reference Future of Data Lake Q&A Appendix A

Cas Apanowicz Cas is the founder and was the first CEO of Infobright – the first Open Source Data Warehouse company co-owned by San Microsystems and RBC Royal Bank of Canada He is an accomplished IT consultant and entrepreneur, Internationally recognized IT practitioner who has served as a co-chair and speaker on International conferences. Prior to Infobright, Mr. Apanowicz founded Cognitron Technology, which specialized in developing data mining tools, many of which were used in the health care field to assist in customer care and treatment. Previous to Cognitron Technology, Mr. Apanowicz worked in the Research Centre at BCTel where he developed an algorithm that measured customer satisfaction. At the same time, he was working in the Brain Center at UBC in Vancouver applying ground-breaking algorithms for brain reading interpretation. As well, he offered his expertise to Vancouver General Hospital in applying new technology for recognition of different types of epilepsy. Cas Apanowicz has been designing and delivering BI/DW technology solutions for over 18 years. He has created a BI/DW open source software company and has North American patents in this field. Throughout his career, Cas has held consulting roles with Fortune 500 companies across North America, including Royal Bank of Canada, the New York Stock Exchange, the Federal Government of Canada, Honda, and many others. Cas holds a Master's Degree in Mathematics from the University of Krakow. Cas is an author of North American patents and several publications by renowned publishers such as Springer and Sherbrooke Hospital. He is also regularly invited to be a peer-reviewer by Springer publisher of many IT related publications.

A Little History  Big Data has received much attention over past two years, some calling it Ugly Data.  The challenge is dealing with the “mountains of sand” – hundreds, thousands and is cases millions of small, medium, and large data sets which are related, but unintegrated  IT is overtaxed and unable to integrate vast majority of data  New class of software needed to discover relationships between related yet unintegrated data sets

Cloud Current BI  Data Analyses  Data Cleansing  Entity Relationship Modeling  Dimensional Modeling  Database Design & Implementation  Database Population through ETL/ELT  Downstream Applications linkage - Metadata  Maintaining the processes Source Data Extensive processes and costs: BI and Hadoop Data Marts Analytical Database Analytical Database Analytical Database Analytical Database Analytical Database

BI Reference Architecture Metadata Management Security and Data Privacy System Management and Administration Network Connectivity, Protocols & Access Middleware Hardware & Software Platforms Web Browser Portals Devices (ex.: mobile) Web Services Access Collaboration Business Applications Query & Reporting Data Mining Modeling Scorecard Visualization Embedded Analytics Operational Data Stores Data Warehouse Data Marts Staging Areas Metadata Data Repositories Extraction Transformation Load / Apply Synchronization Transport / Messaging Information Integrity Data Integration Data Flow and Workflow Enterprise Unstructured Informational External Data Sources Supplier Orders Product Promotions Customer Location Invoice ePOS Other HDFS Analytical Data Marts HCatalog – Data Lake Sqoop MapReduce/PIG Load / Apply Single Source HCatalog & Pig Can work with most ETL tools on the market Transport / Messaging Metadata Management - HCatalog

Metadata Management Security and Data Privacy System Management and Administration Network Connectivity, Protocols & Access Middleware Hardware & Software Platforms Web Browser Portals Devices (ex.: mobile) Web Services Access Collaboration Business Applications Query & Reporting Data Mining Modeling Scorecard Visualization Embedded Analytics Operational Data Stores Data Warehouse Data Marts Staging Areas Metadata Data Repositories Extraction Transformation Load / Apply Synchronization Transport / Messaging Information Integrity Data Integration Data Flow and Workflow Enterprise Unstructured Informational External Data Sources Supplier Orders Product Promotions Customer Location Invoice ePOS Other BI Reference Architecture

Metadata Management Security and Data Privacy System Management and Administration Network Connectivity, Protocols & Access Middleware Hardware & Software Platforms Web Browser Portals Devices (ex.: mobile) Web Services Access Collaboration Business Applications Query & Reporting Data Mining Modeling Scorecard Visualization Embedded Analytics Operational Data Stores Data Warehouse Data Marts Staging Areas Metadata Data Repositories Extraction Transformation Load / Apply Synchronization Transport / Messaging Information Integrity Data Integration Data Flow and Workflow Enterprise Unstructured Informational External Data Sources Supplier Orders Product Promotions Customer Location Invoice ePOS Other Transport / Messaging HCatalog – Hadoop metadata repository and management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop. Extraction is an application used to transfer data, usually from relational databases to a flat file, which can then be use to transport to a landing are of a Data Warehouse and ingest into BI/DW environment. BI Reference Architecture Extraction Sqoop – is a command-line interface application for transferring data between relational databases and Hadoop. It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Exports can be used to put data from Hadoop into a relational database. Source ExtractTarget SourceTarget Sqoop Current BI Proposed BI s ftp Database extract MapReduce – A framework for writing applications that processes large amounts of structured and unstructured data in parallel across large clusters of machines in a very reliable and fault- tolerant manner. Pig – A platform for processing and analyzing large data sets. Pig consists on a high-level language (Pig Latin) for expressing data analysis programs paired with the MapReduce framework for processing these programs. Transformation Landing Staging DW HDFS DM Current BI Proposed BI DM MapReduce/Pig Complex ETL Load / Apply Staging DW DM Current BI Proposed BI DM Synchronization Synchronization – The ETL process takes source data from staging, transforms using business rules and loads into central repository DW. In this scenario, in order to retain information integrity, one has to put in place a synchronization checks & correction mechanism. HDFS as a Single Source – In the proposed solution HDFS acts as a single source of data so there is no danger of desinhronization. The inconsistencies resulted from duplicated or inconsistent data will be reconciled with assistance of HCatalog and proper data governance. Staging DW Landing Synchronization SourceDM HDFS SourceDM Information Integrity Current – Currently there is no special approach to the data quality other than imbedded into the ETL processes and logic. There are tools and approaches to implement QA & QC. Hadoop – More focused approach - While we use HDFS as a one big “Data Lake” QA and QC will be applied at the Data Mart Level where the actual transformations will occur, hence reducing the overall effort. QA & QC will be an integral part of Data Governance and augmented by usage of HCatalog.

Metadata Management Security and Data Privacy System Management and Administration Network Connectivity, Protocols & Access Middleware Hardware & Software Platforms Web Browser Portals Devices (ex.: mobile) Web Services Access Collaboration Business Applications Query & Reporting Data Mining Modeling Scorecard Visualization Embedded Analytics Operational Data Stores Data Warehouse Data Marts Staging Areas Metadata Data Repositories Extraction Transformation Load / Apply Synchronization Transport / Messaging Information Integrity Data Integration Data Flow and Workflow Enterprise Unstructured Informational External Data Sources Supplier Orders Product Promotions Customer Location Invoice ePOS Other Data Repositories Operational Data Stores Data Warehouse Data Marts Staging Areas Metadata HDFS HCatalog HCatalog Metadata Management HCatalog – A Hadoop metadata repository and management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop. BI Reference Architecture Hadoop Distributed File System (HDFS) – A reliable and distributed Java-based file system that allows large volumes of data to be stored and rapidly accessed across large clusters of commodity servers

HCatalog Metadata Management Security and Data Privacy System Management and Administration Network Connectivity, Protocols & Access Middleware Hardware & Software Platforms Web Browser Portals Devices (ex.: mobile) Web Services Access Collaboration Business Applications Query & Reporting Data Mining Modeling Scorecard Visualization Embedded Analytics Data Flow and Workflow Enterprise Unstructured Informational External Data Sources Supplier Orders Product Promotions Customer Location Invoice ePOS Other HDFS Analytical Data Marts HCatalog Data Repositories Sqoop MapReduce/PIG Load / Apply Single Source HCatalog & Pig Can work with Informatica Data Integration Transport / Messaging BI Reference Architecture

CapabilityCurrent BIProposed BI Expected Change Data SourcesSource Applications No Data Integration Extraction from SourceDB ExportSqoopOn-to-one change Transport/MessagingSFTP No Staging Area Transformations/Load Complex ETL CodeNone requiredeliminated Extract from StagingComplex ETL CodeNone requiredeliminated Transformation for DWComplex ETL CodeNone requiredeliminated Load to DWComplex ETL, RDBMSNone requiredeliminated Extract from from DW, Transformation and load to DM Complex ETL code & process to feed DM MapReduce/Pig simplified transformations from HDFS to DM Data Quality, Balance & Controls Imbedded ETL CodeMapReduce/Pig in conjunction with HCatalog; Can also coexist with Informatica Yes BI Reference Architecture

CapabilityCurrent BIProposed BI Expected Change Data Repositories Operational Data Stores Additional Data Store (currently sharing resources with BIDW) No additional repository. The BI consumption implemented through appropriated DM Elimination of additional data store Data Warehouse Complex Schema, Expensive platform. Requires complex modeling and design for any new data element Eliminated. All data is collected in HDFS and available for feeding all required Data Marts (DM) - NO Schema on Write. Eliminated Staging Areas Complex Schema, Expensive platform. Requires complex design with any new data element Eliminated. All data is collected in HDFS and available for creation of Data Marts Eliminated Data MartsDimensional Schema No change BI Reference Architecture

CapabilityCurrent BIProposed BI Expected Change MetadataNot ImplementedHCatalog Simplified due to simplified processing & existence of native metadata management system. SecurityMature Enterprise Mature Enterprise guaranteed by Cloud provider Less maintenance Analytics WebFocus, Microstrategy, Pentaho, SSRS, etc. No change AccessWeb, mobile, other No change BI Reference Architecture

Business Case Solution Component Traditional/Original Proposed DW Discovery Implementation Time6 Months2 Months Cost of Implementation$975,000$197,000 Number of Resources involved in Implementation 174 Maintenance Estimated Cost $195,000$25,000 The client has internally developed BI component strategically positioned in the BI ecosystem. Cas Apanowicz of IT Horizon Corp. was retained to evaluate the solution. The Data Lake approach was recommended resulting in total saving of $778,000 and shortening the implementation time from 6 to 2 month:

Thank You Contact information: Cas Apanowicz Questions?