Zhangxi Lin Texas Tech University

Slides:



Advertisements
Similar presentations
Enterprise Data Warehousing (EDW) By: Jordan Olp.
Advertisements

Data Warehouse Architecture Sakthi Angappamudali Data Architect, The Oregon State University, Corvallis 16 th May, 2005.
Introduction to data warehouses
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
Chapter 2: Data Warehousing
IST722 Data Warehousing An Introduction to Data Warehousing Michael A. Fudge, Jr.
Chapter 8: Data Warehousing
Data Warehouse Toolkit Introduction. Data Warehouse Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system. An.
Chapter 4 Data Warehousing.
Hadoop Ecosystem Overview
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Data Conversion to a Data warehouse Presented By Sanjay Gunasekaran.
BUSINESS INTELLIGENCE/DATA INTEGRATION/ETL/INTEGRATION AN INTRODUCTION Presented by: Gautam Sinha.
Understanding Data Warehousing
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Data Warehouse Concepts Transparencies
DW-1: Introduction to Data Warehousing. Overview What is Database What Is Data Warehousing Data Marts and Data Warehouses The Data Warehousing Process.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri.
Data Warehouse Development Methodology
2 Copyright © Oracle Corporation, All rights reserved. Defining Data Warehouse Concepts and Terminology.
Business Intelligence and Decision Support Systems (9 th Ed., Prentice Hall) Chapter 8: Data Warehousing.
1 Reviewing Data Warehouse Basics. Lessons 1.Reviewing Data Warehouse Basics 2.Defining the Business and Logical Models 3.Creating the Dimensional Model.
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
Datawarehouse A sneak preview. 2 Data Warehouse Approach An old idea with a new interest: Cheap Computing Power Special Purpose Hardware New Data Structures.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
Chapter 5 DATA WAREHOUSING Study Sections 5.2, 5.3, 5.5, Pages: & Snowflake schema.
Decision Support Systems Data Warehousing. Modified from Decision Support Systems and Business Intelligence Systems 9E. 1-2 Learning Objectives Understand.
Business Intelligence Transparencies 1. ©Pearson Education 2009 Objectives What business intelligence (BI) represents. The technologies associated with.
Nov 2006 Google released the paper on BigTable.
Hadoop & MapReduce Zhangxi Lin CAABI, Texas Tech University FIFE, Southwestern University of Finance & Economics Cellphone: , QQ/WeChat:
Big Data Tools Hadoop S.S.Mulay Sr. V.P. Engineering February 1, 2013.
1 Copyright © Oracle Corporation, All rights reserved. Business Intelligence and Data Warehousing.
Hadoop & Spark Zhangxi Lin ISQS3358, Spring 2016.
Harnessing Big Data with Hadoop Dipti Sangani; Madhu Reddy DBI210.
Chapter 8: Data Warehousing. Data Warehouse Defined A physical repository where relational data are specially organized to provide enterprise- wide, cleansed.
Business Intelligence and Decision Support Systems (9 th Ed., Prentice Hall) Chapter 5: Data Warehousing.
Business Intelligence and Decision Support Systems (9 th Ed., Prentice Hall) Chapter 8: Data Warehousing.
Business Intelligence and Decision Support Systems (9 th Ed., Prentice Hall) Chapter 8: Data Warehousing.
Business Intelligence and Decision Support Systems (9 th Ed., Prentice Hall) Chapter 8: Data Warehousing.
2 Copyright © 2006, Oracle. All rights reserved. Defining Data Warehouse Concepts and Terminology.
An Introduction To Big Data For The SQL Server DBA.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
ISQS 3358, Business Intelligence Data Warehousing Zhangxi Lin Texas Tech University 1.
Microsoft Partner since 2011
Microsoft Ignite /28/2017 6:07 PM
BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
Hadoop Big Data Usability Tools and Methods. On the subject of massive data analytics, usability is simply as crucial as performance. Right here are three.
Big Data & Test Automation
OMOP CDM on Hadoop Reference Architecture
Zhangxi Lin Texas Tech University
Advanced Applied IT for Business 2
Zhangxi Lin Texas Tech University
Introduction to Data Warehousing
Data Warehouse.
Chapter 8: Data Warehousing
استقرار هوش تجاری در سازمان نمونه مورد بررسی : سولدوش
Data Warehouse and OLAP
Introduction to Apache
ISQS 6339, Business Intelligence Big Data Management
Introduction of Week 9 Return assignment 5-2
Data Warehousing Concepts
Big DATA.
Technical Architecture
Big-Data Analytics with Azure HDInsight
Data Warehouse and OLAP
Big Data.
Presentation transcript:

Zhangxi Lin Texas Tech University ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University

Projects Students will build up a Hadoop system and explore/visualize a Hadoop based data warehouse. Students are divided into three cohorts. Cohort 1 uses Pentaho for data analysis, Cohort 2 uses Tableau for data analysis, and the third cohort will work on self-selected business intelligence topic. Each cohort may home 2-4 teams with no more than 12 students in total, and each team is composed of 2-4 members. Deliverables include a team presentation of 15 minutes and a term report in 6-10 pages.

Project contents Each team will identify a big data topic and find needed data. The dataset is not necessarily to be Big enough, but representative. A data warehouse using either SQL Server, or Hadoop is fine. Data analysis/visualization must be well done. The report/presentation will cover the following points: Business background Data description Data model design ETL Analytical results

Hadoop/Spark

Topics Data warehousing Publicly available big data services No: Topic Components 1 Data warehousing Focus: Hadoop Data warehouse design HDFS, HBase, HIVE, NoSQL/NewSQL, Solr 2 Publicly available big data services Focus: tools and free resources Hortonworks, CloudEra, HaaS, EC2 3 MapReduce & Data mining Focus: Efficiency of distributed data/text mining Mahout, H2O, MLlib, R, Python 4 Big data ETL Focus: Heterogeneous data processing across platforms Kettle, Flume, Sqoop, Impala 5 System management: Focus: Load balancing and system efficiency Oozie, ZooKeeper, Ambari, Loom, Ganglia, Mesos 6 Application development platform Focus: Algorithms and innovative development environments Tomcat, Neo4J, Taitan, GraphX, Pig, Hue 7 Tools & Visualizations Focus: Features for big data visualization and data utilization. Pentaho, Tableau, Qlik Saiku, Mondrian, Gephi, 8 Streaming data processing Focus: Efficiency and effectiveness of real-time data processing Spark, Storm, Kafka, Avro

Data Warehousing Methodology - Implementing data warehouse systematically Data Warehousing Methodology 6

Data Warehouse Development Methods Data warehouse development approaches Kimball Model: Data mart approach Data marts - EDW Inmon Model: EDW approach EDW – Data Marts Which model is better? There is no one-size-fits-all strategy to data warehousing One alternative is the hosted warehouse 7

Comparison Kimball Model Inmon Model Kimball’s model follows a bottom-up approach. The Data Warehouse (DW) is provisioned from Datamarts (DM) as and when they are available or required. The Datamarts are sourced from OLTP systems are usually relational databases in Third normal form (3NF). The Data Warehouse which is central to the model is a de-normalized star schema. The OLAP cubes are built on this DW. Inmon Model Inmon’s model follows a top-down approach. The Data Warehouse (DW) is sourced from OLTP systems and is the central repository of data. The Data Warehouse in Inmon’s model is in Third Normal Form (3NF). The Datamarts (DM) are provisioned out of the Data Warehouse as and when required. Datamarts in Inmon’s model are in 3NF from which the OLAP cubes are built.

Strengths and Weaknesses Scalable vs. structural Kimball’s model is more scalable because of the bottom-up approach and hence you can start small and scale-up eventually. The ROI is usually faster with Kimball’s model. Because of this approach it is difficult to created re-usable structures/ ETL for different data marts. On the other hand Inmon’s model is more structured and easier to maintain while it is rigid and takes more time to build. The significant advantage of Inmon’s model is because the DW is in 3NF; it is easier to build data mining models.      Both Kimball and Inmon models agree and emphasis that DW is the central repository of data and OLAP cubes are built of de-normalized star schemas.      In conclusion, when it comes to data modeling, it is irrelevant which camp you belong to as long as you understand why you are adopting a specific model. Sometimes it makes sense to take a hybrid approach. 

General Data Warehouse Development Approaches “Big bang” approach Incremental approach: Top-down incremental approach Bottom-up incremental approach Warehouse Development Approaches The most challenging aspect of data warehousing lies not in its technical difficulty, but in choosing the best approach to data warehousing for your company’s structure and culture, and dealing with the organizational and political issues that will inevitably arise during implementation. Among the different approaches to developing a data warehouse are: “Big bang” approach Incremental approach Top-down incremental approach Bottom-up incremental approach ISQS 6339, Data Mgmt & BI, Zhangxi Lin 11 11

“Big Bang” Approach Analyze enterprise requirements Build enterprise data warehouse Report in subsets or store in data marts “Big Bang” Approach Historically IT departments attempted to provide enterprisewide data warehouse implementations in a single project approach. Data warehouse development is a huge task, and it is a mistake to assume that the solution can be built all at once. The time required to develop the warehouse often means that user requirements and technologies change before the project is completed. In this approach, you perform the following: Analyze the entire information requirement for the organization Build the enterprise data warehouse to support these requirements Build access, as required, either directly or by subsetting to data marts ISQS 6339, Data Mgmt & BI, Zhangxi Lin 12 12

Incremental Approach to Warehouse Development Multiple iterations Shorter implementations Validation of each phase Increment 1 Strategy Definition Analysis Design Build Incremental Approach The incremental approach manages the growth of the data warehouse by developing incremental solutions that comply with the full-scale data warehouse architecture. Rather than starting by building an entire enterprisewide data warehouse as a first deliverable, start with just one or two subject areas, implement them as scalable data mart and roll them out to your end users. Then, after observing how users are actually using the warehouse, add the next subject area or the next increment of functionality to the system. This is also an iterative process. It is this iteration that keeps the data warehouse in line with the needs of the organization. Benefits Delivers a strategic data warehouse solution through incremental development efforts Provides extensible, scalable architecture Supports the information needs of the enterprise organization Quickly provides business benefit and ensures a much earlier return of investment Allows a data warehouse to be built based on a subject or application area at a time Allows the construction of an integrated data mart environment Iterative Production ISQS 6339, Data Mgmt & BI, Zhangxi Lin 13 13

Top-Down Approach Analyze requirements at the enterprise level Develop conceptual information model Identify and prioritize subject areas Complete a model of selected subject area Map to available data Perform a source system analysis Implement base technical architecture Establish metadata, extraction, and load processes for the initial subject area Create and populate the initial subject area data mart within the overall warehouse framework Top-Down Incremental Approach Advantages This approach has the following advantages: Provides a relatively quick implementation and payback. Typically, the scoping, definition study, and initial implementation are scaled down so that they can be completed in six to seven months. Offers significantly lower risk because it avoids being as analysis heavy as the “big bang” approach Emphasizes high-level business needs Achieves synergy among subject areas. Maximum information leverage is achieved as cross-functional reporting and a single version of the truth are made possible Disadvantages This approach has the following disadvantages: Requires an increase in up-front costs before the business sees any return on their investment Is difficult to define the boundaries of the scoping exercise if the business is global May not be suitable unless the client needs cross-functional reporting ISQS 6339, Data Mgmt & BI, Zhangxi Lin 14 14

Bottom-Up Approach Define the scope and coverage of the data warehouse and analyze the source systems within this scope Define the initial increment based on the political pressure, assumed business benefit and data volume Implement base technical architecture and establish metadata, extraction, and load processes as required by increment Create and populate the initial subject areas within the overall warehouse framework Bottom-Up Incremental Approach This approach is similar to the top-down approach but the emphasis is on the data rather than the business benefit. Here, IT is in charge of the project either because IT wants to be in charge or the business has deferred the project to IT. Advantages This approach has the following advantages: This is a “proof of concept” type of approach, therefore it is often appealing to IT. It is easier to get IT buy-in for this approach because it is focused on IT. Disadvantages This approach has the following disadvantages: Because the solution model is typically developed from source systems and these source systems will have encapsulated within them the current business processes, the overall extensibility of the model will be compromised. IT staff is often the last to know about business changes—IT could be designing something that will be out of date before they complete its delivery. As the framework of definition in this approach tends to be much narrower, often a significant amount of reengineering work is required for each increment. ISQS 6339, Data Mgmt & BI, Zhangxi Lin 15 15