Zhangxi Lin Texas Tech University

Slides:



Advertisements
Similar presentations
Pentaho Open Source BI Goldwin. Pentaho Overview Pentaho is the commercial open source software for Business Pentaho is the commercial open source software.
Advertisements

Data Manager Business Intelligence Solutions. Data Mart and Data Warehouse Data Warehouse Architecture Dimensional Data Structure Extract, transform and.
Technical BI Project Lifecycle
Accessing Organizational Information—Data Warehouse
IST722 Data Warehousing Technical Architecture Michael A. Fudge, Jr. * Figures taken from Kimball Ch. 4.
Business Intelligence System September 2013 BI.
Data Warehouse Toolkit Introduction. Data Warehouse Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system. An.
Hadoop Ecosystem Overview
ETL By Dr. Gabriel.
What is Business Intelligence? Business intelligence (BI) –Range of applications, practices, and technologies for the extraction, translation, integration,
Understanding Data Warehousing
DW-1: Introduction to Data Warehousing. Overview What is Database What Is Data Warehousing Data Marts and Data Warehouses The Data Warehousing Process.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Business Intelligence Zamaneh Jahed. What is Business Intelligence? Business Intelligence (BI) is a broad category of applications and technologies for.
1 Data Warehouses BUAD/American University Data Warehouses.
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
Datawarehouse A sneak preview. 2 Data Warehouse Approach An old idea with a new interest: Cheap Computing Power Special Purpose Hardware New Data Structures.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.
Chapter 5 DATA WAREHOUSING Study Sections 5.2, 5.3, 5.5, Pages: & Snowflake schema.
Building Dashboards SharePoint and Business Intelligence.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Business Intelligence Transparencies 1. ©Pearson Education 2009 Objectives What business intelligence (BI) represents. The technologies associated with.
Nov 2006 Google released the paper on BigTable.
Big Data Tools Hadoop S.S.Mulay Sr. V.P. Engineering February 1, 2013.
Zhangxi Lin Texas Tech University
Hadoop & Spark Zhangxi Lin ISQS3358, Spring 2016.
1 Seattle University Master’s of Science in Business Analytics Key skills, learning outcomes, and a sample of jobs to apply for, or aim to qualify for,
Harnessing Big Data with Hadoop Dipti Sangani; Madhu Reddy DBI210.
Building the Corporate Data Warehouse Pindaro Demertzoglou Data Resource Management.
ISQS 3358, Business Intelligence Data Warehousing Zhangxi Lin Texas Tech University 1.
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Big Data & Test Automation
OMOP CDM on Hadoop Reference Architecture
Zhangxi Lin Texas Tech University
PROTECT | OPTIMIZE | TRANSFORM
Introduction to ODPi Roman VP of
TDWI EXECUTIVE SUMMIT From Traditional to Modern: How Rakuten Marketing Realized the Promise of a New Generation of BI September 21, 2015 Donald Krapohl.
Advanced Applied IT for Business 2
Status and Challenges: January 2017
Chapter 13 Business Intelligence and Data Warehouses
Chapter 14 Big Data Analytics and NoSQL
Hadoop Developer.
Hadoopla: Microsoft and the Hadoop Ecosystem
Chapter 13 The Data Warehouse
Data Warehouse.
Applying Data Warehouse Techniques
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Visual Analytics Sandbox
Business Intelligence
MANAGING DATA RESOURCES
Applying Data Warehouse Techniques
An Introduction to Data Warehousing
Typically data is extracted from multiple sources
Data Warehouse Architecture
Introduction to Apache
Applying Data Warehouse Techniques
ISQS 6339, Business Intelligence Big Data Management
Setup Sqoop.
Data Warehouse Architecture
Introduction of Week 9 Return assignment 5-2
Charles Tappert Seidenberg School of CSIS, Pace University
Data Warehousing Concepts
Big DATA.
Technical Architecture
IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.
EAST MDSplus Log Data Management System
David Gilmore & Richard Blevins Senior Consultants April 17th, 2012
Presentation transcript:

Zhangxi Lin Texas Tech University ISQS 6339, Business Intelligence Supplemental Notes on the Term Project (Spring 2017) Zhangxi Lin Texas Tech University

Projects Two data warehousing projects (70%) SQL Server based Hadoop based Big data collaborative studies (30%). One presentation – 30-40 minutes, and another 10 minutes for discussion Report & references Videos and demonstrations Total 60 points

Term project 3-5 students form a team to fulfill a data mart development project. Stage 1 (10%): One-page project proposal. April 11 Stage 2 (20%): Data mart Implementation. April 20 Stage 3 (30%): Collaborative study. Due April 27 Stage 4 (20%): Hadoop Project completed. Due May 4 Stage 5 (20%): Final report. Due May 12 Detailed instructions: http://zlin.ba.ttu.edu/6339/Projects17.html

Merits of data warehousing projects Carefully developed project proposal demonstrating the understanding of the business requirements, attractive analytics themes, and clearly defined project goal and objectives Comprehensive data mart design, such as multiple fact tables, with supporting analytic themes Applications of advanced ETL model or techniques, such as slowly changing dimensions, the use of containers, etc. Advanced OLAP cube design, and/or optional MDX scripting by self-taught Rich data analysis outcomes Well-presented final report Demonstrating the creative ideas and skillful data warehousing ability

Hadoop projects

Components Load Balancer Oozie Solr, SolrCloud, SolrJ, HA NewSQL Kafka, Storm, Impala REST ZK MySQL Nginx/HA-Proxy Flume Sqoop Ganglia Technology stack Tomcat, Jetty  Avro

Big Data Presentation Topics No: Topic Components Team# Presentation 1 Data warehousing Focus: Hadoop Data warehouse design HDFS, HBase, HIVE, NoSQL/NewSQL, Solr DW1 2 Publicly available big data services Focus: tools and free resources Hortonworks, CloudEra, HaaS, EC2 DW2 3 MapReduce & Data mining Focus: Efficiency of distributed data/text mining Mahout, H2O, R, Python DW3 4 Big data ETL Focus: Heterogeneous data processing across platforms Kettle, Flume, Sqoop, Impala DW4 5 System management: Focus: Load balancing and system efficiency Oozie, ZooKeeper, Ambari, Loom, Ganglia DW5 6 Application development platform Focus: Algorithms and innovative development environments Tomcat, Neo4J, Pig, Hue DW6 7 Tools & Visualizations Focus: Features for big data visualization and data utilization. Pentaho, Tableau Saiku, Mondrian, Gephi, DW7 8 Streaming data processing Focus: Efficiency and effectiveness of real-time data processing Spark, Storm, Kafka, Avro

Data Warehousing Methodology - Implementing data warehouse systematically 8

Dimensional Modeling Process Preparation Identify roles and participants Understanding the data architecture strategy Setting up the modeling environment Establishing naming conventions Data profiling and research Data profiling and source system exploration Interacting with source system experts Identifying core business users Studying existing reporting systems Building Dimensional models High-level dimensional model design Identifying dimension and fact attributes Developing the detailed dimensional model Testing the model Reviewing and validating the model

Business Dimensional Lifecycle Req’ts definition Technical Arch. Design Product Selection & Installation Growth Dimensional Modeling Physical Design ETL design & Development Deployment Project Planning BI Appl. Specification BI Application Development Maintenance Project Management 10

Data Profiling Data profiling is a methodology for learning about he characteristics of the data It is a hierarchical process that attempt to build an assessment of the metadata associated with a collection of data sets. Three levels Bottom – characterizing the values associated with individual attributes Middle – the assessment looking at relationships between multiple columns within a single table. Highest level – the profile describing relationships that exist between data attributes across different tables. Can run a program against the sandbox source system to obtain the needed information. 11

ETL Methodology Develop a high-level map Build a sandbox source system (optional) Detailed data profiling Make decisions The source-to-target mapping How often loading tables The strategy for partitioning the relational and Analysis Services fact table The strategy for extracting data from each source system De-duplicate key data from each source system (optional) Develop a strategy for distributing dimension tables across multiple database servers (optional) 12

Sandbox Source System Sandbox A protected, limited environment where applications are allowed to "play" without risking damage to the rest of the system. A term for the R&D department at many software and computer companies. The term is half-derisive, but reflects the truth that research is a form of creative play. In the DW/BI context, sandbox source system is a subset of source database for analytic exploration tasks How to create Set up a static snapshot of the database By sampling 13

Decision Issues in ETL System Design Source-to-target mapping Load frequency How much history is needed 14

Strategies for Extracting Data Extracting data from packaged source systems –self-contained data sources May not be good to use their APIs May not be good to use their add-on analytic system Extracting directly from the source databases Strategies vary depending on the nature of the source database Extracting data from incremental loads How the source database records the changes of the rows Extracting historical data 15