Data Warehousing in the age of Big Data (1)

Slides:



Advertisements
Similar presentations
Managing Data Resources
Advertisements

Hive: A data warehouse on Hadoop
MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
1 Introduction The Database Environment. 2 Web Links Google General Database Search Database News Access Forums Google Database Books O’Reilly Books Oracle.
Microsoft Business Intelligence Gustavo Santade Business Intelligence Project Manager Improving Business Insight Building a cube using Analysis Services.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-1 HDFS itself is “big” Why do we need “hbase” that is bigger and more complex? Word count, web logs.
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
BUSINESS INTELLIGENCE/DATA INTEGRATION/ETL/INTEGRATION AN INTRODUCTION Presented by: Gautam Sinha.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
What You Need before You Deploy Master Data Management Presented by Malcolm Chisholm Ph.D. Telephone – Fax
© 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang 5-1 Chapter 5 Business Intelligence: Data.
© 2009 Pearson Education, Inc. Publishing as Prentice Hall 1 Chapter 1: The Database Environment Modern Database Management 9 th Edition Jeffrey A. Hoffer,
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-0 Think about the goal of a typical application today and the data characteristics Application trend:
Right In Time Presented By: Maria Baron Written By: Rajesh Gadodia
Personal Computer - Stand- Alone Database  Database (or files) reside on a PC - on the hard disk.  Applications run on the same PC and directly access.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
CSC480 Software Engineering Lecture 10 September 25, 2002.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Chapter 8: Data Warehousing. Data Warehouse Defined A physical repository where relational data are specially organized to provide enterprise- wide, cleansed.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
Managing Data Resources File Organization and databases for business information systems.
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
Building a Data Warehouse
Introduction to Oracle Forms Developer and Oracle Forms Services
Databases and DBMSs Todd S. Bacastow January 2005.
HIVE A Warehousing Solution Over a MapReduce Framework
Scaling Big Data Mining Infrastructure: The Twitter Experience
Data Platform and Analytics Foundational Training
SAS users meeting in Halifax
Advanced Applied IT for Business 2
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Introduction to Oracle Forms Developer and Oracle Forms Services
Open Source distributed document DB for an enterprise
A Warehousing Solution Over a Map-Reduce Framework
Introduction to Oracle Forms Developer and Oracle Forms Services
AMGA Web Interface Salvatore Scifo INFN sez. Catania
Software Design and Architecture
Part 3 Design What does design mean in different fields?
The Client/Server Database Environment
CHAPTER 3 Architectures for Distributed Systems
Data Warehouse.
Software Architecture in Practice
Hadoop EcoSystem B.Ramamurthy.
Data, Databases, and DBMSs
MANAGING DATA RESOURCES
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Server & Tools Business
Lecture 1: Multi-tier Architecture Overview
C.U.SHAH COLLEGE OF ENG. & TECH.
Software Architecture
Overview of big data tools
DataMart (Data Warehouse) Tool:
AMGA Web Interface Vincenzo Milazzo
Chapter 17: Client/Server Computing
Chapter 1: The Database Environment
The Database Environment
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Data Warehousing Concepts
SOFTWARE DEVELOPMENT LIFE CYCLE
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Pig Hive HBase Zookeeper
Big Data.
Presentation transcript:

Data Warehousing in the age of Big Data (1) 2017. 5

3V of Big Data

Additional Big Data characteristics. Ambiguity (애매성) a lack of metadata creates ambiguity in Big Data. For example, in a photograph or in a graph, M and F can depict gender or can depict Monday and Friday. Viscosity (저항성) measures the resistance (slow down) to flow in the volume of data. Resistance can manifest in dataflows, business rules, and even be a limitation of technology. For example, social media monitoring falls into this category, where a number of enterprises just cannot understand how it impacts their business and resist the usage of the data until it is too late in many cases. Virality (전염성) measures and describes how quickly data is shared in a people-to-people (peer) network. Rate of spread is measured in time. For example, re-tweets that are shared from an original tweet is a good way to follow a topic or a trend. The context of the tweet to the topic matters in this situation.

Hive an open-source data warehousing solution that has been built on top of Hadoop Architecture Metastore: stores the system catalog and metadata about tables, columns, and partitions. Driver: maintains session details, process handles, and statistics, and manages the life cycle of a HiveQL statement as it moves through Hive. Query compiler: compiles HiveQL into Map and Reduce tasks Execution engine: processes and executes the tasks produced by the compiler in a dependency order. The execution engine manages all the interactions between the compiler and Hadoop. Thrift server: provides a thrift interface, a JDBC/ODBC server, and a rich API to integrate Hive with other applications. CLI and web UI—two client interfaces. the command line interface (CLI) allows command-line execution and the web user interface (web UI) is a management console.

Data Abstractions in Hive

Query Processing in Hive

DW Architectures Pros Bottom-up approach Faster implementation of multiple manageable modules Simple design at the datamart level Less risk of failure Incremental approach to building most important or complex datamarts first Can deploy in smaller footprint of infrastructure. Cons A datamart cannot see outside of its subject area of focus. Redundant data architecture can become expensive. Needs all requirements to be completed before the start of the project. Difficult to manage operational workflows for complex business intelligence. Bottom-up approach

DW Architectures Top-down approach Pros Provides an enterprise view of the data Centralized architecture Central rules and control Refresh of data happens at one location Extremely high performance Can build in multiple steps Cons High risk of failure Data quality can stall processing data to the data warehouse Expensive to maintain Needs more scalable infrastructure

Data Warehouse 2.0 4 data layers Huge amounts of data Complex types- and diverse format of data (e.g., text, images, video, sensor data, and etc.) The next-generation data warehouse is an integrated architecture of Big Data and traditional data in one heterogeneous platform. Next-generation DW should focus on usability and scalability from a user perspective.

Data Warehouse 2.0

Data Warehouse 2.0

DSS 2.0 architecture Integration platform BI + content management BI + process management

DSS 2.0 matrix Business Intelligence를 3개의 영역으로 구분

Big Data Platform (including DW)

Big Data & DW