Big Data and The Data Warehouse. Everything is either ETL or storage, right?

Slides:



Advertisements
Similar presentations
No SQL is not about SQL No SQL is a Zoo.. Key-Value Stores Wide Column Stores Document Stores Graph Databases.
Advertisements

Distributed Data Processing
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce- enabled data stores.
NoSQL Databases: MongoDB vs Cassandra
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
Fraud Detection in Banking using Big Data By Madhu Malapaka For ISACA, Hyderabad Chapter Date: 14 th Dec 2014 Wilshire Software.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Hadoop Ecosystem Overview
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Massively Parallel Cloud Data Storage Systems S. Sudarshan IIT Bombay.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
:: Conférence :: NoSQL / Scalabilite Etat de l’art Samuel BERTHE10 Mars 2014Epitech Nantes.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Getting Biologists off ACID Ryan Verdon 3/13/12. Outline Thesis Idea Specific database Effects of losing ACID What is a NoSQL database Types of NoSQL.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
1 Adapted from Pearson Prentice Hall Adapted form James A. Senn’s Information Technology, 3 rd Edition Chapter 7 Enterprise Databases and Data Warehouses.
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
An Introduction to HDInsight June 27 th,
SLIDE 1IS 257 – Fall 2014 NewSQL and VoltDB University of California, Berkeley School of Information IS 257: Database Management.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Hadoop implementation of MapReduce computational model Ján Vaňo.
MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Nov 2006 Google released the paper on BigTable.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Big Data Analytics with Excel Peter Myers Bitwise Solutions.
Big Data Yuan Xue CS 292 Special topics on.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
BIG DATA/ Hadoop Interview Questions.
Microsoft Ignite /28/2017 6:07 PM
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
Big Data and The Data Warehouse
Software Systems Development
Introduction In the computing system (web and business applications), there are enormous data that comes out every day from the web. A large section of.
Chapter 14 Big Data Analytics and NoSQL
NOSQL.
Ministry of Higher Education
Massively Parallel Cloud Data Storage Systems
Overview of big data tools
Lecture 16 (Intro to MapReduce and Hadoop)
Charles Tappert Seidenberg School of CSIS, Pace University
Transaction Properties: ACID vs. BASE
AGENDA Buzz word. AGENDA Buzz word What is BIG DATA ? Big Data refers to massive, often unstructured data that is beyond the processing capabilities.
Presentation transcript:

Big Data and The Data Warehouse

Everything is either ETL or storage, right?

Recall: Informational Needs of the Org From other lecture

What should I do? What will happen? Why did it happen? What Happened? Automated decision making Strategic decision making Tactical decision making Operational decision making Organizational Decision Making

Data Volume Years

The Rise of Enterprise Unstructured Data Most of the data required for informed decision- making is unstructured data. * IDG

Timeline data volumes problem of big data

Scaling Services: How do you address growth? Vertical “Scale Up” Add more resources to an existing system running the service. Easier, but limited scale. Single point of failure Horizontal “Scale Out” Run the service over multiple systems, and orchestrate communication between them. Harder, but massive scale. Overhead to manage nodes.

Distributed Data When the data volume is tool large for a single system, and you can no longer scale up… … you scale out.

CAP Theorem of Distributed Systems You can only have two of the following three guarantees: Data Consistency: all nodes see the same data at the same time. Data Availability: assurances that every request can be processed. Partition Tolerance: network failures are tolerated, the continues to operate C onsistency P artition Tolerance A vailability X RDBMS: MSSQL, Oracle, MySQL AP CA CP Single-Master: Hbase, MongoDB, Accumulo, HDFS Eventual Consistency: Dynamo, Cassandra, CouchDb

Why Can’t You Have All Three? * A Counterexample: Suppose we lose communication between nodes: We must ignore any updates the nodes receive, or sacrifice Consistency or we must deny service until it becomes Available again. If we guarantee Availability of requests, despite the failure: We gain Partition Tolerance (the system still works), but lose Consistency (nodes will get out of sync). If we guarantee Consistency of data, despite the failure: We gain Partition Tolerance (again, system works) but lose Availability (data on nodes cannot be changed failure is resolved). * You can have all three, just not at the same time. Node 1 Node 2

CAP: All Kinds of Database Systems. RDBMS’s like Oracle, MySQL and SQL Server: Focus on Consistency and Availability (ACID Principles), sacrificing Partition Tolerance (and thus they don’t’ scale well horizontally). Use cases: Business data, when you don’t need to scale out. Single-Master systems like MongoDb, Hbase, Redis, and HDFS: Provide Consistency at scale but data availability runs through a single node. Use cases: Read-heavy. Caching, document storage, product catalogs. Eventual Consistency systems like CouchDb, Cassandra and Dynamo Provide Availability at scale but do not guarantee consistency. Use cases: Write heavy, Isolated activities: Shopping carts, Orders, Tweets.

So What is Big Data? It’s more than just large data

The Three V’s of Big Data Volume Velocity Variety Quantity Of DataRate of Change of DataKinds of Data

Other V’s of Big Data Veracity – uncertainty of your data. How can we be confident in the trustworthiness of our data sources? Example: Matching a tweet to a customer, without knowing their twitter. Viability – can we predict results from the data? Can we determine which features serve as predictors? Example: Discovering patterns among customer purchase habits and unfavorable weather conditions. Value – what meaning can we derive from our data? Can we use it make good business decisions? Example: Increase inventory levels of potato chips 2 weeks before the super bowl.

Examples Big Data Applications Clickstream – Analyze website traffic to determine how to invenst in site improvements. Sensor Data – Collect data from environmental sensors to identify foot traffic patterns in a retail store. Geographic Data – Analyze on-line orders to establish consistency between where products are shipped versus ordered. Server Logs – identify potential intrusions and mis-configured firewalls. Sentiment – get a sense of brand through social media. Unstructured – detect potential inside trading though , and phone conversations.

What is Hadoop?

Hadoop is Suite of Technologies… … Distributed Over Computers on a Network.

Fundamentally, Hadoop Does 2 Things: Distributed Storage HDFS Distributed Processing YARN Each Computer is a Node. All the nodes make up a Cluster.

Master Node: -Manages the Hadoop infrastructure. -Runs one of each of these services per cluster, on a single server or many. -Should run on server-class hardware. Worker Nodes: -Store data and perform processing over it. -Each node runs the same services. -Runs on commodity hardware. * Map Reduce 2 service on YARN YARN: App Timeline Server, Resource Manager, History Server * HDFS: Name Node YARN: Node Manager HDFS: Data Node YARN: Node Manager HDFS: Data Node YARN: Node Manager HDFS: Data Node YARN: Node Manager HDFS: Data Node YARN: Node Manager HDFS: Data Node YARN: Node Manager HDFS: Data Node Hadoop Nodes: Masters and Workers

HDFS Based on Googles GFS Distributed Nodes, Redundancy

Namenode Namenode /users/mafudge/data.csv $ hadoop fs –put data.csv Datanodes File: File: data.csv Namenode: 2) Splits the file into 64MB blocks (size can be changed). 3) Writes each block to a separate Datanode. 4) Replicates each block a number of times (default is 3). 5) Keeps track of which nodes contain each block in the file. Client: 1) Issues command to write data.csv file to HDFS HDFS At Work

HDFS Demo Skip-Bo Cards

YARN: The Data Operating System Hadoop 2.0 Introduces YARN (Yet Another Resource Negotiator) Orchestrates processing over the nodes. Uses HDFS for storage. Runs a variety of Applications.

HDFS: “Schema on Read” Traditional RDBMS You cannot write data without a schema (table) in the DBMS. Large up front design costs. “Schema on Write” Hadoop’s HDFS You write the data “as-is”, schema applied when data is read from HDFS as part of a program. Very little up-front design costs “Schema on Read”

MapReduce A Programming Model for large scale distributed data processing. Foundations in functional programming, LISP. Map  Apply a transformation to a data set. Shuffle  Transfer output from mapper to reducer nodes Reduce  Aggregate items into a single result. Combine  Output of reducer nodes into single output. In Hadoop 2.0, MapReduce programs use HDFS and YARN. MapShuffleReduceCombine

HDFS Blocks Namenodes JAN, NY, 3 JAN, PA, 1 JAN, NJ, 2 JAN, CT, 4 FEB, PA, 1 FEB, NJ, 1 FEB, NY, 2 FEB, VT, 1 MAR, NJ, 2 MAR, NY, 1 MAR, VT, 2 MAR, PA, 3 SourceFile JAN, NY, 3 JAN, PA, 1 JAN, NJ, 2 JAN, CT, 4 FEB, PA, 1 FEB, NJ, 1 FEB, NY, 2 FEB, VT, 1 MAR, NJ, 2 MAR, NY, 1 MAR, VT, 2 MAR, PA, 3 Mapping JAN, 3 JAN, 1 JAN, 2 JAN, 4 FEB, 1 FEB, 2 FEB, 1 MAR, 2 MAR, 1 MAR, 2 MAR, 3 JAN, 3 JAN, 1 JAN, 2 JAN, 4 FEB, 1 FEB, 2 FEB, 1 MAR, 2 MAR, 1 MAR, 2 MAR, 3 Shuffle JAN, 10 FEB, 5 MAR, 8 Reduce Result FEB, 5 MapReduce Example: Orders for each Month

HDFS Blocks Namenodes JAN, NY, 3 JAN, PA, 1 JAN, NJ, 2 JAN, CT, 4 FEB, PA, 1 FEB, NJ, 1 FEB, NY, 2 FEB, VT, 1 MAR, NJ, 2 MAR, NY, 1 MAR, VT, 2 MAR, PA, 3 SourceFile JAN, NY, 3 JAN, PA, 1 JAN, NJ, 2 JAN, CT, 4 FEB, PA, 1 FEB, NJ, 1 FEB, NY, 2 FEB, VT, 1 MAR, NJ, 2 MAR, NY, 1 MAR, VT, 2 MAR, PA, 3 Mapping NY, 3 PA, 1 NJ, 2 CT, 4 PA, 1 NJ, 1 NY, 2 VT, 1 NJ, 2 NY, 1 VT, 2 PA, 3 CT, 4 NY, 3 NY, 2 NY, 1 PA, 1 PA, 1 PA, 3 Shuffle CT, 4 NJ, 5 NY, 6 PA, 5 VT, 3 Reduce Result NJ, 2 NJ, 1 NJ, 2 VT, 1 VT, 2 CT, 4 NY, 6 PA, 5 NJ, 5 VT, 3 MapReduce: Total Orders by State

Cards MapReduce Map-Reduce example with skip-bo cards

Google Data Centers – Commodity Hardware

Example Web log files.

Hadoop Tools MapReduce is great but there’s a need for high-level scripting. There are also other needs beyond batch capabilities of M-R.

Pig Platform for analyzing large data sets, performing ETL, Data cleanup, etc. Write code simpler MapReduce in “piglatin” instead of Java. Steps: LOAD TRANSFORM STORE /DUMP

Hive SQL-Like Syntax over HDFS. Declarative, not Procedural like Pig. Useful for Ad-hoc query of HDFS data.

Spark Blah

Integrate with DW (Kimball)