Ch 4. The Evolution of Analytic Scalability

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Ch 4. The Evolution of Analytic Scalability
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.
Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Ch 5. The Evolution of Analytic Processes
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Oracle Challenges Parallelism Limitations Parallelism is the ability for a single query to be run across multiple processors or servers. Large queries.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
An Introduction to HDInsight June 27 th,
Open Search Office Web Services Database Doc Mgt Sys Pipeline Index Geospatial Analysis Text Search Faceting Caching Query parsing Clustering Synonyms.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Big Data Directions Greg.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Data Centers and Cloud Computing 1. 2 Data Centers 3.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
MapReduce Compiler RHadoop
Hadoop Aakash Kag What Why How 1.
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Software Systems Development
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Informix Red Brick Warehouse 5.1
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
What is the Azure SQL Datawarehouse?
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 2nd – Map/reduce.
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
3 Cloud Computing.
Overview of big data tools
Syllabus and Introduction Keke Chen
Interpret the execution mode of SQL query in F1 Query paper
Charles Tappert Seidenberg School of CSIS, Pace University
Introduction to MapReduce
Database System Architectures
MapReduce: Simplified Data Processing on Large Clusters
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Presentation transcript:

Ch 4. The Evolution of Analytic Scalability Taming The Big Data Tidal Wave 24 May 2012 SNU IDB Lab. Hyewon Kim

Outline Introduction The Convergence of the Analytic and Data Environment Massively Parallel Processing System (MPP) Cloud Computing Grid Computing MapReduce Conclusion

The old methods for handling data won’t work anymore Introduction The amount of data organizations process continues to increase Important technologies to tame the big data tidal wave possible The old methods for handling data won’t work anymore MPP The cloud Grid computing MapReduce

Outline Introduction The Convergence of the Analytic and Data Environment Massively Parallel Processing System (MPP) Cloud Computing Grid Computing MapReduce Conclusion

The heavy processing occurs in the analytic environment The Convergence of the Analytic and Data Environment (1/2) Traditional Analytic Architecture We had to pull all data together into a separate analytics environment to do analysis Database 3 Database 1 Database 4 Database 2 The heavy processing occurs in the analytic environment Analytic Server Or PC

Enterprise Data Warehouse The Convergence of the Analytic and Data Environment (2/2) Modern In-Database Architecture The processing stays in the database where the data has been consolidated Database 3 Database 1 Database 4 Database 2 Consolidate Enterprise Data Warehouse Submit Request The user’s machine just submits the request Analytic Server Or PC

Outline Introduction The Convergence of the Analytic and Data Environment Massively Parallel Processing System (MPP) Cloud Computing Grid Computing MapReduce Conclusion

Massively Parallel Processing (1/3) What is an MPP Database? An MPP database breaks the data into independent chunks with independent disk and CPU Single overloaded server Multiple lightly loaded servers Shared Nothing! 100-gigabyte chunks 100-gigabyte chunks 100-gigabyte chunks 100-gigabyte chunks 100-gigabyte chunks One-terabyte table 100-gigabyte chunks 100-gigabyte chunks 100-gigabyte chunks 100-gigabyte chunks 100-gigabyte chunks A Traditional database will query a one-terabyte table one row at time 10 simultaneous 100-gigabyte queries

Massively Parallel Processing (2/3) Concurrent Processing An MPP system allows the different sets of CPU and disk to run the process concurrently An MPP system breaks the job into pieces ★ ★ Single Threaded Process Parallel Process

Massively Parallel Processing (3/3) Others MPP systems build in redundancy to make recovery easy MPP systems have resource management tools Manage the CPU and disk space Query optimizer

Outline Introduction The Convergence of the Analytic and Data Environment Massively Parallel Processing System (MPP) Cloud Computing Grid Computing MapReduce Conclusion

Cloud Computing (1/2) What is Cloud Computing? McKinsey and Company paper from 2009¹ Mask the underlying infrastructure from the user Be elastic to scale on demand On a pay-per-use basis National Institute of Standards and Technology (NIST) On-demand self-service Broad network access Resource pooling Rapid elasticity Measured service [1] McKinsey and Company, ‘Clearing the Air on Cloud Computing,” March 2009.

Cloud Computing (2/2) Two Types of Cloud Environment Public Cloud The services and infrastructure are provided off-site over the internet Greatest level of efficiency in shared resources Less secured and more vulnerable than private clouds Private Cloud Infrastructure operated solely for a single organization The same features of a public cloud Offer the greatest level of security and control Necessary to purchase and own the entire cloud infrastructure

Outline Introduction The Convergence of the Analytic and Data Environment Massively Parallel Processing System (MPP) Cloud Computing Grid Computing MapReduce Conclusion

Grid Computing The federation of computer resources to reach a common goal E.g., SETI@Home (Search for Extraterrestrial Intelligence) An Internet-based public volunteer computing project

Outline Introduction The Convergence of the Analytic and Data Environment Massively Parallel Processing System (MPP) Cloud Computing Grid Computing MapReduce Conclusion

MapReduce (1/3) What is MapReduce? A Parallel programming framework¹ Map function Processing a key/value pairs to generate a set of intermediate key/value pairs Reduce function Merging all intermediate values associated with the same intermediate key Library Parallelization Fault-tolerance Data distribution Load balancing …… map reduce [1] MapReduce: Simplified Data Processing on Large Clusters – OSDI 2004

MapReduce (2/3) How MapReduce Works Let’s assume there are 20 terabytes of data and 20 MapReduce server nodes for a project Map Function Distribute a terabyte to each of the 20 nodes using a simple file copy process Submit two programs(Map, Reduce) to the scheduler The map program finds the data on disk and executes the logic it contains The results of the map step are then passed to the reduce process to summarize and aggregate the final answers Scheduler Map Shuffle Reduce Results

MapReduce (3/3) Strengths and Weaknesses Good for Lots of input, intermediate, and output data Batch oriented datasets (ETL: Extract, Load, Transform) Cheap to get up and running because of running on commodity hardware Bad for Fast response time Large amounts of shared data CPU intensive operations (as opposed to data intensive) NOT a database! No built-in security No indexing, No query or process optimizer No knowledge of other data that exists

Outline Introduction The Convergence of the Analytic and Data Environment Massively Parallel Processing System (MPP) Cloud Computing Grid Computing MapReduce Conclusion

[Running MapReduce in Database] [Running MapReduce in Cloud]² Conclusion These technologies can integrate and work together Databases running in the cloud Databases including MapReduce functionality MapReduce can be run against data sourced from a database MapReduce can also run against data in the cloud [Cloud Database] [SQL-MapReduce] [In-Database MapReduce]¹ [Running MapReduce in Database] [Running MapReduce in Cloud]² [1] https://blogs.oracle.com/datawarehousing/entry/in-database_map-reduce [2] http://code.google.com/p/cloudmapreduce/ Cloud mapreduce: a mapreduce implementation on top of a cloud operating system – CCGRID 2011, IEEE Computer Society