Paula Ta-Shma, IBM Haifa Research 1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Big Data and.

Slides:



Advertisements
Similar presentations
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce- enabled data stores.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
Fraud Detection in Banking using Big Data By Madhu Malapaka For ISACA, Hyderabad Chapter Date: 14 th Dec 2014 Wilshire Software.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Introduction. Readings r Van Steen and Tanenbaum: 5.1 r Coulouris: 10.3.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Ch 4. The Evolution of Analytic Scalability
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Penwell Debug Intel Confidential BRIEF OVERVIEW OF HIVE Jonathan Brauer ESE 380L Feb
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Presented by John Dougherty, Viriton 4/28/2015 Infrastructure and Stack.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-0 Think about the goal of a typical application today and the data characteristics Application trend:
SLIDE 1IS 257 – Fall 2014 NewSQL and VoltDB University of California, Berkeley School of Information IS 257: Database Management.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Hadoop implementation of MapReduce computational model Ján Vaňo.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Nov 2006 Google released the paper on BigTable.
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Performance Comparison of Clustered Systems Yugandhar Maram, # Anjana Vadivel, # Stuthi Balaji, #
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
Data Centers and Cloud Computing 1. 2 Data Centers 3.
MarkLogic The Only Enterprise NoSQL Database Presented by: Aashi Rastogi ( ) Sanket Patel ( )
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
BIG DATA/ Hadoop Interview Questions.
Data Science Hadoop YARN Rodney Nielsen. Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Microsoft Ignite /28/2017 6:07 PM
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Big Data & Test Automation
CS 405G: Introduction to Database Systems
SAS users meeting in Halifax
Hadoop.
Central Florida Business Intelligence User Group
Ministry of Higher Education
Ch 4. The Evolution of Analytic Scalability
Overview of big data tools
Big Data Young Lee BUS 550.
Charles Tappert Seidenberg School of CSIS, Pace University
Presentation transcript:

Paula Ta-Shma, IBM Haifa Research 1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Big Data and Map Reduce Paula Ta-Shma IBM Haifa Research Storage Systems 1/5/2013

Paula Ta-Shma, IBM Haifa Research 2 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Outline  Historical Context behind Map Reduce  What is Big Data ?  The Map Reduce Framework  Connections with Storage Cloud

Paula Ta-Shma, IBM Haifa Research 3 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Historical Context  Relational Database Management Systems (RDBMS) –Researched in 70s, products in 80s and beyond –Relational (tabular) data model –Query Language : SQL -Efficient Query Processing: Indexing, Query Evaluation Strategies –Transactions, Consistency –Concurrency Control –Security and Authorization –Can be implemented on top of file systems -Provide higher level of abstraction and functionality than file systems  Example Use Cases –Banking, Stock trading, Personnel Management, Inventory Management, Manfuacturing Data, etc. –The list is very long SELECT Name FROM Accounts GROUP BY Name HAVING SUM(Balance) < 0 NameBalance ($) Bob Alice Fred Alice Accounts

Paula Ta-Shma, IBM Haifa Research 4 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Historical Context Cont.  Business Intelligence –Extract value from large amounts of data –Banking use case example -Identify and actively retain and pursue profitable customers -Analyze the performance of sales personnel, tellers and account managers -etc. –Massive query processing to analyze data across multiple dimensions -Requires read access to large amounts of data -Typically long running queries, can interfere with transactions –Work on a snapshot of data -Deployed as physically separate Data Warehousing systems -Mission critical -Data warehousing products in early 90s

Paula Ta-Shma, IBM Haifa Research 5 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University New Requirements in Internet Era  Massive amounts of data  Unstructured (e.g. text) and semi-structured data (e.g. XML)  Analysis capabilities beyond what is possible in SQL  LOW COST $$$Capital ExpensesOperational Expenses HardwareUse commodity hardware, scale out instead of scale up. Make it easy to manage hardware which will fail often. Treat failure case as the norm, automatic failover. SoftwareDBMS software is complex and expensive, transactions, concurrency control etc. not needed for many tasks Make it easy to write ‘queries’ on a distributed infrastructure.

Paula Ta-Shma, IBM Haifa Research 6 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Map Reduce  Invented by Google –Inspired by functional programming languages map and reduce functions –Seminal paper: Dean, Jeffrey & Ghemawat, Sanjay (OSDI 2004), "MapReduce: Simplified Data Processing on Large Clusters"  Used at Google to completely regenerate Google's index of the World Wide Web. –It replaced the old ad hoc programs that updated the index and ran the various analyses.  Uses: –distributed pattern-based searching, distributed sorting, web link-graph reversal, term- vector per host, web access log stats, inverted index construction, document clustering, machine learning, statistical machine translation  Hadoop: –Open source implementation which matches Google’s specifications

Paula Ta-Shma, IBM Haifa Research 7 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Source: IBM InfoSphere BigInsights slides, by Bruce Brown

Paula Ta-Shma, IBM Haifa Research 8 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Source: IBM InfoSphere BigInsights slides, by Bruce Brown

Paula Ta-Shma, IBM Haifa Research 9 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Map Reduce In Detail  Map Reduce material taken from Distributed Systems Course, MapReduce lecture by Paul Krzyzanowski –

Paula Ta-Shma, IBM Haifa Research 10 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University HDFS Architecture Source

Paula Ta-Shma, IBM Haifa Research 11 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Integrating Hadoop with Object Storage  Implement Hadoop FileSystem API  Leave MapReduce framework unchanged –=> no changes needed for user applications –=> work with Hadoop based technologies -Hive, Pig Latin, HBase, Jaql, and others Hadoop FileSystem API (create,open,close,read,write,seek,get block locations…) Hadoop Distributed File System (HDFS) S3FileSystem CDMI FileSystem Hadoop Map Reduce invokes implements Application HBase, Jaql,…

Paula Ta-Shma, IBM Haifa Research 12 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Amazon Elastic Map Reduce Source:

Paula Ta-Shma, IBM Haifa Research 13 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University The End