Big Data I/Hadoop explained Presented to ITS at the UoA on December 6 th 2012.

Slides:



Advertisements
Similar presentations
From Startup to Enterprise A Story of MySQL Evolution Vidur Apparao, CTO Stephen OSullivan, Manager of Data and Grid Technologies April 2009.
Advertisements

Oracle Data Warehouse Mit Big Data neue Horizonte für das Data Warehouse ermöglichen Alfred Schlaucher, Detlef Schroeder DATA WAREHOUSE.
Introduction to Hadoop Richard Holowczak Baruch College.
Dan Bassett, Jonathan Canfield December 13, 2011.
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
+ Hbase: Hadoop Database B. Ramamurthy. + Introduction Persistence is realized (implemented) in traditional applications using Relational Database Management.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Paula Ta-Shma, IBM Haifa Research 1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Big Data and.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
Fraud Detection in Banking using Big Data By Madhu Malapaka For ISACA, Hyderabad Chapter Date: 14 th Dec 2014 Wilshire Software.
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-1 HDFS itself is “big” Why do we need “hbase” that is bigger and more complex? Word count, web logs.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Presented by John Dougherty, Viriton 4/28/2015 Infrastructure and Stack.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
An Introduction to HDInsight June 27 th,
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-0 Think about the goal of a typical application today and the data characteristics Application trend:
Big Table - Slides by Jatin. Goals wide applicability Scalability high performance and high availability.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
Hadoop implementation of MapReduce computational model Ján Vaňo.
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.
Nov 2006 Google released the paper on BigTable.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
Big Data Analytics with Excel Peter Myers Bitwise Solutions.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
This is a free Course Available on Hadoop-Skills.com.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
BIG DATA/ Hadoop Interview Questions.
CPSC8985 FA 2015 Team C3 DATA MIGRATION FROM RDBMS TO HADOOP By Naga Sruthi Tiyyagura Monika RallabandiRadhakrishna Nalluri.
Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore
Microsoft Ignite /28/2017 6:07 PM
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
SAS users meeting in Halifax
Software Systems Development
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
NOSQL.
Hadoop.
Central Florida Business Intelligence User Group
Ministry of Higher Education
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Introduction to Apache
NoSQL Databases Antonino Virgillito.
Big Data Young Lee BUS 550.
Zoie Barrett and Brian Lam
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Pig Hive HBase Zookeeper
Presentation transcript:

Big Data I/Hadoop explained Presented to ITS at the UoA on December 6 th 2012

The gospel according to Dilbert

What is the scale of Big Data? There is no single agreed definition of Big Data but examples of its scale include: 12 terabytes of daily tweets 2.8 petabytes of untapped power utilities data 350 billion annual meter readings 5 million daily stock market trades 500 million daily call-centre records 100s of live video feeds millions of daily click stream records from web logs

65,313,993 rows of data doth not Big Data make FACT_EFTS_SNAPSHOT largest table in the DSS data warehouse – 11.6GB, 65,313,993 rows Lecture theatre recording data – 1TB

So how do we define Big Data? Volume; velocity; variety; veracity 1 TB of data can be handled by traditional enterprise relational databases A working definition of Big Data: data that make the use of tools like Hadoop necessary Therefore the UoA does not deal in Big Data, nor do most organisations in New Zealand, so why do you think consultants are pushing it?

The Big Data problem Enterprise-scale relational databases adequately handle large amounts of data Businesses need to analyse huge amounts of data In the search for competitive advantage SQL joins in row-based relational databases cannot handle Big Data Big Data changes everything Googles solution to the Big Data problem is a disruptive technology for Big Data but not for merely large amounts of data

Google File System (GFS) GFS was created to address the storage scalability problem GFS is a distributed file system housed on clusters of cheaper commodity servers and disks Commodity servers and disks fail often so huge data files are chunked and replicated across the file system to minimise the impact of failures how-the-giants-of-the-web-store-big-data/

Google Bigtable Bigtable is Googles distributed storage system for managing data and sits on top of the Google File System Is designed to scale to a very large size: petabytes of data across thousands of commodity servers/disks Near-linear scalability is achieved by performing computations on the distributed servers/disks that manage and contain the data rather than moving data to separate processing nodes Many projects at Google store data in Bigtable, including web indexing and Google Earth

Bigtable is column-based rather than row-based Bigtable maps two arbitrary string values (row key and column key) and timestamp (hence three dimensional mapping) into an associated arbitrary byte array (an array of key/value pairs) Bigtable can be better defined as a sparse (gaps between keys), distributed (across many machines/disks), multi-dimensional (maps within maps), sorted (by key rather than value), map (key with an associated value) Bigtable is therefore a columnar data store rather than a row-based relational database

Row-based and columnar examples IDNAMEAGEINTERESTS 1RickySoccer, Movies, Baseball 2Ankur20 3Sam25Music Row-based example, e.g., an RDBMS table called PERSONAL_DETAILS Columnar breakdown IDNAME 1Ricky 2Ankur 3Sam IDAGE IDINTERESTS 1Soccer 1Movies 1Baseball 3Music and-hbase.html

Conceptual columnar Bigtable equivalent Primary Index ROWKEY:COLUMNKEY:TIMESTAMP Column Family PERSONAL_DETAILS 1:PERSONAL_DETAILS:01/01/2011 NAME:Ricky INTERESTS:Soccer INTERESTS: Movies INTERESTS: Baseball 2:PERSONAL_DETAILS:31/03/2012 NAME:Ankur AGE:20 3:PERSONAL_DETAILS:20/10/2012NAME:Sam AGE:25 INTERESTS: Music

Google MapReduce MapReduce processes massive distributed datasets by mapping data into key/value pairs then reducing over all pairs with the same key

Hadoop Hadoop is Apaches free open source implementation of Google File System, Google Bigtable, Google MapReduce and other software Hadoop (written in Java) is buggy, needing strong (expensive) Java expertise to fix the code Wrappers for underlying Hadoop function calls can be written in almost any language Tools like HBase (an example of a NoSQL columnar data store) sit on top of HDFS (Hadoop Distributed File System) and offer tables and a query language supporting MapReduce as well as DML like Get/Put/Scan Hadoop expertise is relatively scarce (expensive), especially when configuring 100s/1,000s of servers/disks, when writing MapReduce jobs on a huge distributed infrastructure, and when managing data in a new way

Other utilities Apache Pig and Apache Hive are platforms providing data summarisation, analyses, and queries Pig Latin is a procedural data flow language for exploring large datasets HiveQL is an SQL-like (but not SQL) language for exploring large datasets Pig Latin and HiveQL commands compile to create MapReduce jobs