Delip Rao What is the typical size of data you deal with on a daily basis?

Slides:



Advertisements
Similar presentations
Inner Architecture of a Social Networking System Petr Kunc, Jaroslav Škrabálek, Tomáš Pitner.
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Overview of MapReduce and Hadoop
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce- enabled data stores.
Paula Ta-Shma, IBM Haifa Research 1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Big Data and.
HBase Presented by Chintamani Siddeshwar Swathi Selvavinayakam
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Thanks to our Sponsors! To connect to wireless 1. Choose Uguest in the wireless list 2. Open a browser. This will open a Uof U website 3. Choose Login.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
Performance Evaluation on Hadoop Hbase By Abhinav Gopisetty Manish Kantamneni.
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-0 Think about the goal of a typical application today and the data characteristics Application trend:
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
SLIDE 1IS 257 – Fall 2014 NewSQL and VoltDB University of California, Berkeley School of Information IS 257: Database Management.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
HBase Elke A. Rundensteiner Fall 2013
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
BIG DATA/ Hadoop Interview Questions.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Image taken from: slideshare
Hadoop.
Software Systems Development
An Open Source Project Commonly Used for Processing Big Data Sets
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Introduction to MapReduce and Hadoop
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Hadoop Basics.
Overview of big data tools
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Pig Hive HBase Zookeeper
Presentation transcript:

Delip Rao

What is the typical size of data you deal with on a daily basis?

Processes 20 Petabytes of raw data a day Works out to 231 TB per second! Numbers from 2008, grows by the day.

200 GB per day (March 2008) 2+ TB (compressed) per day in April TB (compressed) per day in Oct 2009 – 15 TB (uncompressed) a day (2009)

And Many More … eBay: 50 TB / day NYSE: 1TB / day CERN LHC : 15 PB / Year

Storage and Analysis of Tera-scale Data : 1 of Database Class 11/17/09

In Today’s Class We Will.. Deal with scale from a completely different perspective Discuss problems with traditional approaches Discuss how to analyze large quantities of data Discuss how to store (physically) huge amounts of data in a scalable, reliable fashion Discuss a simple effective approach to store record-like data

Dealing with scale MySQL will crawl with 500 GB of data “Enterprise” databases (Oracle, DB2) – Expensive $$$$$$$$ – Does not scale well Indexing painful Aggregate operations (SELECT COUNT(*)..) almost impossible Distributed databases – Expensive $$$$$$$$$$$$$$$$$$$$$$$$ – Doesn’t scale well either New approaches required!

Large Scale Data: Do we need databases? Traditional database design is inspired from decades of research on storage and retrieval. Complicated database systems == more tuning – A whole industry of “Database Administrators” – Result: Increased operational expenses Complicated indexing, transaction processing algorithms are not needed if all we care about are analyses from the data

Parallelize both data access and processing Over time processing capacity has increased compared to – Disk transfer time (slow) – Disk seek time (even slower) Solution: – Process data using a cluster of nodes using independent CPUs and independent disks.

Overview MapReduce is a design pattern: – Manipulate large quantities of data – Abstracts away system specific issues – Encourage cleaner software engineering – Inspired from functional programming primitives

MapReduce by Example Output: Word frequency histogram Input: Text, read one line at a time Single core design: Use a hash table MapReduce: def mapper(line): foreach word in line.split(): output(word, 1) def reducer(key, values): output(key, sum(values))

Word Frequency Histogram (contd) the quick brown fox the fox ate the rabbit the brown rabbit the quick brown fox the fox ate the rabbit the brown rabbit

Word Frequency Histogram (contd) the quick brown fox the fox ate the rabbit the brown rabbit the quick brown fox the fox ate the rabbit the brown rabbit MAPPER REDUCER SHUFFLE (the, 1) (the, (1, 1, 1, 1)) (the, 4) (ate, 1) (brown, 2) (fox, 2) (quick, 1) (rabbit, 2)

WordCount review Output: Word frequency histogram Input: Text, read one line at a time – Key: ignore, Value: Line of text def mapper(key, value): foreach word in value.split(): output(word, 1) def reducer(key, values): output(key, sum(values))

WordCount: In actual code Mapper

WordCount: In actual code Reducer

WordCount: In actual code Driver (main) method Observe the benefits of abstraction from hardware dependence, reliability and job distribution

“Thinking” in MapReduce Input is a sequence of key value pairs (records) Processing of any record is independent of the others Need to recast algorithms and sometimes data to fit to this model – Think of structured data (Graphs!)

Example: Inverted Indexing Say you have a large (billions) collection of documents How do you efficiently find all documents that contain a certain word? Database solution: SELECT doc_id FROM doc_table where doc_text CONTAINS ‘word’; Forget scalability. Very inefficient. Another demonstration of when not to use a DB

Example: Inverted Indexing Well studied problem in Information Retrieval community More about this in course (Spring) For now, we will build a simple index – Scan all documents in the collection – For each word record the document in which it appears Can write a few lines of Perl/Python to do it – Simple but will take forever to finish What is the complexity of this code?

“Thinking” in MapReduce (contd) Building inverted indexes Input: Collection of documents Output: For each word find all documents with the word def mapper(filename, content): foreach word in content.split(): output(word, filename) def reducer(key, values): output(key, unique(values)) What is the latency of this code?

Suggested Exercise Twitter has data in following format: Write map-reduces for – Finding all users who tweeted on “Comic Con” – Ranking all users by frequency of their tweets – Finding number of tweets containing “iPhone” varies with time

MapReduce vs RDBMS Traditional RDBMSMapReduce Data sizeGigabytesPetabytes AccessInteractive & BatchBatch UpdatesRead & write many timesWrite once, read many StructureStatic schemaDynamic schema IntegrityHighLow ScalingNonlinearLinear

The Apache Hadoop Zoo PIG CHUKWA HIVE HBASE MAPREDUCE HDFS ZOOKEEPER COMMON AVRO

Storing Large Data: HDFS Hadoop File System (HDFS) Very large distributed file system (~10PB) Assumes commodity hardware – Replication – Failure detection & Recovery Optimized for batch processing Single namespace for entire cluster hdfs://node-21/user/smith/job21/input01.txt

HDFS Concepts Blocks – A single unit of storage Namenode (master) – manages namespace Filesystem namespace tree + metadata – Maintains file to block mapping Datanodes (workers) – Performs block level operations

HDFS Architecture

Storing record data HDFS is a filesystem: An abstraction for files with raw bytes and no structure Lot of real world data occur as tuples – Hence RDBMS. If only they were scalable … Google’s solution: BigTable (2004) – A scalable distributed multi-dimensional sorted map Currently used in 100+ projects inside Google – 70+ PB data; 30+ GB/s IO (Jeff Dean, LADIS ’09)

Storing record data: HBase Open source clone of Google’s Bigtable Originally created in PowerSet in 2007 Used in : Yahoo, Microsoft, Adobe, Twitter, … Distributed column-oriented database on top of HDFS Real time read/write random-access Not Relational and does not support SQL But can work with very large datasets – Billions of rows, millions of columns

HBase: Data Model Data stored in labeled tables – Multi-dimensional sorted map Table has rows, columns. Cell: Intersection of a row & column – Cells are versioned (timestamped) – Contains an uninterrupted array of bytes (no type information) Primary key: A cell that uniquely identifies a row

HBase: Data model (contd) Columns are grouped into column families – Eg., temperature:air, temperature:dew_point Thus column name is family_name:identifier The column families are assumed to be known a priori However can add new columns for an existing family at run time ROWCOLUMN FAMILIES temperature:humidity:… location_idtemperature:air temperature:dew_point humidity:absolute humidity:relative humidity:specific …

HBase: Data model (contd) Tables are partitioned into regions Region: Subset of rows Regions are units that get distributed across a cluster Locking – Row updates are atomic – Updating a cell will lock entire row. – Simple to implement. Efficient. – Also, updates are rare

HBase vs. RDBMS Scale : Billions of rows and Millions of columns Traditional RDBMS: – Fixed schema – Good for small to medium volume applications – Scaling RDBMS involves violating Codd’s Rules, loosening ACID properties

HBase schema design case study Store information about students, courses, course registration Relationships (Two one-to-many) – A student can take multiple courses – A course is taken by multiple students

HBase schema design case study RDBMS solution STUDENTS id (primary key) name department_id REGISTRATION student_id course_id type COURSES id (primary key) title faculty_id

HBase schema design case study HBase solution ROWCOLUMN FAMILIES info:course: info:name info:department_id course:course_id=type ROWCOLUMN FAMILIES info:student: info:title info:faculty_id student:student_id=type

HBase: A real example Search Engine Query Log ROWCOLUMN FAMILIES querycookie:request: request_md5_hashquery:text query:lang... cookie:idrequest:user_agent request:ip_addr request:timestamp Common practice to use hash of the row of as key when no natural primary key exists This is okay when data is accessed sequentially. No need to “lookup”

Suggested Exercise Write an RDBMS schema to model User- Follower network in Twitter Now, write its HBase equivalent

Access to HBase via Java API – Map semantics: Put, Get, Scan, Delete – Versioning support via the HBase shell $ hbase shell... hbase> create 'test' 'data' hbase> list test … hbase> put 'test', 'row1', 'data:1', 'value1' hbase> put 'test', 'row2', 'data:1', 'value1' hbase> put 'test', 'row2', 'data:2', 'value2' hbase> put 'test', 'row3', 'data:3', 'value3' hbase> scan 'test' … hbase> disable 'test' hbase> drop 'test' Not a preferred way to access. Typically API is used. Not a real query language. Useful for “inspecting” a table.

A Word on HBase Performance Original HBase had performance issues HBase 2.0 (latest release) much faster – Open source development!!! Performance Analysis by StumbleUpon.com – Website uses over 9b rows in a single HBase table – 1.2m row reads/sec using just 19 nodes – Scalable with more nodes – Caching (new feature) further improves performance

Are traditional databases really required? Will a bank store all its data on HBase or its equivalents? Unlikely because Hadoop – Does not have a notion of a transaction – No security or access control like databases Fortunately batch processing large amounts of data does not require such guarantees Hot research topic: Integrate databases and MapReduce – “In database MapReduce”

Summary (Traditional) Databases are not Swiss-Army knives Large data problems require radically different solutions Exploit the power of parallel I/O and computation MapReduce as a framework for building reliable distributed data processing applications Storing large data requires redesign from the ground up, i.e. filesystem (HDFS)

Summary (contd) HDFS : A reliable open source distributed file system HBase : A sorted multi-dimensional map for record oriented data – Not Relational – No query language other than map semantics (Get and Put) Using MapReduce + HBase involves a fair bit of programming experience – Next class we will study Pig and Hive: A “data analyst friendly” interface to processing large data.

Suggested Reading Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters” Dewitt and Stonebraker, Mapreduce: “A major step backwards” Chu-Carroll, “Databases are hammers; MapReduce is a screwdriver” Dewitt and Stonebraker, “Mapreduce II”

Suggested Reading (contd) Hadoop Overview Who uses Hadoop? HDFS Architecture Chang et. al., “Bigtable: A Distributed Storage System for Structured Data” Hbase