What is the typical size of data you deal with on a daily basis?
Processes 20 Petabytes of raw data a day Works out to 231 TB per second! Numbers from 2008, grows by the day. http://www.niallkennedy.com/blog/2008/01/google-mapreduce-stats.html
200 GB per day (March 2008) 2+ TB (compressed) per day in April 2009 4+ TB (compressed) per day in Oct 2009 – 15 TB (uncompressed) a day (2009)
And Many More … eBay: 50 TB / day NYSE: 1TB / day CERN LHC : 15 PB / Year
Storage and Analysis of Tera-scale Data : 1 of 2 415 Database Class 11/17/09
In Today’s Class We Will.. Deal with scale from a completely different perspective Discuss problems with traditional approaches Discuss how to analyze large quantities of data Discuss how to store (physically) huge amounts of data in a scalable, reliable fashion Discuss a simple effective approach to store record-like data
Dealing with scale MySQL will crawl with 500 GB of data “Enterprise” databases (Oracle, DB2) – Expensive $$$$$$$$ – Does not scale well Indexing painful Aggregate operations (SELECT COUNT(*)..) almost impossible Distributed databases – Expensive $$$$$$$$$$$$$$$$$$$$$$$$ – Doesn’t scale well either New approaches required!
Large Scale Data: Do we need databases? Traditional database design is inspired from decades of research on storage and retrieval. Complicated database systems == more tuning – A whole industry of “Database Administrators” – Result: Increased operational expenses Complicated indexing, transaction processing algorithms are not needed if all we care about are analyses from the data
Parallelize both data access and processing Over time processing capacity has increased compared to – Disk transfer time (slow) – Disk seek time (even slower) Solution: – Process data using a cluster of nodes using independent CPUs and independent disks.
Overview MapReduce is a design pattern: – Manipulate large quantities of data – Abstracts away system specific issues – Encourage cleaner software engineering – Inspired from functional programming primitives
MapReduce by Example Output: Word frequency histogram Input: Text, read one line at a time Single core design: Use a hash table MapReduce: def mapper(line): foreach word in line.split(): output(word, 1) def reducer(key, values): output(key, sum(values))
Word Frequency Histogram (contd) the quick brown fox the fox ate the rabbit the brown rabbit the quick brown fox the fox ate the rabbit the brown rabbit
Word Frequency Histogram (contd) the quick brown fox the fox ate the rabbit the brown rabbit the quick brown fox the fox ate the rabbit the brown rabbit MAPPER REDUCER SHUFFLE (the, 1) (the, (1, 1, 1, 1)) (the, 4) (ate, 1) (brown, 2) (fox, 2) (quick, 1) (rabbit, 2)
WordCount review Output: Word frequency histogram Input: Text, read one line at a time – Key: ignore, Value: Line of text def mapper(key, value): foreach word in value.split(): output(word, 1) def reducer(key, values): output(key, sum(values))
WordCount: In actual code Driver (main) method Observe the benefits of abstraction from hardware dependence, reliability and job distribution
“Thinking” in MapReduce Input is a sequence of key value pairs (records) Processing of any record is independent of the others Need to recast algorithms and sometimes data to fit to this model – Think of structured data (Graphs!)
Example: Inverted Indexing Say you have a large (billions) collection of documents How do you efficiently find all documents that contain a certain word? Database solution: SELECT doc_id FROM doc_table where doc_text CONTAINS ‘word’; Forget scalability. Very inefficient. Another demonstration of when not to use a DB
Example: Inverted Indexing Well studied problem in Information Retrieval community More about this in 600.466 course (Spring) For now, we will build a simple index – Scan all documents in the collection – For each word record the document in which it appears Can write a few lines of Perl/Python to do it – Simple but will take forever to finish What is the complexity of this code?
“Thinking” in MapReduce (contd) Building inverted indexes Input: Collection of documents Output: For each word find all documents with the word def mapper(filename, content): foreach word in content.split(): output(word, filename) def reducer(key, values): output(key, unique(values)) What is the latency of this code?
Suggested Exercise Twitter has data in following format: Write map-reduces for – Finding all users who tweeted on “Comic Con” – Ranking all users by frequency of their tweets – Finding number of tweets containing “iPhone” varies with time
MapReduce vs RDBMS Traditional RDBMSMapReduce Data sizeGigabytesPetabytes AccessInteractive & BatchBatch UpdatesRead & write many timesWrite once, read many StructureStatic schemaDynamic schema IntegrityHighLow ScalingNonlinearLinear
The Apache Hadoop Zoo PIG CHUKWA HIVE HBASE MAPREDUCE HDFS ZOOKEEPER COMMON AVRO
Storing Large Data: HDFS Hadoop File System (HDFS) Very large distributed file system (~10PB) Assumes commodity hardware – Replication – Failure detection & Recovery Optimized for batch processing Single namespace for entire cluster hdfs://node-21/user/smith/job21/input01.txt
HDFS Concepts Blocks – A single unit of storage Namenode (master) – manages namespace Filesystem namespace tree + metadata – Maintains file to block mapping Datanodes (workers) – Performs block level operations
Storing record data HDFS is a filesystem: An abstraction for files with raw bytes and no structure Lot of real world data occur as tuples – Hence RDBMS. If only they were scalable … Google’s solution: BigTable (2004) – A scalable distributed multi-dimensional sorted map Currently used in 100+ projects inside Google – 70+ PB data; 30+ GB/s IO (Jeff Dean, LADIS ’09)
Storing record data: HBase Open source clone of Google’s Bigtable Originally created in PowerSet in 2007 Used in : Yahoo, Microsoft, Adobe, Twitter, … Distributed column-oriented database on top of HDFS Real time read/write random-access Not Relational and does not support SQL But can work with very large datasets – Billions of rows, millions of columns
HBase: Data Model Data stored in labeled tables – Multi-dimensional sorted map Table has rows, columns. Cell: Intersection of a row & column – Cells are versioned (timestamped) – Contains an uninterrupted array of bytes (no type information) Primary key: A cell that uniquely identifies a row
HBase: Data model (contd) Columns are grouped into column families – Eg., temperature:air, temperature:dew_point Thus column name is family_name:identifier The column families are assumed to be known a priori However can add new columns for an existing family at run time ROWCOLUMN FAMILIES temperature:humidity:… location_idtemperature:air temperature:dew_point humidity:absolute humidity:relative humidity:specific …
HBase: Data model (contd) Tables are partitioned into regions Region: Subset of rows Regions are units that get distributed across a cluster Locking – Row updates are atomic – Updating a cell will lock entire row. – Simple to implement. Efficient. – Also, updates are rare
HBase vs. RDBMS Scale : Billions of rows and Millions of columns Traditional RDBMS: – Fixed schema – Good for small to medium volume applications – Scaling RDBMS involves violating Codd’s Rules, loosening ACID properties
HBase schema design case study Store information about students, courses, course registration Relationships (Two one-to-many) – A student can take multiple courses – A course is taken by multiple students
HBase schema design case study RDBMS solution STUDENTS id (primary key) name department_id REGISTRATION student_id course_id type COURSES id (primary key) title faculty_id
HBase schema design case study HBase solution ROWCOLUMN FAMILIES info:course: info:name info:department_id course:course_id=type ROWCOLUMN FAMILIES info:student: info:title info:faculty_id student:student_id=type
HBase: A real example Search Engine Query Log ROWCOLUMN FAMILIES querycookie:request: request_md5_hashquery:text query:lang... cookie:idrequest:user_agent request:ip_addr request:timestamp Common practice to use hash of the row of as key when no natural primary key exists This is okay when data is accessed sequentially. No need to “lookup”
Suggested Exercise Write an RDBMS schema to model User- Follower network in Twitter Now, write its HBase equivalent
Access to HBase via Java API – Map semantics: Put, Get, Scan, Delete – Versioning support via the HBase shell $ hbase shell... hbase> create 'test' 'data' hbase> list test … hbase> put 'test', 'row1', 'data:1', 'value1' hbase> put 'test', 'row2', 'data:1', 'value1' hbase> put 'test', 'row2', 'data:2', 'value2' hbase> put 'test', 'row3', 'data:3', 'value3' hbase> scan 'test' … hbase> disable 'test' hbase> drop 'test' Not a preferred way to access. Typically API is used. Not a real query language. Useful for “inspecting” a table.
A Word on HBase Performance Original HBase had performance issues HBase 2.0 (latest release) much faster – Open source development!!! Performance Analysis by StumbleUpon.com – Website uses over 9b rows in a single HBase table – 1.2m row reads/sec using just 19 nodes – Scalable with more nodes – Caching (new feature) further improves performance
Are traditional databases really required? Will a bank store all its data on HBase or its equivalents? Unlikely because Hadoop – Does not have a notion of a transaction – No security or access control like databases Fortunately batch processing large amounts of data does not require such guarantees Hot research topic: Integrate databases and MapReduce – “In database MapReduce”
Summary (Traditional) Databases are not Swiss-Army knives Large data problems require radically different solutions Exploit the power of parallel I/O and computation MapReduce as a framework for building reliable distributed data processing applications Storing large data requires redesign from the ground up, i.e. filesystem (HDFS)
Summary (contd) HDFS : A reliable open source distributed file system HBase : A sorted multi-dimensional map for record oriented data – Not Relational – No query language other than map semantics (Get and Put) Using MapReduce + HBase involves a fair bit of programming experience – Next class we will study Pig and Hive: A “data analyst friendly” interface to processing large data.
Suggested Reading Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters” Dewitt and Stonebraker, Mapreduce: “A major step backwards” Chu-Carroll, “Databases are hammers; MapReduce is a screwdriver” Dewitt and Stonebraker, “Mapreduce II”
Suggested Reading (contd) Hadoop Overview http://hadoop.apache.org/common/docs/current/mapred_tutorial.html Who uses Hadoop? http://wiki.apache.org/hadoop/PoweredBy HDFS Architecture http://hadoop.apache.org/common/docs/current/hdfs_design.html Chang et. al., “Bigtable: A Distributed Storage System for Structured Data” http://labs.google.com/papers/bigtable.html Hbase http://hadoop.apache.org/hbase/