Presentation on theme: "Presentation by Yoni Nesher NonSQL database Techforum"— Presentation transcript:
1 Presentation by Yoni Nesher NonSQL database Techforum Hadoop & Map ReducePresentation by Yoni NesherNonSQL database Techforum
2 Hadoop & Map Reduce Forum Agenda: Big data problem domain Hadoop ecosystemHadoop Distributed File System (HDFS)Diving in to MapReduceMapReduce case studiesMapReduce v.s. parallel DBs systems – comparison and analysis
3 Hadoop & Map Reduce Main topics: HDFS – Hadoop distributed file system - manage the storage across a network of machine, designed for storing very large files, optimized for streaming data access patternsMapReduce - A distributed data processing model and execution environment that runs on large clusters of commodity machines.
4 IntroductionIt has been said that “More data usually beats better algorithms”For some problems, however sophisticated your algorithms are, they can often be beaten simply by having more data (and a less sophisticated algorithm)So the good news is that Big Data is here!The bad news is that we are struggling to store and analyze it..
5 IntroductionA possible (and only) solution - read and write data in parallelThis approach introduces new problems in the data I/O domain:hardware failure:As soon as you start using many pieces of hardware, the chance that one will fail is fairly high.A common way of avoiding data loss is through replicationsCombining data:Most of the analysis tasks need to be able to combine the data in some way; data read from one disk may need to be combined with the data from any of the other disksThe MapReduce - programming model abstracts the problem from disk reads and writes (commin up..)
6 Introduction What is Hadoop ? History Hadoop provides a reliable shared storage and analysis system. The storage is provided by HDFS and analysis by MapReduce.There are other parts to Hadoop, but these capabilities are its kernelHistoryHadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library.Hadoop has its origins in Apache Nutch, an open source web search engine, also a part of the Lucene project.In January 2008, Hadoop was made its own top-level project at ApacheUsing Hadoop: Yahoo!, Last.fm, Facebook, the New York Times (more examples later on..)
7 The Hadoop ecosystem: Common Avro MapReduce HDFS A set of components and interfaces for distributed file systems and general I/O (serialization, Java RPC, persistent data structures).AvroA serialization system for efficient, cross-language RPC, and persistent data storage.MapReduceA distributed data processing model and execution environment that runs on large clusters of commodity machines.HDFSA distributed file system that runs on large clusters of commodity machines.PigA data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.HiveA distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data.HBaseA distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).ZooKeeperA distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.SqoopA tool for efficiently moving data between relational databases and HDFS.
8 Hadoop HDFSWhat is HDFS ?distributed filesystems – manage the storage across a network of machinedesigned for storing very large filesfiles that are hundreds of megabytes, gigabytes, or terabytes in size.There are Hadoop clusters running today that store petabytes of data in single filesstreaming data access patternsOptimize for write-once, read-many-timesNot optimized for low latency seek operations, Lots of small files, Multiple writers, arbitrary file modifications
9 Hadoop HDFS Highlights File block are distributed among nodes in the clusterBlock are typically 64MBBlocks are replicated on different nodes (3 times as default)Fault tolerance mechanisms make sure that when a node goes down, all blocks handled by this node are replicated to other nodesIf a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client (more on that later)
10 Hadoop HDFS Highlights HDSF NameNode:Manages the files system namespace and maintains the file system tree and the metadata for all the files and directories in the treeKnows the datanodes on which all the blocks for a given file are locatedDataNodes:Datanodes are the workhorses of the file system.Store and retrieve blocks when they are told to (by clients or the namenode)Report back to the namenode periodically with lists of blocks that they are storing
11 A taste of HDFS.. Linux CLI examples add file from local FS: hadoop fs -copyFromLocal input/docs/quangle.txt quangle.txtReturn file to local FS:hadoop fs –copyToLocal quangle.txt quangle.copy.txthadoop fs -mkdir bookshadoop fs -ls .
12 Hadoop HDFS running on clusters of commodity hardware commonly available hardware available from multiple vendorsthe chance of node failure across the cluster is high, at least for large clustersHDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure
13 HDFS conceptsBlocksThe minimum amount of data that a file system can read or write.A disk files system blocks are typically a few kilobytes in sizeHDFS block are 64MB by defaultfiles in HDFS are broken into block-sized chunks, which are stored as independent units.Unlike a file system for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage.
14 HDFS concepts Blocks (cont.) blocks are just a chunk of data to be stored—file metadata such as hierarchies and permissions does not need to be stored with the blockseach block is replicated to a small number of physically separate machines (typically three).If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client
15 HDFS concepts Namenodes and Datanodes NameNode: master-worker pattern: a namenode (the master) and a number of datanodes (workers)NameNode:Manages the filesystem namespace and maintains the filesystem tree and the metadata for all the files and directories in the tree.Information is stored persistently on the local diskKnows the datanodes on which all the blocks for a given file are locatedIt does not store block locations persistently, since this information is reconstructed from datanodes when the system starts.
16 HDFS concepts Namenodes and Datanodes (cont.) DataNodes: Datanodes are the workhorses of the file system.Store and retrieve blocks when they are told to (by clients or the namenodeReport back to the namenode periodically with lists of blocks that they are storingWithout the namenode, the filesystem cannot be used:If the machine running the namenode crashes, all the files on the file system would be lostThere would be no way of knowing how to reconstruct the files from the blocks on the datanodes.
17 HDFS concepts Namenodes and Datanodes (cont.) For this reason, it is important to make the namenode resilient to failurePossible solution - back up the files that make up the persistent state of the file system metadataHadoop can be configured so that the namenode writes its persistent state to multiple file systems. These writes are synchronous and atomic.The usual configuration choice is to write to local disk as well as a remote NFS mount.
18 HDFS concepts Linux CLI examples add file from local FS: hadoop fs -copyFromLocal input/docs/quangle.txt quangle.txtReturn file to local FS:hadoop fs –copyToLocal quangle.txt quangle.copy.txthadoop fs -mkdir bookshadoop fs -ls .
19 HDFS concepts Anatomy of a file write Create a file Write data Close fileThe namenode makes a record of the new file, DistributedFileSystem returns an FSDataOutputStream for the client to start writing data toAs the client writes data (step 3), DFSOutputStream splits it into packets, which it writes to an internal queue, called the data queue.The data queue is consumed by the Data Streamer, whose responsibility it is to ask the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas.The list of datanodes forms a pipeline—we’ll assume the replication level is three, so there are three nodes in the pipeline.DFSOutputStream maintains a queue of packets that are waiting to be acknowledged by datanodes. A packet is removed from the ack queue only when it has been acknowledged by all the datanodes in the pipeline (step 5).When the client has finished writing data, it calls close() on the stream (step 6).This action flushes all the remaining packets to the datanode pipeline and waits for acknowledgments before contacting the namenode to signal that the file is complete (step 7).
20 HDFS concepts Anatomy of a file write The client creates the file by calling create() on a local DistributedFileSystem object – which represents the HDFS to the client (step 1)DistributedFileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace, with no blocks associated with it (step 2)The namenode makes a record of the new file, DistributedFileSystem returns an FSDataOutputStream for the client to start writing data toAs the client writes data (step 3), DFSOutputStream splits it into packets, which it writes to an internal queue, called the data queue.
21 HDFS concepts Anatomy of a file write The data queue is consumed by the Data Streamer, whose responsibility it is to ask the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas.The list of datanodes forms a pipeline—we’ll assume the replication level is three, so there are three nodes in the pipeline.The DataStreamer streams the packets to the first datanode in the pipeline, which stores the packet and forwards it to the second datanode in the pipeline. Similarly, the second datanode stores the packet and forwards it to the third (and last) datanode in the pipeline (step 4).DFSOutputStream maintains a queue of packets that are waiting to be acknowledged by datanodes. A packet is removed from the ack queue only when it has been acknowledged by all the datanodes in the pipeline (step 5).
22 HDFS concepts Anatomy of a file write When the client has finished writing data, it calls close() on the stream (step 6).This action flushes all the remaining packets to the datanode pipeline and waits for acknowledgments before contacting the namenode to signal that the file is complete (step 7).
23 HDFS concepts Anatomy of a file read Open a file Read data Close file DistributedFileSystem calls the namenode, using RPC, to determine the datanodes containing the first blocks of the file.The datanodes are sorted according to their proximity to the clientThe DistributedFileSystem returns an FSDataInputStream (an input stream that supports file seeks) to the client for it to read data from.The client then calls read() on the stream (step 3).DFSInputStream, which has stored the datanode addresses for the first few blocks in the file, connects to the first (closest) datanode for the first block in the file.Data is streamed from the datanode back to the client, which calls read() repeatedly on the stream (step 4).When the end of the block is reached, DFSInputStream will close the connection to the datanode, then find the best datanode for the next block (step 5).When the client has finished reading, it calls close() on the FSDataInputStream (step 6).
24 HDFS concepts Anatomy of a file read The client opens the file it wishes to read by calling open() on the DistributedFileSystem object (step 1)DistributedFileSystem calls the namenode, using RPC, to determine the locations of the blocks for the first few blocks in the file (step 2).For each block, the namenode returns the addresses of the datanodes that have a copy of that block.The datanodes are sorted according to their proximity to the clientThe DistributedFileSystem returns an FSDataInputStream (an input stream that supports file seeks) to the client for it to read data from.
25 HDFS concepts Anatomy of a file read The client then calls read() on the stream (step 3).DFSInputStream, which has stored the datanode addresses for the first few blocks in the file, connects to the first (closest) datanode for the first block in the file.Data is streamed from the datanode back to the client, which calls read() repeatedly on the stream (step 4).When the end of the block is reached, DFSInputStream will close the connection to the datanode, then find the best datanode for the next block (step 5).This happens transparently to the clientWhen the client has finished reading, it calls close() on the FSDataInputStream (step 6).
26 HDFS concepts Network topology and Hadoop Hadoop takes a simple approach in which the to estimate proximity between nodesnetwork is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor.Nodes levels correspond to the data center, the rack, and the node that a process is running on.The idea is that the bandwidth available for each of the following scenarios becomes progressively less:• Processes on the same node• Different nodes on the same rack• Nodes on different racks in the same data center• Nodes in different data centers
27 HDFS concepts Network topology and Hadoop For example: a node n1 on rack r1 in data center d1. This can be represented as /d1/r1/n1. Using this notation, here are the distances for the four scenarios:• distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same node)• distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack)• distance(/d1/r1/n1, /d1/r2/n3) = 4 (nodes on different racks in the same data center)• distance(/d1/r1/n1, /d2/r3/n4) = 6 (nodes in different data centers)
28 MapReduceWhat is it?A distributed data processing model and execution environment that runs on large clusters of commodity machines.Can be used with Java, Ruby, Python, C++ and moreInherently parallel, thus putting very large-scale data analysis into the hands of anyone with enough machines at their disposalMapReduce proccess flowHDFSData<key, value> collection<key, value> collectionFormattingMapMR framework processing<key, values> collection<key, value> collectionHDFSDataReduceoutput
29 MapReduce Problem example: Weather Dataset Create a program that mines weather dataWeather sensors collecting data every hour at many locations across the globe, gather a large volume of log data. Source: NCDCThe data is stored using a line-oriented ASCII format, in which each line is a recordMission - calculate max temperature each year around the worldProblem - millions of temperature measurements records
30 MapReduce Example: Weather Dataset Brute Force approach – Bash: (each year’s logs are compressed to a single yearXXXX.gz file)The complete run for the century took 42 minutes in one run on a single EC2 High-CPU Extra Large Instance.The script loops through the compressed year filesEach file is processed using awkThe awk script extracts two fields from the data: the air temperature and the quality code. if the quality code indicates that the reading is not suspect or erroneous, the value is compared with the maximum value seen so far, which is updated if a new maximum is found.
31 <key, value> collection MapReduceWeather Dataset with MapReduceInput formatting phaseThe input to MR job is the raw NCDC dataInput format: we use Hadoop text formatter class:When given a directory (HDFS URL), outputs a Hadoop <key,value> collection:The key is the offset of the beginning of the line from the beginning of the fileThe value is the line textHDFSData<key, value> collectionFormatting
33 MapReduceMap phaseThe input to our map phase is the lines <offset, line_text> pairsMap function pulls out the year and the air temperature, since these are the only fields we are interested inMap function also drops bad records - filters out temperatures that are missing suspect, or erroneous.Map Output (<year, temp> pairs):<key, value> collection<key, value> collectionMap
34 MapReduce MR framework processing phase The output from the map function is processed by the MR framework before being sent to the reduce functionThis processing sorts and groups the key-value pairs by keyMR framework processing output (<year, temperatures> pairs):<key, value> collection<key, values> collectionMR framework processing
35 MapReduce Reduce phase The input to our reduce phase is the <year, temperatures> pairsAll the reduce function has to do now is iterate through the list and pick up the maximum readingReduce output:<key, values> collection<key, value> collectionReduce
36 <key, value> collection MapReduceData output phaseThe input to the data output class is the <year, max temperature> pairs from the reduce functionWhen using the default Hadoop output formatter, the output is written to a pre-defined directory, which contains one output file per reducer.Output formatter file output:<key, value> collectionHDFSDataoutput
37 MapReduce Process summary: Question: How this process could be more optimized in the NCDC case?Textual logs in HDFS<offset, line> collection<year, temp> collectionFormattingMapMR framework processing<year, temp values> collection<year, max temp> collectionTextual result in HDFSReduceoutput
40 MapReduce Some code.. Putting it all together And running: hadoop MaxTemperature input/ncdc/sample.txt output
41 MapReduceGoing deep..Definitions:MR Job - a unit of work that the client wants to be performed. Consists of:The input dataThe MapReduce programConfiguration information.Hadoop runs the job by dividing it into tasks, of which there are two types: map tasks and reduce tasks.
42 MapReduceThere are two types of nodes that control the job execution process: a jobtracker and a number of tasktrackers.Jobtracker - coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers.Tasktrackers - run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job.Job TrackerTask TrackerTask TrackerTask Tracker
43 MapReduceScaling out!Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits.Splits are normally corresponds to (one or more) file blocksHadoop creates one map task for each split, which runs the user defined map function for each record in the split.Hadoop does its best to run the map task on a node where the input data resides in HDFS. This is called the data locality optimization.
44 MapReduceMap tasks write their output to the local disk, not to HDFS - Map output is intermediate output: it’s processed by reduce tasks to produce the final outputOnce the job is complete the map output can be thrown away, So storing it in HDFS, with replication, would be overkill.
45 MapReduceReduce tasks don’t have the advantage of data locality—the input to a single reduce task is the output from many mappersThe sorted map outputs have to be transferred across the network to the node where the reduce task is running,There they are merged and then passed to the user-defined reduce function.The output of the reduce is normally stored in HDFS for reliability.
46 MapReduce Reduce - don’t have the advantage of data locality sorted map outputs have to be transferred across the network to the node where the reduce task is running,There they are merged and then passed to the user-defined reduce function.The output of the reduce is normally stored in HDFS for reliability.
47 MapReduceWhen there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task.Framework ensures that the records for any given key are all in a single partition.The partitioning can be controlled by a user-defined partitioning function, but normally the default partitioner—which buckets keys using a hash function—works very wellwhat will be a good partition function in our case?
49 MapReduce Overall MR system flows The 4 entities in MapReduce application:The client, which submits the MapReduce jobThe jobtracker, which coordinates the job run. The jobtracker is a Java application whose main class is JobTracker.The tasktrackers, which run the tasks that the job has been split into. Tasktrackers are Java applications whose main class is TaskTracker.The distributed filesystem (HDFS), which is used for sharing job files between the other entities.
50 MapReduce Overall MR system flows Job submission: MR program creates a new JobClient instance and call submitJob() on it (step 1)Having submitted the job, runJob() polls the job’s progress once a second and reports the progress to the console if it has changed since the last report.When the job is complete, if it was successful, the job counters are displayed.Otherwise, the error that caused the job to fail is logged to the console.The job submission process implemented by JobClient’s submitJob() method does thefollowing:Asks the jobtracker for a new job ID by calling getNewJobId() on JobTracker (step 2)
51 MapReduceChecks the output specification of the job. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program.Computes the input splits for the job. If the splits cannot be computed, because the input paths don’t exist, for example, then the job is not submitted and an error is thrown to the MapReduce program.Copies the resources needed to run the job, including the job JAR file, the configuration file, and the computed input splits, to the jobtracker’s filesystem in a directory named after the job ID (step 3).Tells the jobtracker that the job is ready for execution by calling submitJob() on JobTracker (step 4).
52 MapReduceWhen the JobTracker receives a call to its submitJob() method, it puts it into an internal queue from where the job scheduler will pick it up and initialize it.Initialization involves creating an object to represent the job being run, which encapsulates its tasks, and bookkeeping information to keep track of the tasks’ status and progress (step 5).To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the JobClient from the shared filesystem (step 6).It then creates one map task for each split. The number of reduce tasks to create is determined by the mapred.reduce.tasks property in the JobConf and the scheduler simply creates this number of reduce tasks to be run.Tasks are given IDs at this point
53 MapReduceTasktrackers run a simple loop that periodically sends heartbeat method calls to the jobtracker.Heartbeats tell the jobtracker that a tasktracker is aliveThe jobtracker will allocate a task to the tasktracker using the heartbeat return value (step 7).Before it can choose a task for the tasktracker, the jobtracker must choose a job to select the task from.Tasktrackers have a fixed number of slots for map tasks and for reduce tasks (The precise number depends on the number of cores and the amount of memory on the tasktracker)The default scheduler fills empty map task slots before reduce task slots
54 MapReduce Data locality For a map task, it takes account of the tasktracker’s network location and picks a task whose input split is as close as possible to the tasktracker.In the optimal case, the task is data-local - running on the same node that the split resides on.Alternatively, the task may be rack-local: on the same rack, but not the same node, as the split.Some tasks are neither data-local nor rack-local and retrieve their data from a different rack from the one they are running on.
55 MapReduce running the task: First, it localizes the job JAR by copying it from the shared filesystem to the tasktracker’s filesystemSecond, it creates a local working directory for the task, and un-jars the contents of the JAR into this directory.Third, it creates an instance of TaskRunner to run the task.TaskRunner launches a new Java Virtual Machine (step 9) to run each task in (step 10), so that any bugs in the user-defined map and reduce functions don’t affect the tasktracker (by causing it to crash or hang, for example).It is possible to reuse the JVM between tasksThe child process communicates with its parent in order to inform the parent of the task’s progress every few seconds until the task is complete.
56 MapReduce Job completion When the jobtracker receives a notification that the last task for a job is complete, it changes the status for the job to “successful.”When the JobClient polls for status, it learns that the job has completed successfully, so it prints a message to tell the user and then returns from the runJob() method.|Last, the jobtracker cleans up its working state for the job and instructs tasktrackers to do the same (so intermediate output is deleted, for example)
57 MapReduce Back to the Weather Dataset The same program will run, without alteration, on a full cluster.This is the point of MapReduce: it scales to the size of your data and the size of your hardware.On a 10-node EC2 cluster running High-CPU Extra Large Instances, the program took six minutes to run
58 MapReduce Hadoop implementations around: EBay Facebook 532 nodes cluster (8 * 532 cores, 5.3PB).Heavy usage of Java MapReduce, Pig, Hive, HBaseUsing it for Search optimization and Research.FacebookUse Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.Currently major clusters:1100-machine cluster with 8800 cores and about 12 PB raw storage.300-machine cluster with 2400 cores and about 3 PB raw storage.Each (commodity) node has 8 cores and 12 TB of storage.
59 MapReduce LinkedIn Yahoo! multiple grids divided up based upon purpose.120 Nehalem-based Sun x4275, with 2x4 cores, 24GB RAM, 8x1TB SATA580 Westmere-based HP SL 170x, with 2x4 cores, 24GB RAM, 6x2TB SATA1200 Westmere-based SuperMicro X8DTT-H, with 2x6 cores, 24GB RAM, 6x2TB SATASoftware:CentOS 5.5 -> RHEL 6.1Apache Hadoop patches -> Apache Hadoop patchesPig 0.9 heavily customizedHive, Avro, Kafka, and other bits and pieces...Used for discovering People You May Know and other fun facts.Yahoo!More than 100,000 CPUs in >40,000 computers running HadoopBiggest cluster: 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM)Used to support research for Ad Systems and Web SearchAlso used to do scaling tests to support development of Hadoop on larger clusters
60 MapReduce and Parallel DBMS systems In the mid-1980s the Teradata and Gamma projects pioneered a new architectural paradigm for parallel database systems based on a cluster of commodity computers nodesThose were called “shared-nothing nodes” (or separate CPU, memory, and disks), only connected through a high-speed interconnectionEvery parallel database system built since then essentially uses the techniques first pioneered by these two projects:horizontal partitioning of relational tables - distribute the rows of a relational table across the nodes of the cluster so they can be processed in parallel.Partitioned execution of SQL queries - selection, aggregation, join, projection, and update queries are distributed among the nodes, result are sent back to a “master” node for merge.
61 MapReduce and Parallel DBMS systems Many commercial implementations are available, including Teradata, Netezza, DataAllegro (Microsoft), ParAccel, Greenplum, Aster, Vertica, and DB2.All run on shared-nothing clusters of nodes, with tables horizontally partitioned over them.MapReduceAn attractive quality of the MR programming model is simplicity; an MR program consists of only two functionsMap and Reduce—written by a user to process key/value data pairs.The input data set is stored in a collection of partitions in a distributed file system deployed on each node in the cluster.The program is then injected into a distributed-processing framework and executed
62 MapReduce and Parallel DBMS systems MR – Parallel DBMS comparisonFiltering and transformation of individual data items (tuples in tables) can be executed by a modern parallel DBMS using SQL.For Map operations not easily expressed in SQL, many DBMSs support user defined functions (UDF) extensibility provides the equivalent functionality of a Map operation.SQL aggregates augmented with UDFs and user-defined aggregates provide DBMS users the same MR-style reduce functionality.Lastly, the reshuffle that occurs between the Map and Reduce tasks in MR is equivalent to a GROUP BY operation in SQL.Given this, parallel DBMSs provide the same computing model as MR, with the added benefit of using a declarative language (SQL).
63 MapReduce and Parallel DBMS systems MR – Parallel DBMS comparisonAs for scalability - several production databases in the multi-petabyte range are run by very large customers, operating on clusters of order 100 nodes.The people who manage these systems do not report the need for additional parallelism.Thus, parallel DBMSs offer great scalability over the range of nodes that customers desire.So why use MapReduce?Why it is used so widely?
64 MapReduce and Parallel DBMS systems several application classes are mentioned as possible use cases in which the MR model is a better choice than a DBMS:ETL and “read once” data sets.The use of MR is characterized by the following template of five operations:Read logs of information from several different sources;Parse and clean the log data;Perform complex transformationsDecide what attribute data to storeLoad the information into a persistent storageThese steps are analogous to the extract, transform, and load phases in ETL systemsThe MR system is essentially “cooking” raw data into useful information that is consumed by another storage system. Hence, an MR system can be considered a general-purpose parallel ETL system.
65 MapReduce and Parallel DBMS systems Complex analytics.In many data mining and data-clustering applications, the program must make multiple passes over the data.Such applications cannot be structured as single SQL aggregate queriesMR is a good candidate for such applicationsSemi-structured data.MR systems do not require users to define a schema for their data.MR-style systems easily store and process what is known as “semistructured” data.Such data often looks like key-value pairs, where the number of attributes present in any given record varies.This style of data is typical of Web traffic logs derived from different sources
66 MapReduce and Parallel DBMS systems Quick-and-dirty analysis.One disappointing aspect of many current parallel DBMSs is that they are difficult to install and configure properly, and require heavy tuningopen-source MR implementation provides the best “out-of-the-box” experience - MR system up and running significantly faster than the DBMSs.Once a DBMS is up and running properly, programmers must still write a schema for their data, then load the data set into the system.This process takes considerably longer in a DBMS than in an MR systemLimited-budget operations.Another strength of MR systems is that most are open source projects available for free.DBMSs, and in particular parallel DBMSs, are expensive!!
67 MapReduce and Parallel DBMS systems BenchmarkThe benchmark compares two parallel DBMSs to the Hadoop MR framework on a variety of tasks.Two database systems are used:Vertica, a commercial column-store relational databaseDBMS-X, a row-based database from a large commercial vendorAll experiments run on a 100-node shared-nothing cluster at the University of Wisconsin-MadisonBenchmark tasks:Grep taskRepresentative of a large subset of the real programs written by users of MapReduceFor the task, each system must scan through a data set of 100B records looking for a three-character pattern.Use a 1TB data set spread over the 100 nodes (10GB/node). The data set consists of 10 billion records, each 100B.
68 MapReduce and Parallel DBMS systems BenchmarkWeb log taskConventional SQL aggregation with a GROUP BY clause on a table of user visits in a Web server log.Such data is fairly typical of Web logs, and the query is commonly used in traffic analytics.Used a 2TB data set consisting of 155 million records spread over the 100 nodes (20GB/node).Each system must calculate the total ad revenue generated for each visited IP address from the logs.Like the previous task, the records must all be read, and thus there is no indexing opportunity for the DBMSs
69 MapReduce and Parallel DBMS systems BenchmarkJoin taskA complex join operation over two tables requiring an additional aggregation and filtering operation.The user-visit data set from the previous task is joined with an additional 100GB table of PageRank values for 18 million URLs (1GB/node).The join task consists of two subtasksIn the first part of the task, each system must find the IP address that generated the most revenue within a particular date range in the users visits.Once these intermediate records are generated, the system must then calculate the average PageRank of all pages visited during this interval.
71 MapReduce and Parallel DBMS systems Results analysisThe performance differences between Hadoop and the DBMSs can be explained by a variety of factorsthese differences result from implementation choices made by the two classes of system, not from any fundamental difference in the two models:the MR processing model is independent of the underlying storage system, so data could theoretically be massaged, indexed, compressed, and carefully laid out (into a schema) on storage during a load phase, just like a DBMS.Howerver, the goal of the study was to compare the real-life differences in performance of representative realizations of the two models.
72 MapReduce and Parallel DBMS systems Results analysisRepetitive record parsingOne contributing factor for Hadoop’s slower performance is that the default configuration of Hadoop stores data in the accompanying distributed file system (HDFS), in the same textual format in which the data was generated.Consequently, this default storage method places the burden of parsing the fields of each record on user code.This parsing task requires each Map and Reduce task repeatedly parse and convert string fields into the appropriate type.In DBMS systems, data resides in the schema in the converted form (conversions take place once when loading the logs in to the DB)
73 MapReduce and Parallel DBMS systems Results analysisCompressionEnabling data compression in the DBMSs delivered a significant performance gain.The benchmark results show that using compression in Vertica and DBMS-X on these workloads improves performance by a factor of two to four.On the other hand, Hadoop often executed slower when used compression on its input files. At most, compression improved performance by 15%Why?commercial DBMSs use carefully tuned compression algorithms to ensure that the cost of decompressing tuples does not offset the performance gains from the reduced I/O cost of reading compressed dataUNIX gzip (used by Hadoop) does not make these optimizations
74 MapReduce and Parallel DBMS systems Results analysisPipeliningAll parallel DBMSs operate by creating a query plan that is distributed to the appropriate nodes at execution time.data is “pushed” (streamed) by the producer node to the consumer nodethe intermediate data is never written to diskIn MR systems, the producer writes the intermediate results to local data structures, and the consumer subsequently “pulls” the data.These data structures are often quite large, so the system must write them out to disk, introducing a potential bottleneck.writing data structures to disk gives Hadoop a convenient way to checkpoint the output of intermediate map jobs - improving fault toleranceit adds significant performance overhead!
75 MapReduce and Parallel DBMS systems Results analysisColumn-oriented storageIn a column store-based database (such as Vertica), the system reads only the attributes necessary for solving the user query.This limited need for reading data represents a considerable performance advantage over traditional, row-stored databases, where the system reads all attributes off the disk.DBMS-X and Hadoop/HDFS are both essentially row stores, while Vertica is a column store, giving Vertica a significant advantage over the other two systems in the benchmark
76 MapReduce and Parallel DBMS systems ConclusionsMost of the architectural differences between the two systems are the result of the different focuses of the two classes of system.Parallel DBMSs excel at efficient querying of large data setsMR style systems excel at complex analytics and ETL tasks.MR systems are also more scalable, faster to learn and install and much cheaper, in both the software COGS, and the ability to utilize legacy machines, deployment environments, etc.The two technologies are complementary, and MR-style systems are expected to performing ETL and live directly upstream from DBMSs.