Presentation on theme: "Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce- enabled data stores."— Presentation transcript:
Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce- enabled data stores
How we got here Map/ReduceGFS Google HadoopHDFS Uses BigTable HBase To Provide Accumulo CassandraMongoDB Related Stuff…
In the beginning was the Google Larry and Sergey had a lot of data – Needed fast distributed large files – Needed location awareness – GFS was born:
Processing that data Needed some way to process it all efficiently – Move processing to the data – Distributed processing – Only transfer minimal results – Map/Reduce
Files are good, structure is better Map/Reduce naturally produces and functions on structured data (key => value pairs) – Needed a way to efficiently store and access data – BigTable Compressed, sparse, distributed, multidimensional
Open, sortof Google told the world about this great stuff: – Dean, Jeffrey and Ghemawat, Sanjay. “MapReduce: Simplified Data Processing on Large Clusters,” OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. – Chang, Fay et al. “Bigtable: A Distributed Storage System for Structured Data,” OSDI'06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, November, 2006. But they weren’t sharing the implementations
Hadoop: Map/Reduce for the masses Open source Apache project – Derived from Google papers – Consists of Hadoop Kernel, MapReduce, and HDFS – Also related projects Hive, Hbase, Zookeeper, etc.
MapReduce Layer Takes Jobs, which are split into Tasks – Tasks are executed on worker nodes that, ideally, store the data the task needs to process – If that’s not possible, the task attempts to execute on a worker node in the same rack as the data – Tasks might be map tasks or reduce tasks, depending on what the job tracker needs at the time
HDFS Layer Consists of namenode, secondary namenode for replication, and datanodes – Datanodes contain redundant copies of data, generally 2 copies on one rack, and a third copy on a different rack – Exposes data location information to Jobtracker so tasks can be distributed to workers close to the data – Not a POSIX file system, and can’t be mounted directly
Other Storage Hadoop is flexible about what storage system is used – Alternatives are Amazon S3, CloudStore, FTP Filesystem, and read-only HTTP(S) file systems – Only HDFS and CloudStore are rack-aware, though – Multiple data store implementations Also, HDFS isn’t restricted to Hadoop. Hbase and other projects use it as storage
HBase Basically open-source BigTable – Non-relational, distributed, sparse, multi- dimensional, compressed data – Tables can be input/output for MapReduce jobs run in Hadoop – Support Bloom filters Another thing borrowed from BigTable Can tell you if something isn’t in the column, but not necessarily if it is there
Data Model Data is stored as rows with a single key, timestamp, and multiple columnfamilies Data is sorted based on the key, but otherwise there aren’t any indexes Supports 4 operations: Get, Put, Scan, Delete Deletes don’t actually delete, they just mark a row as dead, for later compactions to clean up
Digression: Bloom Filters Maintains a bit array like a hash table – Each item, when inserted to the column, is hashed with k different algorithms, and the resulting index bit is set to 1. – To determine if a value is in the table, hash it with the k algorithms and see if all the indexes are set to 1. If one or more is missing, the value isn’t in there – But if there is a non-zero probability that all will be 1 and the value won’t be there. – Write-only, since you never know which entries duplicated a bit
So, why bother? Column scans are expensive, and that’s about the only way to find stuff in a column that’s not the key
Accumulo Hbase for the NSA – Provides basically the same functionality of Hbase, but with security – Adds a new element to the key, Column Visibility Stores a logical combination of security labels that must be satisfied at query time for the key/value to be returned Hence a single table can store data with various security levels, and users only see what they’re allowed to see
Cassandra A lot like Hbase, with BigTable inspiration, but also inspired by Amazon Dynamo (cloud key/value store) Also has columnfamilies (and even supercolumns), but allows secondary indexes Distribution and replication are tunable Writes faster than reads, so good for logging, etc.
Cassandra vs. HBase Basically comes down to the CAP theorem: – You have to pick two of Consistency, Availability, and Partition tolerance. You can’t have all 3. Cassandra chooses AP, though you can get consistency if you can tolerate greater latency. – By default provides weak consistency Hbase values CP, but availability may suffer. In the event of a partition (node failure), the data won’t be available if it can’t be guaranteed to be consistent with committed operation.
Conclusion There are a lot of options out there, and more all the time RDBMS offers the most functionality, but stumbles at the scalability problem Key/value stores scale, but require different processing model Best option will be determined by a combination of data and task