Presentation is loading. Please wait.

Presentation is loading. Please wait.

ETM 555 1 Hadoop. ETM 555 2 IDC estimate put the size of the “digital universe” at - 0.18 zettabytes in 2006 -forecasting a tenfold growth by 2011 to.

Similar presentations


Presentation on theme: "ETM 555 1 Hadoop. ETM 555 2 IDC estimate put the size of the “digital universe” at - 0.18 zettabytes in 2006 -forecasting a tenfold growth by 2011 to."— Presentation transcript:

1 ETM 555 1 Hadoop

2 ETM 555 2 IDC estimate put the size of the “digital universe” at - 0.18 zettabytes in 2006 -forecasting a tenfold growth by 2011 to 1.8 zettabytes The New York Stock Exchange generates about one terabyte of new trade data per day Facebook hosts approximately 10 billion photos, taking up one petabyte of storage. The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month. The Large Hadron Collider near Geneva, Switzerland, will produce about 15 petabytes of data per year. Data Explosion

3 ETM 555 3 Common A set of components and interfaces for distributed filesystems and general I/O (serialization, Java RPC, persistent data structures). Avro A serialization system for efficient, cross-language RPC, and persistent data storage. MapReduce A distributed data processing model and execution environment that runs on large clusters of commodity machines. HDFS A Distributed filesystem that runs on large clusters of commodity machines. Pig A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters. Hadoop Projects

4 ETM 555 4 Hive A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data. Hbase A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads). ZooKeeper A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications. Sqoop A tool for efficiently moving data between relational databases and HDFS. Hadoop Projects

5 ETM 555 5 RDBMS Compared to MapReduce MapReduce can be seen as a complement to an RDBMS MapReduce is a good fit for problems that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis. An RDBMS is good for point queries or updates, where the dataset has been indexed to deliver low-latency retrieval and update times of a relatively small amount of data. MapReduce suits applications where the data is written once, and read many times, whereas a relational database is good for datasets that are continually updated.

6 ETM 555 6 RDBMS Compared to MapReduce


Download ppt "ETM 555 1 Hadoop. ETM 555 2 IDC estimate put the size of the “digital universe” at - 0.18 zettabytes in 2006 -forecasting a tenfold growth by 2011 to."

Similar presentations


Ads by Google