Presentation is loading. Please wait.

Presentation is loading. Please wait.

The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.

Similar presentations


Presentation on theme: "The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available."— Presentation transcript:

1

2 The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available increases every year. success in the future will be dictated to a large extent by their ability to extract value from other organizations’ data. Variety of Data, Velocity of the Data, Volume of the data –V 3 Data Storage & Analysis –The storage capacity of the hard drives increased, but access speeds have not kept up significantly –Now 1 Terabyte data is norm for disks and speed is around 100 MB/s,so it takes more than two and half hours to read all the data from the disk. So there is a long time to read zetta bytes of data –So alternative solution—To read from multiple disks

3 Data Storage & Analysis –Problems in reading from and writing to multiple disks –Multiple hardware pieces are prone for failure-So data loss probability is high –Solution for Data loss-Replication –RAID works with replication only –Data Analyis need to combine the data from various elements & challenges – Need a solution as reliable shared storage and analysis system Hello ! Hadoop NUTCH project by Doug Cutting Google GFS & Map Reduce distributed data storage and processing Yahoo Development Project Doug Cutting Apache Hadoop Open source frame work Hadoop-Made up Name

4 HADOOP Best fit for Adhoc Analysis Written once and read many times Variety of Data Peta bytes of data Batch analysis Dynamic Schema Data Locality Data flow is implicit Shared Nothing Architecture Scaling out approach with commodity hardware Key/value pair RDBMS Good for low latency data Organized data/Structured data Gigabytes of data Interactive and Batch Static Schema Scaling is expensive Tables Structure HPC,GRID&VOLUNTEER COMPUTING Distribution of work across the cluster Data intensive applications, Network Bandwidth Compute nodes idle MPI(message passing interface) flexibility but complexity for data flow SETI@home, Volunteer computing Volunteers are donating CPU cycles, not bandwidth Volunteer computing, untrusted computers, no data locality

5 Top of Existing File System Streaming Data Access patterns Very large files Commodity Hardware High Through put rather than low latency Lot of small files Low latency Data access Multiple Writes, 1) MAP 2) REDUCE 3) CODE for MR JOB 4) Automatic parallelization 5) Fault Tolerance Java,Python etc House keeping in built

6 HDFS block size 64 MB -128 MB Why is it so large? Name Node Secondary Name Node Client Heart Beating, Block replication and Balancing Data node Data Node Data Nodes Data nodes Data node


Download ppt "The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available."

Similar presentations


Ads by Google