Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.

Slides:



Advertisements
Similar presentations
Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.
Advertisements

Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Spark: Cluster Computing with Working Sets
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
The Google File System.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
The Hadoop Distributed File System
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
MapReduce How to painlessly process terabytes of data.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
HDFS (Hadoop Distributed File System) Taejoong Chung, MMLAB.
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Hadoop: what is it?. Hadoop manages: – processor time – memory – disk space – network bandwidth Does not have a security model Can handle HW failure.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
The IEEE International Conference on Cluster Computing 2010
History & Motivations –RDBMS History & Motivations (cont’d) … … Concurrent Access Handling Failures Shared Data User.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
By: Joel Dominic and Carroll Wongchote 4/18/2012.
BIG DATA/ Hadoop Interview Questions.
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
CSS534: Parallel Programming in Grid and Cloud
Introduction to MapReduce and Hadoop
Introduction to HDFS: Hadoop Distributed File System
Hadoop Clusters Tess Fulkerson.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
CS6604 Digital Libraries IDEAL Webpages Presented by
GARRETT SINGLETARY.
Hadoop Distributed Filesystem
Hadoop Basics.
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
CS 345A Data Mining MapReduce This presentation has been altered.
THE GOOGLE FILE SYSTEM.
Map Reduce, Types, Formats and Features
Presentation transcript:

Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint Files Bo Dong, Jie Qiu, Qinghua Zheng, Xiao Zhong, Jingwei Li, Ying Li

S-2 Outline  Background & Motivation  Design  Evaluation of Experiments  Conclusions

S-3 Hadoop Distributed File System (HDFS)  Very Large Distributed File System –10K nodes, 100 million files, 10PB  Design Pattern: Master-slaves  Two Types of Machines in a HDFS Cluster –NameNode (the master): the heart of an HDFS filesystem, it maintains and manages the file system metadata, e.g., what blocks make up a file, and on which DataNodes those blocks are stored. –DataNode (the slave): where HDFS stores the actual data, there are usually quite a few of these.

S-4 HDFS – Data Organization  HDFS is a block-structured file system.  A file can be made of several blocks, and they are stored across a cluster of one or more machines with data storage capacity.  Each block of a file is replicated across a number of machines, to improve fault tolerance.

S-5 Problem of Small Files in HDFS  Disk utilization –A 2.5 MB file is stored with the block size of 64 MB. But it only uses 2.5 MB of disk space. Other files cannot be written to the free space of 64 MB block.  High memory usage –The metadata size for each file is 368 bytes in memory including the default 3 replicas. (16 GB memory for 24 millions files) –DataNodes periodically send reports to NameNode with the lists of blocks, and NameNode gerthers the reports and stores them in memory.  High Access Costs –When reading files, HDFS client first consults NameNode for file metadata, which happens once for each file access. –Sequential files are usually not placed sequentially in block level but rather placed on different DataNodes.

S-6 A Case Study on PPT Files  A PPT file is converted into a certain amount of preview pictures.  A PPT file and all its pictures belong to a PPT courseware.

S-7 Outline  Background & Motivation  Design  Evaluation of Experiments  Conclusions

S-8 The novel approach – Basic Idea  MERGING small files into larger ones.  PREFETCHING to mitigate the load of NameNode and improve access efficiency. Example: File number: N  3

S-9 The novel approach – A share system  A HDFS based file share system could be like: –User interface layer interface for uploading Browsing downloading, etc. –Business process layer file converting file merging file naming web server and cache functions contacts with HDFS though HDFS clients. –Storage layer persistence functions using HDFS cluster.

S-10 The novel approach – Uploading ① PPT file arrived web server; ② PPT  multiple picture series; ③ Built Local index file, merge; ④ Uploaded.

S-11 ② Check metadata ③ tract target DataNode ④ Split file from block, return. ② Mapping ③ metadata of the merged file ④ DataNode ⑤ Fetch target block ⑥ Split ⑦ Prefetching The novel approach – Browsing. ① Check cache for the target file ② File is in Cache, browsing over

S-12 The novel approach – Download  Down load process is just for PPT files: –if the file has been prefetched, read from cache. –if not, the download process is almost the same as browsing process, except for no prefetching here.

S-13 The novel approach – File merging  Calculate the total number of files, pictures and local index file. (fixed length index)  The sum of lengths of all files including local index file is calculated  compared with HDFS block size: less than HDFS block: merged file all in one block in default order local index can be established. exceeds HDFS block size: merged file broken into blocks. Two strategies are adopted for exceeding case.

S-14 The novel approach – strategies for file merging Strategy 1  target: try to make picture series located in one block.  method: adjusting order of picture series, each one of them is a whole.  steps : 1. calculate prefix length prefix length = local index file + PPT file + standard resolution picture series 2. compare prefix length with HDFS block size exceed?  Y: process over. N: go step adjust order of others picture series, try to fill the HDFS block size. If couldn’t, follow default order, process over.

S-15 The novel approach – strategies for file merging Strategy 2  target: try to make picture series located in one block. (same of strategy 1)  Method: vacant domain  Steps : 1. check offset of each file, and if there are any files across two blocks’ boundary. if not go to step 3, otherwise go to step 2; 2. adjust file order. Put the file on boundary to next block, add index file at the start location of next block, And reset offset to be just after the index file. 3. loop step 1 and 2.

S-16 The novel approach – file mapping WHY ? original files  merged one. ONE SOLUTION: name mapping Depend on the given naming disciplines, four domains of name are: Name domain + resolution domain + serial number domain + block number domain [example] : A_1280_05_01.jpg THE OTHER SOLUTION: building global table (more general) Create record for each original file. kept within NameNode, and also in disk persistently.

S-17 The novel approach – prefetching WHY? PPT courseware files are related to each other, while HDFS doesn’t provide prefetching. HOW? Two level prefetching: 1. local index file prefetching; 2. correlated file prefetching ; WHY? Save effort of file mapping and interaction with NameNode TARGET : Prefetching records(metadata and index information) EXAMPLE: CONTENTS? files in the same PPT courseware QUANTITY? pictures in the same serie and after target pictures; + PPT file; + pictures in all series after target pictures WHEN TRIGGER? browsing process: 1. check file existed or not; Y  read, N  2 2. check prefetching record existed or not; Y  file prefetching, N  3 3. local index file prefetching be triggered.

S-18 Outline  Background & Motivation  Design  Evaluation of Experiments  Conclusions

S-19 Experiments - setup  One Master Node (NameNode) –IBM X3650, 8 Intel Xeon CPU2.00GHz, 16GB memory, 3TB disk  Eight Slaves Nodes (DateNodes) –IBM X3610, 8 Intel Xeon CPU2.00GHz, 8GB memory, 3TB disk  Ubuntu server 9.04  Hadoop  Java version  HDFS block size 64MB  Connected by 1.0 Gbps Ethernet network

S-20 Memory Usage Hadoop archive(HAR): General small file solution which archives small files into larger files. In HAR, all files in one PPT courseware are stored in one HAR file.

S-21 Time Efficiency MSPF: Millisecond per Accessing a File.

S-22 Conclusions  The proposed approach adopts a combination of file merging method and two-level prefetching mechanism to mitigate small file problems on HDFS.  The focus of this paper is storing small files but not processing them with MapReduce framework, e.g., Hadoop.  Our project not only provides efficient I/O for small files in HDFS, but also looks at how to work with small files using MapReduce.

S-23 Questions?