Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.

Slides:

Advertisements

Similar presentations

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.

P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.

Overview of MapReduce and Hadoop

Mapreduce and Hadoop Introduce Mapreduce and Hadoop

Clayton Sullivan PEER-TO-PEER NETWORKS. INTRODUCTION What is a Peer-To-Peer Network A Peer Application Overlay Network Network Architecture and System.

File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.

 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)

Efficient Content Location Using Interest-based Locality in Peer-to-Peer Systems Presented by: Lin Wing Kai.

Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.

Object Naming & Content based Object Search 2/3/2003.

Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.

Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.

1 Seminar: Information Management in the Web Gnutella, Freenet and more: an overview of file sharing architectures Thomas Zahn.

ICDE A Peer-to-peer Framework for Caching Range Queries Ozgur D. Sahin Abhishek Gupta Divyakant Agrawal Amr El Abbadi Department of Computer Science.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.

INTRODUCTION TO PEER TO PEER NETWORKS Z.M. Joseph CSE 6392 – DB Exploration Spring 2006 CSE, UT Arlington.

Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Map Reduce and Hadoop S. Sudarshan, IIT Bombay

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

HAMS Technologies 1

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

HAMS Technologies 1

Network Computing Laboratory Scalable File Sharing System Using Distributed Hash Table Idea Proposal April 14, 2005 Presentation by Jaesun Han.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

An IP Address Based Caching Scheme for Peer-to-Peer Networks Ronaldo Alves Ferreira Joint work with Ananth Grama and Suresh Jagannathan Department of Computer.

1 Peer-to-Peer Technologies Seminar by: Kunal Goswami (05IT6006) School of Information Technology Guided by: Prof. C.R.Mandal, School of Information Technology.

Peer to Peer A Survey and comparison of peer-to-peer overlay network schemes And so on… Chulhyun Park

Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.

Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.

Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.

HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

INTERNET TECHNOLOGIES Week 10 Peer to Peer Paradigm 1.

Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.

Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.

{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.

Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.

Plethora: A Locality Enhancing Peer-to-Peer Network Ronaldo Alves Ferreira Advisor: Ananth Grama Co-advisor: Suresh Jagannathan Department of Computer.

Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.

Hadoop MapReduce Framework

Introduction to MapReduce and Hadoop

Central Florida Business Intelligence User Group

Meng Cao, Xiangqing Sun, Ziyue Chen May 28th, 2014

The Basics of Apache Hadoop

CS6604 Digital Libraries IDEAL Webpages Presented by

Hadoop Technopoints.

Presentation transcript:

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results

Introduction Peer-to-Peer (P2P) Information Retrieval framework  Peers that share information  Cumulative bandwidth  High processing power and storage  Absence of high cost hardware Three generations of P2P networks

1 st Generation  Centralized DB for coordinated look up  Napster 2 nd Generation  Flooding to search every node on the network  Gneutella 3 rd Generation’  Distributed Hash Tables  Tapestry, Chord, Pastry, CAN, Kademlia  Uses routing tables to maintain the addresses of its neighbours

In 3G P2P networks log N to N nodes have to be contacted to reach destination. Proposed method,  the target peer can be contacted directly from the source peer.  Search occurs within the target peer to retrieve file reference using keyword indices in a B+ tree

System Architecture P2P cluster and Hadoop cluster Hadoop cluster  Extract keywords for efficient searching  MapReduce programming paradigm P2P cluster  Upload files  Servicing search requests

Map reduce Master (Job Tracker) DFS Master (Name node) Map reduce Slave (Task Tracker) DFS Slave (Data node) Map reduce Slave (Task Tracker) DFS Slave (Data node) HADOOP CLUSTER P2P CLUSTER Keyword extraction SYSTEM ARCHITECTURE

Hadoop Software platform to handle vast amounts of data Moving computation to the place of data rather than moving large data blocks to the place of computation HDFS and MapReduce framework  HDFS – NameNode and DataNode  MapReduce computation Map – splits input data set into fragments and assigns each fragment to a map task. (K,V) Reduce – Merges all intermediate values associated with a key

D1,B1 D2,B1 D1,B2D1,B3 D3,B1D2,B2 D3,B2 MMMMMMM K 1,C 1 K 2,C 1 K 3,C 1 K 2,C 2 K 5,C 2 K 3,C 2 K 6,C 3 K 3,C 3 K 4,C 3 K 5,C 4 K 2,C 4 K 4,C 4 K 4,C 5 K 1,C 5 K 6,C 5 K 6,C 6 K 3,C 6 K 1,C 6 K 5,C 7 K 6,C 7 K 4,C 7 Sort and Group (D2) K 1,[C 6 ] K 2,[C 2 ] K 3,[C 2,C 6 ] K 5,[C 2 ] K 6,[C 6 ] Sort and Group (D1) RRR R RR K 1,[C 1 ] K 2,[C 1,C 4 ] K 3,[C 1,C 3 ] K 4,[C 4,C 3 ] K 5,[C 4 ] K 6,[C 3 ] R R R R R K 1,I K 2,I K 3, I K 4, I K 5, I K 6,I K 1, I K 2, I K 3, I K 5, I K 6, I Map Task 1Map Task 2 Map Task 3 Reduce Task 1 Reduce Task 2

B+ Tree – IP and its hash Represents sorted data indexed by a key for efficient insertion, retrieval and removal of records. Inserting / Searching a record requires O(log B N) operations in the worst case  B - order, N - nodes

DLS Components Start up component: Starting up the Hadoop cluster Identifying nodes to participate in the P2P cluster. Determining the IP hash values for the peers  Using SHA1 (160-bit  40-bit) Forming the B+ tree. Uploading B+ trees in other peers. Starting the Web Server.

DB Distribution Component Keyword extraction using Hadoop cluster Hashing keywords (SHA1 (160-bit  40-bit) Find peer with relatively close match Upload in target peer Update B+ tree (Keyword – file-ref) in target

HADOOP CLUSTER Doc 1 Doc 2Doc n File name, list of keywords Hash search keys Target Identification Upload the document in target node PEERS in P2P network

Search Component Process keywords Find 40-bit hash value Search the B+ tree in peer to identify target node Search B+ tree in target node to retrieve file reference

list of keywords Hash search keys Identify the search node using Relative difference between hash vales of keywords and IP address in B+ tree Search the document in target peer PEER2 in P2P network Search request PEER1 in P2P network

Add/Delete Peer Update IP address table Compute IP-hash of newly added peer Reconstruct the B+ tree and update in peers Relocate appropriate files to new peer Modify metadata in peers

Experimental Results – Keyword Extraction from multiple files(1MB each) Observation – depends on no of keywords

Cluster Set up Time It is a factor of No.of nodes

Add a new Peer It is a factor of No. of keywords (for 1 peer)

Performance of data distribution Component Load time is a factor of No.of keywords

Performance of Search Component Search time remains a constant (9 msec) - B+ tree and search distribution

Conclusion P2P Information Retrieval Framework uses 3G P2P DHT approach B+ trees are maintained in peers Hadoop is used for keyword extraction from multiple files in parallel Efficient search on peers

THANK YOU