Applications of Map-Reduce Team 3 CS 4513 – D08. Very popular example to explain how Map-Reduce works Demo program comes with Nutch (where Hadoop originated)

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

CHANGING THE WAY IT WORKS Cloud Computing 4/6/2015 Presented by S.Ganesh ( )
Large Scale Computing Systems
MapReduce.
Overview of MapReduce and Hadoop
LIBRA: Lightweight Data Skew Mitigation in MapReduce
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Chapter 13 (Web): Distributed Databases
Paula Ta-Shma, IBM Haifa Research 1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Big Data and.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Nikolay Tomitov Technical Trainer SoftAcad.bg.  What are Amazon Web services (AWS) ?  What’s cool when developing with AWS ?  Architecture of AWS 
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
Cloud Computing Other Mapreduce issues Keke Chen.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Ch 4. The Evolution of Analytic Scalability
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.
Advanced File Processing
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Concept demo System dashboard. Overview Dashboard use case General implementation ideas Use of MULE integration platform Collection Aggregation/Factorization.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
The Confident Researcher: Google Away (Module 2) The Confident Researcher: Google Away 2.
Search Engines. Internet protocol (IP) Two major functions: Addresses that identify hosts, locations and identify destination Connectionless protocol.
Software Architecture
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
HAMS Technologies 1
CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Web Search Algorithms By Matt Richard and Kyle Krueger.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
1 TCS Confidential. 2 Objective : In this session we will be able to learn:  What is Cloud Computing?  Characteristics  Cloud Flavors  Cloud Deployment.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Next Generation of Apache Hadoop MapReduce Owen
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Large-scale file systems and Map-Reduce
Spark Presentation.
Map Reduce.
Introduction to MapReduce and Hadoop
CHAPTER 3 Architectures for Distributed Systems
湖南大学-信息科学与工程学院-计算机与科学系
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Presentation transcript:

Applications of Map-Reduce Team 3 CS 4513 – D08

Very popular example to explain how Map-Reduce works Demo program comes with Nutch (where Hadoop originated) 2 Distributed Grep

For Unix guru: grep -Eh /* | sort | uniq -c | sort -nr - counts lines in all files in that match and displays the counts in descending order - grep -Eh 'A|C' in/* | sort | uniq -c | sort -nr - Analyzing web server access logs to find the top requested pages that match a given pattern Distributed Grep CBBCCBBC CACA 3 C 1 A Result File 2 File 1

Map function in this case: - input is (file offset, line) - output is either: 1. an empty list [] (the line does not match) 2. a key-value pair [(line, 1)] (if it matches) Reduce function in this case: - input is (line, [1, 1,...]) - output is (line, n) where n is the number of 1s in the list. Distributed Grep

Map tasks: (0, C) -> [(C, 1)] (2, B) -> [] (4, B) -> [] (6, C) -> [(C, 1)] (0, C) -> [(C, 1)] (2, A) -> [(A, 1)] Distributed Grep CBBCCBBC CACA 3 C 1 A Result File 2 File 1 Reduce tasks: (A, [1]) -> (A, 1) (C, [1, 1, 1]) -> (C, 3)

Large-Scale PDF Generation The New York Times needed to generate PDF files for 11,000,000 articles (every article from ) in the form of images scanned from the original paper Each article is composed of numerous TIFF images which are scaled and glued together Code for generating a PDF is relatively straightforward The Problem

Technologies Used Amazon Simple Storage Service (S3) –Scalable, inexpensive internet storage which can store and retrieve any amount of data at any time from anywhere on the web –Asynchronous, decentralized system which aims to reduce scaling bottlenecks and single points of failure Amazon Elastic Compute Cloud (EC2) –Virtualized computing environment designed for use with other Amazon services (especially S3) Hadoop –Open-source implementation of MapReduce Large-Scale PDF Generation

1.4TB of scanned articles were sent to S3 2.A cluster of EC2 machines was configured to distribute the PDF generation via Hadoop 3.Using 100 EC2 instances and 24 hours, the New York Times was able to convert 4TB of scanned articles to 1.5TB of PDF documents Large-Scale PDF Generation Results

Compute statistics –Central Limit Theorem N voting nodes cast votes (map) Tally votes and take action (reduce) Artificial Intelligence

Photos from: stockcharts.com Statistical analysis of current stock against historical data Each node (map) computes similarity and ROI. Tally Votes (reduce) to generate expected ROI and standard deviation

Geographical Data Large data sets including road, intersection, and feature data Problems that Google Maps has used MapReduce to solve –Locating roads connected to a given intersection –Rendering of map tiles –Finding nearest feature to a given address or location

Example 1 Input: List of roads and intersections Map: Creates pairs of connected points (road, intersection) or (road, road) Sort: Sort by key Reduce: Get list of pairs with same key Output: List of all points that connect to a particular road Geographical Data

Input: Graph describing node network with all gas stations marked Map: Search five mile radius of each gas station and mark distance to each node Sort: Sort by key Reduce: For each node, emit path and gas station with the shortest distance Output: Graph marked and nearest gas station to each node Example 2 Geographical Data

Hadoop HDFS Lucene Solr Tomcat Platform Rackspace Log Querying

More than 50k devices 7 data centers Solr stores 800M objects Hadoop stores 9.6B ~ 6.3TB Several hunderd Gb of log data generated each day Statistics Rackspace Log Querying

The Problem Logging V1.0 V1.1 V2.0 V2.1 V2.2 V3.0, mapreduce introduced. System Evolution Rackspace Log Querying

PageRank

Program implemented by Google to rank any type of recursive “documents” using MapReduce. Initially developed at Stanford University by Google founders, Larry Page and Sergey Brin, in Led to a functional prototype named Google in Still provides the basis for all of Google's web search tools. PageRank

Simulates a “random-surfer” Begins with pair (URL, list-of-URLs) Maps to (URL, (PR, list-of-URLs)) Maps again taking above data, and for each u in list-of-URLs returns (u, PR/|list-of-URLs|), as well as (u, new- list-of-URLs) Reduce receives (URL, list-of-URLs), and many (URL, value) pairs and calculates (URL, (new-PR, list-of-URLs)) PageRank

PageRank: Problems Has some bugs – Google Jacking Favors Older websites Easy to manipulate

Used for translating between different languages A phrase or sentence can be translated more than one way so this method uses statistics from previous translations to find the best fit one Statistical Machine Translation

the quick brown fox jumps over the lazy dog –Each word translated individually: la rápido marrón zorro saltos más la perezoso perro –Complete sentence translation: el rápido zorro marrón salta sobre el perro perezoso Creating quality translations requires a large amount of computing power due to p(f|e)p(e) Need the statistics of previous translations of phrases Statistical Machine Translation

Google Translator When computing the previous example it would not translate "brown" and "fox" individually, but it translated the complete sentence correctly After providing a translation for a given sentence, it asks the user to suggest a better translation The information can then be added to the statistics to improve quality Statistical Machine Translation

Benefits –more natural translation –better use of resources Challenges –compound words –Idioms –Morphology –different word orders –Syntax –out of vocabulary words Statistical Machine Translation

Peak performance rating of 256 GFLOPS at 4GHz. However, Programmers must write multi-threaded code unique to each of the SPE (Synergistic Processing Element) cores in addition to the main PPE (Power Processing Element) core. SPE local memory is software-managed, requiring programmers to individually manage all reads and writes to and from the global memory space. The SPEs are statically scheduled Single Instruction, Multiple Data (SIMD) cores. This requires a lot of parallelism to achieve high performance. Map Reduce on Cell

Takes out the effort in writing multi-processor code for single operations that are performed on large amounts of data. As easy to develop as single-threaded code. Depending on input, data processed was 3x to 10x faster with Cell vs. 2.4 Core2 Duo. However, computationally weak data went slower. Code not fully developed; Currently no support for variable length structures (such as strings). Map Reduce on Cell

Database management Sub-optimal implementation for DB Does not provide traditional DBMS features Lacks support for default DBMS tools Map Reduce Inapplicability

Database implementation issues Lack of a schema No separation from application program No indexes Reliance on brute force Map Reduce Inapplicability

Feature absence and tool incompatibility Transaction updates Changing data and maintaining data integrity Data mining and replication tools Database design and construction tools Map Reduce Inapplicability