Group 15 Swathi Gurram Prajakta Purohit

Slides:

Advertisements

Similar presentations

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Advertisements

UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

MapReduce Online Veli Hasanov Fatih University.

Developing a MapReduce Application – packet dissection.

SALSA HPC Group School of Informatics and Computing Indiana University.

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.

UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Spark: Cluster Computing with Working Sets

Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US.

Big Data Open Source Software and Projects ABDS in Summary XIII: Level 14A I590 Data Science Curriculum August Geoffrey Fox

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

SALSASALSA Twister: A Runtime for Iterative MapReduce Jaliya Ekanayake Community Grids Laboratory, Digital Science Center Pervasive Technology Institute.

Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

Big Data Open Source Software and Projects ABDS in Summary XVIII: Layer 14A Data Science Curriculum March Geoffrey Fox

SALSA HPC Group School of Informatics and Computing Indiana University.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Hadoop System simulation with Mumak Fei Dong, Tianyu Feng, Hong Zhang Dec 8, 2010.

SALSASALSASALSASALSA Clouds Ball Aerospace March Geoffrey Fox

Virtualization and Databases Ashraf Aboulnaga University of Waterloo.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Grid Appliance The World of Virtual Resource Sharing Group # 14 Dhairya Gala Priyank Shah.

Toward Efficient and Simplified Distributed Data Intensive Computing IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 6, JUNE 2011PPT.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

This is a free Course Available on Hadoop-Skills.com.

1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.

BIG DATA/ Hadoop Interview Questions.

Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.

EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

MapReduce using Hadoop Jan Krüger … in 30 minutes...

”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.

Big Data is a Big Deal!.

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

Hadoop Aakash Kag What Why How 1.

By Chris immanuel, Heym Kumar, Sai janani, Susmitha

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

Hadoop MapReduce Framework

Spark Presentation.

Hadoop Clusters Tess Fulkerson.

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Introduction to Spark.

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

Distributed Systems CS

Applying Twister to Scientific Applications

The Basics of Apache Hadoop

湖南大学-信息科学与工程学院-计算机与科学系

CS110: Discussion about Spark

KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner

Parallel Applications And Tools For Cloud Computing Environments

Proposal for Term Project Operating Systems, Fall 2018

Distributed Systems CS

MapReduce: Simplified Data Processing on Large Clusters

Map Reduce, Types, Formats and Features

Presentation transcript:

Group 15 Swathi Gurram Prajakta Purohit K-means Clustering Group 15 Swathi Gurram Prajakta Purohit

Goal To program K-means on Twister (Iterative Map-Reduce) and Hadoop(Map - Reduce) and see how the change of framework effects the implementation time.

Survey Twister Configurable long running (cacheable) map/reduce tasks Pub/sub messaging based communication/data transfers Efficient support for Iterative MapReduce computation Combine phase to collect all reduce outputs Data access via local disks

Survey Hadoop: a software framework that supports data-intensive distributed applications Uses Map- reduce programming model it's own filesystem ( HDFS Hadoop Distributed File System based on the Google File System) which is specifically tailored for dealing with large files can intelligently manage the distribution of processing and your files, and breaking those files down into more manageable chunks for processing

Survey Haloop : a modified version of the Hadoop MapReduce framework provide caching options for loop-invariant data access let users reuse major building blocks from applications' Hadoop implementations have similar intra-job fault-tolerance mechanisms to Hadoop. HaLoop reduces query runtimes by 1.85 compared with Hadoop

K-means Clustering

K-means Clustering

Twister K-means

Hadoop K-means

Implementation Timeline Week Task Team member Oct 24th – Oct 31st Understand K-means algorithm and design Prajakta, Swathi Nov 1st – Nov 7th Implement K-means Nov 8th – Nov 21st Implement K-means on Twister and performance analysis Nov 21st – Nov 28th Optimized validation method for Kmeans algorithm Nov 29th – Dec 3rd Implement K-means on Hadoop Dec 4th – Dec 5th Performance Analysis and Presentation Dec 6th – Dec 12th Final Technical report

Validation methods

Conclusion Twister framework is faster than Hadoop for iterative mapreduce applications.

References http://salsahpc.indiana.edu http://www.iterativemapreduce.org/samples.html http://hadoop.apache.org/ http://en.wikipedia.org/wiki/Apache_Hadoop http://clue.cs.washington.edu/node/14 http://code.google.com/p/haloop/ http://www.cs.washington.edu/homes/billhowe/pubs/Ha Loop.pdf

Demo

Thank you