Group 15 Swathi Gurram Prajakta Purohit

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
MapReduce Online Veli Hasanov Fatih University.
Developing a MapReduce Application – packet dissection.
SALSA HPC Group School of Informatics and Computing Indiana University.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark: Cluster Computing with Working Sets
Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US.
Big Data Open Source Software and Projects ABDS in Summary XIII: Level 14A I590 Data Science Curriculum August Geoffrey Fox
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
SALSASALSA Twister: A Runtime for Iterative MapReduce Jaliya Ekanayake Community Grids Laboratory, Digital Science Center Pervasive Technology Institute.
Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Big Data Open Source Software and Projects ABDS in Summary XVIII: Layer 14A Data Science Curriculum March Geoffrey Fox
SALSA HPC Group School of Informatics and Computing Indiana University.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Hadoop System simulation with Mumak Fei Dong, Tianyu Feng, Hong Zhang Dec 8, 2010.
SALSASALSASALSASALSA Clouds Ball Aerospace March Geoffrey Fox
Virtualization and Databases Ashraf Aboulnaga University of Waterloo.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Grid Appliance The World of Virtual Resource Sharing Group # 14 Dhairya Gala Priyank Shah.
Toward Efficient and Simplified Distributed Data Intensive Computing IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 6, JUNE 2011PPT.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
This is a free Course Available on Hadoop-Skills.com.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
MapReduce using Hadoop Jan Krüger … in 30 minutes...
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Hadoop Aakash Kag What Why How 1.
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Hadoop MapReduce Framework
Spark Presentation.
Hadoop Clusters Tess Fulkerson.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Distributed Systems CS
Applying Twister to Scientific Applications
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
Hadoop Basics.
CS110: Discussion about Spark
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Parallel Applications And Tools For Cloud Computing Environments
Proposal for Term Project Operating Systems, Fall 2018
Distributed Systems CS
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

Group 15 Swathi Gurram Prajakta Purohit K-means Clustering Group 15 Swathi Gurram Prajakta Purohit

Goal To program K-means on Twister (Iterative Map-Reduce) and Hadoop(Map - Reduce) and see how the change of framework effects the implementation time.

Survey Twister Configurable long running (cacheable) map/reduce tasks Pub/sub messaging based communication/data transfers Efficient support for Iterative MapReduce computation Combine phase to collect all reduce outputs Data access via local disks

Survey Hadoop: a software framework that supports data-intensive distributed applications Uses Map- reduce programming model it's own filesystem ( HDFS Hadoop Distributed File System based on the Google File System) which is specifically tailored for dealing with large files can intelligently manage the distribution of processing and your files, and breaking those files down into more manageable chunks for processing

Survey Haloop : a modified version of the Hadoop MapReduce framework  provide caching options for loop-invariant data access let users reuse major building blocks from applications' Hadoop implementations have similar intra-job fault-tolerance mechanisms to Hadoop.  HaLoop reduces query runtimes by 1.85 compared with Hadoop

K-means Clustering

K-means Clustering

Twister K-means

Hadoop K-means

Implementation Timeline Week Task Team member Oct 24th – Oct 31st Understand K-means algorithm and design Prajakta, Swathi Nov 1st – Nov 7th Implement K-means Nov 8th – Nov 21st Implement K-means on Twister and performance analysis Nov 21st – Nov 28th Optimized validation method for Kmeans algorithm Nov 29th – Dec 3rd Implement K-means on Hadoop Dec 4th – Dec 5th Performance Analysis and Presentation Dec 6th – Dec 12th Final Technical report

Validation methods

Conclusion Twister framework is faster than Hadoop for iterative map- reduce applications.

References http://salsahpc.indiana.edu http://www.iterativemapreduce.org/samples.html http://hadoop.apache.org/ http://en.wikipedia.org/wiki/Apache_Hadoop http://clue.cs.washington.edu/node/14 http://code.google.com/p/haloop/ http://www.cs.washington.edu/homes/billhowe/pubs/Ha Loop.pdf

Demo

Thank you