MalStone:Towards A Benchmark for Analytics on Large Data Clouds Collin Bennett Open Data Group 400 Lathrop Ave Suite 90 River Forest IL 60305 Robert L.

Slides:



Advertisements
Similar presentations
Meet Hadoop Doug Cutting & Eric Baldeschwieler Yahoo!
Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Paula Ta-Shma, IBM Haifa Research 1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Big Data and.
An Introduction to Sector/Sphere Sector & Sphere Yunhong Gu Univ. of Illinois at Chicago and VeryCloud June 22, 2010.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.
Introduction to Hadoop 趨勢科技研發實驗室. Copyright Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview.
Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Distributed Data Storage and Parallel Processing Engine Sector & Sphere Yunhong Gu Univ. of Illinois at Chicago.
On the Varieties of Clouds for Data Intensive Computing 董耀文 Antslab Robert L. Grossman University of Illinois at Chicago And Open Data.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Building BI App on Cloud Rohit Chatter Sr.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Project Matsu: Large Scale On-Demand Image Processing for Disaster Relief Collin Bennett, Robert Grossman, Yunhong Gu, and Andrew Levine Open Cloud Consortium.
Sector and Sphere: the design and implementation of a high-performance data cloud by Yunhong Gu, and Robert L. Grossman Philosophical Transactions A Volume.
Yunhong Gu and Robert Grossman University of Illinois at Chicago 碩資工一甲 王聖爵
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Hidemoto Nakada, Hirotaka Ogawa and Tomohiro Kudoh National Institute of Advanced Industrial Science and Technology, Umezono, Tsukuba, Ibaraki ,
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 2011 UKSim 5th European Symposium on Computer Modeling and Simulation Speker : Hong-Ji.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
Toward Efficient and Simplified Distributed Data Intensive Computing IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 6, JUNE 2011PPT.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
Big Data Tools Hadoop S.S.Mulay Sr. V.P. Engineering February 1, 2013.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Next Generation of Apache Hadoop MapReduce Owen
Big Data Analytics with Excel Peter Myers Bitwise Solutions.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Data Science Hadoop YARN Rodney Nielsen. Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop.
B ig D ata Analysis for Page Ranking using Map/Reduce R.Renuka, R.Vidhya Priya, III B.Sc., IT, The S.F.R.College for Women, Sivakasi.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Our experience with NoSQL and MapReduce technologies Fabio Souto.
Microsoft Ignite /28/2017 6:07 PM
Data Analytics (CS40003) Introduction to Data Lecture #1
Yarn.
Status and Challenges: January 2017
Running virtualized Hadoop, does it make sense?
Map Reduce.
Hadoopla: Microsoft and the Hadoop Ecosystem
Latest Updates on BlackHawk Mines Music : Privacy Policy
DATA SCIENCE Online Training at GoLogica
Introduction to MapReduce and Hadoop
Hadoop Clusters Tess Fulkerson.
Introduction to Apache
Big DATA.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

MalStone:Towards A Benchmark for Analytics on Large Data Clouds Collin Bennett Open Data Group 400 Lathrop Ave Suite 90 River Forest IL Robert L. Grossman Open Data Group 400 Lathrop Ave Suite 90 River Forest IL David Locke Open Data Group 400 Lathrop Ave Suite 90 River Forest IL Jonathan Seidman Open Data Group 400 Lathrop Ave Suite 90 River Forest IL Steve Vejcik Open Data Group 400 Lathrop Ave Suite 90 River Forest IL KDD ’ 10, July 25 – 28, 2010, Washington, DC, USA

OUTLINE 0. ABSTRACT 1. INTRODUCTION 2. Common Elements 3. MalStone A & B 4. MalGen 5. THREE IMPLEMENTATIONS 6. EXPERIMENTAL STUDIES 7. DISCUSSION 8. RELATED WORK 9. SUMMARY

0. ABSTRACT  Terasort  MalStone  MalGen

1. INTRODUCTION  Data Mining for Clouds : Hbase, Apache Pig, Hive and ZooKeeper,  There are no similar benchmarks for comparing two large data clouds that support building analytic models on large datasets.  Use MalStone, also describe the implementation of a data generator for MalStone called MalGen

2.Common Elements  Time stamps  Sites e.g. Web sites, computers, network devices  Entities e.g. visitors, users, flows  Log files fill disks, many, many disks  Behavior occurs at all scales  Want to identify phenomena at all scales  Need to group “ similar behavior ”  Need to do statistics (not just sorting)

2.Common Elements Abstract the Problem Using Site-Entity Logs ExampleSitesEntities Measuring online advertising Web sitesConsumers Drive-by exploitsWeb sitesComputers (identified by cookies or IP) Compromised systems Compromised computers User accounts

3. MalStone A & B MalStone Benchmark  Benchmark developed by Open Cloud Consortium for clouds supporting data intensive computing.  Code to generate synthetic data required is available from code.google.com/p/malgen  Stylized analytic computation that is easy to implement in MapReduce and its generalizations.

3. MalStone A & B MalStone A computes j for all sites j in the log files. MalStone B computes j;t for sites j in the log files

3. MalStone A & B  be the set of all entities ei  Aj that become marked at any time in the monitor window

3. MalStone A & B  is the set of entities  that become marked at any time during the monitor window.

3. MalStone A & B The statistic is ( )/( ) = 1/2

4. MalGen  Tens of millions of sites  Hundreds of millions of entities  Billions of events  Most sites have a few number of events  Some sites have many events  Most entities visit a few sites  Some visitors visit many sites

4. MalGen  For generating site-entity log files

5. THREE IMPLEMENTATIONS  HDFS, Hadoop Streams and Python  Hadoop HDFS and MapReduce  Sector and Sphere UDFs(User Defined Functions )

6. EXPERIMENTAL STUDIES

MalStone B Sector/Sphere v min # Nodes20 nodes # Records10 Billion Size of Dataset1 TB Tests done on Open Cloud Testbed.

7. DISCUSSION  Hadoop streams does not require the MapReduce framework.  Python programs can be invoked by Hadoop streams.

8. RELATED WORK  In 2008,Haddop by Terasort : 297sec. In 2009,Hadoop by Terasort : 209sec. In nowadays,Terasort was replacement by Minute Sort : in about 1 Min.  [MapReduce for machine learning on multicore] Using MapReduce,but does not describe a computation similar to the MalStone statistic.

9. SUMMARY  MalGen to create large amount of data.  Performance depend upon which cloud middleware is used to compute.