Presentation is loading. Please wait.

Presentation is loading. Please wait.

MalStone:Towards A Benchmark for Analytics on Large Data Clouds Collin Bennett Open Data Group 400 Lathrop Ave Suite 90 River Forest IL 60305 Robert L.

Similar presentations


Presentation on theme: "MalStone:Towards A Benchmark for Analytics on Large Data Clouds Collin Bennett Open Data Group 400 Lathrop Ave Suite 90 River Forest IL 60305 Robert L."— Presentation transcript:

1 MalStone:Towards A Benchmark for Analytics on Large Data Clouds Collin Bennett Open Data Group 400 Lathrop Ave Suite 90 River Forest IL 60305 Robert L. Grossman Open Data Group 400 Lathrop Ave Suite 90 River Forest IL 60305 David Locke Open Data Group 400 Lathrop Ave Suite 90 River Forest IL 60305 Jonathan Seidman Open Data Group 400 Lathrop Ave Suite 90 River Forest IL 60305 Steve Vejcik Open Data Group 400 Lathrop Ave Suite 90 River Forest IL 60305 KDD ’ 10, July 25 – 28, 2010, Washington, DC, USA

2 OUTLINE 0. ABSTRACT 1. INTRODUCTION 2. Common Elements 3. MalStone A & B 4. MalGen 5. THREE IMPLEMENTATIONS 6. EXPERIMENTAL STUDIES 7. DISCUSSION 8. RELATED WORK 9. SUMMARY

3 0. ABSTRACT  Terasort  MalStone  MalGen

4 1. INTRODUCTION  Data Mining for Clouds : Hbase, Apache Pig, Hive and ZooKeeper,  There are no similar benchmarks for comparing two large data clouds that support building analytic models on large datasets.  Use MalStone, also describe the implementation of a data generator for MalStone called MalGen

5 2.Common Elements  Time stamps  Sites e.g. Web sites, computers, network devices  Entities e.g. visitors, users, flows  Log files fill disks, many, many disks  Behavior occurs at all scales  Want to identify phenomena at all scales  Need to group “ similar behavior ”  Need to do statistics (not just sorting)

6 2.Common Elements Abstract the Problem Using Site-Entity Logs ExampleSitesEntities Measuring online advertising Web sitesConsumers Drive-by exploitsWeb sitesComputers (identified by cookies or IP) Compromised systems Compromised computers User accounts

7 3. MalStone A & B MalStone Benchmark  Benchmark developed by Open Cloud Consortium for clouds supporting data intensive computing.  Code to generate synthetic data required is available from code.google.com/p/malgen  Stylized analytic computation that is easy to implement in MapReduce and its generalizations.

8 3. MalStone A & B MalStone A computes j for all sites j in the log files. MalStone B computes j;t for sites j in the log files

9 3. MalStone A & B  be the set of all entities ei  Aj that become marked at any time in the monitor window

10 3. MalStone A & B  is the set of entities  that become marked at any time during the monitor window.

11 3. MalStone A & B The statistic is (1 + 0 + 0)/(1 + 1 + 0) = 1/2

12 4. MalGen  Tens of millions of sites  Hundreds of millions of entities  Billions of events  Most sites have a few number of events  Some sites have many events  Most entities visit a few sites  Some visitors visit many sites

13 4. MalGen  For generating site-entity log files

14 5. THREE IMPLEMENTATIONS  HDFS, Hadoop Streams and Python  Hadoop HDFS and MapReduce  Sector and Sphere UDFs(User Defined Functions )

15 6. EXPERIMENTAL STUDIES

16 MalStone B Sector/Sphere v1.2044 min # Nodes20 nodes # Records10 Billion Size of Dataset1 TB Tests done on Open Cloud Testbed.

17 7. DISCUSSION  Hadoop streams does not require the MapReduce framework.  Python programs can be invoked by Hadoop streams.

18 8. RELATED WORK  In 2008,Haddop by Terasort : 297sec. In 2009,Hadoop by Terasort : 209sec. In nowadays,Terasort was replacement by Minute Sort : in about 1 Min.  [MapReduce for machine learning on multicore] Using MapReduce,but does not describe a computation similar to the MalStone statistic.

19 9. SUMMARY  MalGen to create large amount of data.  Performance depend upon which cloud middleware is used to compute.


Download ppt "MalStone:Towards A Benchmark for Analytics on Large Data Clouds Collin Bennett Open Data Group 400 Lathrop Ave Suite 90 River Forest IL 60305 Robert L."

Similar presentations


Ads by Google