Streaming Big Data with Self-Adjusting Computation Umut A. Acar, Yan Chen DDFP 2013 02 January 2014 SNU IDB Lab. Namyoon Kim.

Slides:



Advertisements
Similar presentations
Experiences with Hadoop and MapReduce
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
PScout: Analyzing the Android Permission Specification
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Grand Challenge: The BlueBay Soccer Monitoring Engine Hans-Arno Jacobsen Kianoosh Mokhtarian Tilmann Rabl Mohammad.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
R2: An application-level kernel for record and replay Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, Z. Zhang, (MSR Asia, Tsinghua, MIT),
SkewTune: Mitigating Skew in MapReduce Applications
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Convex Hull Algorithms for Dynamic Data Kanat Tangwongsan Joint work with Guy Blelloch and Umut Acar (TTI-C)
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Peter Dinda Department of Computer Science Northwestern University Beth Plale Department.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Jeffrey D. Ullman Stanford University. 2 Formal Definition Implementation Fault-Tolerance Example: Join.
Self-Adjusting Computation Robert Harper Carnegie Mellon University (With Umut Acar and Guy Blelloch)
Lecture 2 – MapReduce: Theory and Implementation CSE 490h – Introduction to Distributed Computing, Spring 2007 Except as otherwise noted, the content of.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Jeffrey D. Ullman Stanford University.  Mining of Massive Datasets, J. Leskovec, A. Rajaraman, J. D. Ullman.  Available for free download at i.stanford.edu/~ullman/mmds.html.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Ch 4. The Evolution of Analytic Scalability
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Interpreting the data: Parallel analysis with Sawzall LIN Wenbin 25 Mar 2014.
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
An Approach for Processing Large and Non-uniform Media Objects on MapReduce-Based Clusters Rainer Schmidt and Matthias Rella Speaker: Lin-You Wu.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Storage in Big Data Systems
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Parallelizing Security Checks on Commodity Hardware E.B. Nightingale, D. Peek, P.M. Chen and J. Flinn U Michigan.
Goodbye rows and tables, hello documents and collections.
Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools Mohammad Farhan Husain, Latifur Khan, Murat Kantarcioglu and Bhavani Thuraisingham.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
MapReduce How to painlessly process terabytes of data.
CUDA Performance Study on Hadoop MapReduce Clusters Chen He Peng Du University of Nebraska-Lincoln.
Mining High Utility Itemset in Big Data
1 CSE 326: Data Structures Sorting It All Out Henry Kautz Winter Quarter 2002.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Hidemoto Nakada, Hirotaka Ogawa and Tomohiro Kudoh National Institute of Advanced Industrial Science and Technology, Umezono, Tsukuba, Ibaraki ,
Data Replication and Power Consumption in Data Grids Susan V. Vrbsky, Ming Lei, Karl Smith and Jeff Byrd Department of Computer Science The University.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
Map Reduce. Functional Programming Review r Functional operations do not modify data structures: They always create new ones r Original data still exists.
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Searching Topics Sequential Search Binary Search.
Logical & Physical Address Nihal Güngör. Logical Address In simplest terms, an address generated by the CPU is known as a logical address. Logical addresses.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
Zaiben Chen et al. Presented by Lian Liu. You’re traveling from s to t. Which gas station would you choose?
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Ch 4. The Evolution of Analytic Scalability
Experiences with Hadoop and MapReduce
CS639: Data Management for Data Science
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Streaming Big Data with Self-Adjusting Computation Umut A. Acar, Yan Chen DDFP January 2014 SNU IDB Lab. Namyoon Kim

2 / 13 Outline Introduction Self-Adjusting MapReduce Regular MapReduce and User Interface Internals Implementation Evaluation Conclusion

3 / 13 Introduction Big Data Is all the rage at the moment Bound to be more important with real time processing in the future Requirements Relatively sophisticated systems support (MapReduce) Asymptotically linear time examining the data Not too many updates (not fit for transaction heavy systems)

4 / 13 Problems Live Updates Using a single CPU, stream-updating n linear-time updates on a data set of size n requires O(n 2 ) time Re-processing of Existing Data When input is fed line by line in the classic MapReduce wordcount example, the entire data set is re-processed (regardless of the next input chunk size, it will always take longer)

5 / 13 Self-Adjusting MapReduce (Regular MapReduce) Arguments: type of key-value pair (for wordcount, string * int) mapper – maps a word to a string * int pair reducer – reduces common key pairs to a pair

6 / 13 Self-Adjusting MapReduce (Internals) Modifiable(reference) Stores values that can change over time Using primitives mod, read and write, execution is represented as a dependency graph (also known as change propagation) After initial complete run, costs only ϴ(1) to update (complete re-execution ϴ(n)) Levels are type annotations – resulting types are called level types Instead of explicit primitive reference, only needs identification of changeable data

7 / 13 MapReduce in SML with changeables $C – changeable data types 1. map function converts input list into key- value pairs 2. groupby function mergesorts and scans pair lists 3. reduce each sublist generated by the groupby

8 / 13 Self-Adjusting MapReduce (Implementation) Map Pass in a changeable list with stable elements (can add/delete, not modify) groupby common key pairs into a pair list Reduce Sublist returned by groupby function is reduced to a changeable key-value pair When inserting a new input cell, result can be updated by changing the result of the reducer operating on the keys of the new cell Requires no structural changes to the output Uses stable algorithms (preserves order of records)

9 / 13 Evaluation – The Input Data from dbpedia lines, words Test setup: single node with 2GHz Intel Xeon and 64GB RAM

10 / 13 Evaluation – Regular MapReduce in SML Data filling up fast cache

11 / 13 Evaluation – Self-Adjusting MapReduce

12 / 13 Evaluation – Results Updating whole data vs. updating the delta Non adjusted MapReduce takes 12.5 hours Self-Adjusting MapReduce takes 1.7 minutes – 440x speedup Memory Usage Self-Adjusting MapReduce records the dynamic dependency graph in memory Takes up to 16GB of memory, vs. 350MB for the standard MapReduce 45x memory cost

13 / 13 Conclusion Contribution Use of self-adjusting computation with dynamically changing big data Much faster than non self-adjusting methods, albeit with a tradeoff in memory usage Future Work Memory optimisation – reduce memory consumption Increase granularity of dependency Generalise results to distributed settings, and beyond the MapReduce model