Big Data Engineering: Recent Performance Enhancements in JVM- based Frameworks Mayuresh Kunjir.

Slides:

Advertisements

Similar presentations

Meet Hadoop Doug Cutting & Eric Baldeschwieler Yahoo!

Advertisements

Shark:SQL and Rich Analytics at Scale

Can’t We All Just Get Along? Sandy Ryza. Introductions Software engineer at Cloudera MapReduce, YARN, Resource management Hadoop committer.

Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.

Spark: Cluster Computing with Working Sets

BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.

APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.

Startup Technology Pitfalls and How to Avoid them.

© 2009 VMware Inc. All rights reserved Big Data’s Virtualization Journey Andrew Yu Sr. Director, Big Data R&D VMware.

Memory Allocation CS Introduction to Operating Systems.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,

MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …

Dynamic Memory Allocation Questions answered in this lecture: When is a stack appropriate? When is a heap? What are best-fit, first-fit, worst-fit, and.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

Chapter 3.5 Memory and I/O Systems. 2 Memory Management Memory problems are one of the leading causes of bugs in programs (60-80%) MUCH worse in languages.

1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.

CS 149: Operating Systems March 3 Class Meeting Department of Computer Science San Jose State University Spring 2015 Instructor: Ron Mak

Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.

Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –

Hadoop implementation of MapReduce computational model Ján Vaňo.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Antoine Chambille Head of Research & Development, Quartet FS

ICOM 5016 – Introduction to Database Systems Lecture 13- File Structures Dr. Bienvenido Vélez Electrical and Computer Engineering Department Slides by.

® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.

Next Generation of Apache Hadoop MapReduce Owen

Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)

NUMA Optimization of Java VM

BIG DATA/ Hadoop Interview Questions.

Ignite in Sberbank: In-Memory Data Fabric for Financial Services

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Microsoft Ignite /28/2017 6:07 PM

1 Cloud-Native Data Warehousing Bob Muglia. 2 Scenarios with affinity for cloud Gartner 2016 Predictions: By 2018, six billion connected things will be.

BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.

Big Data Analytics and HPC Platforms

Big Data is a Big Deal!.

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

Hadoop Aakash Kag What Why How 1.

Machine Learning Library for Apache Ignite

Hadoop and Analytics at CERN IT

Running virtualized Hadoop, does it make sense?

Java 9: The Quest for Very Large Heaps

CS703 - Advanced Operating Systems

Spark Presentation.

Rifat Shahriyar Stephen M. Blackburn Australian National University

Hadoop Clusters Tess Fulkerson.

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Introduction to Spark.

I590 Data Science Curriculum August

Data Science Curriculum March

CS Introduction to Operating Systems

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Ch 4. The Evolution of Analytic Scalability

Overview of big data tools

Department of Intelligent Systems Engineering

Convergence of Big Data and Extreme Computing

Presentation transcript:

Big Data Engineering: Recent Performance Enhancements in JVM- based Frameworks Mayuresh Kunjir

Modern Data Analytics Frameworks Process external data Great for unstructured data Easy to scale out Suited for Cloud infrastructure Fast recovery Suited for commodity nodes JVM based Better adaptability Hadoop Hbase Spark Flink Hadoop Hbase Spark Flink

How do they stack up? Modern Data Analytics Framework (Hadoop, Spark, Flink) Relational DB, MPP DB (Netezza, Pivotal, Vertica) Extensibility Fault Tolerance Scalability Extensibility Fault Tolerance Scalability Performance Expressiveness SQL-on-Hadoop solutions bring power of SQL to modern data analytics Java vs. C ?

Why can’t performance match? 1. Java objects have storage overheads 2. Garbage collection : “Stop the world” pauses JVM

JVM Memory Management Young generation New objects are created here. Parallel GC is run frequently GC “tenures” objects that are alive for a long time Old generation GC is run less frequently and takes longer Uses Concurrent-Mark-Sweep (CMS) or G1 GC Capable of compacting memory to avoid memory fragmentation

In-memory processing in conflict with GC Spark caches RDDs in memory between iterations Implies less memory for user program’s custom objects Implies more strain on Garbage collection Performance Guideline: Keep as less data in JVM managed heap as possible

Serialization schemes Performance Guideline: Rather than generic serialization schemes, build semantic and schema-specific schemes

Enhancement #1: Custom Serialization Store the exact schema only once per dataset. Store byte streams per tuple, with offsets to instance attributes. Project Tungsten [4]

Serialization in Flink [3] Custom class ‘TypeInformation’ represents any data type. Each implementation of TypeInformation provides a custom serializer e.g. To serialize Tuple3 where Person is a POJO

Enhancement #2: Custom Memory Management HBase is used in latency-sensitive applications, GC delays are hazardous. HBase MemStore allocates allocates memory in chunks of 2MB which makes GC sweeps more efficient. [2] MemStore-Local Allocation Buffers help avoid memory fragmentation. [2]

Flink Managed Memory A pool of 32KB buffers managed by MemoryManager and never released to GC. [3] No major GC ever takes place

Utilizing off-heap memory java.nio and sun.misc.unsafe packages allow C-style memory management in Java. Project Tachyon supports in-memory storage for Spark RDDs in off-heap space. [1] Project Tungsten supports storing shuffle objects in off- heap space. [4]

Enhancement #3: Cache-Sensitive Data Structures Sort buffer in Flink [3]

Case study: Sort in Spark Spark won Daytona GraySort contest in Compared to Yahoo’s earlier record for 100TB, Spark sorted the same data 3X faster using 10X fewer machines. Spark managed to also sort 1PB on 190 machines in under 4 hours. Earlier record: 3800 machines, 16 hours.

Sort in Spark: Optimizations Sort-based shuffle Lower memory overhead compared to hash-based shuffle New network module Uses JNI and bypasses JVM’s memory allocator External shuffle service Shuffle continues even during GC pauses TimSort Performs better than Quicksort for most real-world datasets Cache locality

Are we there yet? Alexey Grishchenko[6] compared performance of Spark with all sun.misc.unsafe magic enabled with Pivotal HAWQ (written in C) and here are the runtimes: SparkPivotal HAWQ First run7.33sec0.25sec Second run2.12sec Third run2.04sec * Tested using a query “ select a, avg(b) from test group by a order by a; ” Huge gap still!

What are we doing? Memory management in Spark Understanding impact of various memory tuning parameters on GC

References 1. Big Data Performance Engineering : Examples from Hadoop, Pig, HBase, Flink and Spark Big Data Performance Engineering 2. Avoiding Full GCs in Apache HBase with MemStore-Local Allocation Buffers: Three part series Avoiding Full GCs in Apache HBase with MemStore-Local Allocation Buffers 3. Juggling with Bits and Bytes: Flink memory management Juggling with Bits and Bytes 4. Project Tungsten: Bringing Spark Closer to Bare Metal Project Tungsten 5. Spark the fastest open source engine for sorting a petabyte Spark the fastest open source engine for sorting a petabyte 6. Spark DataFrames are faster, aren’t they? Spark DataFrames are faster, aren’t they? 7. Byte Buffers and Non-Heap Memory Byte Buffers and Non-Heap Memory 8. JavaOne 2013: Memory Efficient Java JavaOne Java Garbage Collection Basics Java Garbage Collection Basics