Big Data Engineering: Recent Performance Enhancements in JVM- based Frameworks Mayuresh Kunjir.

Big Data Engineering: Recent Performance Enhancements in JVM- based Frameworks Mayuresh Kunjir

Modern Data Analytics Frameworks Process external data Great for unstructured data Easy to scale out Suited for Cloud infrastructure Fast recovery Suited for commodity nodes JVM based Better adaptability Hadoop Hbase Spark Flink Hadoop Hbase Spark Flink

How do they stack up? Modern Data Analytics Framework (Hadoop, Spark, Flink) Relational DB, MPP DB (Netezza, Pivotal, Vertica) Extensibility Fault Tolerance Scalability Extensibility Fault Tolerance Scalability Performance Expressiveness SQL-on-Hadoop solutions bring power of SQL to modern data analytics Java vs. C ?

Why can’t performance match? 1. Java objects have storage overheads 2. Garbage collection : “Stop the world” pauses JVM

JVM Memory Management Young generation New objects are created here. Parallel GC is run frequently GC “tenures” objects that are alive for a long time Old generation GC is run less frequently and takes longer Uses Concurrent-Mark-Sweep (CMS) or G1 GC Capable of compacting memory to avoid memory fragmentation

In-memory processing in conflict with GC Spark caches RDDs in memory between iterations Implies less memory for user program’s custom objects Implies more strain on Garbage collection Performance Guideline: Keep as less data in JVM managed heap as possible

Serialization schemes Performance Guideline: Rather than generic serialization schemes, build semantic and schema-specific schemes

Enhancement #1: Custom Serialization Store the exact schema only once per dataset. Store byte streams per tuple, with offsets to instance attributes. Project Tungsten [4]

Serialization in Flink [3] Custom class ‘TypeInformation’ represents any data type. Each implementation of TypeInformation provides a custom serializer e.g. To serialize Tuple3 where Person is a POJO

Enhancement #2: Custom Memory Management HBase is used in latency-sensitive applications, GC delays are hazardous. HBase MemStore allocates allocates memory in chunks of 2MB which makes GC sweeps more efficient. [2] MemStore-Local Allocation Buffers help avoid memory fragmentation. [2]

Flink Managed Memory A pool of 32KB buffers managed by MemoryManager and never released to GC. [3] No major GC ever takes place

Utilizing off-heap memory java.nio and sun.misc.unsafe packages allow C-style memory management in Java. Project Tachyon supports in-memory storage for Spark RDDs in off-heap space. [1] Project Tungsten supports storing shuffle objects in off- heap space. [4]

Enhancement #3: Cache-Sensitive Data Structures Sort buffer in Flink [3]

Case study: Sort in Spark Spark won Daytona GraySort contest in 2014. Compared to Yahoo’s earlier record for 100TB, Spark sorted the same data 3X faster using 10X fewer machines. Spark managed to also sort 1PB on 190 machines in under 4 hours. Earlier record: 3800 machines, 16 hours.

Sort in Spark: Optimizations Sort-based shuffle Lower memory overhead compared to hash-based shuffle New network module Uses JNI and bypasses JVM’s memory allocator External shuffle service Shuffle continues even during GC pauses TimSort Performs better than Quicksort for most real-world datasets Cache locality

Are we there yet? Alexey Grishchenko[6] compared performance of Spark with all sun.misc.unsafe magic enabled with Pivotal HAWQ (written in C) and here are the runtimes: SparkPivotal HAWQ First run7.33sec0.25sec Second run2.12sec Third run2.04sec * Tested using a query “ select a, avg(b) from test group by a order by a; ” Huge gap still!

What are we doing? Memory management in Spark Understanding impact of various memory tuning parameters on GC

References 1. Big Data Performance Engineering : Examples from Hadoop, Pig, HBase, Flink and Spark Big Data Performance Engineering 2. Avoiding Full GCs in Apache HBase with MemStore-Local Allocation Buffers: Three part series Avoiding Full GCs in Apache HBase with MemStore-Local Allocation Buffers 3. Juggling with Bits and Bytes: Flink memory management Juggling with Bits and Bytes 4. Project Tungsten: Bringing Spark Closer to Bare Metal Project Tungsten 5. Spark the fastest open source engine for sorting a petabyte Spark the fastest open source engine for sorting a petabyte 6. Spark DataFrames are faster, aren’t they? Spark DataFrames are faster, aren’t they? 7. Byte Buffers and Non-Heap Memory Byte Buffers and Non-Heap Memory 8. JavaOne 2013: Memory Efficient Java JavaOne 2013 9. Java Garbage Collection Basics Java Garbage Collection Basics

Big Data Engineering: Recent Performance Enhancements in JVM- based Frameworks Mayuresh Kunjir.

Similar presentations

Presentation on theme: "Big Data Engineering: Recent Performance Enhancements in JVM- based Frameworks Mayuresh Kunjir."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Data Engineering: Recent Performance Enhancements in JVM- based Frameworks Mayuresh Kunjir.

Similar presentations

Presentation on theme: "Big Data Engineering: Recent Performance Enhancements in JVM- based Frameworks Mayuresh Kunjir."— Presentation transcript:

Similar presentations

About project

Feedback