Running virtualized Hadoop, does it make sense?

Slides:



Advertisements
Similar presentations
BY VAIBHAV NACHANKAR ARVIND DWARAKANATH Evaluation of Hbase Read/Write (A study of Hbase and it’s benchmarks)
Advertisements

Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
CERN IT Department CH-1211 Geneva 23 Switzerland t Sequential data access with Oracle and Hadoop: a performance comparison Zbigniew Baranowski.
Analyzing the Energy Efficiency of a Database Server Hanskamal Patel SE 521.
Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Introduction to Hadoop and HDFS
Development of Hybrid SQL/NoSQL PanDA Metadata Storage PanDA/ CERN IT-SDC meeting Dec 02, 2014 Marina Golosova and Maria Grigorieva BigData Technologies.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Scale up Vs. Scale out in Cloud Storage and Graph Processing Systems
CERN IT Department CH-1211 Geneva 23 Switzerland t Oracle Parallel Query vs Hadoop MapReduce for Sequential Data Access Zbigniew Baranowski.
Nov 2006 Google released the paper on BigTable.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
Cloudera Kudu Introduction
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.
Big Data & Test Automation
Integration of Oracle and Hadoop: hybrid databases affordable at scale
Introduction to MapReduce and Hadoop
Scaling HDFS to more than 1 million operations per second with HopsFS
Integration of Oracle and Hadoop: hybrid databases affordable at scale
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Hadoop.
Ignacio Cano, Srinivas Aiyar, Arvind Krishnamurthy
Data Analytics and CERN IT Hadoop Service
Hadoop and Analytics at CERN IT
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Real-time analytics using Kudu at petabyte scale
Diskpool and cloud storage benchmarks used in IT-DSS
Hadoop Tutorials Spark
Distributed Network Traffic Feature Extraction for a Real-time IDS
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Chapter 14 Big Data Analytics and NoSQL
Scaling SQL with different approaches
Bridges and Clouds Sergiu Sanielevici, PSC Director of User Support for Scientific Applications October 12, 2017 © 2017 Pittsburgh Supercomputing Center.
Windows Azure Migrating SQL Server Workloads
Data Analytics and CERN IT Hadoop Service
Data Analytics and CERN IT Hadoop Service
Hadoop Clusters Tess Fulkerson.
Powering real-time analytics on Xfinity using Kudu
Year 2 Updates.
Grid Canada Testbed using HEP applications
Data Analytics and CERN IT Hadoop Service
Big Data - in Performance Engineering
Data Lifecycle Review and Outlook
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Introduction to Apache
Overview of big data tools
Cloud computing mechanisms
Applying Data Warehousing and Big Data Techniques to Analyze Internet Performance Thiago Barbosa, Renan Souza, Sérgio Serra, Maria Luiza and Roger Cottrell.
Big DATA.
MapReduce: Simplified Data Processing on Large Clusters
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
SDMX meeting Big Data technologies
Presentation transcript:

Running virtualized Hadoop, does it make sense? Kacper Surdy, Zbigniew Baranowski HEPiX Spring 2016

Intro I work for IT databases group at CERN We are involved in running Hadoop service My responsibilities include Exploring potential service evolution Providing consultancy to the users Day-to-day administration

Motivations for using VMs Easy scaling up or down of the service Unified procurement procedure Clusters with configurations optimized for individual needs Potential user separation

Generic VM characteristics Fast CPU and memory SSDs available More constraints on storage, we have to choose between Local disks shared with other VMs running on the same hypervisor Network attached volumes External HDFS installation

What about data locality and emphasis on high throughput? shared storage shared nothing

Particular, but popular, applications For some workloads, distributed processing is more desired than data locality Analytics (Spark, Impala, MapReduce) NoSQL databases (HBase) Indexing engines (Solr) Data scanning is CPU intensive Parsing data while reading / writing Data decoding and decompression

Benchmarks Sequential reads Key-value store workloads Custom queries on Atlas Event Index data Impala – SQL query engine for Hadoop Stressing throughput Smart data scanning Key-value store workloads Yahoo! Cloud Serving Benchmark (YCSB) HBase – non relational, key-value data store on HDFS Potential caching and in-memory operations

Test clusters VM cluster on CERN Openstack with HDDs 4 nodes, m1.large flavour, per node: 4 VCPUs (Nehalem-C 2.26 GHz) 8 GB RAM 80 GB disk VM cluster on CERN Openstack with SSDs 4 nodes, m2.large flavour, per node: 4 VCPUs (Haswell 2.4 GHz) 7.3 GB RAM 40 GB disk Physical production cluster 14 nodes, per node: 16 cores (2x ‎Ivy Bridge-EP 2.6 GHz) 64 GB RAM 173 TiB (48x 4 TB disks)

Big Data processing case Generating a report for given collection Prune the partitions Fully scan the partition Present the report physics_ZeroBias physics_MinBias physics_CosmicMuons Full scan GBs of data TeraBytes of data physics_CosmicCalo physics_Main

Sequential scanning of a dataset Looking for events of interest in certain dataset by performing full scan Input data set (after preselection) 40M of records Data highly compressed in volume of 12GB in CSV format it occupies 74GB

Scanning performance results: Cluster on virtual machines with HDDs VMs on HDDs VMs with HDFS over network VMs on SSDs Physical scan time 59 sec 43 sec 9 sec 6 sec scan speed 660K row/s 916K row/s 4.3M row/s 6.6M row/s Throughput (total) 210 MB/s 287 MB/s 1.3 GB/s 2.05 GB/s Throughput per machine 52 MB/s 71 MB/s 343 MB/s 147 MB/s Processing is throttled by IO performance 50MB/s per VM CPU utilized in 50%

Scanning performance results: VM cluster with remote HDFS attached VMs on HDDs VMs with HDFS over network VMs on SSDs Physical scan time 59 sec 43 sec 9 sec 6 sec scan speed 660K row/s 916K row/s 4.3M row/s 6.6M row/s Throughput (total) 210 MB/s 287 MB/s 1.3 GB/s 2.05 GB/s Throughput per machine 52 MB/s 71 MB/s 343 MB/s 147 MB/s Processing throttled by Network performance 70MB/s per VM CPU utilized in 60%

Scanning performance results: Cluster on virtual machines with SSDs VMs on HDDs VMs with HDFS over network VMs on SSDs Physical scan time 59 sec 43 sec 9 sec 6 sec scan speed 660K row/s 916K row/s 4.3M row/s 6.6M row/s Throughput (total) 210 MB/s 287 MB/s 1.3 GB/s 2.05 GB/s Throughput per machine 52 MB/s 71 MB/s 343 MB/s 147 MB/s Processing throttled by CPU All 4 VM cores utilised IO throughput per VM 343MB/s

Scanning performance results: Physical cluster was underutilised VMs on HDDs VMs with HDFS over network VMs on SSDs Physical scan time 59 sec 43 sec 9 sec 6 sec scan speed 660K row/s 916K row/s 4.3M row/s 6.6M row/s Throughput (total) 210 MB/s 287 MB/s 1.3 GB/s 2.05 GB/s Throughput per machine 52 MB/s 71 MB/s 343 MB/s 147 MB/s The test dataset was too small to make use of all resources available Processing 76 files occupied 76 cores, 1/3 of total

Column-wise analytics There are data file formats that stores the data in a column-oriented way (e.g Parquet) This reduces amount of data to be read to the subsets of interest physics_ZeroBias Properties to be scanned physics_MinBias physics_CosmicMuons physics_CosmicCalo physics_Main

Column-wise analytics - performance Reporting workload (counting, averaging, min, max, stdev) Single columns values extractions

Recap on analytic workloads Throughput of sequential scanning on the virtualized clusters was close to the expectations Bound by performance of underlying hypervisor storage – typically 50MB/s per VM with HDD Accessing the data over network improved IO performance to 70MB/s per VM and offers additional capacity Notably high throughputs (400 MB/s per VM) can be obtain when hypervisors are equipped with SSD Satisfying performance has been observed for reporting workloads when input data can be pruned to the level of megabytes

Key-value data store case Different than analytics/reporting workload Only small amount of data is written or retrieved at a time HBase used for tests Widely used and strongly linked to Hadoop environment Additional processing on data structures before writing to HDFS Sorting, compression, additional metadata Caching

Yahoo! Cloud Serving Benchmark Popular “Big Data” benchmarking suite Tests dataset 10 millions records 14.6 GB on HDFS Test workloads Operations on 1 million records Varying ratio of read to write operations

YCSB results – total throughput per cluster 4 nodes SSD virtualized cluster performs comparably to 14 nodes HDD production one Unusual degradation of performance on read intensive workloads

YCSB results – throughput per node Virtualized SSD cluster outperforms all others in per node comparison Virtualized HDD cluster shows good per node performance in write heavy application

HBase: conclusions A small virtual installation on SSDs outperforms our current production cluster Obviously having orders of magnitude smaller capacity Virtual installation on HDDs can be suitable for write intensive workloads Per node performance comparable to the physical cluster Could be suitable for write-mostly workloads Can serve up to 5500 write ops/sec while only up to 1000 read ops/sec

Summary of use cases for Hadoop on VMs VMs on hypervisors with HDDs Development, and functional testing CPU-intensive reporting workloads on highly compressed data NoSQL databases not needing high performance VMs on hypervisors SSDs Performance testing of new components Small but high throughput systems

Future work Run similar tests on network attached Cinder volumes Tests with other workloads and components Streaming data (Spark) Machine Learning (Spark) Invite Hadoop users to try out the virtual Hadoop infrastructure We plan to use this infrastructure to evaluate new processing framework for Hadoop Apache Flink, Cloudera Kudu

Acknowledgments Many thanks to Bernd Panzer-Steindel and Arne Wiebalck for help with allocating the test HW Hadoop Service at CERN Hadoop users’ forum and analytics working group https://indico.cern.ch/category/5894/

General Observations Out of single VM we could get throughput of 60 MB/s reading from local HDD 500 MB/s reading from local SSD 80-90MB/s reading from a network VM in CERN Openstack demonstrated very stable performance