- Inter-departmental Lab

BigData@polito - Inter-departmental Lab
Idilio Drago / Marco Mellia

Outline Introduction to the BigData@Polito lab The Big Data cluster
Hardware Software Basic benchmark How to access and use the cluster? Examples of current usage of the cluster

BigData@Polito Lab – Why?
When the data is such that processing it becomes part of the challenge Volume, velocity, variety etc Extract some useful knowledge Data mining, machine learning, clustering … Big data cluster Open, flexible, scalable Based on open-source For experimental activities Research Teaching

Big data vs HPC HPC Focus on fast computing Message passing etc.
Focus on storage Simple operations on large data Embarrassingly parallel tasks Divide and conquer principle Move code where data is located PB HPC Focus on fast computing - cores, ram, GHz, … Message passing etc. Move superfast little data to superfast CPUs TFLOPS

BigData@Polito Lab Involved departments Physical cluster location
DET, DAUIN, DISMA, DIGEP Physical cluster location Auta T – Ing. del Cinema Scientific committee members Mellia Marco - Telecommunication Networks Group DET Baralis Elena - Database and Data Mining Group DAUIN Paolucci Emilio, Neirotti Paolo - DIGEP Mauro Gasparini, Vaccarino Francesco - DISMA Michiardi Pietro - Distributed Systems Group EURECOM (France)

History

Key ideas of big data frameworks
Data locality principle Move algorithms to the data, not data to the algorithms Failures are the norm, not the exception The framework takes care of splitting data, synchronizing tasks, recovering in case of failures of a task or a server etc. Data intensive workloads MapReduce → a batch processing framework designed to perform full reads of the input, thus avoiding random access Horizontal scalability based on commodity servers e.g., doubling the number of servers, halving processing time

Map Reduce – Toy example
How often a word appears in a collection of documents?

BigData@Polito – The Hardware

BigData@Polito – The Hardware
4 Switches N3048 18 Workers DELL R720XD 2 x Intel E5-2630v2 6 cores Memory 96 GB 12 HDs 3TB – JBOD 4+1 GbE Network 12 Workers SuperMicro 1 x Intel Xeon 6 cores Memory 64 / 32 GB 5 HDs 2TB – JBOD 2+1 GbE Network Workers: 576 logical cores (with HT) +2TB RAM 276 HDs 768 TB of storage ~ 45 GB/s “nominal” disk read speed (dd) 3 Masters DELL R620 2 x Intel E5-2630v2 6 cores Memory 128 GB 3 HDs 600GB in RAID 4+1 GbE Network

BigData@Polito – Logic Setup
Link Aggregation w/Bonding (balance-alb) all machines are connected to both switches in their racks P2P communication is limited to 1 Gbps

The Software Based on the Cloudera platform

Architecture HDFS – Hadoop Distributed File System
YARN – Yet Another Resource Negotiator Applications : MapReduce, Spark etc

HDFS: What is the usable disk capacity?
Replication set to 3 – the client writes blocks to its own node first, then the other rack is used for a second and a third copy Therefore out cluster actual capacity is 256 TB Replicas guarantee resilience to disk failures (and we had some already) They give flexibility to allocation of executors

YARN: How are the resources shared?
Scheduling Policy Preemption

YARN: How are the resources shared?
Dominant Resource Fairness: Equalizes “dominant share” of users Host: <9 CPU, 18 GB> Task User 1: <1 CPU, 4 GB> dom res: memory Task User 2: <3 CPU, 1 GB> dom res: CPU Preemption occurs after 2 min: It is normal to wait some time to see the job starting running It is normal to see containers being killed

Spark applications

MLlib algorithms

Example – Spark execution overview
The application creates a driver process The application gets its executor processes It sends the code and tasks to the executors Our current setup allows applications to have more than 500 executors (500+ threads reading and processing the data in parallel)

Raw HDFS read speed Thanks to overhead, the cluster can read up to 13 GB/s (without any processing)

Roughly, this cluster can sort 1 TB in ~10 min (mapred)
Terasort Roughly, this cluster can sort 1 TB in ~10 min (mapred)

Hardware Software Basic benchmark How to access and use the cluster? Samples of current usage of the cluster

How do I request an user account?
First: Is this cluster/framework the best solution? The cluster has an independent LDAP/Kerberos system controlling access and HDFS permissions Contact the responsible in your department DET: Marco Mellia, Maurizio Munafò, Idilio Drago, … DAUIN: Elena Baralis, Paolo Garza, … … Fill in the form available at

How do I use the cluster? Go to

Research Scope: New Algorithms and data science APPLICATION LAYER
TRANSPORT LAYER Analysis of network traffic in real-time APPLICATION LAYER Analysis of OSN contents Scope: New Algorithms and data science Traffic classification, engineering Network security (e.g., malware detection) User and community profiling Recommendation systems

Teaching Computer Engineering MS current offering
Data Mining Artificial Intelligence Big Data Management New track on Data Science Data Modeling + Data Engineering + Software engineering + Data Mining & Analytics

Questions?

- Inter-departmental Lab

Similar presentations

Presentation on theme: "- Inter-departmental Lab"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

- Inter-departmental Lab

Similar presentations

Presentation on theme: "- Inter-departmental Lab"— Presentation transcript:

Similar presentations

About project

Feedback