- Inter-departmental Lab

Similar presentations

Presentation on theme: "- Inter-departmental Lab"— Presentation transcript:

1 BigData@polito - Inter-departmental Lab
Idilio Drago / Marco Mellia

2 Outline Introduction to the BigData@Polito lab The Big Data cluster
Hardware Software Basic benchmark How to access and use the cluster? Examples of current usage of the cluster

3 BigData@Polito Lab – Why?
When the data is such that processing it becomes part of the challenge Volume, velocity, variety etc Extract some useful knowledge Data mining, machine learning, clustering … Big data cluster Open, flexible, scalable Based on open-source For experimental activities Research Teaching

4 Big data vs HPC HPC Focus on fast computing Message passing etc.
Focus on storage Simple operations on large data Embarrassingly parallel tasks Divide and conquer principle Move code where data is located PB HPC Focus on fast computing - cores, ram, GHz, … Message passing etc. Move superfast little data to superfast CPUs TFLOPS

5 BigData@Polito Lab Involved departments Physical cluster location
DET, DAUIN, DISMA, DIGEP Physical cluster location Auta T – Ing. del Cinema Scientific committee members Mellia Marco - Telecommunication Networks Group DET Baralis Elena - Database and Data Mining Group DAUIN Paolucci Emilio, Neirotti Paolo - DIGEP Mauro Gasparini, Vaccarino Francesco - DISMA Michiardi Pietro - Distributed Systems Group EURECOM (France)

6 History

7 Key ideas of big data frameworks
Data locality principle Move algorithms to the data, not data to the algorithms Failures are the norm, not the exception The framework takes care of splitting data, synchronizing tasks, recovering in case of failures of a task or a server etc. Data intensive workloads MapReduce → a batch processing framework designed to perform full reads of the input, thus avoiding random access Horizontal scalability based on commodity servers e.g., doubling the number of servers, halving processing time

8 Map Reduce – Toy example
How often a word appears in a collection of documents?

9 Outline Introduction to the BigData@Polito lab The Big Data cluster
Hardware Software Basic benchmark How to access and use the cluster? Examples of current usage of the cluster

10 BigData@Polito – The Hardware

11 BigData@Polito – The Hardware
4 Switches N3048 18 Workers DELL R720XD 2 x Intel E5-2630v2 6 cores Memory 96 GB 12 HDs 3TB – JBOD 4+1 GbE Network 12 Workers SuperMicro 1 x Intel Xeon 6 cores Memory 64 / 32 GB 5 HDs 2TB – JBOD 2+1 GbE Network Workers: 576 logical cores (with HT) +2TB RAM 276 HDs 768 TB of storage ~ 45 GB/s “nominal” disk read speed (dd) 3 Masters DELL R620 2 x Intel E5-2630v2 6 cores Memory 128 GB 3 HDs 600GB in RAID 4+1 GbE Network

12 BigData@Polito – Logic Setup
Link Aggregation w/Bonding (balance-alb) all machines are connected to both switches in their racks P2P communication is limited to 1 Gbps

13 Outline Introduction to the BigData@Polito lab The Big Data cluster
Hardware Software Basic benchmark How to access and use the cluster? Examples of current usage of the cluster

14 The Software Based on the Cloudera platform

15 Architecture HDFS – Hadoop Distributed File System
YARN – Yet Another Resource Negotiator Applications : MapReduce, Spark etc

16 HDFS: What is the usable disk capacity?
Replication set to 3 – the client writes blocks to its own node first, then the other rack is used for a second and a third copy Therefore out cluster actual capacity is 256 TB Replicas guarantee resilience to disk failures (and we had some already) They give flexibility to allocation of executors

17 YARN: How are the resources shared?
Scheduling Policy Preemption

18 YARN: How are the resources shared?
Dominant Resource Fairness: Equalizes “dominant share” of users Host: <9 CPU, 18 GB> Task User 1: <1 CPU, 4 GB> dom res: memory Task User 2: <3 CPU, 1 GB> dom res: CPU Preemption occurs after 2 min: It is normal to wait some time to see the job starting running It is normal to see containers being killed

19 Spark applications

20 MLlib algorithms

21 Example – Spark execution overview
The application creates a driver process The application gets its executor processes It sends the code and tasks to the executors Our current setup allows applications to have more than 500 executors (500+ threads reading and processing the data in parallel)

22 Outline Introduction to the BigData@Polito lab The Big Data cluster
Hardware Software Basic benchmark How to access and use the cluster? Examples of current usage of the cluster

23 Raw HDFS read speed Thanks to overhead, the cluster can read up to 13 GB/s (without any processing)

24 Roughly, this cluster can sort 1 TB in ~10 min (mapred)
Terasort Roughly, this cluster can sort 1 TB in ~10 min (mapred)

25 Outline Introduction to the BigData@Polito lab The Big Data cluster
Hardware Software Basic benchmark How to access and use the cluster? Samples of current usage of the cluster

26 How do I request an user account?
First: Is this cluster/framework the best solution? The cluster has an independent LDAP/Kerberos system controlling access and HDFS permissions Contact the responsible in your department DET: Marco Mellia, Maurizio Munafò, Idilio Drago, … DAUIN: Elena Baralis, Paolo Garza, … Fill in the form available at

27 How do I use the cluster? Go to

28 Outline Introduction to the BigData@Polito lab The Big Data cluster
Hardware Software Basic benchmark How to access and use the cluster? Examples of current usage of the cluster

29 Research Scope: New Algorithms and data science APPLICATION LAYER
TRANSPORT LAYER Analysis of network traffic in real-time APPLICATION LAYER Analysis of OSN contents Scope: New Algorithms and data science Traffic classification, engineering Network security (e.g., malware detection) User and community profiling Recommendation systems

30 Teaching Computer Engineering MS current offering
Data Mining Artificial Intelligence Big Data Management New track on Data Science Data Modeling + Data Engineering + Software engineering + Data Mining & Analytics

31 Questions?

Download ppt "- Inter-departmental Lab"

Similar presentations

Ads by Google