Presentation is loading. Please wait.

Presentation is loading. Please wait.

Overview of big data tools

Similar presentations


Presentation on theme: "Overview of big data tools"— Presentation transcript:

1 Overview of big data tools
THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Eurostat

2 Topics Hadoop Analysis Tools in Hadoop Spark NoSQL 2 Eurostat

3 Hadoop Open source platform for distributed processing of large data
Functions: Distribution of data and processing across machine Management of the cluster Simplified programming model Easy to write distributed algorithms Eurostat

4 Hadoop scalability Hadoop can reach massive scalability by exploiting a simple distribution architecture and coordination model Huge clusters can be made up using (cheap) commodity hardware A 1000-CPU machine would be much more expensive than 1000 single-CPU or 250 quad-core machines Cluster can easily scale up with little or no modifications to the programs Eurostat

5 Hadoop Components HDFS: Hadoop Distributed File System MapReduce
Abstraction of a file system over a cluster Stores large amount of data by transparently spreading it on different machines MapReduce Simple programming model that enables parallel execution of data processing programs Executes the work on the data near the data In a nutshell: HDFS places the data on the cluster and MapReduce does the processing work Eurostat

6 Hadoop Principle I’m one big data set
Hadoop is basically a middleware platforms that manages a cluster of machines The core components is a distributed file system (HDFS) Files in HDFS are split into blocks that are scattered over the cluster The cluster can grow indefinitely simply by adding new nodes I’m one big data set Hadoop HDFS Eurostat

7 The MapReduce Paradigm
Parallel processing paradigm Programmer is unaware of parallelism Programs are structured into a two-phase Map execution Rex 4 duce x 5 x 3 An algorithm is applied to all the elements of the same category Data elements are classified into categories Eurostat

8 MapReduce and Hadoop Hadoop MapReduce HDFS
MapReduce is logically placed on top of HDFS MapReduce HDFS Eurostat

9 MapReduce and Hadoop Hadoop Output is written on HDFS
MR works on (big) files loaded on HDFS Hadoop Each node in the cluster executes the MR program in parallel, applying map and reduces phases on the blocks it stores MR MR MR MR HDFS HDFS HDFS HDFS Output is written on HDFS Scalability principle: Perform the computation were the data is Eurostat

10 Hadoop pros & cons Good for Not good for
Repetitive tasks on big size data Not good for Replacing a RDMBS Complex processing requiring various phases and/or iterations Processing small to medium size data Eurostat

11 Tools for Data Analysis with Hadoop
Pig Hive Hadoop Statistical Software MapReduce HDFS Eurostat

12 Apache Pig Tool for querying data on Hadoop clusters
Widely used in the Hadoop world Yahoo! estimates that 50% of their Hadoop workload on their 100,000 CPUs clusters is genarated by Pig scripts Allows to write data manipulation scripts written in a high-level language called Pig Latin Interpreted language: scripts are translated into MapReduce jobs Mainly targeted at joins and aggregations Eurostat

13 Pig Example Real example of a Pig script used at Twitter
The Java equivalent… Eurostat

14 Hive SQL interface to Hadoop
Text files in tabular format stored in HDFS can be wrapped as relational tables by Hive and then can be queried through standard SQL Eurostat

15 Demo: Pig and Hive in the Sandbox
Example of queries and analysis of Comtrade data 15 Eurostat

16 RHadoop Set of packages that allows integration of R with HDFS and MapReduce Hadoop provides the storage while R brings the analysis Just a library Not a special run-time, Not a different language, Not a special purpose language Incrementally port your code and use all packages Requires R installed and configured on all nodes in the cluster Eurostat

17 Demo: RHadoop in the Sandbox
Example of quality analysis of Comtrade data in MapReduce, written in R 17 Eurostat

18 Spark Most of Machine Learning Algorithms are iterative because each
iteration can improve the results With Disk based approach each iteration’s output is written to disk making it slow Hadoop execution flow Spark execution flow Extract a working set Cache it Query it repeatedly Not tied to 2 stage Map Reduce paradigm Eurostat

19 About Apache Spark Initially started at UC Berkeley in 2009
Fast and general purpose cluster computing system 10x (on disk) - 100x (In-Memory) faster than MapReduce Most popular for running Iterative Machine Learning Algorithms. Provides high level APIs in 3 different programming languages Scala, Java, Python Support to R Integration with Hadoop and its eco-system and can read existing data on HDFS Eurostat

20 Spark Stack Spark SQL MLib GraphX Spark Streaming
For SQL and unstructured data processing MLib Machine Learning Algorithms GraphX Graph Processing Spark Streaming stream processing of live data streams Eurostat

21 Spark in the Sandbox Spark has been used in the Sandbox to compute network indicators starting from the Comtrade Networks of countries has been extracted from the Comtrade database for each category of products. Network indicators has been computed with the Graphx library All the extraction and the processing has been done in Spark Alternative approach using R required a preliminary extraction and copy of the data in Hive 21 Eurostat

22 NoSQL: Definition NoSQL databases is an approach to data management that is useful for very large sets of distributed data NoSQL should not be misleading: the approach does not prohibit Structured Query Language (SQL) And indeed they are commonly referred to as “NotOnlySQL” Eurostat

23 NoSQL: Main Features Non relational/Schema Free: little or no pre- defined schema, in contrast to Relational Database Management Systems Distributed Horizontally scalable: able to manage large volume of data also with availability guarantees Transparent failover and recovery using mirror copies Eurostat

24 Example of NoSQL Databases: Document Storage
Platforms for storing and indexing semi- structured data in JSON format Not tied to a specific schema but can store different types of document together Products MongoDB Elasticsearch Eurostat

25 Elasticsearch in the Sandbox
Demo of visualization of Twitter data stored in Elasticsearch with Kibana Eurostat


Download ppt "Overview of big data tools"

Similar presentations


Ads by Google