Overview of big data tools THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Eurostat
Topics Hadoop Analysis Tools in Hadoop Spark NoSQL 2 Eurostat
Hadoop Open source platform for distributed processing of large data Functions: Distribution of data and processing across machine Management of the cluster Simplified programming model Easy to write distributed algorithms Eurostat
Hadoop scalability Hadoop can reach massive scalability by exploiting a simple distribution architecture and coordination model Huge clusters can be made up using (cheap) commodity hardware A 1000-CPU machine would be much more expensive than 1000 single-CPU or 250 quad-core machines Cluster can easily scale up with little or no modifications to the programs Eurostat
Hadoop Components HDFS: Hadoop Distributed File System MapReduce Abstraction of a file system over a cluster Stores large amount of data by transparently spreading it on different machines MapReduce Simple programming model that enables parallel execution of data processing programs Executes the work on the data near the data In a nutshell: HDFS places the data on the cluster and MapReduce does the processing work Eurostat
Hadoop Principle I’m one big data set Hadoop is basically a middleware platforms that manages a cluster of machines The core components is a distributed file system (HDFS) Files in HDFS are split into blocks that are scattered over the cluster The cluster can grow indefinitely simply by adding new nodes I’m one big data set Hadoop HDFS Eurostat
The MapReduce Paradigm Parallel processing paradigm Programmer is unaware of parallelism Programs are structured into a two-phase Map execution Rex 4 duce x 5 x 3 An algorithm is applied to all the elements of the same category Data elements are classified into categories Eurostat
MapReduce and Hadoop Hadoop MapReduce HDFS MapReduce is logically placed on top of HDFS MapReduce HDFS Eurostat
MapReduce and Hadoop Hadoop Output is written on HDFS MR works on (big) files loaded on HDFS Hadoop Each node in the cluster executes the MR program in parallel, applying map and reduces phases on the blocks it stores MR MR MR MR HDFS HDFS HDFS HDFS Output is written on HDFS Scalability principle: Perform the computation were the data is Eurostat
Hadoop pros & cons Good for Not good for Repetitive tasks on big size data Not good for Replacing a RDMBS Complex processing requiring various phases and/or iterations Processing small to medium size data Eurostat
Tools for Data Analysis with Hadoop Pig Hive Hadoop Statistical Software MapReduce HDFS Eurostat
Apache Pig Tool for querying data on Hadoop clusters Widely used in the Hadoop world Yahoo! estimates that 50% of their Hadoop workload on their 100,000 CPUs clusters is genarated by Pig scripts Allows to write data manipulation scripts written in a high-level language called Pig Latin Interpreted language: scripts are translated into MapReduce jobs Mainly targeted at joins and aggregations Eurostat
Pig Example Real example of a Pig script used at Twitter The Java equivalent… Eurostat
Hive SQL interface to Hadoop Text files in tabular format stored in HDFS can be wrapped as relational tables by Hive and then can be queried through standard SQL Eurostat
Demo: Pig and Hive in the Sandbox Example of queries and analysis of Comtrade data 15 Eurostat
RHadoop Set of packages that allows integration of R with HDFS and MapReduce Hadoop provides the storage while R brings the analysis Just a library Not a special run-time, Not a different language, Not a special purpose language Incrementally port your code and use all packages Requires R installed and configured on all nodes in the cluster Eurostat
Demo: RHadoop in the Sandbox Example of quality analysis of Comtrade data in MapReduce, written in R 17 Eurostat
Spark Most of Machine Learning Algorithms are iterative because each iteration can improve the results With Disk based approach each iteration’s output is written to disk making it slow Hadoop execution flow Spark execution flow Extract a working set Cache it Query it repeatedly Not tied to 2 stage Map Reduce paradigm Eurostat
About Apache Spark Initially started at UC Berkeley in 2009 Fast and general purpose cluster computing system 10x (on disk) - 100x (In-Memory) faster than MapReduce Most popular for running Iterative Machine Learning Algorithms. Provides high level APIs in 3 different programming languages Scala, Java, Python Support to R Integration with Hadoop and its eco-system and can read existing data on HDFS Eurostat
Spark Stack Spark SQL MLib GraphX Spark Streaming For SQL and unstructured data processing MLib Machine Learning Algorithms GraphX Graph Processing Spark Streaming stream processing of live data streams Eurostat
Spark in the Sandbox Spark has been used in the Sandbox to compute network indicators starting from the Comtrade Networks of countries has been extracted from the Comtrade database for each category of products. Network indicators has been computed with the Graphx library All the extraction and the processing has been done in Spark Alternative approach using R required a preliminary extraction and copy of the data in Hive 21 Eurostat
NoSQL: Definition NoSQL databases is an approach to data management that is useful for very large sets of distributed data NoSQL should not be misleading: the approach does not prohibit Structured Query Language (SQL) And indeed they are commonly referred to as “NotOnlySQL” Eurostat
NoSQL: Main Features Non relational/Schema Free: little or no pre- defined schema, in contrast to Relational Database Management Systems Distributed Horizontally scalable: able to manage large volume of data also with availability guarantees Transparent failover and recovery using mirror copies Eurostat
Example of NoSQL Databases: Document Storage Platforms for storing and indexing semi- structured data in JSON format Not tied to a specific schema but can store different types of document together Products MongoDB Elasticsearch Eurostat
Elasticsearch in the Sandbox Demo of visualization of Twitter data stored in Elasticsearch with Kibana Eurostat