Overview of big data tools

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Hadoop Ecosystem Overview
HADOOP ADMIN: Session -2
A Study in NoSQL & Distributed Database Systems John Hawkins.
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
MapReduce VS Parallel DBMSs
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
An Introduction to HDInsight June 27 th,
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
BIG DATA/ Hadoop Interview Questions.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Microsoft Ignite /28/2017 6:07 PM
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
MapReduce Compilers-Apache Pig
Image taken from: slideshare
Big Data is a Big Deal!.
MapReduce Compiler RHadoop
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Hadoop Aakash Kag What Why How 1.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Introduction to Spark Streaming for Real Time data analysis
Hadoop.
Introduction to Distributed Platforms
ITCS-3190.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
An Open Source Project Commonly Used for Processing Big Data Sets
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
CS122B: Projects in Databases and Web Applications Winter 2017
Zhangxi Lin, The Rawls College,
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Spark Presentation.
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Data Platform and Analytics Foundational Training
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Ministry of Higher Education
Introduction to Spark.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Introduction to Hadoop and Spark
Introduction to Apache
NoSQL Databases Antonino Virgillito.
Spark and Scala.
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.
Big-Data Analytics with Azure HDInsight
Server & Tools Business
MapReduce: Simplified Data Processing on Large Clusters
Big Data Technology: Introduction to Hadoop
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Pig Hive HBase Zookeeper
Presentation transcript:

Overview of big data tools THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Eurostat

Topics Hadoop Analysis Tools in Hadoop Spark NoSQL 2 Eurostat

Hadoop Open source platform for distributed processing of large data Functions: Distribution of data and processing across machine Management of the cluster Simplified programming model Easy to write distributed algorithms Eurostat

Hadoop scalability Hadoop can reach massive scalability by exploiting a simple distribution architecture and coordination model Huge clusters can be made up using (cheap) commodity hardware A 1000-CPU machine would be much more expensive than 1000 single-CPU or 250 quad-core machines Cluster can easily scale up with little or no modifications to the programs Eurostat

Hadoop Components HDFS: Hadoop Distributed File System MapReduce Abstraction of a file system over a cluster Stores large amount of data by transparently spreading it on different machines MapReduce Simple programming model that enables parallel execution of data processing programs Executes the work on the data near the data In a nutshell: HDFS places the data on the cluster and MapReduce does the processing work Eurostat

Hadoop Principle I’m one big data set Hadoop is basically a middleware platforms that manages a cluster of machines The core components is a distributed file system (HDFS) Files in HDFS are split into blocks that are scattered over the cluster The cluster can grow indefinitely simply by adding new nodes I’m one big data set Hadoop HDFS Eurostat

The MapReduce Paradigm Parallel processing paradigm Programmer is unaware of parallelism Programs are structured into a two-phase Map execution Rex 4 duce x 5 x 3 An algorithm is applied to all the elements of the same category Data elements are classified into categories Eurostat

MapReduce and Hadoop Hadoop MapReduce HDFS MapReduce is logically placed on top of HDFS MapReduce HDFS Eurostat

MapReduce and Hadoop Hadoop Output is written on HDFS MR works on (big) files loaded on HDFS Hadoop Each node in the cluster executes the MR program in parallel, applying map and reduces phases on the blocks it stores MR MR MR MR HDFS HDFS HDFS HDFS Output is written on HDFS Scalability principle: Perform the computation were the data is Eurostat

Hadoop pros & cons Good for Not good for Repetitive tasks on big size data Not good for Replacing a RDMBS Complex processing requiring various phases and/or iterations Processing small to medium size data Eurostat

Tools for Data Analysis with Hadoop Pig Hive Hadoop Statistical Software MapReduce HDFS Eurostat

Apache Pig Tool for querying data on Hadoop clusters Widely used in the Hadoop world Yahoo! estimates that 50% of their Hadoop workload on their 100,000 CPUs clusters is genarated by Pig scripts Allows to write data manipulation scripts written in a high-level language called Pig Latin Interpreted language: scripts are translated into MapReduce jobs Mainly targeted at joins and aggregations Eurostat

Pig Example Real example of a Pig script used at Twitter The Java equivalent… Eurostat

Hive SQL interface to Hadoop Text files in tabular format stored in HDFS can be wrapped as relational tables by Hive and then can be queried through standard SQL Eurostat

Demo: Pig and Hive in the Sandbox Example of queries and analysis of Comtrade data 15 Eurostat

RHadoop Set of packages that allows integration of R with HDFS and MapReduce Hadoop provides the storage while R brings the analysis Just a library Not a special run-time, Not a different language, Not a special purpose language Incrementally port your code and use all packages Requires R installed and configured on all nodes in the cluster Eurostat

Demo: RHadoop in the Sandbox Example of quality analysis of Comtrade data in MapReduce, written in R 17 Eurostat

Spark Most of Machine Learning Algorithms are iterative because each iteration can improve the results With Disk based approach each iteration’s output is written to disk making it slow Hadoop execution flow Spark execution flow Extract a working set Cache it Query it repeatedly Not tied to 2 stage Map Reduce paradigm Eurostat

About Apache Spark Initially started at UC Berkeley in 2009 Fast and general purpose cluster computing system 10x (on disk) - 100x (In-Memory) faster than MapReduce Most popular for running Iterative Machine Learning Algorithms. Provides high level APIs in 3 different programming languages Scala, Java, Python Support to R Integration with Hadoop and its eco-system and can read existing data on HDFS Eurostat

Spark Stack Spark SQL MLib GraphX Spark Streaming For SQL and unstructured data processing MLib Machine Learning Algorithms GraphX Graph Processing Spark Streaming stream processing of live data streams Eurostat

Spark in the Sandbox Spark has been used in the Sandbox to compute network indicators starting from the Comtrade Networks of countries has been extracted from the Comtrade database for each category of products. Network indicators has been computed with the Graphx library All the extraction and the processing has been done in Spark Alternative approach using R required a preliminary extraction and copy of the data in Hive 21 Eurostat

NoSQL: Definition NoSQL databases is an approach to data management that is useful for very large sets of distributed data NoSQL should not be misleading: the approach does not prohibit Structured Query Language (SQL) And indeed they are commonly referred to as “NotOnlySQL” Eurostat

NoSQL: Main Features Non relational/Schema Free: little or no pre- defined schema, in contrast to Relational Database Management Systems Distributed Horizontally scalable: able to manage large volume of data also with availability guarantees Transparent failover and recovery using mirror copies Eurostat

Example of NoSQL Databases: Document Storage Platforms for storing and indexing semi- structured data in JSON format Not tied to a specific schema but can store different types of document together Products MongoDB Elasticsearch Eurostat

Elasticsearch in the Sandbox Demo of visualization of Twitter data stored in Elasticsearch with Kibana Eurostat