Big Data is a Big Deal!.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Matthew Winter and Ned Shawa
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Image taken from: slideshare
Big Data is a Big Deal! Capstone Project
MapReduce Compiler RHadoop
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Introduction to Spark Streaming for Real Time data analysis
Introduction to Distributed Platforms
ANOMALY DETECTION FRAMEWORK FOR BIG DATA
ITCS-3190.
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Data Platform and Analytics Foundational Training
Introduction to MapReduce and Hadoop
Rahi Ashokkumar Patel U
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Meng Cao, Xiangqing Sun, Ziyue Chen May 28th, 2014
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Introduction to Spark.
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Distributed Systems CS
MapReduce Simplied Data Processing on Large Clusters
The Basics of Apache Hadoop
Hadoop Basics.
Apache Spark & Complex Network
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Introduction to Apache
HPML Conference, Lyon, Sept 2018
Overview of big data tools
Spark and Scala.
Group 15 Swathi Gurram Prajakta Purohit
Lecture 16 (Intro to MapReduce and Hadoop)
Distributed Systems CS
Charles Tappert Seidenberg School of CSIS, Pace University
Introduction to MapReduce
Big Data, Bigger Data & Big R Data
Apache Hadoop and Spark
Big-Data Analytics with Azure HDInsight
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Big Data is a Big Deal!

Our Team SUSHANT AHUJA Project Lead Algorithm Design Lead CASSIO CRISTOVAO Technical Lead Website Architect SAMEEP MOHTA Documentation Lead Testing Lead

Contents 1. Project Overview 2. Apache Hadoop and Spark 3. Benchmarks Project Background Our Goal Project Goals Project Technologies 2. Apache Hadoop and Spark Apache Hadoop Apache Spark Difference between Hadoop and Spark Cluster Multiple Reducer Problem 3. Benchmarks 4. Iteration Description Iteration 1 Iteration 2 Iteration 3 & 4 Schedule

Section 1: Project Overview

Project Background Recommendation Systems Big Data Revolution Phones, tablets, Laptops, Computers Credit Cards Transport Systems 0.5 % of data stored is actually analyzed1 Smart Data (Data Mining and visualization of Big data) Recommendation Systems Netflix, Amazon, Facebook 1. Source: http://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-mind-boggling- facts-everyone-must-read/#ec996f46c1d3

Our Goal (Problems we are solving) Gathering Research data for validation Performance tests in different environments Predict (data mining) data through recommendation systems Improve the knowledge and awareness on Big Data at TCU Computer Science Department

Project Goals Compare efficiency of 3 different environments Validate the Feasibility Transform structured query process into non-structured Map/Reduce tasks Report performance statistics Single Node (Java, Hadoop & Spark environments) Cluster (Hadoop & Spark environments) Turning ‘Big Data’ into ‘Smart Data’ Build recommendation system

Project Technologies Java Virtual Machine Eclipse IDE Apache Hadoop Apache Spark Maven in Eclipse for Hadoop and Spark Mahout on Hadoop systems MLlib on Spark systems

Section 2: Apache Hadoop and Spark

Apache Hadoop Born out of the need to process an avalanche of data Open-source software framework Was designed with a simple write-once storage infrastructure MapReduce – relatively new style of data processing, created by Google Become the go-to framework for large-scale, data-intensive deployments

Apache Hadoop Can handle all 4 dimensions of Big Data – Volume, Velocity, Variety, and Veracity Hadoop’s Beauty – efficiently process huge datasets in parallel Both Structured (converted) and Unstructured HDFS – breaks up input data, stores it on compute nodes (parallel processing) – blocks of data

HDFS Segmentation

Hadoop Map/Reduce Map Phase: Reduce Phase: Splits input data-set into independent chunks Processed in parallel manner – map tasks Framework sorts the output of map tasks Reduce Phase: Takes input from map function Performs summary operation Output stored in HDFS

Hadoop Map/Reduce

Code Snippet of Word Count Output Input

Code Snippet of Word Count

Apache Spark Open source big data processing framework designed to be fast and general-purpose Fast and meant to be ease to use Originally developed at UC Berkeley in 2009

Apache Spark Supports Map and Reduce functions Lazy evaluation In-memory storage and computing Offers APIs in Scala, Java, Python, R, SQL Built-in libraries; Spark SQL, Spark Streaming, MLlib, GraphX

Apache Spark

Code Snippet of Word count

Code Snippet of Word count

Difference Between Hadoop and Spark Not mutually exclusive Spark – does not have its own distributed system, can use HDFS or others Spark – Faster as works “in memory”, Hadoop works in Hard drive Hadoop – needs a third-party machine-learning library (Apache Mahout), Spark has its own (MLlib) Not really a competition as both are non-commercial, open-source

Cluster

Hadoop and Spark Cluster One manager node with two worker nodes, all with same configuration Namenode on manager, Datanode on workers No data stored on the manager

M W W

M W W

Container Allocation Input File B1 Container 1 B2 Worker 1 B3 B4

Map Tasks

Reduce Tasks

Map/Reduce Log

Multiple Reducer Problem Problem : Multiple Output files with multiple reducers. Ideally, only one output file. Solution : Map/Reduce Job Chaining. Map1 -> Reduce1 -> Map2 -> Reduce2

Map/Reduce for Job-2

Job Chaining

Job Chaining Job - 1 Job - 2 Map Map temp3 temp4 Reduce Reduce Reduce Output File

Spark Job Input 1 Input 2 Map 1 Map 2 Reduce Reduce output1 Instead of disk icon, use heap icon output1

Stage 1 Stage 2 Stage 3 Map 1 Reduce 2 Save 1 Input 1 Output 1

Spark Log Input 1 Input 2 Map 1 Map 2 Reduce output1

Benchmarks

Word Count (Single-Node) Spark Hadoop Java Time (in minutes) Java Stoppage point indication Size of the text file

Word Count (Cluster) Spark Hadoop Time (in minutes) Size of the text file

Matrix-Multiply (Single-Node) Spark Hadoop Java Time (in minutes) Java Graph as well Size of the Matrices

Matrix-Multiply (Cluster) Spark Hadoop Time (in minutes) Size of the Matrices

Section 3: Iteration Description

Iteration 1 Setting up 6 Linux machines Hadoop and Spark on 2 machines as master nodes 2 slave nodes for Hadoop and Spark each Initial Software Tests Word Frequency Count on all the 3 environments with text files of different sizes (10mb, 50mb, 100mb, 500mb, 1gb, 5gb and 10gb) Large Matrix Multiplication with the following matrix sizes:

Iteration 2 Setting up a cluster with at least 2 slave nodes. Running Mahout on Hadoop systems and Mlib on spark systems Start building a Simple Recommender using Apache Mahout

Iteration 3 & 4 Iteration 3: Using K-means clustering on all 3 machines Perform unstructured text processing search on all 3 machines Perform classification on both the clusters Iteration 4: Using K-means clustering to create recommendation systems from large datasets on all 3 machines

Schedule Skeleton Website 4 October, 2015 Software Engineering Presentation 15 December, 2015 Iteration 1 15 December, 2015 Project Plan v2.0 21 January, 2016 Requirements Document v2.0 21 January, 2016 Design Document v2.0 21 January, 2016 Iteration 2 2 February, 2016 Faculty Presentation 2 February, 2016 Iteration 3 3 March, 2016 Iteration 4 31 March, 2016 User manual 5 April, 2016 Developer manual 7 April, 2016 SRS 8 April, 2016 NTASC Presentation 16 April, 2016 Complete Documentation 26 April, 2016 Final Presentation 28 April, 2016