Sushant Ahuja, Cassio Cristovao, Sameep Mohta

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Storage in Big Data Systems
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Welcome to the Intermountain Big Data Conference! 2 Data Science and Machine Learning Tools from Python to R, with Hands-On R/Shiny U Student – Math major.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Integrating Big Data into the Computing Curricula 02/2015 Achmad Benny Mutiara
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Big Data Javad Azimi May First of All… Sorry about the language  Feel free to ask any question Please share similar experiences.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
MapReduce using Hadoop Jan Krüger … in 30 minutes...
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Big Data & Test Automation
Image taken from: slideshare
Big Data is a Big Deal!.
Big Data is a Big Deal! Capstone Project
MapReduce Compiler RHadoop
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Hadoop Aakash Kag What Why How 1.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Machine Learning Library for Apache Ignite
Introduction to Spark Streaming for Real Time data analysis
ANOMALY DETECTION FRAMEWORK FOR BIG DATA
ITCS-3190.
Spark.
An Open Source Project Commonly Used for Processing Big Data Sets
Tutorial: Big Data Algorithms and Applications Under Hadoop
Status and Challenges: January 2017
Hadoop Tutorials Spark
Distributed Network Traffic Feature Extraction for a Real-time IDS
Spark Presentation.
Iterative Computing on Massive Data Sets
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
Meng Cao, Xiangqing Sun, Ziyue Chen May 28th, 2014
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Ministry of Higher Education
Introduction to Spark.
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
The Basics of Apache Hadoop
Cloud Distributed Computing Environment Hadoop
MapReduce: Data Distribution for Reduce
Apache Spark & Complex Network
Cse 344 May 4th – Map/Reduce.
Methodology & Current Results
CS110: Discussion about Spark
Introduction to Apache
Overview of big data tools
Big Data Young Lee BUS 550.
Spark and Scala.
VI-SEEM data analysis service
Zoie Barrett and Brian Lam
Charles Tappert Seidenberg School of CSIS, Pace University
Big Data, Bigger Data & Big R Data
CS639: Data Management for Data Science
Apache Hadoop and Spark
Big-Data Analytics with Azure HDInsight
MapReduce: Simplified Data Processing on Large Clusters
CS639: Data Management for Data Science
Presentation transcript:

Sushant Ahuja, Cassio Cristovao, Sameep Mohta Big Data is a Big Deal! Sushant Ahuja, Cassio Cristovao, Sameep Mohta Capstone Project 2015-2016 Computer Science Department Texas Christian University NTASC '16

Agenda Project Overview and Goals Apache Hadoop Apache Spark Initial Testing Recommender System Clustering Challenges Questions Needs to be modified NTASC '16

Project Background Big Data Revolution Phones, Tablets, Laptops, Computers Credit Cards Transport Systems 0.5% of data stored is actually analyzed1 Smart Data (Data Mining and Visualization of Big Data) Recommendation Systems Netflix, Amazon, Facebook 1. Source: https://www.technologyreview.com/business-report/big-data-gets-personal/ Published in May 2013 NTASC '16

Goals Performance Analysis Validate feasibility Hadoop vs Spark Speed Size of data Efficiency Validate feasibility Predict data - recommendation systems ‘Big Data’ ‘Smart Data’ Needs to be modified NTASC '16

Our Cluster 8 GB RAM M W W 2 Clusters: 3 nodes each Hadoop and Spark 500GB HDD Ubuntu 15.04 Manager-Worker Architecture 1 Manager 2 Workers M W W

Project Technologies Eclipse IDE Apache Hadoop Apache Spark Java Virtual Machine Eclipse IDE Apache Hadoop Apache Spark Maven for Hadoop and Spark Mahout on Hadoop systems MLlib on Spark systems NTASC '16

Apache Hadoop Framework for large-scale, data-intensive deployments Open-source MapReduce – Stream I/O style of data processing, created by Google Map – Filtering input line by line Reduce – Collecting and processing filtered data Write-once storage infrastructure NTASC '16

Apache Hadoop 4 Dimensions – Volume, Velocity, Variety, Veracity Both Structured (converted) and Unstructured HDFS – Breaks up input data – Blocks Stores it on compute nodes (Parallel Processing) NTASC '16

HDFS Segmentation NTASC '16

Hadoop MapReduce Splits input data-set into chunks Map Phase Splits input data-set into chunks Parallel processing - blocks Sorts the output Reduce Phase Input from map phase Summary operation Output stored back in HDFS NTASC '16

Hadoop Map/Reduce NTASC '16

Apache Spark Supports MapReduce Lazy (on demand) evaluation Open-source Supports MapReduce Lazy (on demand) evaluation In-memory storage and computing Offers APIs in Scala, Java, Python, R, SQL Built-in libraries NTASC '16

Apache Spark RDD – Resilient Distributed Dataset NTASC '16

Hadoop or Spark? NOT mutually exclusive Database – HDFS or others Rate of processing data Third-party machine-learning library Non-commercial, open-source NTASC '16

Word Count Spark Hadoop Time (in minutes) Size of the text file NTASC '16

Matrix Multiplication Spark Hadoop Time (in minutes) Size of the Matrices NTASC '16

Movie Recommender-Hadoop NTASC '16

Movie Recommender-Spark Collaborative Filtering NTASC '16

Recommender Comparison Spark Hadoop Time (in minutes) Number of records in the input file NTASC '16

K Means Clustering k1 k3 k2 Y X NTASC '16 Basic concept and applications and graph k2 Source: www.liacs.nl/~putten/edu/dbdm05/Lecture3.ppt NTASC '16

Clustering Comparison Spark Spark Hadoop Hadoop Time (in minutes) Number of records in the input file NTASC '16

Hadoop or Spark? (Based on a cluster of 1 M and 2 W) Hadoop – Huge datasets Spark – Computational capability Spark – Degrades on I/O swapping NTASC '16

Challenges Multiple Reducer Problem Amount of data Co-occurrence algorithm Dumping clusters: K-Means Quality of Recommendations NTASC '16

Questions NTASC '16