INTRODUCTION TO BIGDATA & HADOOP

INTRODUCTION TO BIGDATA & HADOOP
Girish L Assistant Professor Dept. of CSE CIT Gubbi

Data challenges ?

Multiples of bytes … 1000 MB = 1 GB 1000 GB = 1 TB 1000 TB = 1 PB
1000 KB = 1 MB 1000 MB = 1 GB 1000 GB = 1 TB 1000 TB = 1 PB 1000 PB = 1 EB 1000 EB = 1 ZB

What is the Need of Big data Technology
when we have robust, high-performing, relational database management system ?

Big Data is Different than BI

RDBMS Data Stored in structured format like PK, Rows, Columns, Tuples and FK . It was for just Transactional data analysis. Later using Data warehouse for offline data. (Analysis done within Enterprise) Massive use of Internet and Social Networking(FB, Linkdin) Data become less structured. Data is stored on central server.

Data Analytics questions will vary from business

‘Big Data’ is similar to ‘small data’, but bigger
Having data bigger it requires different approaches: Techniques, tools and architecture with an aim to solve new problems …or old problems in a better way

Attributes of BIG DATA

Applications for Big Data Analytics
Multi-channel sales Smarter Healthcare Finance Log Analysis Homeland Security Traffic Control Telecom Search Quality Explain well. Quote practical examples Manufacturing Trading Analytics Fraud and Risk Retail: Churn, NBO

Job Opportunities

HADOOP

Why HADOOP ? Answer: Big Datasets !

What is Hadoop? A Scalable distributed system for data storage and processing Open-source data storage and processing API Operates on Unstructured and Structured data Massively scalable, automatically parallelizable Based on work from Google GFS + MapReduce + BigTable Current Distributions based on Open Source and Vendor Work Apache Hadoop Cloudera – CH4 w/ Impala Hortonworks MapR AWS Windows Azure HDInsight

Hadoop Components Storage Processing HDFS MapReduce Self-healing
high-bandwidth clustered storage Fault-tolerant distributed processing HDFS cluster/healing. MapReduce

HDFS (Hadoop Distributed File System)
Data is Organized into files & directories HDFS is a file system written in Java Sits on top of a native file system Provides redundant storage for massive amounts of data Use cheap, unreliable computers Blocks replicated to handle failures Let’s talk about HDFS

HDFS Data Data is split into blocks and stored on multiple nodes in the cluster Each block is usually 64 MB or 128 MB (conf) Each block is replicated multiple times (conf) Replicas stored on different data nodes

2 Kinds of Nodes Master Nodes Slave Nodes
In the cluster there are two kinds of nodes….

Master Nodes NameNode only 1 per cluster metadata server and database
SecondaryNameNode helps with some housekeeping JobTracker only 1 per cluster job scheduler

Slave Nodes Data Nodes 1-4000 per cluster block data storage
Task Trackers per cluster task execution

Name Node A single NameNode stores all metadata
Filenames, locations on Data Nodes of each block, owner, group, etc. All information maintained in RAM for fast lookup File system metadata size is limited to the amount of available RAM on the NameNode

Data Node DataNodes store file contents
Stored as ‘blocks’ on the underlying file system Different blocks of the same file will be stored on different DataNodes Same block is stored on three (or more) DataNodes for redundancy

Self-healing DataNodes send heartbeats to NameNode
After a period without any heartbeats, a DataNode is assumed to be lost NameNode determines which blocks were on the lost node NameNode finds other DataNodes with copies of these blocks These DataNodes are instructed to copy the blocks to other nodes Replication is actively maintained

Secondary NameNode The Secondary NameNode is not a failover NameNode
Does memory-intensive administrative functions for the NameNode Should run on a separate machine

… … … namenode job submission node namenode daemon jobtracker
tasktracker tasktracker tasktracker datanode daemon datanode daemon datanode daemon Linux file system Linux file system Linux file system … … … slave node slave node slave node

Hadoop Ecosystem

MapReduce

MapReduce It’s a programmable framework for poling data in parallel out of cluster A method for distributing a task across multiple nodes. Each node processes data stored on that node. Consists of two phases: Map Reduce In between Map and Reduce is the Shuffle and Sort

Map Reduce Key Concepts

MapReduce In our case: circe.rc.usf.edu

MapReduce Objects Master Node Slave Node 1 Slave Node 2 Slave Node 3
Task Tracker Data Node Name Node Job Tracker Master Node Slave Node 1 Slave Node 2 Slave Node 3

MpaReduce Master “Jobtracker”
Accepts MR jobs submitted by users Assigns Map and Reduce tasks to Tasktrackers Monitors task and tasktracker status, reexecutes tasks upon failure MapReduce Slaves “Tasktrackers” Run Map and Reduce tasks upon instruction from the Jobtracker Manage storage and transmission of intermediate output.

MapReduce Example - WordCount

Input and Output InputFormat: OutputFormat: TextInputFormat
KeyValueTextInputFormat SequenceFileInputFormat OutputFormat: TextOutputFormat SequenceFileOutputFormat

Lifecycle of a MapReduce Job
Map function Reduce function Run this program as a MapReduce job

Hadoop Workflow 1. Load data into HDFS 2. Develop code locally
Hadoop Cluster 3. Submit MapReduce job 3a. Go back to Step 2 You 4. Retrieve data from HDFS

CASE STUDY : 1 Environment Change Prediction to Assist Formers Using Hadoop

CASE STUDY : 2 Sentimental Analysis
Hadoop used frequently to monitor what customers think of company’s products or services Data Loaded from Social media (Twitter, Facebook, s etc) Map Reduce jobs runs continuously to identify sentiment Positive and Negative sentiments Why Hadoop ? Social media /web data is unstructured Amount of data is immense

Hadoop will generate the 10 GB of random data within 20 seconds !!!
CASE STUDY : 3 Random Data Generator Hadoop will generate the 10 GB of random data within 20 seconds !!!

Publishing Paper in Journal

Questions ? Thank you

INTRODUCTION TO BIGDATA & HADOOP

Similar presentations

Presentation on theme: "INTRODUCTION TO BIGDATA & HADOOP"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

INTRODUCTION TO BIGDATA & HADOOP

Similar presentations

Presentation on theme: "INTRODUCTION TO BIGDATA & HADOOP"— Presentation transcript:

Similar presentations

About project

Feedback