Download presentation
Presentation is loading. Please wait.
2
INTRODUCTION TO BIGDATA & HADOOP
Girish L Assistant Professor Dept. of CSE CIT Gubbi
3
Data challenges ?
4
Multiples of bytes … 1000 MB = 1 GB 1000 GB = 1 TB 1000 TB = 1 PB
1000 KB = 1 MB 1000 MB = 1 GB 1000 GB = 1 TB 1000 TB = 1 PB 1000 PB = 1 EB 1000 EB = 1 ZB
5
What is the Need of Big data Technology
when we have robust, high-performing, relational database management system ?
6
Big Data is Different than BI
7
RDBMS Data Stored in structured format like PK, Rows, Columns, Tuples and FK . It was for just Transactional data analysis. Later using Data warehouse for offline data. (Analysis done within Enterprise) Massive use of Internet and Social Networking(FB, Linkdin) Data become less structured. Data is stored on central server.
8
Data Analytics questions will vary from business
9
‘Big Data’ is similar to ‘small data’, but bigger
Having data bigger it requires different approaches: Techniques, tools and architecture with an aim to solve new problems …or old problems in a better way
10
Attributes of BIG DATA
12
Applications for Big Data Analytics
Multi-channel sales Smarter Healthcare Finance Log Analysis Homeland Security Traffic Control Telecom Search Quality Explain well. Quote practical examples Manufacturing Trading Analytics Fraud and Risk Retail: Churn, NBO
13
Job Opportunities
14
Job Opportunities
15
Job Opportunities
16
HADOOP
17
Why HADOOP ? Answer: Big Datasets !
18
What is Hadoop? A Scalable distributed system for data storage and processing Open-source data storage and processing API Operates on Unstructured and Structured data Massively scalable, automatically parallelizable Based on work from Google GFS + MapReduce + BigTable Current Distributions based on Open Source and Vendor Work Apache Hadoop Cloudera – CH4 w/ Impala Hortonworks MapR AWS Windows Azure HDInsight
19
Hadoop Components Storage Processing HDFS MapReduce Self-healing
high-bandwidth clustered storage Fault-tolerant distributed processing HDFS cluster/healing. MapReduce
20
HDFS (Hadoop Distributed File System)
Data is Organized into files & directories HDFS is a file system written in Java Sits on top of a native file system Provides redundant storage for massive amounts of data Use cheap, unreliable computers Blocks replicated to handle failures Let’s talk about HDFS
21
HDFS Data Data is split into blocks and stored on multiple nodes in the cluster Each block is usually 64 MB or 128 MB (conf) Each block is replicated multiple times (conf) Replicas stored on different data nodes
22
2 Kinds of Nodes Master Nodes Slave Nodes
In the cluster there are two kinds of nodes….
23
Master Nodes NameNode only 1 per cluster metadata server and database
SecondaryNameNode helps with some housekeeping JobTracker only 1 per cluster job scheduler
24
Slave Nodes Data Nodes 1-4000 per cluster block data storage
Task Trackers per cluster task execution
25
Name Node A single NameNode stores all metadata
Filenames, locations on Data Nodes of each block, owner, group, etc. All information maintained in RAM for fast lookup File system metadata size is limited to the amount of available RAM on the NameNode
26
Data Node DataNodes store file contents
Stored as ‘blocks’ on the underlying file system Different blocks of the same file will be stored on different DataNodes Same block is stored on three (or more) DataNodes for redundancy
27
Self-healing DataNodes send heartbeats to NameNode
After a period without any heartbeats, a DataNode is assumed to be lost NameNode determines which blocks were on the lost node NameNode finds other DataNodes with copies of these blocks These DataNodes are instructed to copy the blocks to other nodes Replication is actively maintained
28
Secondary NameNode The Secondary NameNode is not a failover NameNode
Does memory-intensive administrative functions for the NameNode Should run on a separate machine
29
… … … namenode job submission node namenode daemon jobtracker
tasktracker tasktracker tasktracker datanode daemon datanode daemon datanode daemon Linux file system Linux file system Linux file system … … … slave node slave node slave node
30
Hadoop Ecosystem
31
MapReduce
32
MapReduce It’s a programmable framework for poling data in parallel out of cluster A method for distributing a task across multiple nodes. Each node processes data stored on that node. Consists of two phases: Map Reduce In between Map and Reduce is the Shuffle and Sort
33
Map Reduce Key Concepts
34
MapReduce In our case: circe.rc.usf.edu
35
MapReduce Objects Master Node Slave Node 1 Slave Node 2 Slave Node 3
Task Tracker Data Node Name Node Job Tracker Master Node Slave Node 1 Slave Node 2 Slave Node 3
36
MpaReduce Master “Jobtracker”
Accepts MR jobs submitted by users Assigns Map and Reduce tasks to Tasktrackers Monitors task and tasktracker status, reexecutes tasks upon failure MapReduce Slaves “Tasktrackers” Run Map and Reduce tasks upon instruction from the Jobtracker Manage storage and transmission of intermediate output.
37
MapReduce Example - WordCount
38
Input and Output InputFormat: OutputFormat: TextInputFormat
KeyValueTextInputFormat SequenceFileInputFormat OutputFormat: TextOutputFormat SequenceFileOutputFormat
40
Lifecycle of a MapReduce Job
Map function Reduce function Run this program as a MapReduce job
41
Hadoop Workflow 1. Load data into HDFS 2. Develop code locally
Hadoop Cluster 3. Submit MapReduce job 3a. Go back to Step 2 You 4. Retrieve data from HDFS
42
CASE STUDY : 1 Environment Change Prediction to Assist Formers Using Hadoop
43
CASE STUDY : 2 Sentimental Analysis
Hadoop used frequently to monitor what customers think of company’s products or services Data Loaded from Social media (Twitter, Facebook, s etc) Map Reduce jobs runs continuously to identify sentiment Positive and Negative sentiments Why Hadoop ? Social media /web data is unstructured Amount of data is immense
44
Hadoop will generate the 10 GB of random data within 20 seconds !!!
CASE STUDY : 3 Random Data Generator Hadoop will generate the 10 GB of random data within 20 seconds !!!
45
Publishing Paper in Journal
46
Publishing Paper in Journal
47
Publishing Paper in Journal
48
Questions ? Thank you
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.