Presentation is loading. Please wait.

Presentation is loading. Please wait.

INTRODUCTION TO BIGDATA & HADOOP

Similar presentations


Presentation on theme: "INTRODUCTION TO BIGDATA & HADOOP"— Presentation transcript:

1

2 INTRODUCTION TO BIGDATA & HADOOP
Girish L Assistant Professor Dept. of CSE CIT Gubbi

3 Data challenges ?

4 Multiples of bytes … 1000 MB = 1 GB 1000 GB = 1 TB 1000 TB = 1 PB
1000 KB = 1 MB 1000 MB = 1 GB 1000 GB = 1 TB 1000 TB = 1 PB 1000 PB = 1 EB 1000 EB = 1 ZB

5 What is the Need of Big data Technology
when we have robust, high-performing, relational database management system ?

6 Big Data is Different than BI

7 RDBMS Data Stored in structured format like PK, Rows, Columns, Tuples and FK . It was for just Transactional data analysis. Later using Data warehouse for offline data. (Analysis done within Enterprise) Massive use of Internet and Social Networking(FB, Linkdin) Data become less structured. Data is stored on central server.

8 Data Analytics questions will vary from business

9 ‘Big Data’ is similar to ‘small data’, but bigger
Having data bigger it requires different approaches: Techniques, tools and architecture with an aim to solve new problems …or old problems in a better way

10 Attributes of BIG DATA

11

12 Applications for Big Data Analytics
Multi-channel sales Smarter Healthcare Finance Log Analysis Homeland Security Traffic Control Telecom Search Quality Explain well. Quote practical examples Manufacturing Trading Analytics Fraud and Risk Retail: Churn, NBO

13 Job Opportunities

14 Job Opportunities

15 Job Opportunities

16 HADOOP

17 Why HADOOP ? Answer: Big Datasets !

18 What is Hadoop? A Scalable distributed system for data storage and processing Open-source data storage and processing API Operates on Unstructured and Structured data Massively scalable, automatically parallelizable Based on work from Google GFS + MapReduce + BigTable Current Distributions based on Open Source and Vendor Work Apache Hadoop Cloudera – CH4 w/ Impala Hortonworks MapR AWS Windows Azure HDInsight

19 Hadoop Components Storage Processing HDFS MapReduce Self-healing
high-bandwidth clustered storage Fault-tolerant distributed processing HDFS cluster/healing. MapReduce

20 HDFS (Hadoop Distributed File System)
Data is Organized into files & directories HDFS is a file system written in Java Sits on top of a native file system Provides redundant storage for massive amounts of data Use cheap, unreliable computers Blocks replicated to handle failures Let’s talk about HDFS

21 HDFS Data Data is split into blocks and stored on multiple nodes in the cluster Each block is usually 64 MB or 128 MB (conf) Each block is replicated multiple times (conf) Replicas stored on different data nodes

22 2 Kinds of Nodes Master Nodes Slave Nodes
In the cluster there are two kinds of nodes….

23 Master Nodes NameNode only 1 per cluster metadata server and database
SecondaryNameNode helps with some housekeeping JobTracker only 1 per cluster job scheduler

24 Slave Nodes Data Nodes 1-4000 per cluster block data storage
Task Trackers per cluster task execution

25 Name Node A single NameNode stores all metadata
Filenames, locations on Data Nodes of each block, owner, group, etc. All information maintained in RAM for fast lookup File system metadata size is limited to the amount of available RAM on the NameNode

26 Data Node DataNodes store file contents
Stored as ‘blocks’ on the underlying file system Different blocks of the same file will be stored on different DataNodes Same block is stored on three (or more) DataNodes for redundancy

27 Self-healing DataNodes send heartbeats to NameNode
After a period without any heartbeats, a DataNode is assumed to be lost NameNode determines which blocks were on the lost node NameNode finds other DataNodes with copies of these blocks These DataNodes are instructed to copy the blocks to other nodes Replication is actively maintained

28 Secondary NameNode The Secondary NameNode is not a failover NameNode
Does memory-intensive administrative functions for the NameNode Should run on a separate machine

29 … … … namenode job submission node namenode daemon jobtracker
tasktracker tasktracker tasktracker datanode daemon datanode daemon datanode daemon Linux file system Linux file system Linux file system slave node slave node slave node

30 Hadoop Ecosystem

31 MapReduce

32 MapReduce It’s a programmable framework for poling data in parallel out of cluster A method for distributing a task across multiple nodes. Each node processes data stored on that node. Consists of two phases: Map Reduce In between Map and Reduce is the Shuffle and Sort

33 Map Reduce Key Concepts

34 MapReduce In our case: circe.rc.usf.edu

35 MapReduce Objects Master Node Slave Node 1 Slave Node 2 Slave Node 3
Task Tracker Data Node Name Node Job Tracker Master Node Slave Node 1 Slave Node 2 Slave Node 3

36 MpaReduce Master “Jobtracker”
Accepts MR jobs submitted by users Assigns Map and Reduce tasks to Tasktrackers Monitors task and tasktracker status, re­executes tasks upon failure MapReduce Slaves “Tasktrackers” Run Map and Reduce tasks upon instruction from the Jobtracker Manage storage and transmission of intermediate output.

37 MapReduce Example - WordCount

38 Input and Output InputFormat: OutputFormat: TextInputFormat
KeyValueTextInputFormat SequenceFileInputFormat OutputFormat: TextOutputFormat SequenceFileOutputFormat

39

40 Lifecycle of a MapReduce Job
Map function Reduce function Run this program as a MapReduce job

41 Hadoop Workflow 1. Load data into HDFS 2. Develop code locally
Hadoop Cluster 3. Submit MapReduce job 3a. Go back to Step 2 You 4. Retrieve data from HDFS

42 CASE STUDY : 1 Environment Change Prediction to Assist Formers Using Hadoop

43 CASE STUDY : 2 Sentimental Analysis
Hadoop used frequently to monitor what customers think of company’s products or services Data Loaded from Social media (Twitter, Facebook, s etc) Map Reduce jobs runs continuously to identify sentiment Positive and Negative sentiments Why Hadoop ? Social media /web data is unstructured Amount of data is immense

44 Hadoop will generate the 10 GB of random data within 20 seconds !!!
CASE STUDY : 3 Random Data Generator Hadoop will generate the 10 GB of random data within 20 seconds !!!

45 Publishing Paper in Journal

46 Publishing Paper in Journal

47 Publishing Paper in Journal

48 Questions ? Thank you


Download ppt "INTRODUCTION TO BIGDATA & HADOOP"

Similar presentations


Ads by Google