AGENDA Buzz word. AGENDA Buzz word What is BIG DATA ? Big Data refers to massive, often unstructured data that is beyond the processing capabilities.

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
BIG DATA/ Hadoop Interview Questions.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
Data Analytics 1 - THE HISTORY AND CONCEPTS OF DATA ANALYTICS
Image taken from: slideshare
CS 405G: Introduction to Database Systems
NO SQL for SQL DBA Dilip Nayak & Dan Hess.
Current and Future Research Frontiers
Big Data is a Big Deal!.
SAS users meeting in Halifax
Big Data Enterprise Patterns
MapReduce Compiler RHadoop
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Hadoop Aakash Kag What Why How 1.
Hadoop.
INTRODUCTION TO BIGDATA & HADOOP
An Open Source Project Commonly Used for Processing Big Data Sets
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Hadoop MapReduce Framework
Chapter 14 Big Data Analytics and NoSQL
Big Data Technology.
Spark Presentation.
TABLE OF CONTENTS. TABLE OF CONTENTS Not Possible in single computer and DB Serialised solution not possible Large data backup difficult so data.
NOSQL.
NOSQL databases and Big Data Storage Systems
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
1 Demand of your DB is changing Presented By: Ashwani Kumar
Big Data - in Performance Engineering
CS6604 Digital Libraries IDEAL Webpages Presented by
湖南大学-信息科学与工程学院-计算机与科学系
Hadoop Basics.
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Ch 4. The Evolution of Analytic Scalability
Hadoop Technopoints.
Overview of big data tools
TIM TAYLOR AND JOSH NEEDHAM
Database Systems Summary and Overview
Zoie Barrett and Brian Lam
Charles Tappert Seidenberg School of CSIS, Pace University
MAPREDUCE TYPES, FORMATS AND FEATURES
CS639: Data Management for Data Science
Map Reduce, Types, Formats and Features
Big Data.
Presentation transcript:

AGENDA Buzz word

What is BIG DATA ? Big Data refers to massive, often unstructured data that is beyond the processing capabilities of traditional database management tools.  Big Data can take up terabytes and petabytes of storage space in diverse formats including text, video, sound, images etc.  Traditional relational database management systems cannot deal with such large masses of data. Examples : User updates over fb. Clicks over the internet. 3 V’s of big data ?.. Structured vs unstructured

Volume Volume refers to huge amount of data being generated every minute. 90% of the data we have now is created in just past 2 years. IP traffic by 2015 would turn 4X than what it is now. 3 billion people would be online by 2015 . 2.7 zetabytes , hydron exp.

Velocity Velocity refers to SPEED at which new data is being generated and moves around. It includes Real time working systems such as Online banking. Need of low response time. Technology “In-Memory Analytics” is employed to deal with data in motion. 90k youtube, 45k google/sec

Variety Variety refers to various datatypes which we can now use. Earlier focus was on neat and structured data kept in form of tables in RDBMS. 80% of data available now is unstructured data Datatypes are anomalous varying from text to videos to audios to pictures etc Portable devices, sensors n Social media How we gain? Video..

Transform problems into possibilities Big data analytics ..

Big Data Analytics It is the process of examining large amounts of data of a variety of types (big data) to uncover hidden patterns, unknown correlations and other real- time insights. Use of Big Data Analytics – Google Search recommendations, Satyamev jayate. Future scope – Genes reading for curing deadly diseases like cancer . Types of Analytics..

Leading Technologies Relational databases failed to store and process Big Data. As a result, a new class of big data technology has emerged and is being used in many big data analytics environments.  The technologies associated with big data analytics include : Hadoop . Mapreduce. NoSQL.

Hadoop Hadoop is an open source framework. Generally is Java-based programming framework . Processing and storing of large data sets. Distributed computing environment. Components of hadoop HDFS( hadoop distributed file system). Mapreduce.

HDFS (Hadoop Distributed File System) HDFS stores data in DISTRIBUTED,SCALABLE and FAULT-TOLERANT WAY. Name node have metadata about data on DataNodes. DataNodes actually have data on them in form of blocks and they are capable of communicating .

Hadoop SQL Data is stored in form of compressed files across n number of commodity servers Data is stored in form of tables and columns with relation in them Fault tolerant – if one node fails ,system still work If any one node crashes ,it gives error so as to maintain consistency Hadoop SQL Any questions ???...

Benefits of Hadoop Copying same file over all (thousands) of nodes ? doesn’t it seem like wastage of space ! It actually is not a waste memory, because of 2 reasons: If one node failed ,System would still work as data is never lost. The query is scaled over nodes so it bring about faster results due to parallel processing eg- Count all words of my twitter history to check what i talk about the most.  The query is split across multiple servers with a criteria (here months), and the results are consolidated.

Map-Reduce Algorithm MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. as in previous example twitter data was processed on different hosts on basis of months . Hadoop is the physical implementation of Mapreduce . It is combination of 2 java functions : Mapper() and Reducer(). example: to check popularity of text. use of word-count..

MR - Word count example

Mapper() and Reducer() Mapper function maps the split files and provide input to reducer. Mapper ( split_filename , file –contents): for each word in file-contents: emit (word , 1) Reducer function clubs the input provided by mapper and produce output. Reducer ( word , values): sum=0; for each value in values: sum=sum + value emit(word , sum) can anyone think of any disadvantages??..

Disadvantages of hadoop There were 2 major disadvantages when hadoop was developed which now its strengths. HDFS dependency on single Namenode solution: A secondary Namenode is attached to Primary Namenode. MapReduce is a java framework and did not support sql queries solution: Facebook developed HIVE which allowed scientists to work with sql on distributed database.

NoSQL Not only SQL. Non- relational database management system. Used where no fix schemas are required and data is scaled horizontally. 4 Categories of Nosql databases: Key-value pair Columnar database Graph databases Document databases

NoSQL Categories KEY-VALUE PAIR Keys used to get Value from opaque Data blocks. Hash map. Tremendously fast. Drawback: No provision for content based queries .

DOCUMENT DATABASE Again a key value store but value is in form of document. Documents are not of fixed schemas. documents can be nested. Queries based on content as well as keys. Use cases: blogging websites.

COLUMNAR DATABASE Works on attributes rather than tuples. Key here is column name and value is contiguous column values. Best for aggregation queries. Trend : select (1 or 2 column’s values ) where ( same or the other column value ) = some value.

Is a collection of nodes and edges. Nodes represent data GRAPH DATABASES Is a collection of nodes and edges. Nodes represent data while edges represent link between them. Most dynamic and flexible. Base Vs Acid properties ..

Thank you ! Keep dreaming BIG :D CONCLUSION Data is the new oil Without Big data analysis companies are deaf and dumb , mere wanderers on web ... Like a cattle on the highway ! Thank you ! Keep dreaming BIG :D CONCLUSION

References http://searchbusinessanalytics.techtarget.com/ Websites : http://searchbusinessanalytics.techtarget.com/ Experts sound off on big data , Analytics and its tools http://www.ibmbigdatahub.com/infographic/four-vs-big-data Big data and analytics hub https://bigdatauniversity.com/bdu-wp/bdu-course/hadoop-fundamentals-i-version-3/ Hadoop fundamentals Research papers : Dean J. and Ghemawat S., “MapReduce: Simplified Data Processing on Large Clusters ”, “OSDI: Sixth Symposium on Operating System Design San Francisco, CA”, “2004”.