John Lenhart.  Data stores are growing by 50% each year, and that rate of increase is accelerating [1]  In 2010, we crossed the barrier of the zettabyte.

Slides:

Advertisements

Similar presentations

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.

Advertisements

HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

Spark: Cluster Computing with Working Sets

 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)

Big Data Workflows N AME : A SHOK P ADMARAJU C OURSE : T OPICS ON S OFTWARE E NGINEERING I NSTRUCTOR : D R. S ERGIU D ASCALU.

Fraud Detection in Banking using Big Data By Madhu Malapaka For ISACA, Hyderabad Chapter Date: 14 th Dec 2014 Wilshire Software.

Big Data. What is Big Data? Analog starage vs digital. The FOUR V’s of Big Data. Who’s Generating Big Data The importance of Big Data. Optimalization.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Ch 4. The Evolution of Analytic Scalability

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Big Data. What is Big Data? Big Data Analytics: 11 Case Histories and Success Stories

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.

Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.

Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 2011 UKSim 5th European Symposium on Computer Modeling and Simulation Speker : Hong-Ji.

+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Big Data Directions Greg.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.

CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.

What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.

Axis AI Solves Challenges of Complex Data Extraction and Document Classification through Advanced Natural Language Processing and Machine Learning MICROSOFT.

Big Data Analytics with Excel Peter Myers Bitwise Solutions.

Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.

Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.

Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.

BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.

BIG DATA/ Hadoop Interview Questions.

B ig D ata Analysis for Page Ranking using Map/Reduce R.Renuka, R.Vidhya Priya, III B.Sc., IT, The S.F.R.College for Women, Sivakasi.

Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,

BIG DATA BIGDATA, collection of large and complex data sets difficult to process using on-hand database tools.

Bleeding edge technology to transform Data into Knowledge HADOOP In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,

CNIT131 Internet Basics & Beginning HTML

Big Data Analytics on Large Scale Shared Storage System

Big Data Enterprise Patterns

MapReduce Compiler RHadoop

Hadoop Aakash Kag What Why How 1.

ANOMALY DETECTION FRAMEWORK FOR BIG DATA

Ministry of Higher Education

Cloud Distributed Computing Environment Hadoop

Chapter 1 Database Systems

CS110: Discussion about Spark

Ch 4. The Evolution of Analytic Scalability

Big Data Overview.

Zoie Barrett and Brian Lam

Charles Tappert Seidenberg School of CSIS, Pace University

Chapter 1 Database Systems

MapReduce: Simplified Data Processing on Large Clusters

Analysis of Structured or Semi-structured Data on a Hadoop Cluster

Presentation transcript:

John Lenhart

 Data stores are growing by 50% each year, and that rate of increase is accelerating [1]  In 2010, we crossed the barrier of the zettabyte (ZB) across all online data. This year, we will produce 4 ZB of data worldwide [1]  The type of data is also changing. Over 80% of it will be unstructured data which does not work well with relational databases [1]

 “Big data is defined as large amount of data which requires new technologies and architectures so that it becomes possible to extract value from it…”  Big data is sort of a misnomer, as it only points out the size of the data not giving too much of attention to its other existing properties

 Variety - the stored data is not all of the same type or category  Structured data - data that is organized in a structure so that it is identifiable e.g. SQL data  Semi-structured data - a form of structured data that has a self-describing structure yet does not conform with the formal structure of a relational database e.g. XML  Unstructured data - data with no identifiable structure e.g. image

 Volume - The “Big” in Big data and represents the large volume or size of the data  At present the data existing is in petabytes and is supposed to increase to zettabytes in the near future  For example big social networking sites are producing data in order of terabytes everyday and this amount of data is difficult to handle using traditional systems

 Velocity - represents not only the speed at which the data is incoming, but also the speed at which the data is outgoing  Traditional systems are not capable of performing analytics on data that is constantly in motion  Variability - represents the inconsistency of the data flow  The flow of data can be highly inconsistent, leading to periodic peaks and lows  Daily, seasonal and event-triggered peak data loads can be challenging to manage, especially for unstructured data [2]  For example a large natural disaster would spike page visits for cnn.com

 Complexity  Represents the difficulty of linking, matching, cleansing, and transforming data from multiple sources  Value  Systems must not only be designed to handle Big data efficiently and effectively, but also be able to filter the most important data from all of the data collected  This filtered data is what helps add value to a business

 Log Storage in IT Industries  IT industries store large amounts of data as logs to deal with problems which occur rarely in order to solve them  Big data analytics is used on the data to pinpoint the point of failures  Traditional Systems are not able to handle these logs because of their volume, raw and semi structured nature, and high rate of change  Sensor Data  Massive amount of sensor data is also a big challenge for Big data  Example ▪ The Large Hadron Collider (LHC) is the world’s largest and highest- energy particle accelerator. The data flow in its experiments consists of 25 to 200 petabytes of data which needs to be processed and stored

 Risk Analysis  It’s important for financial institutions to model data in order to calculate the risk so that it falls under their acceptable thresholds  A lot of potential data is underutilized because of its volume and should be integrated within the model to determine the risk patterns more accurately  Social Media  The largest use of Big data is for social media and customer sentiments  Keeping an eye on what the customers are saying about their products helps business organizations to get a kind of customer feedback  The customer feedback can then be used to make decisions and add value to the business

 Privacy and Security  The most important issue with Big data which includes conceptual, technical as well as legal significance  The personal information of a person when combined with external large data sets leads to the inference of new private facts about that person  Big data used by law enforcement will increase the chances of certain tagged people to suffer from adverse consequences without the ability to fight back or even having knowledge that they are being discriminated against

 Data Access and Sharing of Information  If data is to be used to make accurate decisions in time it becomes necessary that it should be available in accurate, complete and timely manner  Storage and Processing Issues  Many companies are struggling to store the large amount of data they are producing ▪ Outsourcing storage to the cloud may seem like an option but long upload times and constant updates to the data preclude this option  Processing a large amount of data also takes a lot of time

 Hadoop - is an open source project hosted by Apache Software Foundation for managing Big data  Hadoop consists of two main components  The Hadoop File System (HDFS) which is a distributed file-system that stores the data on multiple separate servers (each of which having its own processor(s))  MapReduce the framework that understands and assigns work to the nodes in a cluster [3]

 Hadoop provides the following advantages [3]  Data read/write performance is increased by distributing the data across the cluster allowing each processor to do work in a parallel fashion  It’s scalable, new nodes can be added as needed without making changes to the existing system  It’s cost effective because it brings parallel computing to commodity servers  It’s flexible, it can absorb any type of data, structured or not from any number of sources  It’s fault tolerant, it handles failures intrinsically by always storing multiple copies of the data and automatically loading a copy when a fault is detected

 How do you use Hadoop?  The developer writes a program that conforms to the MapReduce programming model  The developer specifies the format of the data to be processed in their program  How does MapReduce work? [4]  Each Hadoop program performs two tasks: ▪ Map - Breaks all of the data down into key/value pairs ▪ Reduce - Takes the output from the map step as input and combines those data key/value pairs into a smaller set of key/value pairs

 MapReduce example [4] : Assume you have five files, and each file contains two columns that represent a city and the corresponding temperature recorded in that city for the various measurement days  Toronto, 20, New York, 22, Rome, 32, Toronto, 4, Rome, 33,New York, 18  We want to find the maximum temperature for each city across all of the data files  Then we create five map tasks, where each mapper works on one of the five files and the mapper task goes through the data and returns the maximum temperature for each city  Which results in: (Toronto, 20) (New York, 22) (Rome, 33)  Let’s assume the other four mapper tasks (working on the other four files not shown here) produced the following intermediate results:  (Toronto, 18) (New York, 32) (Rome, 37)(Toronto, 32) (New York, 33) (Rome, 38)(Toronto, 22) (New York, 20) (Rome, 31)(Toronto, 31) (New York, 19) (Rome, 30)  All five of these output streams would be fed into the reduce tasks, which combines the input results and outputs a single value for each city, producing a final result set as follows:  (Toronto, 32) (New York, 33) (Rome, 38)

 Big data: Issues, challenges, tools and Good practices  ber= &tag=1#references ber= &tag=1#references  Why Every Database Must Be Broken Soon  1. database-must-be-broken-soon.htmlhttps://blogs.vmware.com/vfabric/2013/03/why-every- database-must-be-broken-soon.html  Big Data: What it is and why it matters  2. data.htmlhttp:// data.html  What is Hadoop?   What is MapReduce?  ibm.com/software/data/infosphere/hadoop/mapreduce/ 01.ibm.com/software/data/infosphere/hadoop/mapreduce/