Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.

Slides:



Advertisements
Similar presentations
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Advertisements

Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
Big Data and Predictive Analytics in Health Care Presented by: Mehadi Sayed President and CEO, Clinisys EMR Inc.
Big Data Workflows N AME : A SHOK P ADMARAJU C OURSE : T OPICS ON S OFTWARE E NGINEERING I NSTRUCTOR : D R. S ERGIU D ASCALU.
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
Fraud Detection in Banking using Big Data By Madhu Malapaka For ISACA, Hyderabad Chapter Date: 14 th Dec 2014 Wilshire Software.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Getting Biologists off ACID Ryan Verdon 3/13/12. Outline Thesis Idea Specific database Effects of losing ACID What is a NoSQL database Types of NoSQL.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Big Data: A definition Big data is the realization of greater intelligence by storing, processing, and analyzing data that was previously ignored due to.
An Introduction to HDInsight June 27 th,
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
Windows Azure. Azure Application platform for the public cloud. Windows Azure is an operating system You can: – build a web application that runs.
1 Melanie Alexander. Agenda Define Big Data Trends Business Value Challenges What to consider Supplier Negotiation Contract Negotiation Summary 2.
Nov 2006 Google released the paper on BigTable.
BACS 287 Big Data & NoSQL 2016 by Jones & Bartlett Learning LLC.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Big Data Analytics with Excel Peter Myers Bitwise Solutions.
Big Data Yuan Xue CS 292 Special topics on.
Big Data Javad Azimi May First of All… Sorry about the language  Feel free to ask any question Please share similar experiences.
Group members: Phạm Hoàng Long Nguyễn Huy Hùng Lê Minh Hiếu Phan Thị Thanh Thảo Nguyễn Đức Trí 1 BIG DATA & NoSQL Topic 1:
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
B ig D ata Analysis for Page Ranking using Map/Reduce R.Renuka, R.Vidhya Priya, III B.Sc., IT, The S.F.R.College for Women, Sivakasi.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Microsoft Ignite /28/2017 6:07 PM
BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.
Data Analytics (CS40003) Introduction to Data Lecture #1
Big Data & Test Automation
CNIT131 Internet Basics & Beginning HTML
SAS users meeting in Halifax
Big Data Enterprise Patterns
MapReduce Compiler RHadoop
Enhancing Data and Predictive Analytics with Azure HDInsight
An Open Source Project Commonly Used for Processing Big Data Sets
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Zhangxi Lin, The Rawls College,
Chapter 14 Big Data Analytics and NoSQL
Big Data Technology.
Hadoopla: Microsoft and the Hadoop Ecosystem
Hadoop.
Big Data Intro.
Hadoop Clusters Tess Fulkerson.
Data Warehouse.
Central Florida Business Intelligence User Group
Ministry of Higher Education
Big Data - in Performance Engineering
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Tools for Processing Big Data Jinan Al Aridhee and Christian Bach
Introduction to Apache
Overview of big data tools
Big Data Young Lee BUS 550.
Data Warehousing in the age of Big Data (1)
Zoie Barrett and Brian Lam
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Dep. of Information Technology By: Raz Dara Mohammad Amin
Big Data Analysis in Digital Marketing
Big DATA.
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Big Data.
Presentation transcript:

Big Data-An Analysis

Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools. The challenges include capture, curation, storage, search, sharing, analysis, and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions. (Wikipedia)

Big Data: A definition Put another way, big data is the realization of greater business intelligence by storing, processing, and analyzing data that was previously ignored due to the limitations of traditional data management technologies

Lots of data quintillion bytes of data are generated every day! A quintillion is Data come from many quarters. Social media sites Sensors Digital photos Business transactions Location-based data

The four dimensions of Big Data Volume: Large volumes of data Velocity: Quickly moving data Variety: structured, unstructured, images, etc. Veracity: Trust and integrity is a challenge and a must and is important for big data just as for traditional relational DBs

The four dimensions of use Aspects of the way in which users want to interact with their data… Totality: Users have an increased desire to process and analyze all available data Exploration: Users apply analytic approaches where the schema is defined in response to the nature of the query Frequency: Users have a desire to increase the rate of analysis in order to generate more accurate and timely business intelligence Dependency: Users’ need to balance investment in existing technologies and skills with the adoption of new techniques

Why Big Data and BI

Big Data Conundrum Problems: Although there is a massive spike available data, the percentage of the data that an enterprise can understand is on the decline The data that the enterprise is trying to understand is saturated with both useful signals and lots of noise

The Big Data platform Manifesto imperatives and underlying technologies

IBM’s Big Data Platform

Some concepts NoSQL (Not Only SQL): Databases that “move beyond” relational data models (i.e., no tables, limited or no use of SQL) Focus on retrieval of data and appending new data (not necessarily tables) Focus on key-value data stores that can be used to locate data objects Focus on supporting storage of large quantities of unstructured data SQL is not used for storage or retrieval of data No ACID (atomicity, consistency, isolation, durability)

NoSQL NoSQL focuses on a schema-less architecture (i.e., the data structure is not predefined) In contrast, traditional relation DBs require the schema to be defined before the database is built and populated. Data are structured Limited in scope Designed around ACID principles.

Hadoop Hadoop is a distributed file system and data processing engine that is designed to handle extremely high volumes of data in any structure. Hadoop has two components: The Hadoop distributed file system (HDFS), which supports data in structured relational form, in unstructured form, and in any form in between The MapReduce programing paradigm for managing applications on multiple distributed servers The focus is on supporting redundancy, distributed architectures, and parallel processing

Some Hadoop Related Names to Know Apache Avro: designed for communication between Hadoop nodes through data serialization Cassandra and Hbase: a non-relational database designed for use with Hadoop Hive: a query language similar to SQL (HiveQL) but compatible with Hadoop Mahout: an AI tool designed for machine learning; that is, to assist with filtering data for analysis and exploration Pig Latin: A data-flow language and execution framework for parallel computation ZooKeeper: Keeps all the parts coordinated and working together

What to do with the data

Parallels with Data Warehousing Data Warehouses Extraction Transformation Load Connector Processing User Management

Connector Framework Supports access to data by creating indexes that can be used for access to the data in its native repository (i.e., it does not manage the data, it keeps track of where it is located)

Processing Layer Two primary functions: Indexes content: data are crawled, parsed, and analyzed with the result that contents are indexed and located Processes queries Manages access to various servers hosting the indexed and searchable content

Annotated Query Language AQL is an SQL-like declarative language for performing text analysis and extraction create view PersonPhone as select P.name as person, N.number as phone from Person P, Phone PN, Sentence S where Follows(P.name. PN.number, 0, 30) and Contains(S.sentence, P.name) and Contains(S.sentence, PN.number) and ContainsRegex(/\b(phone|at)\b/, SpanBetween(P.name, PN.number));

The provenance viewer

Machine data analysis

THANK YOU