Introduction to Big Data

Slides:



Advertisements
Similar presentations
Cloud Computing Development. Shallow Introduction.
Advertisements

Introduction to Advanced Computing Platforms for Data Analysis Ruoming Jin.
R and HDInsight in Microsoft Azure
Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
AN INTRODUCTION TO CLOUD COMPUTING Web, as a Platform…
CS525: Special Topics in DBs Large-Scale Data Management
Evolution in Coming 10 Years: What's the Future of Network? - Evolution in Coming 10 Years: What's the Future of Network? - Big Data- Big Changes in the.
Presented by Sujit Tilak. Evolution of Client/Server Architecture Clients & Server on different computer systems Local Area Network for Server and Client.
SaaS, PaaS & TaaS By: Raza Usmani
M.A.Doman Model for enabling the delivery of computing as a SERVICE.
Big Data. What is Big Data? Analog starage vs digital. The FOUR V’s of Big Data. Who’s Generating Big Data The importance of Big Data. Optimalization.
Cloud computing Tahani aljehani.
INTRODUCTION TO CLOUD COMPUTING Cs 595 Lecture 5 2/11/2015.
An Introduction to Cloud Computing. The challenge Add new services for your users quickly and cost effectively.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Cloud Computing Source:
Introduction to Cloud Computing
Effectively Explaining the Cloud to Your Colleagues.
1 Introduction to Cloud Computing Jian Tang 01/19/2012.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
PhD course - Milan, March /09/ Some additional words about cloud computing Lionel Brunie National Institute of Applied Science (INSA) LIRIS.
Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over the Internet. Cloud is the metaphor for.
Cloud Computing Kwangyun Cho v=8AXk25TUSRQ.
GIS and Cloud Computing. Flickr  Upload and manage your photos online  Share your photos with your family and friends  Post your photos everywhere.
Big Data. What is Big Data? Big Data Analytics: 11 Case Histories and Success Stories
Cloud Computing. What is Cloud Computing? Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable.
Web 2.0: Concepts and Applications 6 Linking Data.
Geographic Information Systems Cloud GIS. ► The use of computing resources (hardware and software) that are delivered as a service over the Internet ►
Cloud Computing and Big Data
Evolution - not revolution Server resources are shared globally instead of locally Excess capacity for peak usage can be shared Backup, security and other.
Introduction to Cloud Computing
M.A.Doman Short video intro Model for enabling the delivery of computing as a SERVICE.
© 2012 IBM Corporation IBM Security Systems 1 © 2013 IBM Corporation 1 Ecommerce Antoine Harfouche.
What is the cloud ? IT as a service Cloud allows access to services without user technical knowledge or control of supporting infrastructure Best described.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
IBM Bluemix Ecosystem Development Hands on Workshop Section 1 - Overview.
1 Melanie Alexander. Agenda Define Big Data Trends Business Value Challenges What to consider Supplier Negotiation Contract Negotiation Summary 2.
Chapter 8 – Cloud Computing
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
3/12/2013Computer Engg, IIT(BHU)1 CLOUD COMPUTING-1.
Web Technologies Lecture 13 Introduction to cloud computing.
1 TCS Confidential. 2 Objective : In this session we will be able to learn:  What is Cloud Computing?  Characteristics  Cloud Flavors  Cloud Deployment.
Cloud Computing Shannon McManus Michael Weihert. What is Cloud Computing?
RANDY MODOWSKI COSC Cloud Computing. Road Map What is Cloud Computing? History of “The Cloud” Cloud Milestones How Cloud Computing is being used.
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
CS 6027 Advanced Networking FINAL PROJECT ​. Cloud Computing KRANTHI ​ CHENNUPATI PRANEETHA VARIGONDA ​ SANGEETHA LAXMAN ​ VARUN ​ DENDUKURI.
Lecture 1 Book: Hadoop in Action by Chuck Lam Online course – “Cloud Computing Concepts” lecture notes by Indranil Gupta.
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
Data Analytics (CS40003) Introduction to Data Lecture #1
Prof. Jong-Moon Chung’s Lecture Notes at Yonsei University
Lecture 6: Cloud Computing
Unit 3 Virtualization.
From BI to Big DatA.
Smart Building Solution
IOT Critical Impact on DC Design
Introduction to Big Data
BIG DATA IN ENGINEERING APPLICATIONS
An Introduction to Cloud Computing
Smart Building Solution
Hadoop Clusters Tess Fulkerson.
Introduction to Cloud Computing
CNIT131 Internet Basics & Beginning HTML
Database Design (constructed by Hanh Pham based on slides from books and websites listed under reference section) Big Data.
Emerging technologies-
Agenda Need of Cloud Computing What is Cloud Computing
Big DATA.
Consult America Technology Consulting Services
Big-Data Analytics with Azure HDInsight
Big Data.
Presentation transcript:

Introduction to Big Data

Welcome! Instructor: Ruoming Jin TA: Xinyu Chang Office: 264 MCS Building Email: jin AT cs.kent.edu Office hour: Mondays (4:30PM to 5:30PM) or by appointment TA: Xinyu Chang Email: xchang AT kent.edu Homepage: http://www.cs.kent.edu/~jin/BigData/index.html

Topics Scope: Big Data & Analytics Topics: Foundation of Data Analytics and Data Mining Hadoop/Map-Reduce Programming and Data Processing & BigTable/Hbase/Cassandra Graph Database and Graph Analytics

What’s Big Data? No single definition; here is from Wikipedia: Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.”

Big Data: 3V’s

Volume (Scale) Data Volume Data volume is increasing exponentially 44x increase from 2009 2020 From 0.8 zettabytes to 35zb Data volume is increasing exponentially Exponential increase in collected/generated data

12+ TBs of tweet data every day 25+ TBs of log data every day 2+ billion people on the Web by end 2011 30 billion RFID tags today (1.3B in 2005) 4.6 billion camera phones world wide 100s of millions of GPS enabled devices sold annually 76 million smart meters in 2009… 200M by 2014 12+ TBs of tweet data every day ? TBs of data every day 25+ TBs of log data every day

CERN’s Large Hydron Collider (LHC) generates 15 PB a year Maximilien Brice, © CERN

1. The Earthscope The Earthscope is the world's largest science project. Designed to track North America's geological evolution, this observatory records data over 3.8 million square miles, amassing 67 terabytes of data. It analyzes seismic slips in the San Andreas fault, sure, but also the plume of magma underneath Yellowstone and much, much more. (http://www.msnbc.msn.com/id/44363598/ns/technology_and_science-future_of_technology/#.TmetOdQ--uI)

To extract knowledge all these types of data need to linked together Variety (Complexity) Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Graph Data Social Network, Semantic Web (RDF), … Streaming Data You can only scan the data once A single application can be generating/collecting many types of data Big Public Data (online, weather, finance, etc) To extract knowledge all these types of data need to linked together

A Single View to the Customer Social Media Banking Finance Gaming Our Known History Customer Entertain Purchase

Velocity (Speed) Data is begin generated fast and need to be processed fast Online Data Analytics Late decisions  missing opportunities Examples E-Promotions: Based on your current location, your purchase history, what you like  send promotions right now for store next to you Healthcare monitoring: sensors monitoring your activities and body  any abnormal measurements require immediate reaction

Real-time/Fast Data Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Mobile devices (tracking all objects all the time) Sensor technology and networks (measuring all kinds of data) The progress and innovation is no longer hindered by the ability to collect data But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion

Real-Time Analytics/Decision Requirement Learning why Customers Product Recommendations that are Relevant & Compelling Learning why Customers Switch to competitors and their offers; in time to Counter Influence Behavior Friend Invitations to join a Game or Activity that expands business Customer Improving the Marketing Effectiveness of a Promotion while it is still in Play Preventing Fraud as it is Occurring & preventing more proactively

Some Make it 4V’s

Harnessing Big Data OLTP: Online Transaction Processing (DBMSs) OLAP: Online Analytical Processing (Data Warehousing) RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)

The Model Has Changed… The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data

What’s driving Big Data - Optimizations and predictive analytics - Complex statistical analysis - All types of data, and many sources - Very large datasets - More of a real-time - Ad-hoc querying and reporting - Data mining techniques - Structured data, typical sources - Small to mid-size datasets

The Evolution of Business Intelligence Interactive Business Intelligence & In-memory RDBMS QliqView, Tableau, HANA Big Data: Real Time & Single View Graph Databases Speed Scale BI Reporting OLAP & Dataware house Business Objects, SAS, Informatica, Cognos other SQL Reporting Tools Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra Scale Speed 1990’s 2000’s 2010’s

Big Data Analytics Big data is more real-time in nature than traditional DW applications Traditional DW architectures (e.g. Exadata, Teradata) are not well-suited for big data apps Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps

Big Data Technology

Cloud Computing IT resources provided as a service Compute, storage, databases, queues Clouds leverage economies of scale of commodity hardware Cheap storage, high bandwidth networks & multicore processors Geographically distributed data centers Offerings from Microsoft, Amazon, Google, …

wikipedia:Cloud Computing

Benefits Cost & management Reduced Time to deployment Scaling Economies of scale, “out-sourced” resource management Reduced Time to deployment Ease of assembly, works “out of the box” Scaling On demand provisioning, co-locate data and compute Reliability Massive, redundant, shared resources Sustainability Hardware not owned

Types of Cloud Computing Public Cloud: Computing infrastructure is hosted at the vendor’s premises. Private Cloud: Computing architecture is dedicated to the customer and is not shared with other organisations. Hybrid Cloud: Organisations host some critical, secure applications in private clouds. The not so critical applications are hosted in the public cloud Cloud bursting: the organisation uses its own infrastructure for normal usage, but cloud is used for peak loads. Community Cloud

Classification of Cloud Computing based on Service Provided Infrastructure as a service (IaaS) Offering hardware related services using the principles of cloud computing. These could include storage services (database or disk storage) or virtual servers. Amazon EC2, Amazon S3, Rackspace Cloud Servers and Flexiscale. Platform as a Service (PaaS) Offering a development platform on the cloud. Google’s Application Engine, Microsofts Azure, Salesforce.com’s force.com . Software as a service (SaaS) Including a complete software offering on the cloud. Users can access a software application hosted by the cloud vendor on pay-per-use basis. This is a well-established sector. Salesforce.coms’ offering in the online Customer Relationship Management (CRM) space, Googles gmail and Microsofts hotmail, Google docs.

Infrastructure as a Service (IaaS)

More Refined Categorization Storage-as-a-service Database-as-a-service Information-as-a-service Process-as-a-service Application-as-a-service Platform-as-a-service Integration-as-a-service Security-as-a-service Management/ Governance-as-a-service Testing-as-a-service Infrastructure-as-a-service InfoWorld Cloud Computing Deep Dive

Key Ingredients in Cloud Computing Service-Oriented Architecture (SOA) Utility Computing (on demand) Virtualization (P2P Network) SAAS (Software As A Service) PAAS (Platform AS A Service) IAAS (Infrastructure AS A Servie) Web Services in Cloud

Enabling Technology: Virtualization Hardware OS App Hypervisor Virtualized Stack Hardware Operating System App Traditional Stack

Everything as a Service Utility computing = Infrastructure as a Service (IaaS) Why buy machines when you can rent cycles? Examples: Amazon’s EC2, Rackspace Platform as a Service (PaaS) Give me nice API and take care of the maintenance, upgrades, … Example: Google App Engine Software as a Service (SaaS) Just run it for me! Example: Gmail, Salesforce

Cloud versus cloud Amazon Elastic Compute Cloud Google App Engine Microsoft Azure GoGrid AppNexus

The Obligatory Timeline Slide (Mike Culver @ AWS) Amazon.com COBOL, Edsel ARPANET Internet Web Awareness Web as a Platform Web Services, Resources Eliminated Darkness 1969 1982 1959 1996 1997 2001 2004 2006 Dot-Com Bubble Web 2.0 Web Scale Computing

AWS Elastic Compute Cloud – EC2 (IaaS) Simple Storage Service – S3 (IaaS) Elastic Block Storage – EBS (IaaS) SimpleDB (SDB) (PaaS) Simple Queue Service – SQS (PaaS) CloudFront (S3 based Content Delivery Network – PaaS) Consistent AWS Web Services API

What does Azure platform offer to developers? Your Applications … Service Bus Workflow Database Analytics Identity Contacts Access Control … Reporting … Devices … Compute Storage Manage …

Google’s AppEngine vs Amazon’s EC2 Python BigTable Other API’s VMs Flat File Storage AppEngine: Higher-level functionality (e.g., automatic scaling) More restrictive (e.g., respond to URL only) Proprietary lock-in EC2/S3: Lower-level functionality More flexible Coarser billing model June 3, 2008 Google AppEngine vs. Amazon EC2/S3

Topics Overview

Topic 1: Data Analytics & Data Mining Exploratory Data Analysis Linear Classification (Perceptron & Logistic Regression) Linear Regression C4.5 Decision Tree Apriori K-means Clustering EM Algorithm PageRank & HITS Collaborative Filtering

Topic 2: Hadoop/MapReduce Programming & Data Processing Architecture of Hadoop, HDFS, and Yarn Programming on Hadoop Basic Data Processing: Sort and Join Information Retrieval using Hadoop Data Mining using Hadoop (Kmeans+Histograms) Machine Learning on Hadoop (EM) Hive/Pig HBase and Cassandra

Topic 3: Graph Database and Graph Analytics Native Graph Database (Neo4j) Pregel/Giraph (Distributed Graph Processing Engine) Neo4j/Titan/GraphLab/GraphSQL

Textbooks No Official Textbooks References: Hadoop: The Definitive Guide, Tom White, O’Reilly Hadoop In Action, Chuck Lam, Manning Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer (www.umiacs.umd.edu/~jimmylin/MapReduce-book-final.pdf) Data Mining: Concepts and Techniques, Third Edition, by Jiawei Han et al. Many Online Tutorials and Papers

Cloud Resources Hadoop on your local machine Hadoop in a virtual machine on your local machine (Pseudo-Distributed on Ubuntu) Hadoop in the clouds with Amazon EC2

Course Prerequisite Prerequisite: Java Programming / C++ Data Structures and Algorithm Computer Architecture Basic Statistics and Probability Database and Data Mining (preferred)

This course is not for you… If you do not have a strong Java programming background This course is not about only programming (on Hadoop). Focus on “thinking at scale” and algorithm design Focus on how to manage and process Big Data! No previous experience necessary in MapReduce Parallel and distributed programming

Grade Scheme Homework 70% Project Class Participation 20% 10% Each Class will have a sign-in sheet Zero-Tolerance on plagiarism

Project Project (due April 24th) One project: Group size <= 4 students Checkpoints Proposal: title and goal (due March 1st) Outline of approach (due March 15th) Implementation and Demo (April 24th and 26th) Final Project Report (due April 29th) Each group will have a short presentation and demo (15-20 minutes) Each group will provide a five-page document on the project; the responsibility and work of each student shall be described precisely