Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225.

Slides:



Advertisements
Similar presentations
Oracle Data Warehouse Mit Big Data neue Horizonte für das Data Warehouse ermöglichen Alfred Schlaucher, Detlef Schroeder DATA WAREHOUSE.
Advertisements

SQOOP HCatalog Integration
From SQL to Hadoop and Back The “Sqoop” about Data Connections between
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
David J. DeWitt Microsoft Jim Gray Systems Lab Madison, Wisconsin graysystemslab.com.
Pig Contributors Workshop Agenda Introductions What we are working on Usability Howl TLP Lunch Turing Completeness Workflow Fun (Bocci ball)
Parallel and Distributed Computing: MapReduce Alona Fyshe.
Chapter 2 Data Models Database Systems: Design, Implementation, and Management, Eleventh Edition, Coronel & Morris.
Senior Project Manager & Architect Love Your Data.
CS246 TA Session: Hadoop Tutorial Peyman kazemian 1/11/2011.
Intro to Map-Reduce Feb 4, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.
Introduction to Google MapReduce WING Group Meeting 13 Oct 2006 Hendra Setiawan.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Hadoop as a Service Boston Azure 29-March-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license
MapReduce Programming Yue-Shan Chang. split 0 split 1 split 2 split 3 split 4 worker Master User Program output file 0 output file 1 (1) fork (2) assign.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.
Data Analytics 김재윤 이성민 ( 팀장 ) 이용현 최찬희 하승도. Contents Part l 1. Introduction - Data Analytics Cases - What is Data Analytics? - OLTP, OLAP - ROLAP - MOLAP.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
HAMS Technologies 1
Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.
CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Hadoop Introduction Wang Xiaobo Outline Install hadoop HDFS MapReduce WordCount Analyzing Compile image data TeleNav Confidential.
An Introduction to HDInsight June 27 th,
Hadoop as a Service Boston Azure / Microsoft DevBoston 07-Feb-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
Writing a MapReduce Program 1. Agenda  How to use the Hadoop API to write a MapReduce program in Java  How to use the Streaming API to write Mappers.
Before we start, please download: VirtualBox: – The Hortonworks Data Platform: –
Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team Modified by R. Cook.
Nov 2006 Google released the paper on BigTable.
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, January 31, 2013 Session 2: Hadoop Nuts and Bolts This work is licensed.
Graeme Malcolm |
HADOOP Course Content By Mr. Kalyan, 7+ Years of Realtime Exp. M.Tech, IIT Kharagpur, Gold Medalist. Introduction to Big Data and Hadoop Big Data › What.
Harnessing Big Data with Hadoop Dipti Sangani; Madhu Reddy DBI210.
Airlinecount CSCE 587 Spring Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned.
HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.
Distributed Systems Lecture 3 Big Data and MapReduce 1.
Hadoop&Hbase Developed Using JAVA USE NETBEANS IDE.
Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.
CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Sort in MapReduce. MapReduce Block 1 Block 2 Block 3 Block 4 Block 5 Map Reduce Output 1 Output 2 Shuffle/Sort.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Introduction to Google MapReduce
Hadoop.
Zhangxi Lin Texas Tech University
MSBIC Hadoop Series Processing Data with Pig
Sqoop Mr. Sriram
SQOOP.
Central Florida Business Intelligence User Group
07 | Analyzing Big Data with Excel
Ministry of Higher Education
Airlinecount CSCE 587 Fall 2017.
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Chapter X: Big Data.
Big Data Technology: Introduction to Hadoop
MIT 802 Introduction to Data Platforms and Sources Lecture 2
05 | Processing Big Data with Hive
04 | Processing Big Data with Pig
06 | Automating Big Data Processing
Presentation transcript:

Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225

NameNode Giant File Giant File HDFSClientHDFSClient NameNode returns locations of blocks of file DataNode DataNodes return blocks of the file

Output

public static class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } public static class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } Source:

demo A Quick-and-Dirty Data Warehouse in Hadoop

OLTP DW ACID BASE SQL Server Hive HBase Cassandra SQL Server

Define schema with Hive DDL (state the structure, map to file) create external table CUSTOMER ( C_CUSTKEYint, C_MKTSEGMENTstring, C_NATIONKEYint, C_NAMEstring, C_ADDRESSstring, C_PHONEstring, C_ACCTBALfloat, C_COMMENTstring ) row format delimited fields terminated by '|' stored as textfile location 'asv://customer/';

orders = load '/wh/orders/orders.tbl' using PigStorage ('|') as ( ORDERDATE:chararray, ORDERKEY:long, CUSTKEY:int, ORDERSTATUS:chararray, TOTALPRICE:double, COMMENT:chararray ); custs = load '/wh/customer/customer.tbl' using PigStorage ('|') as ( CUSTKEY:int, MKTSEGMENT:chararray, NATIONKEY:int, NAME:chararray, ADDRESS:chararray, PHONE:chararray ); nations = load ‘/wh/nation/nation.tbl' using PigStorage ('|') as ( id:int, nation:chararray, region:int ); custnat = join custs by NATIONKEY, nations by id; ordernat = join custnat by CUSTKEY, orders by CUSTKEY; ordersbynat = group ordernat by NATIONKEY; sums = foreach ordersbynat generate group, COUNT(ordernat.TOTALPRICE), SUM(ordernat.TOTALPRICE); dump sums; Logic here – the rest is schema

hive> select devicemake, devicemodel, sum(querydwelltime) as a > from hivesampletable > group by devicemake, devicemodel > order by a; Total MapReduce jobs = 2 Launching Job 1 out of 2 Starting Job = job_ _0003, Tracking URL = Kill Command = c:\Apps\dist\bin\hadoop.cmd job -Dmapred.job.tracker= :9010 -kill job_ _ :29:21,382 Stage-1 map = 0%, reduce = 0% :29:33,601 Stage-1 map = 50%, reduce = 0% :29:37,617 Stage-1 map = 100%, reduce = 0% :29:48,648 Stage-1 map = 100%, reduce = 33% :29:51,664 Stage-1 map = 100%, reduce = 100% Ended Job = job_ _0003 Launching Job 2 out of 2 Starting Job = job_ _0004, Tracking URL = Kill Command = c:\Apps\dist\bin\hadoop.cmd job -Dmapred.job.tracker= :9010 -kill job_ _ :30:18,195 Stage-2 map = 0%, reduce = 0% :30:30,210 Stage-2 map = 100%, reduce = 0% :30:45,241 Stage-2 map = 100%, reduce = 33% :30:48,257 Stage-2 map = 100%, reduce = 100% Ended Job = job_ _0004 OK Samsung SGH-i LG LG-C HTC 7 Mozart SAMSUNG SGH-i917R HTC PD Apple iPhone

OLTP DB HR DB Data Warehouse DB Customer Mgmt. External sources Staging area Data mart OLAP cube Reports Interactive tools Dashboards ETL (Optional) ETL

Persistent storage in HDFS Interactive tools Sqoop data interchange with relational targets Flume for file acquisition OLTP in HBASE Hive presents data as tables Pig transforms data in HDFS Oozie manages workflows Sqoop data interchange with relational sources DW in Hive Presentation DB OLAP cube Reports Dashboards External sources OLTP in RDBMS

demo The Best of Both Worlds