Download presentation
Presentation is loading. Please wait.
1
Hadoopla: Microsoft and the Hadoop Ecosystem
Presented at SQL Saturday Waltham May 19th, 2012 Jim O’Neil Developer Evangelist, Microsoft
2
Big Data Starts with a V Volume there’s a lot of it; we’re hoarders Variety schema-schmema, it’s coming from the ‘internet of things’ Velocity he who hesitates doesn’t get the worm
3
There’s a Tech for That Volume Data Warehouses Distributed File Systems + Map-Reduce Variety NoSQL databases Velocity Complex Event Processing
4
Two Dimensions of Scale
Up Out
5
Scaling Out is Hard Programming complexity Number of Machines 1 2 3 4
5 6 … n Number of Machines
6
Distributed File Systems
name node data node data node data node data node
7
Map Reduce job tracker name node data node data node data node
task tracker
8
Map Reduce I am what I am Word count example I : 1 I : 2 I : 1 am: 1
var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); }; Word count example Map Reduce I am what I am map I : 1 I : 2 reduce var map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") context.write( words[i].toLowerCase(), 1);} } }; I : 1 am: 1 what : 1 am : 1 shuffle and sort var map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) if (words[i] !== "") context.write(words[i].toLowerCase(), 1); } am: 1 what: 1 am: 2 what : 1 reduce
9
Enter Hadoop Apache project (http://hadoop.apache.org)
Open source implementation of Google File System and MapReduce Hadoop Distributed File System (HDFS) Hadoop MapReduce Hadoop Common
10
Hadoop History 2002 Doug Cutting develops Nutch, web crawler
2004 Google publishes MapReduce + GFS paper 2006 Cutting joins Yahoo! Hadoop becomes Apache Lucene subproject Hadoop becomes top-level Apache project Cutting joins Cloudera 2011 Hortonworks formed by Yahoo! and Benchmark Capital 2011 Hadoop reaches version (Dec. 27)
11
Adopters Yahoo! has a 40,000 node cluster
Facebook has over 30PB of data in Hadoop Oracle’s Big Data Appliance includes a Hadoop distribution JP Morgan Chase uses it for fraud detection eBay is replacing its core search technology with it Microsoft is working with Hortonworks to distribute Hadoop on Windows both in the cloud and on-premises
12
http://hadooponazure.com Hadoop on Azure
Limited customer preview Windows Server on-premises distribution to follow
13
Sign up
14
Cluster Provisioning
15
Demo
16
The Menagerie Begins Pig: query infrastructure for Hadoop
SQL-like scripts (Pig Latin) launch map-reduce jobs Hive: data warehouse system for Hadoop HiveQL (SQL-like) for querying (launching map reduce jobs)
17
More Demo
18
More Ecosystem Hbase: NoSQL database built on HDFS
Cassandra: Wide column NoSQL store Sqoop: bridge from RDBMS to HDFS
19
And More Flume: log aggregator to HDFS Scribe: another log aggregator
Chukwa: log processing platform ______ / ___//_ ______ ____ / /_/ / / / / \/ __/ / __/ / /_/ / / / / __/ / / /_/\____/_/_/_/\__/ /_/ Distributed Log Collection.
20
And Some More Zookeeper: distributed system coordinator Oozie: workflow engine Avro: data serialization system Ganglia: distributed monitoring system
21
We’re Not Done Yet! Mahout: machine learning library Pegasus: graph mining system CloudBurst: genome sequence mapping
22
And It’s Just One Piece of the Big Data Pie
Microsoft’s big data solution And It’s Just One Piece of the Big Data Pie FAMILIAR END USER TOOLS Power View Excel with PowerPivot Predictive Analytics Embedded BI BI PLATFORM SSAS SSRS Microsoft SQL Server / PDW Connectors Hadoop On Windows Azure Hadoop On Windows Server UNSTRUCTURED & STRUCTURED DATA Sensors Devices Bots Crawlers ERP CRM LOB
23
I meant what I said, and I said what I meant
I meant what I said, and I said what I meant. An elephant's faithful, one hundred percent. Jim O’Neil Developer Evangelist, Microsoft
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.