Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation
Windows Azure Center of Excellence Spotlight Pilots Assessment Architecture and Design Guidance Modern AppsGlobal Scale Design Sessions Global Services Team 10 Senior Cloud Architects Dennis Mulder US, EMEA, APAC 8 Pilots Cloud Apps Champs Services Dennis Mulder, Solution Architect, DesignAssessContactPilots Engage
SocialMobility mobile apps will be downloaded in 2012 = 91% of organizations expect to spend on mobile devices in /2 of companies expect to use internal social network apps in zettabytes in 2012 >80% of new apps in 2012 will be distributed/ deployed on clouds 32% of businesses are likely to invest in BI and analytics in 2012 from infrastructure to application platforms The strategic focus in the cloud will shift in 2012 In 2012, mobile devices will outship PCs by more than 2:1 and generate more revenue than PCs for the first time 85 BILLION Social networking will follow not just people but also appliances, devices and products 34% of CIOs say technology as a service (cloud) will have the most profound effect on the CIO role in the future 2/3 of mobile apps developed in 2012 will integrate with analytics offerings 49% of CIOs rank BI as the top project priority for 2012 Big data Cloud Four megatrends will dominate the next decade
mobile apps will be downloaded in 2012 = 91% of organizations expect to spend on mobile devices in /2 of companies expect to use internal social network apps in zettabytes in 2012 >80% of new apps in 2012 will be distributed/ deployed on clouds 32% of businesses are likely to invest in BI and analytics in 2012 from infrastructure to application platforms The strategic focus in the cloud will shift in 2012 In 2012, mobile devices will outship PCs by more than 2:1 and generate more revenue than PCs for the first time 85 BILLION Social networking will follow not just people but also appliances, devices and products 34% of CIOs say technology as a service (cloud) will have the most profound effect on the CIO role in the future 2/3 of mobile apps developed in 2012 will integrate with analytics offerings 49% of CIOs rank BI as the top project priority for 2012 SocialMobility Big data Microsoft is embracing these megatrends Cloud
How will technology megatrends enable you to save money, drive innovation, grow your business, and attract and retain customers? Rethinking and evolving business strategies Social Big data Mobility Cloud
Why Big Data?
Internet of things Audio / Video Log Files Text/Image Social Sentiment Data Market Feeds eGov Feeds Weather Wikis / Blogs Click Stream Sensors / RFID / Devices Spatial & GPS Coordinates WEB 2.0 Mobile Advertisin g CollaborationeCommerce Digital Marketing Search Marketing Web Logs Recommendation s ERP / CRM Sales Pipeline Payables Payroll Inventory Contacts Deal Tracking Terabytes (10E12) Gigabytes (10E9) Exabytes (10E18) Petabytes (10E15) Velocity - Variety - variability Volume ,000$ $ ,000$ $ Storage/GB ERP / CRM WEB 2.0 Internet of things
Example Scenarios
Excess Data Logs ETL Some Data Data Warehouse
Raw Data “Store it All” Cluster Raw Data “Store it All” Cluster Data Warehouse Logs
Understanding the Basics Move the Compute to the Data
Hadoop Distributed Architecture
Server Files Server
RUNTIME Code
MapReduce – Workflow
Map tasks $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $60 MapperMapper MapperMapper $ $ $ $ $ $ $ $ $ $ $ $ $ $95 DataNode3 DataNode2 DataNode1 Blocks of the Sales file in HDFS Group By Group By (custId, zipCode, amount) One output bucket per reduce task
Reducer Reduce tasks Reducer 53705$ $ $ $ $ $60 MapperMapper 53705$ $ $ $ $ $25 MapperMapper 53705$ $ $ $ $ $ $ $ $ $ $ $15 Sort Sort Sort 53705$ $ $ $ $ $ $ $ $ $ $ $15 SUM 10025$ $ $ $ $30 Done! Shuffle
MapReduce – Workflow
HD Insight
Front end Stream Layer Partition Layer Name Node de Data Node Front end HDFS API DFS (1 Data Node per Worker Role) and Compute Cluster Azure Storage (ASV) … Azure Blob Storage
Distributed Storage (HDFS) Query (Hive) Distributed Processing (MapReduce) HDINSIGHT / HADOOP Eco-System Legend Red = Core Hadoop Blue = Data processing Purple = Microsoft integration points and value adds Orange = Data Movement Green = Packages
Hive, Pig, Mahout, Cascading, Scalding, Scoobi, Pegasus… C#, F# Map/Reduce, LINQ to Hive,.NET management clients JavaScript Map/Reduce, Browser hosted console, Node.js management clients PowerShell, Cross Platform CLI tools
TRADITIONAL RDBMSMAPREDUCE Data Size Access Updates Structure Integrity Scaling DBA Ratio
Deploying and Interacting With HDInsight Service demo
Nuget: Hadoop SDK:
Windows Azure Center of Excellence Spotlight Pilots Assessment Architecture and Design Guidance Modern AppsGlobal Scale Design Sessions Global Services Team 10 Senior Cloud Architects Dennis Mulder US, EMEA, APAC 8 Pilots Cloud Apps Champs Services Dennis Mulder, Solution Architect, DesignAssessContactPilots Engage
© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.