R and HDInsight in Microsoft Azure Visual Studio 4/12/2017 R and HDInsight in Microsoft Azure Sebastian Brandes, sbrand@microsoft.com © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
About Sebastian Tech Evangelist, Microsoft Responsible for Azure Evangelism in Denmark Economics and Statistics background (Aarhus University) Twitter: @sebastianbk Blog: https://sebastianbrandes.com
Agenda What is Big Data (according to Microsoft)? Visual Studio 4/12/2017 Agenda What is Big Data (according to Microsoft)? Hadoop and The Hadoop Ecosystem – some stats Introduction to Microsoft Azure and HDInsight Provisioning a Hadoop cluster in Azure Installing R on the cluster Running MapReduce jobs using R Azure Machine Learning + R Wrapping Up © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Introducing Big Data Device Explosion Social Networks Cheap Storage >5.5 billion (70+% of global population) Social Networks >2 Billion users Cheap Storage $100 gets you 3 million times more storage in 30 years Ubiquitous Connection Web traffic 2010 130 Exabyte (10 E18) 2015 1.6 Zettabyte (10 E21) Sensor Networks >10 Billion Faster CPUs 1980 10 MIPS/$ 2005 10M MIPS/$
Introducing Big Data “Big data is a collection of data sets so large and complex that it becomes awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analysis, and visualization.” – Wikipedia
Introducing Big Data Big data solutions deal with complexities of: VOLUME (Size) VARIETY (Structure) VELOCITY (Speed)
Introducing Big Data Internet of Things Modern Web Volume ERP / CRM Terabytes (10E12) Gigabytes (10E9) Exabytes (10E18) Petabytes (10E15) Velocity - Variety Volume Internet of Things Audio / Video Log Files Text/Image Social Sentiment Data Market Feeds eGov Feeds Weather Wikis / Blogs Click Stream Sensors / RFID / Devices Spatial & GPS Coordinates Modern Web Mobile Advertising Collaboration eCommerce Digital Marketing Search Marketing Web Logs Recommendations ERP / CRM Sales Pipeline Payables Payroll Inventory Contacts Deal Tracking ERP / CRM Modern Web Internet of Things
Introducing Big Data Responding to New Questions Live Data Feed What’s the social sentiment of my product? How do I better predict future outcomes? Social Analytics Advanced Analytics How do I optimize my services based on patterns of weather, traffic, etc.?
vs.
One Person Tweeting For 100 Years 140 characters per tweet No metadata Encoded with UTF-8 (one character = 2 bytes) Writes a tweet every 10 seconds Tweets around the clock, 24/7 Uses maximum characters available Is provided with Coca-Cola to keep up with tweeting
140 chars * 2 bytes = 280 bytes
280 bytes * 3600 / 10 = 100,800 B/hr
100,800 * 24 * 365 = 883,008,000 B/yr
883,008,000 * 100 = 88,300,800,000 bytes
88,300,800,000 bytes = 82.2365 GB
3 billion Internet users Per March 2015
82.2365 * 3,000,000,000 = 246,709,500,000 GB = 240,927,246 TB = 235,280 PB = 230 EB
The Large Hadron Collider (LHC) is the world's largest and most powerful particle collider, and the largest single machine in the world, built by the European Organization for Nuclear Research (CERN) from 1998 to 2008. Wikipedia
300 gigabytes per second
300 GB * 3600 * 24 * 365 * 100 = 946,080,000,000 GB = 923,906,250 TB = 902,252 PB = 881 EB
You Tell Me What It Is!
Traditional Hadoop cluster Client(s) Distributed Data Processing (MR) Distributed Data Storage (HDFS) Job Tracker Primary Name Node Secondary Name Node Masters Data Node Data Node Data Node ... Slaves Task Tracker Task Tracker Task Tracker Data
MapReduce Architecture (v1) Task Tracker Job tracker is the single controller for dispatch and status reporting. Client submits job, which in turn coordinates a series of tasks from the job tracker. Task Task Client 1 Job Tracker Task Tracker Client 1 Task Task Task Tracker Status Submission Task Task
YARN Architecture Node Manager Node Manager Resource Manager Visual Studio 4/12/2017 YARN Architecture Node Manager Container App Master Client submits an App to RM RM instantiates App Master (AM) on a Node. App Master asks RM for more containers Containers report status to AM Clients query AM for status of job Nodes report node status to RM Thus, removing contention on Job Tracker to do everything Become more resilient to RM failures Number of active jobs more scalable Node Manager Container Client 1 Resource Manager Container Container Client 1 Node Manager App Master Container M/R Status Submission Node Status Resource Request © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Hadoop Timeline October 2003: Google publishes GFS paper http://research.google.com/archive/gfs.html December 2004: Google publishes MapReduce paper http://research.google.com/archive/mapreduce.html 2005: Doug Cutting (Yahoo!) and Mike Cafarella (U Washington) create Hadoop [Insert many years here.] November 2012: HDInsight is released in Technical Preview (as PaaS and IaaS) August 2013: Hadoop 1.X is stable October 2013: HDInsight is generally available October 2013: Hadoop 2.X is generally available June 2014: HDInsight 3.1 (Hadoop 2.4.0) is released
Absolute
Absolute
Relative
Rexer Analytics 2013 Data Miner Survey: Tools Ranked By Usage
Any Ideas?
5 weeks ago!
Let’s Talk About HDInsight
Introduction to RHadoop Integration between the R statistical package and Hadoop’s Distributed File System and MapReduce Computation Engine Moves algorithm execution closer to the data Provides access to lots of high‐quality statistical libraries Speeds work by processing in parallel
Demo
Azure Machine Learning
Wrap Up
azure.com
Sebastian Brandes, sbrand@microsoft.com