Download presentation
1
R and HDInsight in Microsoft Azure
Visual Studio 4/12/2017 R and HDInsight in Microsoft Azure Sebastian Brandes, © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
2
About Sebastian Tech Evangelist, Microsoft
Responsible for Azure Evangelism in Denmark Economics and Statistics background (Aarhus University) Blog:
3
Agenda What is Big Data (according to Microsoft)?
Visual Studio 4/12/2017 Agenda What is Big Data (according to Microsoft)? Hadoop and The Hadoop Ecosystem – some stats Introduction to Microsoft Azure and HDInsight Provisioning a Hadoop cluster in Azure Installing R on the cluster Running MapReduce jobs using R Azure Machine Learning + R Wrapping Up © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
4
Introducing Big Data Device Explosion Social Networks Cheap Storage
>5.5 billion (70+% of global population) Social Networks >2 Billion users Cheap Storage $100 gets you 3 million times more storage in 30 years Ubiquitous Connection Web traffic Exabyte (10 E18) Zettabyte (10 E21) Sensor Networks >10 Billion Faster CPUs MIPS/$ M MIPS/$
5
Introducing Big Data “Big data is a collection of data sets so large and complex that it becomes awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analysis, and visualization.” – Wikipedia
6
Introducing Big Data Big data solutions deal with complexities of:
VOLUME (Size) VARIETY (Structure) VELOCITY (Speed)
7
Introducing Big Data Internet of Things Modern Web Volume ERP / CRM
Terabytes (10E12) Gigabytes (10E9) Exabytes (10E18) Petabytes (10E15) Velocity - Variety Volume Internet of Things Audio / Video Log Files Text/Image Social Sentiment Data Market Feeds eGov Feeds Weather Wikis / Blogs Click Stream Sensors / RFID / Devices Spatial & GPS Coordinates Modern Web Mobile Advertising Collaboration eCommerce Digital Marketing Search Marketing Web Logs Recommendations ERP / CRM Sales Pipeline Payables Payroll Inventory Contacts Deal Tracking ERP / CRM Modern Web Internet of Things
8
Introducing Big Data Responding to New Questions
Live Data Feed What’s the social sentiment of my product? How do I better predict future outcomes? Social Analytics Advanced Analytics How do I optimize my services based on patterns of weather, traffic, etc.?
10
vs.
11
One Person Tweeting For 100 Years
140 characters per tweet No metadata Encoded with UTF-8 (one character = 2 bytes) Writes a tweet every 10 seconds Tweets around the clock, 24/7 Uses maximum characters available Is provided with Coca-Cola to keep up with tweeting
12
140 chars * 2 bytes = 280 bytes
13
280 bytes * 3600 / 10 = 100,800 B/hr
14
100,800 * 24 * 365 = 883,008,000 B/yr
15
883,008,000 * 100 = 88,300,800,000 bytes
16
88,300,800,000 bytes = GB
17
3 billion Internet users
Per March 2015
18
* 3,000,000,000 = 246,709,500,000 GB = 240,927,246 TB = 235,280 PB = 230 EB
20
The Large Hadron Collider (LHC) is the world's largest and most powerful particle collider, and the largest single machine in the world, built by the European Organization for Nuclear Research (CERN) from 1998 to 2008. Wikipedia
21
300 gigabytes per second
22
300 GB * 3600 * 24 * 365 * 100 = 946,080,000,000 GB = 923,906,250 TB = 902,252 PB = 881 EB
29
You Tell Me What It Is!
31
Traditional Hadoop cluster
Client(s) Distributed Data Processing (MR) Distributed Data Storage (HDFS) Job Tracker Primary Name Node Secondary Name Node Masters Data Node Data Node Data Node ... Slaves Task Tracker Task Tracker Task Tracker Data
32
MapReduce Architecture (v1)
Task Tracker Job tracker is the single controller for dispatch and status reporting. Client submits job, which in turn coordinates a series of tasks from the job tracker. Task Task Client 1 Job Tracker Task Tracker Client 1 Task Task Task Tracker Status Submission Task Task
33
YARN Architecture Node Manager Node Manager Resource Manager
Visual Studio 4/12/2017 YARN Architecture Node Manager Container App Master Client submits an App to RM RM instantiates App Master (AM) on a Node. App Master asks RM for more containers Containers report status to AM Clients query AM for status of job Nodes report node status to RM Thus, removing contention on Job Tracker to do everything Become more resilient to RM failures Number of active jobs more scalable Node Manager Container Client 1 Resource Manager Container Container Client 1 Node Manager App Master Container M/R Status Submission Node Status Resource Request © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
34
Hadoop Timeline October 2003: Google publishes GFS paper
December 2004: Google publishes MapReduce paper 2005: Doug Cutting (Yahoo!) and Mike Cafarella (U Washington) create Hadoop [Insert many years here.] November 2012: HDInsight is released in Technical Preview (as PaaS and IaaS) August 2013: Hadoop 1.X is stable October 2013: HDInsight is generally available October 2013: Hadoop 2.X is generally available June 2014: HDInsight 3.1 (Hadoop 2.4.0) is released
35
Absolute
36
Absolute
37
Relative
38
Rexer Analytics 2013 Data Miner Survey: Tools Ranked By Usage
40
Any Ideas?
46
5 weeks ago!
48
Let’s Talk About HDInsight
49
Introduction to RHadoop
Integration between the R statistical package and Hadoop’s Distributed File System and MapReduce Computation Engine Moves algorithm execution closer to the data Provides access to lots of high‐quality statistical libraries Speeds work by processing in parallel
52
Demo
53
Azure Machine Learning
56
Wrap Up
57
azure.com
58
Sebastian Brandes, sbrand@microsoft.com
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.