Presentation is loading. Please wait.

Presentation is loading. Please wait.

R and HDInsight in Microsoft Azure

Similar presentations


Presentation on theme: "R and HDInsight in Microsoft Azure"— Presentation transcript:

1 R and HDInsight in Microsoft Azure
Visual Studio 4/12/2017 R and HDInsight in Microsoft Azure Sebastian Brandes, © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

2 About Sebastian Tech Evangelist, Microsoft
Responsible for Azure Evangelism in Denmark Economics and Statistics background (Aarhus University) Blog:

3 Agenda What is Big Data (according to Microsoft)?
Visual Studio 4/12/2017 Agenda What is Big Data (according to Microsoft)? Hadoop and The Hadoop Ecosystem – some stats Introduction to Microsoft Azure and HDInsight Provisioning a Hadoop cluster in Azure Installing R on the cluster Running MapReduce jobs using R Azure Machine Learning + R Wrapping Up © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

4 Introducing Big Data Device Explosion Social Networks Cheap Storage
>5.5 billion (70+% of global population) Social Networks >2 Billion users Cheap Storage $100 gets you 3 million times more storage in 30 years Ubiquitous Connection Web traffic Exabyte (10 E18) Zettabyte (10 E21) Sensor Networks >10 Billion Faster CPUs MIPS/$ M MIPS/$

5 Introducing Big Data “Big data is a collection of data sets so large and complex that it becomes awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analysis, and visualization.” – Wikipedia

6 Introducing Big Data Big data solutions deal with complexities of:
VOLUME (Size) VARIETY (Structure) VELOCITY (Speed)

7 Introducing Big Data Internet of Things Modern Web Volume ERP / CRM
Terabytes (10E12) Gigabytes (10E9) Exabytes (10E18) Petabytes (10E15) Velocity - Variety Volume Internet of Things Audio / Video Log Files Text/Image Social Sentiment Data Market Feeds eGov Feeds Weather Wikis / Blogs Click Stream Sensors / RFID / Devices Spatial & GPS Coordinates Modern Web Mobile Advertising Collaboration eCommerce Digital Marketing Search Marketing Web Logs Recommendations ERP / CRM Sales Pipeline Payables Payroll Inventory Contacts Deal Tracking ERP / CRM Modern Web Internet of Things

8 Introducing Big Data Responding to New Questions
Live Data Feed What’s the social sentiment of my product? How do I better predict future outcomes? Social Analytics Advanced Analytics How do I optimize my services based on patterns of weather, traffic, etc.?

9

10 vs.

11 One Person Tweeting For 100 Years
140 characters per tweet No metadata Encoded with UTF-8 (one character = 2 bytes) Writes a tweet every 10 seconds Tweets around the clock, 24/7 Uses maximum characters available Is provided with Coca-Cola to keep up with tweeting

12 140 chars * 2 bytes = 280 bytes

13 280 bytes * 3600 / 10 = 100,800 B/hr

14 100,800 * 24 * 365 = 883,008,000 B/yr

15 883,008,000 * 100 = 88,300,800,000 bytes

16 88,300,800,000 bytes = GB

17 3 billion Internet users
Per March 2015

18 * 3,000,000,000 = 246,709,500,000 GB = 240,927,246 TB = 235,280 PB = 230 EB

19

20 The Large Hadron Collider (LHC) is the world's largest and most powerful particle collider, and the largest single machine in the world, built by the European Organization for Nuclear Research (CERN) from 1998 to 2008. Wikipedia

21 300 gigabytes per second

22 300 GB * 3600 * 24 * 365 * 100 = 946,080,000,000 GB = 923,906,250 TB = 902,252 PB = 881 EB

23

24

25

26

27

28

29 You Tell Me What It Is!

30

31 Traditional Hadoop cluster
Client(s) Distributed Data Processing (MR) Distributed Data Storage (HDFS) Job Tracker Primary Name Node Secondary Name Node Masters Data Node Data Node Data Node ... Slaves Task Tracker Task Tracker Task Tracker Data

32 MapReduce Architecture (v1)
Task Tracker Job tracker is the single controller for dispatch and status reporting. Client submits job, which in turn coordinates a series of tasks from the job tracker. Task Task Client 1 Job Tracker Task Tracker Client 1 Task Task Task Tracker Status Submission Task Task

33 YARN Architecture Node Manager Node Manager Resource Manager
Visual Studio 4/12/2017 YARN Architecture Node Manager Container App Master Client submits an App to RM RM instantiates App Master (AM) on a Node. App Master asks RM for more containers Containers report status to AM Clients query AM for status of job Nodes report node status to RM Thus, removing contention on Job Tracker to do everything Become more resilient to RM failures Number of active jobs more scalable Node Manager Container Client 1 Resource Manager Container Container Client 1 Node Manager App Master Container M/R Status Submission Node Status Resource Request © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

34 Hadoop Timeline October 2003: Google publishes GFS paper
December 2004: Google publishes MapReduce paper 2005: Doug Cutting (Yahoo!) and Mike Cafarella (U Washington) create Hadoop [Insert many years here.] November 2012: HDInsight is released in Technical Preview (as PaaS and IaaS) August 2013: Hadoop 1.X is stable October 2013: HDInsight is generally available October 2013: Hadoop 2.X is generally available June 2014: HDInsight 3.1 (Hadoop 2.4.0) is released

35 Absolute

36 Absolute

37 Relative

38 Rexer Analytics 2013 Data Miner Survey: Tools Ranked By Usage

39

40 Any Ideas?

41

42

43

44

45

46 5 weeks ago!

47

48 Let’s Talk About HDInsight

49 Introduction to RHadoop
Integration between the R statistical package and Hadoop’s Distributed File System and MapReduce Computation Engine Moves algorithm execution closer to the data   Provides access to lots of high‐quality statistical libraries Speeds work by processing in parallel

50

51

52 Demo

53 Azure Machine Learning

54

55

56 Wrap Up

57 azure.com

58 Sebastian Brandes, sbrand@microsoft.com


Download ppt "R and HDInsight in Microsoft Azure"

Similar presentations


Ads by Google