R and HDInsight in Microsoft Azure

Slides:



Advertisements
Similar presentations
Large Scale Computing Systems
Advertisements

Setting Big Data Capabilities Free How to Make Business on Big Data? Stig Torngaard, Partner Platon.
Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
Big Data and Predictive Analytics in Health Care Presented by: Mehadi Sayed President and CEO, Clinisys EMR Inc.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Running Hadoop-as-a-Service in the Cloud
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
25 Need-to-Know Facts. Fact 1 Every 2 days we create as much information as we did from the beginning of time until 2003 [Source]Source © 2014 Bernard.
Big Data. What is Big Data? Analog starage vs digital. The FOUR V’s of Big Data. Who’s Generating Big Data The importance of Big Data. Optimalization.
Big Data Use Cases in the cloud Peter Sirota, GM Elastic
Big Data Course Plans at Purdue Ananth Iyer. Big Data/Analytics Coursera course on Big Data by Bill Howe claims that Big Data involves issues of
Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
WRF in the Cloud: An introduction to Big Compute on Windows Azure Wenming Ye Research Program Manager Microsoft Research
HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited.
Big Data. What is Big Data? Big Data Analytics: 11 Case Histories and Success Stories
Software Architecture
© Hortonworks Inc Hortonworks Page 1. © Hortonworks Inc Big Data Changes the Game Megabytes Gigabytes Terabytes Petabytes Purchase detail.
© 2012 IBM Corporation IBM Security Systems 1 © 2013 IBM Corporation 1 Ecommerce Antoine Harfouche.
SEMINAR ON Guided by: Prof. D.V.Chaudhari Seminar by: Namrata Sakhare Roll No: 65 B.E.Comp.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee.
Dr. Michael D. Featherstone Summer 2013 Introduction to e-Commerce Web Analytics.
Hadoop Ali Sharza Khan High Performance Computing 1.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
What Is Big Data? And What Does It Mean to Marketers? Frank Cotignola October 17, 2013.
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
Sofia Event Center ноември 2013 г. Маги Наумова/ Боряна Петрова.
Breaking points of traditional approach What if you could handle big data?
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
BUSINESS INTELLIGENCE & ADVANCED ANALYTICS DISCOVER | PLAN | EXECUTE JANUARY 14, 2016.
Derek Weitzel Grid Computing. Background B.S. Computer Engineering from University of Nebraska – Lincoln (UNL) 3 years administering supercomputers at.
Unlock your Big Data with Analytics and BI on Office365 Brian Culver ● SharePoint Fest Denver ● SPT 104 ● March 1-3, 2016.
Big Data Analytics with Excel Peter Myers Bitwise Solutions.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Harnessing Big Data with Hadoop Dipti Sangani; Madhu Reddy DBI210.
Big Data Javad Azimi May First of All… Sorry about the language  Feel free to ask any question Please share similar experiences.
This is a free Course Available on Hadoop-Skills.com.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
Unlock your Big Data with Analytics and BI on Office365 Brian Culver ● SharePoint Fest Seattle● BI102 ● August 18-20, 2015.
Data Mining - Introduction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
A Tutorial on Hadoop Cloud Computing : Future Trends.
818 Connecticut Ave. NW Suite 950 Washington, DC :: Phone: :: Fax: ::
CNIT131 Internet Basics & Beginning HTML
Tutorial: Big Data Algorithms and Applications Under Hadoop
Connected Infrastructure
Big Data is a Big Deal!.
CSCE 587: Big Data Analytics
Understanding Big Data
BIG Data 25 Need-to-Know Facts.
Hadoopla: Microsoft and the Hadoop Ecosystem
Microsoft Build /22/ :52 PM © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY,
Big Data Dr. Mazin Al-Hakeem (Nov 2016), “Big Data: Reality and Challenges”, LFU – Erbil.
Connected Infrastructure
Data Platform and Analytics Foundational Training
Cloudy with a Chance of Data
Big Data Programming: an Introduction
Microsoft Connect /22/2018 9:50 PM
Big Data.
Big Data 5 exabytes (1018 bytes) of data were created by human until Today this amount of information is created in two days. In 2012, digital world.
Zoie Barrett and Brian Lam
Big DATA.
Data Analysis and R : Technology & Opportunity
Big-Data Analytics with Azure HDInsight
Server & Tools Business
Big Data and Analytics: Getting Started with ArcGIS
Big Data.
Presentation transcript:

R and HDInsight in Microsoft Azure Visual Studio 4/12/2017 R and HDInsight in Microsoft Azure Sebastian Brandes, sbrand@microsoft.com © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

About Sebastian Tech Evangelist, Microsoft Responsible for Azure Evangelism in Denmark Economics and Statistics background (Aarhus University) Twitter: @sebastianbk Blog: https://sebastianbrandes.com

Agenda What is Big Data (according to Microsoft)? Visual Studio 4/12/2017 Agenda What is Big Data (according to Microsoft)? Hadoop and The Hadoop Ecosystem – some stats Introduction to Microsoft Azure and HDInsight Provisioning a Hadoop cluster in Azure Installing R on the cluster Running MapReduce jobs using R Azure Machine Learning + R Wrapping Up © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Introducing Big Data Device Explosion Social Networks Cheap Storage >5.5 billion (70+% of global population) Social Networks >2 Billion users Cheap Storage $100 gets you 3 million times more storage in 30 years Ubiquitous Connection Web traffic 2010 130 Exabyte (10 E18) 2015 1.6 Zettabyte (10 E21) Sensor Networks >10 Billion Faster CPUs 1980 10 MIPS/$ 2005 10M MIPS/$

Introducing Big Data “Big data is a collection of data sets so large and complex that it becomes awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analysis, and visualization.” – Wikipedia

Introducing Big Data Big data solutions deal with complexities of: VOLUME (Size) VARIETY (Structure) VELOCITY (Speed)

Introducing Big Data Internet of Things Modern Web Volume ERP / CRM Terabytes (10E12) Gigabytes (10E9) Exabytes (10E18) Petabytes (10E15) Velocity - Variety Volume Internet of Things Audio / Video Log Files Text/Image Social Sentiment Data Market Feeds eGov Feeds Weather Wikis / Blogs Click Stream Sensors / RFID / Devices Spatial & GPS Coordinates Modern Web Mobile Advertising Collaboration eCommerce Digital Marketing Search Marketing Web Logs Recommendations ERP / CRM Sales Pipeline Payables Payroll Inventory Contacts Deal Tracking ERP / CRM Modern Web Internet of Things

Introducing Big Data Responding to New Questions Live Data Feed What’s the social sentiment of my product? How do I better predict future outcomes? Social Analytics Advanced Analytics How do I optimize my services based on patterns of weather, traffic, etc.?

vs.

One Person Tweeting For 100 Years 140 characters per tweet No metadata Encoded with UTF-8 (one character = 2 bytes) Writes a tweet every 10 seconds Tweets around the clock, 24/7 Uses maximum characters available Is provided with Coca-Cola to keep up with tweeting

140 chars * 2 bytes = 280 bytes

280 bytes * 3600 / 10 = 100,800 B/hr

100,800 * 24 * 365 = 883,008,000 B/yr

883,008,000 * 100 = 88,300,800,000 bytes

88,300,800,000 bytes = 82.2365 GB

3 billion Internet users Per March 2015

82.2365 * 3,000,000,000 = 246,709,500,000 GB = 240,927,246 TB = 235,280 PB = 230 EB

The Large Hadron Collider (LHC) is the world's largest and most powerful particle collider, and the largest single machine in the world, built by the European Organization for Nuclear Research (CERN) from 1998 to 2008. Wikipedia

300 gigabytes per second

300 GB * 3600 * 24 * 365 * 100 = 946,080,000,000 GB = 923,906,250 TB = 902,252 PB = 881 EB

You Tell Me What It Is!

Traditional Hadoop cluster Client(s) Distributed Data Processing (MR) Distributed Data Storage (HDFS) Job Tracker Primary Name Node Secondary Name Node Masters Data Node Data Node Data Node ... Slaves Task Tracker Task Tracker Task Tracker Data

MapReduce Architecture (v1) Task Tracker Job tracker is the single controller for dispatch and status reporting. Client submits job, which in turn coordinates a series of tasks from the job tracker. Task Task Client 1 Job Tracker Task Tracker Client 1 Task Task Task Tracker Status Submission Task Task

YARN Architecture Node Manager Node Manager Resource Manager Visual Studio 4/12/2017 YARN Architecture Node Manager Container App Master Client submits an App to RM RM instantiates App Master (AM) on a Node. App Master asks RM for more containers Containers report status to AM Clients query AM for status of job Nodes report node status to RM Thus, removing contention on Job Tracker to do everything Become more resilient to RM failures Number of active jobs more scalable Node Manager Container Client 1 Resource Manager Container Container Client 1 Node Manager App Master Container M/R Status Submission Node Status Resource Request © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Hadoop Timeline October 2003: Google publishes GFS paper http://research.google.com/archive/gfs.html December 2004: Google publishes MapReduce paper http://research.google.com/archive/mapreduce.html 2005: Doug Cutting (Yahoo!) and Mike Cafarella (U Washington) create Hadoop [Insert many years here.] November 2012: HDInsight is released in Technical Preview (as PaaS and IaaS) August 2013: Hadoop 1.X is stable October 2013: HDInsight is generally available October 2013: Hadoop 2.X is generally available June 2014: HDInsight 3.1 (Hadoop 2.4.0) is released

Absolute

Absolute

Relative

Rexer Analytics 2013 Data Miner Survey: Tools Ranked By Usage

Any Ideas?

5 weeks ago!

Let’s Talk About HDInsight

Introduction to RHadoop Integration between the R statistical package and Hadoop’s Distributed File System and MapReduce Computation Engine Moves algorithm execution closer to the data   Provides access to lots of high‐quality statistical libraries Speeds work by processing in parallel

Demo

Azure Machine Learning

Wrap Up

azure.com

Sebastian Brandes, sbrand@microsoft.com