Presentation on theme: "Tech Evangelist, Microsoft Responsible for Azure Evangelism in Denmark Economics and Statistics background (Aarhus University) Blog:"— Presentation transcript:
Tech Evangelist, Microsoft Responsible for Azure Evangelism in Denmark Economics and Statistics background (Aarhus University) Blog: https://sebastianbrandes.com
What is Big Data (according to Microsoft)? Hadoop and The Hadoop Ecosystem – some stats Introduction to Microsoft Azure and HDInsight Provisioning a Hadoop cluster in Azure Installing R on the cluster Running MapReduce jobs using R Azure Machine Learning + R Wrapping Up
$100 gets you 3 million times more storage in 30 years MIPS/$ M MIPS/$ >5.5 billion (70+% of global population) >2 Billion users Web traffic Exabyte (10 E18) Zettabyte (10 E21) >10 Billion
“Big data is a collection of data sets so large and complex that it becomes awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analysis, and visualization.” – Wikipedia
Internet of Things Audio / Video Log Files Text/Image Social Sentiment Data Market Feeds eGov Feeds Weather Wikis / Blogs Click Stream Sensors / RFID / Devices Spatial & GPS Coordinates Modern Web Mobile Advertisin g CollaborationeCommerce Digital Marketing Search Marketing Web Logs Recommendation s ERP / CRM Sales Pipeline Payables Payroll Inventory Contacts Deal Tracking Terabytes (10E12) Gigabytes (10E9) Exabytes (10E18) Petabytes (10E15) Velocity - Variety Volume ERP / CRM Modern Web Internet of Things
How do I optimize my services based on patterns of weather, traffic, etc.? What’s the social sentiment of my product? How do I better predict future outcomes?
Per March 2015
The Large Hadron Collider (LHC) is the world's largest and most powerful particle collider, and the largest single machine in the world, built by the European Organization for Nuclear Research (CERN) from 1998 to Wikipedia
Client(s) Job Tracker Primary Name Node Secondary Name Node Data Node... Masters Slaves Task Tracker Data Node Task Tracker Data Node Task Tracker Data
5 weeks ago!
Integration between the R statistical package and Hadoop’s Distributed File System and MapReduce Computation Engine Moves algorithm execution closer to the data Provides access to lots of high ‐ quality statistical libraries Speeds work by processing in parallel