We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byRobin Buzzard
Modified about 1 year ago
Introduction to Windows Azure HDInsight Jon Tupitza Principal, Solution Architect
2 © 2014 JT IT Consulting, LLC Agenda What is Big Data? …and how does it effect our business? What is Hadoop? …and how does it work? What is Windows Azure HDInsight? …and how does it fit into the Microsoft BI Ecosystem? What tools are used to work with Hadoop & HDInsight? How do I get started using HDInsight?
3 © 2014 JT IT Consulting, LLC What is Big Data? Explosion in social media, mobile apps, digital sensors, RFID, GPS, and more have caused exponential data growth. Volume (Size) Traditionally BI has sourced structured data, but now insight must be extracted from unstructured data like large text blobs, digital media, sensor data, etc. Variety (Structure) Sources like Social Networking and Sensor signals create data at a tremendous rate; making it a challenge to capture, store, and analyze that data in a timely or economical manner. Velocity (Speed) Datasets that, due to their size and complexity, are difficult to store, query, and manage using existing data management tools or data processing applications.
4 © 2014 JT IT Consulting, LLC Key Trends Causing Data Explosion Device Explosion Ubiquitous Connection Social Networks Sensor Networks Cheap Storage Inexpensive Computing 5.5 Billion + devices with over 70% of the global population using them. Web traffic to generate over 1.6 Zettabytes of data by 2015 Over 2 Billion users worldwide Over 10 Billion networked sensors In MB cost $1.00 Today 1MB costs.01 cent 1980: 10 MIPS/Sec 2005: 10M MIPS/Sec
5 © 2014 JT IT Consulting, LLC Big Data is Creating Big Opportunities! Big Data technologies are a top priority for most institutions: both corporate and government Currently, 49% of CEO’s and CIO’s claim they are undertaking Big Data projects Software estimated to experience 34% YOY compound growth rate: 4.6B by 2015 Services estimate to experience 39% YOY compound growth rate: 6.5B by 2015
6 © 2014 JT IT Consulting, LLC Uses for Big Data technologies Data Warehousing: Operational Data: New User Registrations, Purchasing, New Product offering Data Exhaust: by- products like log files have low density of useful data but the data they do contain is very valuable if we can extract it at a low cost and in a timely manner. Devices and the Internet of things: Trillions of Internet- connected devices GPS data Cell phone data Automotive engine performance data Collective Intelligence Social Analytics: What is the social sentiment for my product(s)? Live Data Feed/Search: How do I optimize my services based on weather or traffic patterns? How do I build a recommendations engine (Mahout)? Advanced Analytics: How do I better predict future outcomes?
7 © 2014 JT IT Consulting, LLC What is Hadoop and How does it Work? Hadoop Distributed Architecture: Storage Layer (HDFS) Programming Layer (Map/Reduce) Schema on Read vs. Schema on Write Traditionally we have always brought the data to the schema and code Hadoop sends the schema and the code to the data We don’t have to pay the cost or live with the limitations of moving the data: IOPs, Network traffic. Task Tracker Name Node Data Node Map/Reduce Layer HDFS Layer Implements a Divide and Conquer Algorithm to Achieve Greater Parallelism
8 © 2014 JT IT Consulting, LLC HDFS (Hadoop Distributed Files System) Fault Tolerant: Data is distributed across each Data Node in the cluster (like RAID 5). 3 copies of the data is stored in case of storage failures. Data faults can be quickly detected and repaired due to data redundancy. High Throughput Favors batch over interactive operations to support streaming large datasets. Data files are written once and then closed; never to be updated. Supports data locality. HDFS facilitates moving the application code (query) to the data rather than moving the data to the application (schema on read).
9 © 2014 JT IT Consulting, LLC Map/Reduce Like the “Assembly” language of Hadoop / Big Data: very low level Users can interface with Hadoop using higher-level languages like Hive an Pig Schema on Read; not Schema on Write Moving the Code to the Data: First, Store the data Second, (Map function) move the programming to the data; load the code on each machine where the data already resides. Third, (Reduce function) collects statistics back from each of the machines.
10 © 2014 JT IT Consulting, LLC Map/Reduce: Logical Process KeyValue AX=2, Y=3 BX=1, Z=2 CX=3 AY=1, Z=4 KeyValue EY=3 AX=1, Z=2 AZ=5 DY=2, Z=1 KeyValue AX=2, Y=3 AY=1, Z=4 KeyValue AX=1, Z=2 AZ=5 KeyValue AX=3, Y=4, Z=11 Map Reduce Map 1. SELECT WHERE Key=A 2. SUM Values of Each Property The MAP function runs on each data node, extracting data that matches the query. The REDUCE function runs on one node, combining the results from all the MAP components into the final results set.
11 © 2014 JT IT Consulting, LLC Yet Another Resource Manager (YARN) Second generation Hadoop Extends capabilities beyond Map/Reduce; beyond batch processing Makes good on the promise of (near) real-time Big Data processing Node Manager Resource Manager App Master Container Node Manager App Master Container Node Manager Container Client
12 © 2014 JT IT Consulting, LLC Data Movement and Query Processing RDBMS require a schema to be applied when the data is written: The data is transformed to accommodate the schema Some information hidden in the data may be lost at write-time Hadoop/HDInsight applies a schema only when the data is read: The schema doesn’t change the structure of the underlying data The data is stored in its original (raw) format so that all hidden information is retained RDBMS perform query processing in a central location: Data is moved from storage to a central location for processing More central processing capacity is required to move data and execute the query Hadoop performs query processing at each storage node: Data doesn’t need to be moved across the network for processing Only a fraction of the central processing capacity is required to execute the query
13 © 2014 JT IT Consulting, LLC Major Differences: RDBMS vs. Big Data FeatureRelational DatabaseHadoop / HDInsight Data Types and FormatsStructuredSemi-Structured or Unstructured Data IntegrityHigh: Transactional UpdatesLow: Eventually Consistent SchemaStatic: Required on WriteDynamic: Optional on Read & Write Read and Write PatternFully Repeatable Read/WriteWrite Once; Repeatable Read Storage VolumeGigabytes to TerabytesTerabytes, Petabytes and Beyond ScalabilityScale Up with More Powerful HWScale Out with Additional Servers Data Processing DistributionLimited or NoneDistributed Across Cluster EconomicsExpensive Hardware & SoftwareCommodity Hardware & Open Source Software
14 © 2014 JT IT Consulting, LLC Solution Architecture: Big Data or RDBMS? Perhaps you already have the data that contains the information you need, but you can’t analyze it with your existing tools. Or is there a source of data you think will be useful, but you don’t yet know how to collect it, store it, and analyze it? Where will the source data come from? Is it highly structured, in which case you may be able to load it into your existing database or data warehouse and process it there? Or is it semi-structured or unstructured, in which case a Big Data solution that is optimized for textual discovery, categorization, and predictive analysis will be more suitable? What is the format of the data? Is there a huge volume? Does it arrive as a stream or in batches? Is it of high quality or will you need to perform some type of data cleansing and validation of the content? What are the delivery and quality characteristics of the data? If so, do you know where this data will come from, how much it will cost if you have to purchase it, and how reliable this data is? Do you want to combine the results with data from other sources? Will you need to load the data into an existing database or data warehouse, or will you just analyze it and visualize the results separately? Do you want to integrate with an existing BI system?
16 © 2014 JT IT Consulting, LLC HDInsight Data Storage Azure Blob Storage System: The default storage system for HDInsight Enables you to persist your data even when you’re not running an HDInsight cluster Enables you to leverage your data using HDInsight, Azure SQL Server Database & PDW Name Node Data Node Stream Layer Partition Layer Front End Azure Storage (ASV) Azure Blob Storage Hadoop Distributed File System (HDFS) API
17 © 2014 JT IT Consulting, LLC Azure Storage Vault (ASV) The default file system for the HDInsight Service Provides scalable, persistent, sharable, highly-available storage Fast data access for compute nodes residing in the same data center Addressable via: core.windows.net/path Requires storage key in core-site.xml: fs.azure.account.key.accountname keyvaluestring
21 © 2014 JT IT Consulting, LLC How do I get started with HDInsight? 1. Create an Windows Azure account (subscription) 2. Create an Azure Storage account 3. Create an Azure Blob Storage node 4. Provision an HDInsight Service cluster 5. Install Windows Azure PowerShell 6. Install Windows Azure HDInsight PowerShell 7. Setup Environment
22 © 2014 JT IT Consulting, LLC Architectural Models Standalone Data Analysis & Visualization Experiment with data sources to discover if they provide useful information. Handle data that can’t be processed using existing systems. Data Transfer, Cleansing or ETL Extract and transform data before loading it into existing databases. Categorize, normalize, and extract summary results to remove duplication and redundancy. Data Warehouse or Data Storage Create robust data repositories that are reasonably inexpensive to maintain. Especially useful for storing and managing huge data volumes. Integrate with Existing EDW and BI Systems Integrate Big Data at different levels; EDW, OLAP, Excel PowerPivot. Also, PDW enables querying HDInsight to integrate Big Data with existing dimension & fact data.
24 © 2014 JT IT Consulting, LLC Resources PolyBase – David DeWitt us/sqlserver/solutions-technologies/data-warehousing/polybase.aspxhttp://www.Microsoft.com/en- us/sqlserver/solutions-technologies/data-warehousing/polybase.aspx PDW Website: technologies/data-warehousing/pdw.aspx logs.technet.com/b/dataplatforminsider/archive/2013/04/25/insight-through-integration-sql-server parallel-data-warehouse-polybase-demo.aspx logs.technet.com/b/dataplatforminsider/archive/2013/04/25/insight-through-integration-sql-server parallel-data-warehouse-polybase-demo.aspx
25 © 2014 JT IT Consulting, LLC Tools
26 © 2014 JT IT Consulting, LLC Place title here HDInsight clusters can be provisioned when needed and then de-provisioned without loosing the data or metadata they have processed. Azure Storage Vault allow you to maintain that state; paying only for the storage and not the cluster(s). Since stream data often arrives in massive bursts, HDInsight can provide a buffer between that data generation and existing data warehouse/BI infrastructures.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Transform + analyze Visualize + decide Capture + manage Dat a.
Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation.
Harnessing Big Data with Hadoop Dipti Sangani; Madhu Reddy DBI210.
Big Data Analytics with Excel Peter Myers Bitwise Solutions.
Unlock your Big Data with Analytics and BI on Office365 Brian Culver ● SharePoint Fest Seattle● BI102 ● August 18-20, 2015.
FAST FORWARD WITH MICROSOFT BIG DATA Vinoo Srinivas M Solutions Specialist Windows Azure (Hadoop, HPC, Media)
IoT Scenario - Connected Cars / Devices Cloud gateways Queue Service Get Data Get Reference Data Business Logic Store Raw Data Store Reporting Data.
Senior Project Manager & Architect Love Your Data.
Apache Hadoop on Windows Azure Avkash Chauhan
Fraud Detection in Banking using Big Data By Madhu Malapaka For ISACA, Hyderabad Chapter Date: 14 th Dec 2014 Wilshire Software.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
Tech Evangelist, Microsoft Responsible for Azure Evangelism in Denmark Economics and Statistics background (Aarhus University) Blog:
MICROSOFT BIG DATA. WHAT IS BIG DATA? How do I optimize my fleet based on weather and traffic patterns? SOCIAL & WEB ANALYTICS LIVE DATA FEEDS ADVANCED.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
Windows Azure. Azure Application platform for the public cloud. Windows Azure is an operating system You can: – build a web application that runs.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
An Introduction To Big Data For The SQL Server DBA.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
Setting Big Data Capabilities Free How to Make Business on Big Data? Stig Torngaard, Partner Platon.
BIG DATA/ Hadoop Interview Questions.
Introduction to Hadoop and HDFS. Table of Contents Hadoop – Overview Hadoop Cluster HDFS.
Microsoft Azure and DataStax: Start Anywhere and Scale to Any Size in the Cloud, On- Premises, or Both with a Leading Distributed Database MICROSOFT AZURE.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Unlock your Big Data with Analytics and BI on Office365 Brian Culver ● SharePoint Fest Denver ● SPT 104 ● March 1-3, 2016.
Massive Compute and Storage Deployment expertise Data of all Volume Variety, Velocity Speed Scale Economics Always Up, Always On Open and flexible Time.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited.
Hortonworks: Hadoop for the Enterprise ONLY 100 open source Apache Hadoop data platform % Founded in 2011 HADOOP 1 ST distribution to go public IPO Fall.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Axis AI Solves Challenges of Complex Data Extraction and Document Classification through Advanced Natural Language Processing and Machine Learning MICROSOFT.
An Introduction to HDInsight June 27 th,
Powered by Microsoft Azure, PointMatter Is a Flexible Solution to Move and Share Data between Business Groups and IT MICROSOFT AZURE ISV PROFILE: LOGICMATTER.
Need for a new processing platform (BigData) Origin of Hadoop What is Hadoop & what it is not ? Hadoop architecture Hadoop components (Common/HDFS/MapReduce)
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
MSBIC Hadoop Series Hadoop & Microsoft BI Bryan Smith
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Cross Platform Mobile Backend with Mobile Services James
Sofia Event Center ноември 2013 г. Маги Наумова/ Боряна Петрова.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
© 2017 SlidePlayer.com Inc. All rights reserved.