Introduction to Windows Azure HDInsight

Slides:



Advertisements
Similar presentations
R and HDInsight in Microsoft Azure
Advertisements

Senior Project Manager & Architect Love Your Data.
Setting Big Data Capabilities Free How to Make Business on Big Data? Stig Torngaard, Partner Platon.
Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
MICROSOFT BIG DATA. WHAT IS BIG DATA? How do I optimize my fleet based on weather and traffic patterns? SOCIAL & WEB ANALYTICS LIVE DATA FEEDS ADVANCED.
FAST FORWARD WITH MICROSOFT BIG DATA Vinoo Srinivas M Solutions Specialist Windows Azure (Hadoop, HPC, Media)
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Running Hadoop-as-a-Service in the Cloud
Transform + analyze Visualize + decide Capture + manage Dat a.
Fraud Detection in Banking using Big Data By Madhu Malapaka For ISACA, Hyderabad Chapter Date: 14 th Dec 2014 Wilshire Software.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Cross Platform Mobile Backend with Mobile Services James
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited.
SQL Server 2014: The Data Platform for the Cloud.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
fs.azure.account.key.accountname enterthekeyvaluehere.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
An Introduction to HDInsight June 27 th,
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
Sofia Event Center ноември 2013 г. Маги Наумова/ Боряна Петрова.
Windows Azure. Azure Application platform for the public cloud. Windows Azure is an operating system You can: – build a web application that runs.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Powered by Microsoft Azure, PointMatter Is a Flexible Solution to Move and Share Data between Business Groups and IT MICROSOFT AZURE ISV PROFILE: LOGICMATTER.
CloudWay.ro Gives Clients Fast Invoicing, Stock Management, and Resource Planning via Microsoft Azure and Azure SQL Database MICROSOFT AZURE ISV PROFILE:
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Microsoft Azure and DataStax: Start Anywhere and Scale to Any Size in the Cloud, On- Premises, or Both with a Leading Distributed Database MICROSOFT AZURE.
Axis AI Solves Challenges of Complex Data Extraction and Document Classification through Advanced Natural Language Processing and Machine Learning MICROSOFT.
Unlock your Big Data with Analytics and BI on Office365 Brian Culver ● SharePoint Fest Denver ● SPT 104 ● March 1-3, 2016.
Big Data Analytics with Excel Peter Myers Bitwise Solutions.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Harnessing Big Data with Hadoop Dipti Sangani; Madhu Reddy DBI210.
MSBIC Hadoop Series Hadoop & Microsoft BI Bryan Smith
Andy Roberts Data Architect
An Introduction To Big Data For The SQL Server DBA.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
BIG DATA/ Hadoop Interview Questions.
Apache Hadoop on Windows Azure Avkash Chauhan
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Microsoft Partner since 2011
Unlock your Big Data with Analytics and BI on Office365 Brian Culver ● SharePoint Fest Seattle● BI102 ● August 18-20, 2015.
Microsoft Ignite /28/2017 6:07 PM
BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
Energy Management Solution
Connected Infrastructure
Leveraging a Hadoop Cluster from SQL Server Integration Services
Big Data Enterprise Patterns
Chapter 14 Big Data Analytics and NoSQL
Hadoopla: Microsoft and the Hadoop Ecosystem
Connected Infrastructure
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Ministry of Higher Education
Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.
Server & Tools Business
Accelerate Your Self-Service Data Analytics
XtremeData on the Microsoft Azure Cloud Platform:
Big DATA.
Big-Data Analytics with Azure HDInsight
Architecture of modern data warehouse
Presentation transcript:

Introduction to Windows Azure HDInsight Jon Tupitza Principal, Solution Architect jon.tupitza@hotmail.com

Agenda What is Big Data? …and how does it effect our business? What is Hadoop? …and how does it work? What is Windows Azure HDInsight? …and how does it fit into the Microsoft BI Ecosystem? What tools are used to work with Hadoop & HDInsight? How do I get started using HDInsight? © 2014 JT IT Consulting, LLC

What is Big Data? Volume (Size) Variety (Structure) Velocity (Speed) Datasets that, due to their size and complexity, are difficult to store, query, and manage using existing data management tools or data processing applications. Explosion in social media, mobile apps, digital sensors, RFID, GPS, and more have caused exponential data growth. Volume (Size) Traditionally BI has sourced structured data, but now insight must be extracted from unstructured data like large text blobs, digital media, sensor data, etc. Variety (Structure) Sources like Social Networking and Sensor signals create data at a tremendous rate; making it a challenge to capture, store, and analyze that data in a timely or economical manner. Velocity (Speed) © 2014 JT IT Consulting, LLC

Key Trends Causing Data Explosion Device Explosion Social Networks Cheap Storage 5.5 Billion+ devices with over 70% of the global population using them. Over 2 Billion users worldwide In 1990 1MB cost $1.00 Today 1MB costs .01 cent Ubiquitous Connection Sensor Networks Inexpensive Computing Data Explosion: 10x increase every five years 85% of data is from new data types Consumerization of IT: 4.3 connected devices per adult 27% using social media input Web traffic to generate over 1.6 Zettabytes of data by 2015 Over 10 Billion networked sensors 1980: 10 MIPS/Sec 2005: 10M MIPS/Sec © 2014 JT IT Consulting, LLC

Big Data is Creating Big Opportunities! $ Big Data is Creating Big Opportunities! Big Data technologies are a top priority for most institutions: both corporate and government Currently, 49% of CEO’s and CIO’s claim they are undertaking Big Data projects Software estimated to experience 34% YOY compound growth rate: 4.6B by 2015 Services estimate to experience 39% YOY compound growth rate: 6.5B by 2015 Many organizations already use data to improve decision making through existing BI solutions that analyze data generated by business activities and applications, and create reports based on this analysis. Rather than seeking to replace traditional BI solutions, HDInsight provides a way to extend the value of your investment in BI by enabling you to incorporate a much wider variety of data sources that complement and integrate with existing data warehouse, analytical data models, and business reporting solutions. © 2014 JT IT Consulting, LLC

Uses for Big Data technologies Data Warehousing: Operational Data: New User Registrations, Purchasing, New Product offering Data Exhaust: by-products like log files have low density of useful data but the data they do contain is very valuable if we can extract it at a low cost and in a timely manner. Devices and the Internet of things: Trillions of Internet-connected devices GPS data Cell phone data Automotive engine performance data Collective Intelligence Social Analytics: What is the social sentiment for my product(s)? Live Data Feed/Search: How do I optimize my services based on weather or traffic patterns? How do I build a recommendations engine (Mahout)? Advanced Analytics: How do I better predict future outcomes? A Big Data query can be used to generate a result set that is then stored in a relational database for use in the generation of BI, or as input to another process. Big Data is also a valuable tool when you need to handle data that is arriving very quickly, and which you can process later. You can dump the data into the Big Data storage cluster in its original format, and then process it on demand using a Big Data query that extracts the required result set and stores it in a relational database, or makes it available for reporting. © 2014 JT IT Consulting, LLC

What is Hadoop and How does it Work? Implements a Divide and Conquer Algorithm to Achieve Greater Parallelism Hadoop Distributed Architecture: Storage Layer (HDFS) Programming Layer (Map/Reduce) Schema on Read vs. Schema on Write Traditionally we have always brought the data to the schema and code Hadoop sends the schema and the code to the data We don’t have to pay the cost or live with the limitations of moving the data: IOPs, Network traffic. Task Tracker Name Node Data Node Map/Reduce Layer HDFS Layer © 2014 JT IT Consulting, LLC

HDFS (Hadoop Distributed Files System) Fault Tolerant: Data is distributed across each Data Node in the cluster (like RAID 5). 3 copies of the data is stored in case of storage failures. Data faults can be quickly detected and repaired due to data redundancy. High Throughput Favors batch over interactive operations to support streaming large datasets. Data files are written once and then closed; never to be updated. Supports data locality. HDFS facilitates moving the application code (query) to the data rather than moving the data to the application (schema on read). © 2014 JT IT Consulting, LLC

Map/Reduce Like the “Assembly” language of Hadoop / Big Data: very low level Users can interface with Hadoop using higher-level languages like Hive an Pig Schema on Read; not Schema on Write Moving the Code to the Data: First, Store the data Second, (Map function) move the programming to the data; load the code on each machine where the data already resides. Third, (Reduce function) collects statistics back from each of the machines. © 2014 JT IT Consulting, LLC

Map/Reduce: Logical Process Key Value A X=2, Y=3 B X=1, Z=2 C X=3 Y=1, Z=4 1. SELECT WHERE Key=A Key Value A X=2, Y=3 Y=1, Z=4 2. SUM Values of Each Property Map Reduce Key Value A X=3, Y=4, Z=11 Key Value E Y=3 A X=1, Z=2 Z=5 D Y=2, Z=1 Map Key Value A X=1, Z=2 Z=5 Depending on the configuration of the query job, there may be more than one Reduce task running. The output from each Map task is stored in a common buffer, sorted, and then passed to one or more Reduce tasks. Intermediate results are stored in the buffer until the final Reduce task combines them all. The REDUCE function runs on one node, combining the results from all the MAP components into the final results set. The MAP function runs on each data node, extracting data that matches the query. © 2014 JT IT Consulting, LLC

Yet Another Resource Manager (YARN) Second generation Hadoop Extends capabilities beyond Map/Reduce; beyond batch processing Makes good on the promise of (near) real-time Big Data processing Node Manager Client App Master Container Node Manager Resource Manager App Master Container Node Manager Client Container Container © 2014 JT IT Consulting, LLC

Data Movement and Query Processing RDBMS require a schema to be applied when the data is written: The data is transformed to accommodate the schema Some information hidden in the data may be lost at write-time Hadoop/HDInsight applies a schema only when the data is read: The schema doesn’t change the structure of the underlying data The data is stored in its original (raw) format so that all hidden information is retained RDBMS perform query processing in a central location: Data is moved from storage to a central location for processing More central processing capacity is required to move data and execute the query Hadoop performs query processing at each storage node: Data doesn’t need to be moved across the network for processing Only a fraction of the central processing capacity is required to execute the query Big Data solutions are optimized for storing vast quantities of data using simple file formats and highly distributed storage mechanisms. Each distributed node is also capable of executing parts of the queries that extract information. Whereas a traditional database system would need to collect all the data from storage and move it to a central location for processing, with the consequent limitations of processing capacity and network latency, Big Data solutions perform the initial processing of the data at each storage node. Modern data warehouse systems typically use high speed fiber networks, and in-memory caching and indexes, to minimize data transfer delays. However, in a Big Data solution only the results of the distributed query processing are passed across the cluster network to the node that will assemble them into a final results set. Performance during the initial stages of the query is limited only by the speed and capacity of connectivity to the co-located disk subsystem, and this initial processing occurs in parallel across all of the cluster nodes. © 2014 JT IT Consulting, LLC

Major Differences: RDBMS vs. Big Data Feature Relational Database Hadoop / HDInsight Data Types and Formats Structured Semi-Structured or Unstructured Data Integrity High: Transactional Updates Low: Eventually Consistent Schema Static: Required on Write Dynamic: Optional on Read & Write Read and Write Pattern Fully Repeatable Read/Write Write Once; Repeatable Read Storage Volume Gigabytes to Terabytes Terabytes, Petabytes and Beyond Scalability Scale Up with More Powerful HW Scale Out with Additional Servers Data Processing Distribution Limited or None Distributed Across Cluster Economics Expensive Hardware & Software Commodity Hardware & Open Source Software Relational databases are typically optimized for fast and efficient query processing using Structured Query Language. Big Data solutions are optimized for reliable storage of vast quantities of data. The typically unstructured nature of the data, the lack of predefined schemas, and the distributed nature of the storage in Big Data solutions often preclude major optimization for query performance. Unlike SQL queries, which can use intelligent optimization techniques to maximize query performance, Big Data queries typically require an operation similar to a table scan. Big Data queries are batch operations that may take some time to execute. It’s possible to perform real-time queries, but typically you will run the query and store the results for use within your existing BI tools and analytics systems. © 2014 JT IT Consulting, LLC

Solution Architecture: Big Data or RDBMS? Perhaps you already have the data that contains the information you need, but you can’t analyze it with your existing tools. Or is there a source of data you think will be useful, but you don’t yet know how to collect it, store it, and analyze it? Where will the source data come from? Is it highly structured, in which case you may be able to load it into your existing database or data warehouse and process it there? Or is it semi-structured or unstructured, in which case a Big Data solution that is optimized for textual discovery, categorization, and predictive analysis will be more suitable? What is the format of the data? Is there a huge volume? Does it arrive as a stream or in batches? Is it of high quality or will you need to perform some type of data cleansing and validation of the content? What are the delivery and quality characteristics of the data? If so, do you know where this data will come from, how much it will cost if you have to purchase it, and how reliable this data is? Do you want to combine the results with data from other sources? Will you need to load the data into an existing database or data warehouse, or will you just analyze it and visualize the results separately? Do you want to integrate with an existing BI system? Big Data solutions are primarily suited to situations where: You have large volumes of data to store and process. The data is in a semi-structured or unstructured format, often as text files or binary files. The data is not well categorized; for example, similar items are described using different terminology such as a variation in city, country, or region names, and there is no obvious key value. The data arrives rapidly as a stream, or in large batches that cannot be processed in real time, and so must be stored efficiently for processing later as a batch operation. The data contains a lot of redundancy or duplication. The data cannot easily be processed into a format that suits existing database schemas without risking loss of information. You need to execute complex batch jobs on a very large scale, so that running the jobs in parallel is necessary. You want to be able to easily scale the system up or down on demand. You don’t actually know how the data might be useful, but you suspect that it will be - either now or in the future. © 2014 JT IT Consulting, LLC

What is Windows Azure HDInsight? Windows Azure: Microsoft’s online storage and compute services HDInsight: Microsoft’s implementation of Apache Hadoop (Hortonworks Data Platform) as an online service Makes Apache Hadoop readily available to the Windows community Enables Windows Azure subscribers to quickly and easily provision an HDInsight cluster across Windows Azure’s pool of storage and compute resources. Also enables them to quickly de-provision those clusters when they’re not needed. Allows subscribers to continuously store their data for later use. Exposes Apache Hadoop services to the Microsoft programming ecosystem SQL Server: Analysis Services PowerShell and the Cross-platform Command-line tools Visual Studio: CLR (C#, F#, etc.) JavaScript ODBC / JDBC / REST API Excel Self-Service BI (SSBI): PowerQuery, PowerPivot, PowerView and PowerMap © 2014 JT IT Consulting, LLC

HDInsight Data Storage Azure Blob Storage System: The default storage system for HDInsight Enables you to persist your data even when you’re not running an HDInsight cluster Enables you to leverage your data using HDInsight, Azure SQL Server Database & PDW Hadoop Distributed File System (HDFS) API Azure Blob Storage Name Node Data Node Front End Partition Layer The Data Cluster Big Data solutions use a cluster of servers to store and process the data. Each member server in the cluster is called a data node, and contains a data store and a query execution engine. The cluster is managed by a server called the name node, which has knowledge of all the cluster servers and the files stored on each one. The name node server does not store any data, but is responsible for allocating data to the other cluster members and keeping track of the state of each one by listening for heartbeat messages. To store incoming data, the name node server directs the client to the appropriate data node server. The name node also manages replication of data files across all the other cluster members, which communicate with each other to replicate the data. The data is divided into chunks and three copies of each data file are stored across the cluster servers in order to provide resilience against failure and data loss (the chunk size and the number of replicated copies are configurable for the cluster). Stream Layer Azure Storage (ASV) © 2014 JT IT Consulting, LLC

Azure Storage Vault (ASV) The default file system for the HDInsight Service Provides scalable, persistent, sharable, highly-available storage Fast data access for compute nodes residing in the same data center Addressable via: asv[s].<container>@<account>.blob, core.windows.net/path Requires storage key in core-site.xml: <property> <name>fs.azure.account.key.accountname</name> <value>keyvaluestring</value> </property> © 2014 JT IT Consulting, LLC

The Zoo: HDInsight / Hadoop Ecosystem Metadata (HCatalog) Graph (Pegasus) Stats processing (RHadoop) Active Directory (Security) Pipeline / Workflow (Oozie) Scripting (Pig) Query (Hive) Machine Learning (Mahout) NoSQL Database (HBase) Distributed Processing (Map/Reduce) Parallel Data Warehouse (PDW) / Ploybase Business Intelligence (Excel, Power View, SSAS) (ODBC / SQOOP / REST) Data Integration Distributed Storage (HDFS) Monitoring & Deployment (System Center) Log file Aggregation (Flume) C#, F# .NET Azure Storage Vault (ASV) World’s Data (Azure Data Marketplace) 1). The core of Hadoop: distributed storage and compute 2). Layer of abstraction between you and the native Java Map\Reduce jobs that enables metadata cataloging with HCatalog, T-SQL like queries using Hive, and scripting with Pig. 3). Oozie enables workflows, Flume enables the aggregation of log files, and NoSQL Hbase enables analytics. 4). Predictive analytics with machine learning using Mahout, highly customizable analytic “R” language queries using Rhadoop, and Graph mining with Pegasus that enables you to map relationships in social networking data. 5). Tie is all together with data integration capabilities of ODBC, SQOOP and the REST services API. 6). Microsoft differentiates its Hadoop offering by offering integration with PDW (PolyBase), 7). Active Directory and System Center offer tight A&A integration along with the automation needed to provision and de-provision clusters at will. 8). JavaScript Relational (SQL Server) Event Driven Processing © 2014 JT IT Consulting, LLC

Programming HDInsight Since HDInsight is a service-based implementation, you get immediate access to the tools you need to program against HDInsight/Hadoop Hive, Pig, Sqoop, Mahout, Cascading, Scalding, Scoobi, Pegasus, etc. Existing Ecosystem C#, F# Map/Reduce, LINQ to Hive, .Net Management Clients, etc. .NET JavaScript Map/Reduce, Browser-hosted Console, Node.js management clients JavaScript PowerShell, Cross-Platform CLI Tools DevOps/IT Pros: © 2014 JT IT Consulting, LLC

And contribute back to the community distribution of Hadoop. Microsoft’s Vision To provide insight to users by activating new types of data… Broader Access Easy installation of Hadoop on Windows Simplified programming via integration with .Net and JavaScript. Integration with SQL Server Data Warehouses Enterprise Ready Choice of deployment on Windows Server or Windows Azure Integration with other Windows components like Active Directory and System Center Insights Integrate with the Microsoft BI Stack: SQL Server SharePoint Excel & PowerBI And contribute back to the community distribution of Hadoop. © 2014 JT IT Consulting, LLC

How do I get started with HDInsight? Create an Windows Azure account (subscription) Create an Azure Storage account Create an Azure Blob Storage node Provision an HDInsight Service cluster Install Windows Azure PowerShell Install Windows Azure HDInsight PowerShell Setup Environment © 2014 JT IT Consulting, LLC

Architectural Models Standalone Data Analysis & Visualization Experiment with data sources to discover if they provide useful information. Handle data that can’t be processed using existing systems. Data Transfer, Cleansing or ETL Extract and transform data before loading it into existing databases. Categorize, normalize, and extract summary results to remove duplication and redundancy. Data Warehouse or Data Storage Create robust data repositories that are reasonably inexpensive to maintain. Especially useful for storing and managing huge data volumes. Integrate with Existing EDW and BI Systems Integrate Big Data at different levels; EDW, OLAP, Excel PowerPivot. Also, PDW enables querying HDInsight to integrate Big Data with existing dimension & fact data. © 2014 JT IT Consulting, LLC

Resources http://www.windowsazure.com http://hadoopsdk.codeplex.com http://nugget.org/packages?q=hadoop PolyBase – David DeWitt http://www.Microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/polybase.aspx PDW Website: http://Microsoft.com/en-us/sqlserver-solutions-technologies/data-warehousing/pdw.aspx http://blogs.technet.com/b/dataplatforminsider/archive/2013/04/25/insight-through-integration-sql-server-2012-parallel-data-warehouse-polybase-demo.aspx © 2014 JT IT Consulting, LLC

Tools http://azurestorageexplorer.codeplex.com © 2014 JT IT Consulting, LLC

Place title here HDInsight clusters can be provisioned when needed and then de-provisioned without loosing the data or metadata they have processed. Azure Storage Vault allow you to maintain that state; paying only for the storage and not the cluster(s). Since stream data often arrives in massive bursts, HDInsight can provide a buffer between that data generation and existing data warehouse/BI infrastructures. © 2014 JT IT Consulting, LLC