Presentation is loading. Please wait.

Presentation is loading. Please wait.

4 New Insights through Big Data New World of Big Data & DW – Yet another ‘Hype’? 5 … data warehousing has reached the most significant tipping point.

Similar presentations


Presentation on theme: "4 New Insights through Big Data New World of Big Data & DW – Yet another ‘Hype’? 5 … data warehousing has reached the most significant tipping point."— Presentation transcript:

1

2

3

4 4 New Insights through Big Data

5 New World of Big Data & DW – Yet another ‘Hype’? 5 … data warehousing has reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing. – Gartner, “The State of Data Warehousing in 2012” Data sources ETLData warehouseBI and analytics

6 Impact on traditional DW 6 Data sources ETLData warehouseBI and analytics Increasing data volumes 1 Near real-time insights 3 Non-relational data New data sources and types 2 Cloud-born data 4

7 Data sources Non-relational data The modern Data Warehouse

8 When to use what?

9 2014 – 5% think Hadoop will replace existing DW solution (2013: 10%) ↓ 50% decline even with the presence of Hadoop 2.0

10 → primarily includes SQL over HDFS (!)

11 Challenges for a modern data warehouse Keep existing & legacy investment Limited scalability and ability to handle new data types Acquire Big Data solution Significant training and data silos Buy new tier-one hardware appliance High acquisition and migration costs Acquire business intelligence & tools Complex with low adoption

12 12 Microsoft’s Analytics Platform Systems (APS)

13 Introducing the Microsoft Analytics Platform System The turnkey modern data warehouse appliance Next-generation performance at scale Enterprise-ready Big Data Engineered for optimal value Relational and non-relational data in a single appliance Enterprise-ready Hadoop aka HDInsight Region (HDI) Integrated querying across HDI and PDW via PolyBase Integration with Microsoft BI/Excel & 3 rd party BI tools Near real-time performance with In-Memory Columnstore Ability to scale out to accommodate growing data Removal of data warehouse bottlenecks with MPP SQL Server Concurrency that fuels rapid adoption Industry’s lowest data warehouse appliance price per terabyte Value through a single appliance solution Value with flexible hardware options using commodity hardware

14 Core APS Features MPP and In-Memory Columnstore for best-in-class performance Store data in columnar format for massive compression Load data into or out of memory for next-generation performance with up to 60% improvement in data loading speed Updateable and clustered for real-time trickle loading Up to 100x faster queries Updateable clustered columnstore vs. table with customary indexing Up to 15x more compression Columnstore index representation C1C1 C3C3 C5C5 C4C4 C2C2 C6C6 Parallel query execution Query Results

15 What’s HDInsight in the appliance? o Hadoop bits run on same HW as PDW region o Separate appliance region – interconnected via Infiniband to PDW region o Hortonworks HDP 2.0 bits on Windows Server – including YARN execution framework o PolyBase certified with HDInsight on- premise Core APS Features HDInsight Region 15

16 1.Querying all of data sources through APS/PDW o allowing to query relational & non- relational stores (on-premise) with single query interface o On-premise data sources include not only Hadoop, but also RDBMSs or document-oriented stores* 2.APS/PDW as cloud-attach for Azure o allowing to scale storage & leverage cloud compute (SQL DB, Azure HDI, Azure DocDB, Azure ML, Azure Search) o Trickle effect (data migration from on-premise to Azure) IBM DB2 Oracle SQL Server Your Apps APS with PolyBase On-premise Hadoop 1. Azure Storage Microsoft Azure Your Apps 2. Core APS Features PolyBase for the (Big) Data Continuum

17 17 Split-based query processing Best-In- Class Performance Supporting various Hadoop distributions Openness & Flexibility Integration with Microsoft Azure Hybrid Seamless with Microsoft BI & Third-Party BI tools Existing skill sets & tools Mature T-SQL language surface Preserving T-SQL Semantics

18 18

19 19 PolyBase – Building Blocks & Functionality

20 Microsoft APS w/ PolyBase APS control & data nodes Social Apps Sensor &RFID Mobile Apps Web Apps Data Scientists, BI Users, DB Admins Your Apps PowerPivot PowerView PolyBase Split-Based Query Processing External Table External Data Source External File Format Hadoop Relational DW

21 Your Apps APS with PolyBase On-premise Hadoop Indefinite Azure Storage Microsoft Azure Azure Express Route Key principle - Leverage computation power of external Hadoop clusters o Transparent pushing of computation into Hadoop on-the-fly o No map/reduce or Hadoop skills needed o Reducing data volume to be transferred Generalizing key principle in future releases o Apply for all external data sources o Azure for indefinite storage + as appliance extension by making use of Azure compute o Leveraging compute of other data sources - relational data sources or NoSQL platforms Extensible approach o Ideas to plug-in other push-down engines in future o Further performance improvements in near future

22 PolyBase Query Use Cases PolyBase query scenarios (SELECT) 1.Run T-SQL over HDFS/Hadoop data 2.Combine data from different Hadoop clusters (e.g. production & dev/test) via PolyBase 3.Join relational data with non-relational data in HDFS/Hadoop Customer Value o Ease-of-use & Improved Time-To-Insights ― Build the data lake w/o heavily investing into new resources, i.e. Java & map/reduce experts ― Leverage familiar & mature T-SQL scripts and constructs ― Seamless tool integration w/ PolyBase Hadoop Microsoft APS w/ PolyBase Your Apps T-SQL Hadoop Dev/Test

23 PolyBase Import & Export Use Cases PolyBase ETL scenarios (CTAS/CETAS) o Storing subset of Hadoop in PDW for frequent access, i.e. in SQL Server’s columnar format ( → updatable CCI) o Enabling data aging scenarios to more economic storage ( → ‘data lake’) Customer Value ― Avoids the need of maintaining a separate import or export utility ― Allows building multi-temperature DW platforms o PDW/APS acts as hot query engine processing most recent/relevant data sets o Aged data immediately accessible via external tables o No need for deleting any data anymore Hadoop Microsoft APS w/ PolyBase Your Apps T-SQL

24 PolyBase for Hybrid Use Cases (Azure Integration) APS w/ PolyBase Your Apps T-SQL Azure Storage Your Apps APS as Azure-attach o Azure as cheap storage for keeping your data forever o Mesh-up on-premise and cloud apps on your own terms o PolyBase as bridge between on-premise and Azure Customer Value o Indefinite storage and compute ― Azure as extension for your on-premise data assets o Cloud transition on your own terms ― Move only subsets of on-premise data, e.g. non-sensitive data o Leverage new Azure data services ― Reduced capex & availability of new emerging data services in Azure for on-premise focused users

25 PolyBase for Round-Trip Scenarios APS w/ PolyBase Your Apps T-SQL Azure Storage Your Apps Hadoop Dev/Test PolyBase RoundTrip Scenario (CETAS) o Processing non-relational data with relational data and exporting it to your Big Data platform of choice Customer Value ― Total freedom & flexibility o User can decide where to store the results of querying different data sets ― Simplicity o Example: 1 T-SQL statement for a)querying different Hadoop data sets, b)combining with relational data and c)storing results in Azure or in a different Hadoop cluster

26 26

27 27 Real-world customer production use cases with PolyBase

28 ‘ We love Open-Source, but we love SQL Server Analysis Services (even more)’ o SQL Server SMP & Hadoop cluster Pain Points with current solution 1.‘We are currently using a mix of SQOOP & BCP.’ 2.The integration with SQL Server should be better, simpler & as fast as possible’ !s 3.‘Take a definition of an aggregate in Hive and map it through SQL DDL’! 4.‘Loading data from Hadoop to SQL must be easier’. 5.‘Hadoop is hard to deploy and maintain'. insights

29

30 ‘Pay-as-you-drive’ - Driver Discount & Policy adjustment o SQL Server 2012 PDW & HDP 2.0/2.1 on Linux What they want to accomplish 1.‘We have increasing set of non-relational data – sensor data & web data which we want to store in Hadoop.’ 2.‘We want to get answers to our complex questions fast’. 3.‘We don’t want to learn a new tool or programming language or paradigm to get the answers’. 4.‘We know how to use Microsoft tools and services’. 5.‘We don’t want to maintain a complex infrastructure – we trust Microsoft & SQL Server to help us’.

31 Investing into Online Shopping website (‘Korea’s Amazon’) o SQL Server 2012 PDW & HDP 1.3/HDP 2.0 on Linux What they want to accomplish 1.‘We want perform complex data mining on customer purchase data – basket analysis’. 2.‘We want to understand the social media data (reviews/Twitter) – specifically around our products & stores’. 3.‘We will use Hadoop to keep all of our data ~ envisioned to be around 480 TB. PDW will be the efficient analysis engine for the hot & more recent data’. 4.‘PDW & PolyBase are much faster than Hive’. 5. ‘We’re interested in using data mining cloud services in Azure (hybrid scenarios)’

32 HDP on Linux APS/PDW EDW Analytic information (right customer targeting) Campaign Recommendation engine & personalized advertising Online Shopping Mall SSG.com (renewal) Recent/hot data stored in PDW Solution Architecture (Details) – ShinSeGae PolyBase Queries raw/cold/ warm data Complex Event Processing (Storm) Message Queues (KAFKA, Open source) Tracking Log Servers Web log data(160GB/daily) – External Tables A, B, C 1. Unstructured/semi-structured text data - External Tables D, E, F Text (Board/SNS/ Internal Text ) Weather.. 2. Company emails – External Tables G, H, I 3. Mails EIS OLAP (Tabular) DATA Mining Visualization (Silverlight) BI analyst Operational Data Store 10 GB Ethernet

33 What they want to accomplish o On-premise: 15-20 HDP Hadoop cluster + 1 Rack PDW o How they want using Azure: 1.Cost-optimal backup for Hadoop data [WASB] 2.Scale-out environment to spin-up more compute on demand at busy times 3.Platform to leverage new data services, in particular Azure HDI and Azure Machine Learning o Daily - 50 GB compressed data into Azure o Either planning to use custom API or PolyBase CETAS o Planning to use PolyBase to combine data from on-premise and Azure APS with scale-out PDW region Your Apps T-SQL Azure Storage Your Apps

34 ‘Understanding network quality’ o SQL Server 2012 PDW & Hadoop on Linux What they want to accomplish 1.‘We collect millions of network records for quality assessment and capacity planning – on a daily basis’. 2.‘Hadoop will be used for storage and ETL of these network record files’. 3.‘PDW for more operational analysis, ad-hoc analysis, operational reports’. 4.‘We are using Polybase along with Oozie-based orchestration for a seamless & automated integration’.

35 Hadoop cluster (18+ servers) APS/PDW EDW Hot operational PDW data Solution Architecture (Details) – Telcom PolyBase Queries raw/cold data (Petabyte of network logs) High-frequency Event Processing (Network logs) Capturing Network logs (>300 GB/per day) – External Tables A, B, C Network quality analysis Capacity Planning Visualization (PowerPivot) BI analyst/Planner/ Decision-maker Operational Data Store Infiniband Oozie Workflows Remote procedure calls via stored procedures to trigger PolyBase queries HCatalog Usage of Hive’s Metadata stores

36 ‘Analyzing oil drilling rigs’ o SQL Server 2012 PDW & external Hadoop What they want to accomplish 1.‘Each drilling rig is equipped with sensor devices and we want to store the incoming sensor data into Hadoop’. 2.‘We want to use PolyBase to access these sensor data’. 3.‘We want to monitor our rigs in near real-time’. 4.‘We want to conduct predictive analysis – which rigs are running poorly, which rigs show similar symptoms’.

37

38 ‘Bridging the gap between cloud & on-prem’ o SQL Server 2012 PDW & Azure HDInsight What they want to accomplish 1.‘HDInsight/Hadoop in the cloud to store and massage our raw data (XML files) generated by our web-application’. 2.‘PDW to keep the data on-prem (legal requirement) and to have an efficient query engine for analysis purposes’. 3.‘PolyBase is a great way of accessing our files in the cloud via simple T-SQL.’. 4.‘With this solution, we can allow web users to quickly ask questions while the heavy, more complex business analysis is accomplished by PDW users’.

39 Microsoft BI stack IBM Cognos Solution Architecture (Details) – Government PolyBase Queries APS/PDW EDW PDW/APS for fast query response & data processing of hot data Operational Data Store Public Internet or Azure Express Route Transforming to large text files ~ 10 GBs each (External Tables) HDI on Azure cheap data store – alternative to Hadoop on-prem solution Azure Blob Storage Web Application for Tax Filing (e-invoice) Web apps- Generating tons of smaller XML files (~7KB each) Other Web Feeds HDI tools for data transformation

40

41 ‘What are you drinking? Why is the machine down’? o SQL Server/APS with PDW & HDI region What they want to accomplish 1.‘We want a complete solution stack – we do not have Hadoop experts in-house and don’t have the money to get it’. 2.‘We want to store all raw data coming from vending machines into Hadoop’. 3.‘360 degree of all our data – structured customer data & unstructured data coming from vending machines’. 4.‘Predicate maintenance of machines’.

42 What they want to accomplish o Offering data management services to other Microsoft groups o Cookie scoring for msn.com o 3TB of data for each analysis o Added PolyBase for Reporting & PowerQuery o Previously – o Analysis solely done via HiveQL o Results are writing back in SQL Server for reporting o PowerQuery via HiveODBC driver ‘Data Management Services for Microsoft Teams ’ o APS with PDW & HDInsight region

43 Solution Architecture (Details) – Internal Microsoft Data Scientist PowerQuery/Power View/PowerMap Data scientist group 2 - Using Polybase for existing tooling (T-SQL, BI tools) performing processing of complex analytical queries & consistent management experience Full Rack PDW Polybase Queries PDW region HDI region 1 scale unit HDI region Secure Gateway & AD Integration msn.com – Log files Analyzing ~3 TB Web Traffic Microsoft servers – Log files Data scientist group 1 - using chaining of Hive queries & PowerQuery via HiveODBC Hive & PowerQuery via Hive ODBC Analytical queries via SSDT APS with PDW & HDI region System Center & AdminConsole Infiniband

44 44

45

46

47 www.microsoft.com/learning http://developer.microsoft.com http://microsoft.com/technet http://channel9.msdn.com/Events/TechEd

48

49


Download ppt "4 New Insights through Big Data New World of Big Data & DW – Yet another ‘Hype’? 5 … data warehousing has reached the most significant tipping point."

Similar presentations


Ads by Google