Megabytes Gigabytes Terabytes Petabytes Purchase detail Purchase record Payment record Purchase detail Purchase record Payment record ERP CRM.

Megabytes Gigabytes Terabytes Petabytes Purchase detail Purchase record Payment record Purchase detail Purchase record Payment record ERP CRM WEB BIG DATA Offer details Support Contacts Customer Touches Segmentation Web logs Offer history A/B testing Dynamic Pricing Affiliate Networks Search Marketing Behavioral Targeting Dynamic Funnels User Generated Content Mobile Web SMS/MMS Sentiment External Demographics HD Video, Audio, Images Speech to Text Product/Service Logs Social Interactions & Feeds Business Data Feeds User Click Stream Sensors / RFID / Devices Spatial & GPS Coordinates Increasing Data Variety and Complexity Transactions + Interactions + Observations = BIG DATA

APPLICATIONS DATA SYSTEM REPOSITORIES SOURCES Existing Sources (CRM, ERP, Clickstream, Logs) RDBMSEDWMPP Business Analytics Custom Applications Packaged Applications Source: IDC 2.8 ZB in 2012 85% from New Data Types 15x Machine Data by 2020 40 ZB by 2020 OLTP, ERP, CRM Systems Unstructured documents, emails Clickstream Server logs Sentiment, Web Data Sensor. Machine Data Geolocation

OPERATIONS TOOLS Provision, Manage & Monitor DEV & DATA TOOLS Build & Test DATA SYSTEM REPOSITORIES SOURCES RDBMSEDWMPP OLTP, ERP, CRM Systems Documents, Emails Web Logs, Click Streams Social Networks Machine Generated Sensor Data Geolocation Data Governance & Integration SecurityOperations Data Access Data Management APPLICATIONS Business Analytics Custom Applications Packaged Applications OLTP, ERP, CRM Systems Unstructured documents, emails Clickstream Server logs Sentiment, Web Data Sensor. Machine Data Geolocation

Hortonworks Data Platform (HDP) The Only Completely Open Distribution for Apache Hadoop Fundamentally Versatile and Comprehensive enterprise capabilities Wholly Integrated for deep ecosystem interoperability Hortonworks Data Platform 2.2 YARN : Data Operating System (Cluster Resource Management) 1°°°°°°° °°°°°°°° Script Pig SQL Hive Tez Java Scala Cascading Tez °° °° °°°°° °°°°° Others ISV Engines HDFS (Hadoop Distributed File System) Stream Storm Search Solr NoSQL HBase Slider SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume WebHDFS Authentication Authorization Accounting Data Protection Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox Cluster: Ranger Deployment Choice LinuxWindows On-PremisesCloud

HDP certifies the most recent & stable community innovation * version numbers are targets and subject to change at time of general availability in accordance with ASF release process

SOURCES APPLICATIONS OPERATIONAL TOOLS DEV & DATA TOOLS INFRASTRUCTURE xΩxΩ xΩxΩ a DATA SYSTEM HDInsight Azure New! Power BI

HDP certifies the most recent & stable community innovation * version numbers are targets and subject to change at time of general availability in accordance with ASF release process Hortonworks Data Platform 2.2 Hadoop &YARN Pig Hive & HCatalog HBase Sqoop Oozie Zookeeper Ambari Storm Flume Knox Phoenix 2.2.0 0.12.0 2.4.0 0.12.1 Data Management 0.13.0 0.96.1 0.98.0 0.9.1 1.4.4 1.3.1 1.4.0 1.4.4 1.5.1 3.3.2 4.0.0 3.4.5 0.4.0 4.0.0 Falcon 0.5.0 Ranger Spark 0.14.0 0.98.4 4.2 0.9.3 1.2.0 0.6.0 1.4.5 1.5.0 1.7.0 4.1.0 0.5.0 0.4.0 2.6.0 3.4.5 Tez 0.4.0 Slider 0.60 HDP 2.0 October 2013 HDP 2.2 October 2014 HDP 2.1 April 2014 Solr 4.7.2 4.10.0 0.5.1 Data Access Governance & Integration Security Operations

Scalable Linearly scale to store Petabytes of data Reliable Redundant storage protects against node failures Flexible Store all types of data, apply flexible schemas for analysis and sharing Economical Utilize cose efficient commodity hardware Achieve high cluster utilization Open Source Data Management

NodeManager map 1.1 vertex 1.2.2 NodeManager map 1.2 reduce 1.1 Batch vertex 1.1.1 vertex 1.1.2 vertex 1.2.1 Interactive SQL ResourceManager Scheduler Real-Time nimbus 0 nimbus 1 nimbus 2

Traditional Database SCALE (storage & processing) Hadoop Platform NoSQL MPP Analytics EDW schema speed governance best fit use processing Required on write Required on read Reads are fast Writes are fast Standards and structured Loosely structured Limited, no data processing Processing coupled with data data types Structured Multi and unstructured Interactive OLAP Analytics Complex ACID Transactions Operational Data Store Data Discovery Processing unstructured data Massive Storage/Processing

All offerings co-engineered by Hortonworks and Microsoft Enjoy seamless interoperability across on-premises and cloud

DATA ACCESS YARN : Data Operating System DATA MANAGEMENT 1°°°°°°°°° °°°°°°°°°° °°°°°°°°°° ° ° N HDFS (Hadoop Distributed File System) Script Pig Search Solr SQL Hive/Tez, HCatalog NoSQL HBase Accumulo Stream Storm Others In-Memory Analytics, ISV engines Batch Map Reduce

Sqoop HORTONWORKS DATA PLATFORM (HDP) For Windows RPC REST (HTTP) C LibHDFS Flume

Business Analytics Custom Apps Apache YARN Apache MapReduce 1 ° ° ° ° ° ° ° ° ° ° ° ° ° N Apache Tez Apache Hive SQL ° ° ° ° ° ° HDFS (Hadoop Distributed File System) Apache Hive Contribution… an Open Community at its finest 1,672 Jira Tickets Closed 145 Developers 44 Companies ~390,000 Lines Of Code Added… (2x) 13 Months

Replaces MapReduce as primitive for Hive, Pig, etc Task with pluggable Input, Processor and Output Tez Task - Task Processor InputOutput

Hive – MRHive – Tez SELECT a.state JOIN (a, c) SELECT c.price SELECT b.id JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) MMM R R MM R MM R M M R HDFS MMM R R R MM R R SELECT a.state, c.itemId JOIN (a, c) JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) SELECT b.id SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Tez avoids unneeded writes to HDFS

Hive SQL DatatypesHive SQL Semantics INTSELECT, INSERT TINYINT/SMALLINT/BIGINTGROUP BY, ORDER BY, SORT BY BOOLEANJOIN on explicit join key FLOATInner, outer, cross and semi joins DOUBLESub-queries in FROM clause STRINGROLLUP and CUBE TIMESTAMPUNION BINARYWindowing Functions (OVER, RANK, etc) DECIMALCustom Java UDFs ARRAY, MAP, STRUCT, UNIONStandard Aggregation (SUM, AVG, etc.) DATEAdvanced UDFs (ngram, Xpath, URL) VARCHARSub-queries for IN/NOT IN, HAVING CHARExpanded JOIN Syntax INTERSECT / EXCEPT Hive 0.12 (HDP 2.0) Hive 0.11 Hive 0.13 (HDP 2.1) SQL Compliance Hive provides a wide array of SQL datatypes and semantics so your existing tools integrate more seamlessly with Hadoop

YARN : Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° N ° ° ° ° ° ° HDFS (Permanent Data Storage) NoSQL HBase

HDFS (Hadoop Distributed File System) ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° MapReduce Indexing Job

YARN : Data Operating System (Cluster Resource Management) 1°°°°°°° °°°°°°°° Script Pig SQL Hive Tez Others Engines Tez Java Scala Cascading Tez °° °° °°°°° °°°°° ° ° ° ° ° ° Others ISV Engines ° ° Storm Stream Others Engines Slider Solr Search HBase NoSQL Slider Accumulo NoSQL Slider Spark In-Memory Kafka Slider ° ° ° ° HDFS (Hadoop Distributed File System)

Single Use System Batch Apps Multi Use Data Platform Batch, Interactive, Online, Streaming, … 1 st Gen of Hadoop HDFS (redundant, reliable storage) MapReduce (cluster resource management & data processing) Redundant, Reliable Storage (HDFS) Efficient Cluster Resource Management & Shared Services (YARN) Flexible Data Processing Hive, Pig, others… Batch MapReduce Batch & Interactive Tez Online Data Processing HBase, Accumulo Stream Processing Storm others … 2 nd Gen of Hadoop Classic Hadoop Apps

Define sophisticated Worklows and DLM Policies Enable audit, compliance, and data re-processing Staged Data Retain 5 Years Cleansed Data Retain 3 Years Conformed Data Retain 3 Years Presented Data Retain Last Copy Only

Disaster Recovery and Backup between environments Publishing data between environments for Discovery Site to Site Site to Cloud

Enterprise Identity Provider LDAP/AD Enterprise Identity Provider LDAP/AD Identity Providers Knox Gateway GWGW DMZ A stateless reverse proxy instance deployed in DMZ Firewall HDP Cluster 1 Masters JT NN Web HCat Oozie YARN HBase Hive DN TT HDP Hadoop Cluster 2 Masters JT NN Web HCat Oozie YARN HBase Hive DN TT -Requests streamed through GW to Hadoop services after auth. -URLs rewritten to refer to gateway -Requests streamed through GW to Hadoop services after auth. -URLs rewritten to refer to gateway Firewall REST Client JDBC Client Browser

Ambari: Deploy, Manage, Monitor AMBARI WEB compute & storage.......... PROVISION MANAGE MONITOR REST APIs AMBARI SERVER PROVISION | MANAGE | MONITOR

Ambari SCOM Mgmt Pack HADOOP Storage & Process at Scale Ambari SCOM Server Ambari SCOM Server aggregates + exposes Hadoop metrics Ambari SCOM monitors health + alerts in case of problems

www.microsoft.com/learning http://developer.microsoft.com http://microsoft.com/technet http://channel9.msdn.com/Events/TechEd

Megabytes Gigabytes Terabytes Petabytes Purchase detail Purchase record Payment record Purchase detail Purchase record Payment record ERP CRM.

Similar presentations

Presentation on theme: "Megabytes Gigabytes Terabytes Petabytes Purchase detail Purchase record Payment record Purchase detail Purchase record Payment record ERP CRM."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Megabytes Gigabytes Terabytes Petabytes Purchase detail Purchase record Payment record Purchase detail Purchase record Payment record ERP CRM.

Similar presentations

Presentation on theme: "Megabytes Gigabytes Terabytes Petabytes Purchase detail Purchase record Payment record Purchase detail Purchase record Payment record ERP CRM."— Presentation transcript:

Similar presentations

About project

Feedback