Big Data Hands-On Labs: Date Time Location Tuesday 3:45pm – 4:45pm Hotel Nikko - Peninsula Wednesday 1:15pm – 2:15pm Thursday 11:30am – 12:30pm Or download: Big Data Lite Virtual Machine
Oracle Big Data Appliance for Customers and Partners Jean-Pierre Dijcks Oracle Big Data Product Management Paul Kent SAS VP Big Data
Oracle Big Data Appliance for Customers and Partners 1 Big Data Appliance Recap Why You Should Consider Big Data Appliance Driving Business Value with SAS on Big Data Appliance Q&A 2 3 4
Oracle Big Data Management System Oracle Big Data SQL Oracle Database Oracle Industry Models Oracle Advanced Analytics Oracle Spatial & Graph Cloudera Hadoop Oracle NoSQL Database Oracle R Advanced Analytics for Hadoop Oracle R Distribution Oracle Database Oracle Advanced Security Oracle Advanced Analytics Oracle Spatial & Graph Oracle Big Data Connectors Oracle Data Integrator Big Data Appliance Oracle Exadata SOURCES
Recap: Big Data Appliance Overview Big Data Appliance X4-2 Sun Oracle X4-2L Servers with per server: 2 * 8 Core Intel Xeon E5 Processors 64 GB Memory 48TB Disk space Integrated Software: Oracle Linux, Oracle Java VM Oracle Big Data SQL* Cloudera Distribution of Apache Hadoop – EDH Edition Cloudera Manager Oracle R Distribution Oracle NoSQL Database * Oracle Big Data SQL is separately licensed
Recap: Standard and Modular Starter Rack is a fully cabled and configured for growth with 6 servers In-Rack Expansion delivers 6 server modular expansion block Full Rack delivers optimal blend of capacity and expansion options Grow by adding rack – up to 18 racks without additional switches
Recap: Harness Rapid Evolution BDA 4.0 – Sept 2014 Big Data SQL Node Migration b BDA 4.0 BDA 2.x – April 2013 Starter Rack In-Rack Expansion EM Integration BDA 3.x – April 2014 CDH 5.0 (MR2 & YARN) AAA Security Encryption BDA 1.0 – Jan 2012 Initial BDA Mammoth Install
Core Design Principles for Big Data Appliance Operational Simplicity Simplify Access to ALL Data
Core Design Principles for Big Data Appliance Oracle Big Data SQL Oracle SQL on ALL your data All Native Oracle SQL Operators Smart Scan for Optimized Performance Oracle Security Govern all Data through a Single Set of Security Policies Operational Simplicity Simplify Access to ALL Data
Oracle Big Data SQL – A New Architecture Powerful, high-performance SQL on Hadoop Full Oracle SQL capabilities on Hadoop SQL query processing local to Hadoop nodes Simple data integration of Hadoop and Oracle Database Single SQL point-of-entry to access all data Scalable joins between Hadoop and RDBMS data Optimized hardware Balanced Configurations No bottlenecks Big Data SQL represents a new architecture for querying data in its natural format, wherever it leves, and – when running on Oracle Big Data Appliance and Oracle Exadata – provides a world-class Big Data Management System. Oracle Confidential – Internal/Restricted/Highly Restricted
Big Data SQL 10’s of Gigabytes of Data Hadoop Cluster Oracle Database SELECT w.sess_id, c.name FROM web_logs w, customers c WHERE w.source_country = ‘Brazil’ AND w.cust_id = c.customer_id; Relevant SQL runs on BDA nodes CUSTOMERS WEB_LOGS Big Data SQL Only columns and rows needed to answer query are returned Big Data SQL’s Smart Scan capability radically reduces the cost of joining data with Oracle Database as well. When a join between massive data in Hadoop and smaller data in Oracle occurs, Big Data SQL can process rows using Bloom filters. This ensures that only data from Hadoop which meets the join conditions are transmitted back to the database. As before, this can reduce the amount of data being transmitted and processed by the database by an order of magnitude or more. But in this case, Oracle Database is responsible for joining a part of average sized tables. By processing data at the source, whether it’s stored in Hadoop or Oracle Database, Big Data SQL ensures the best possible use of all the compute resources in a Big Data Management System. 10’s of Gigabytes of Data Hadoop Cluster Oracle Database
Big Data SQL SQL Push Down in Big Data SQL SELECT w.sess_id, c.name FROM web_logs w, customers c WHERE w.source_country = ‘Brazil’ AND w.cust_id = c.customer_id; SQL Push Down in Big Data SQL Hadoop Scans on Unstructured Data WHERE Clause Evaluation Column Projection Bloom Filters for Better Join Performance JSON Parsing, Data Mining Model Evaluation Relevant SQL runs on BDA nodes CUSTOMERS WEB_LOGS Big Data SQL Only columns and rows needed to answer query are returned Big Data SQL’s Smart Scan capability radically reduces the cost of joining data with Oracle Database as well. When a join between massive data in Hadoop and smaller data in Oracle occurs, Big Data SQL can process rows using Bloom filters. This ensures that only data from Hadoop which meets the join conditions are transmitted back to the database. As before, this can reduce the amount of data being transmitted and processed by the database by an order of magnitude or more. But in this case, Oracle Database is responsible for joining a part of average sized tables. By processing data at the source, whether it’s stored in Hadoop or Oracle Database, Big Data SQL ensures the best possible use of all the compute resources in a Big Data Management System. 10’s of Gigabytes of Data Hadoop Cluster Oracle Database
Oracle Communications Data Model Reference Architecture Data Sources Oracle Comms Apps (BSS/OSS) Oracle Comms Ntwk Products (Tekelec & Acme) Other Oracle Apps (CRM, ERP, etc.) Third Party Sources Data Management Big Data Platform (Hadoop/NoSQL) Relational Data Warehouse (OCDM) ETL/ELT Adapters Customer Experience Real-Time Adapters Operations Third Party Monetization Adapters Analytic Apps Feedback Loop To Other Apps
Core Design Principles for Big Data Appliance Operational Simplicity Simplify Access to ALL Data
Core Design Principles for Big Data Appliance No Bottlenecks Full Stack Install and Upgrades Simplified Management Cluster Growth Critical Node Migration Always Highly Available Always Secure Very Competitive Price Point Operational Simplicity Simplify Access to ALL Data
Successful Big Data Systems Grow From Cluster Install with HA to Large Clusters to Dealing with Operational Issues Day 1 12 node BDA for Production Hadoop HA and Security Set-up Ready to Load Data Full install with a single command: ./mammoth –i rck_1 This is a small example using the Name Nodes (HA setup) as an example how things change (automatically) on a BDA, and how critical node migration is happening. RCK_1
Successful Big Data Systems Grow From Cluster Install with HA to Large Clusters to Dealing with Operational Issues Day 1 This is a small example using the Name Nodes (HA setup) as an example how things change (automatically) on a BDA, and how critical node migration is happening. RCK_1 N Example Service: Hadoop Name Nodes
Successful Big Data Systems Grow From Cluster Install with HA to Large Clusters to Dealing with Operational Issues Day 90 Add 12 New Nodes across two Racks Cluster expansion with a single command: mammoth –e newhost1,…,newhostn This is a small example using the Name Nodes (HA setup) as an example how things change (automatically) on a BDA, and how critical node migration is happening. RCK_1 RCK_2 N N
Successful Big Data Systems Grow From Cluster Install with HA to Large Clusters to Dealing with Operational Issues Cluster Expansion with a single command: mammoth –e newhost1,…,newhostn This is a small example using the Name Nodes (HA setup) as an example how things change (automatically) on a BDA, and how critical node migration is happening. RCK_1 RCK_2 This expansion automatically optimizes HA setup across multiple racks N Because of uniform nodes and IB networking, no data is moved N
Successful Big Data Systems Grow From Cluster Install with HA to Large Clusters to Dealing with Operational Issues Day n Critical Node Failure => Primary Name Node This is a small example using the Name Nodes (HA setup) as an example how things change (automatically) on a BDA, and how critical node migration is happening. RCK_1 RCK_2 N N
Successful Big Data Systems Grow From Cluster Install with HA to Large Clusters to Dealing with Operational Issues Automatic Failover to other NameNode Automatic Service Request to Oracle for HW Failure This is a small example using the Name Nodes (HA setup) as an example how things change (automatically) on a BDA, and how critical node migration is happening. RCK_1 RCK_2 N N
Successful Big Data Systems Grow From Cluster Install with HA to Large Clusters to Dealing with Operational Issues Restore HA with a Single command bdacli admin_cluster migrate N1 This is a small example using the Name Nodes (HA setup) as an example how things change (automatically) on a BDA, and how critical node migration is happening. RCK_1 RCK_2 Reinstate the Repaired Node with a Single Command: bdacli admin_cluster reprovision N1 N N
Core Design Principles for Big Data Appliance 30% Quicker to Deploy “Oracle Big Data Appliance is an excellent choice for customers looking to work with the full suite of Cloudera’s leading Hadoop-based technology. It’s more cost-effective and quicker to deploy than a DIY cluster.” Operational Simplicity 21% It’s logical that an engineered system would be quicker to deploy that building your own. It’s that first one that people don’t believe. But we worked with an analyst firm, ESG, and they measured list prices, did the comparison and the BDA is at least 20% cheaper than building your own comparable cluster, assuming that you have the time and skills to do so. The key word here is comparable – I’ve had lots of people tell me that they could build one cheaper, but they turn out to have a fraction of the storage. These are large, dense storage nodes, and if you use cheap pizza box servers, you can fill a rack much more cheaply, but you’ll have way less storage. Cheaper to Buy Mike Olson, Cloudera founder, Chief Strategy Officer, and Chairman of the Board
Big Data Initiative @ Oracle Global Support Services Real-time access to better data means better insights, which means better decisions and better business results Integrate data associated with customer telemetry, configurations, service history, diagnostics, knowledge & support information Anticipate Detect Predict Automate Delight
Core Design Principles Enable Success Operational Simplicity Simplify Access to ALL Data
There is one more thing… Business Value = Applications
Big Data Appliance powers instant Business Value Customer Experience Management Communications Data Model Cyber Security Solutions
Introducing Paul Kent - SAS
Big Data and Big Analytics – So Much more Gunpowder! Paul Kent VP BigData, SAS Research and Development
1. Change 2. Safari Pics
[CON8279] Oracle Big Data Appliance: Deep Dive and Roadmap for Customers and Partners Oracle Big Data Appliance is the premier Hadoop appliance in the market. This session describes the roadmap for customers in the areas of high-performance SQL on Hadoop and securing big data, plus overall performance improvements for Hadoop. A special focus in the session is the roadmap and benefits Oracle Big Data Appliance brings to Oracle partners. To illustrate the benefits of running on a standardized and optimized Hadoop platform, SAS presents the findings of its tests of SAS In-Memory Analytics on Oracle Big Data Appliance.
Agenda SAS & Oracle Partnership Family Stories Deployment Patterns Hadoop Oracle Engineered Systems Family SAS Software Family Deployment Patterns
Reflection on a stronger partnership than ever Both leaders in Big Data – Jointly solving the most difficult and demanding Big Data Problems Providing simplicity and agility to create flexible configurations Extensive engineering collaboration Can we answer: How Does it Work? How Does it Perform? 2014 #SASGF13 Copyright © 2013, SAS Institute Inc. All rights reserved.
SOURCE: http://commons.wikimedia.org/wiki/File:Tamoxifen-3D-vdW.png the tamoxifen dilemma SOURCE: http://commons.wikimedia.org/wiki/File:Tamoxifen-3D-vdW.png
Agenda SAS & Oracle Partnership Family Stories Deployment Patterns Hadoop Oracle Engineered Systems Family SAS Software Family Deployment Patterns
Elephant :: 3 Good Ideas !! Never forgets Is a good (hard) worker Is a Social Animal (teamwork)
Hadoop – Simplified View Controller Worker Nodes MPP (Massively Parallel) hardware running database-like software “data” is stored in parts, across multiple worker nodes “work” operates in parallel ,on the different parts of the table
Idea #1 - HDFS. Never forgets! Head Node Data 1 Data 2 Data 3 Data 4… MYFILE.TXT ..block1 -> block1 ..block2 -> block2 ..block3 -> block3
Idea #1 - HDFS. Never forgets! Head Node Data 1 Data 2 Data 3 Data 4… MYFILE.TXT ..block1 -> block1 block1 copy2 ..block2 -> block2 block2 copy2 ..block3 -> block3 copy2 block3
Idea #1 - HDFS. Never forgets! Head Node Data 1 Data 2 Data 3 Data 4… MYFILE.TXT ..block1 -> block1 block1copy2 ..block2 -> block2 block2 copy2 ..block3 -> block3 copy2 block3 X X
Redundancy Wins!
Idea #2 – MapReduce – Send the work to the Data We Want the Youngest Person in the Room Each Row in the audience is a data node I’ll be the coordinator From outside to center, accumulate MIN Sweep from back to front. Youngest Advances
Agenda SAS & Oracle Partnership Family Stories Deployment Patterns Hadoop Oracle Engineered Systems Family SAS Software Family Deployment Patterns
Recap: Standard and Modular Starter Rack is a fully cabled and configured for growth with 6 servers In-Rack Expansion delivers 6 server modular expansion block Full Rack delivers optimal blend of capacity and expansion options Grow by adding rack – up to 18 racks without additional switches
Oracle Big Data SQL – A New Architecture Powerful, high-performance SQL on Hadoop Full Oracle SQL capabilities on Hadoop SQL query processing local to Hadoop nodes Simple data integration of Hadoop and Oracle Database Single SQL point-of-entry to access all data Scalable joins between Hadoop and RDBMS data Optimized hardware Balanced Configurations No bottlenecks Big Data SQL represents a new architecture for querying data in its natural format, wherever it leves, and – when running on Oracle Big Data Appliance and Oracle Exadata – provides a world-class Big Data Management System. Oracle Confidential – Internal/Restricted/Highly Restricted
Diversity. It’s a good thing! Impala Nyala
Agenda SAS & Oracle Partnership Family Stories Deployment Patterns Hadoop Oracle Engineered Systems Family SAS Software Family Deployment Patterns
4 Important Things #1 Join the Family
SAS ACCESS to Hadoop SAS SERVER HADOOP Hive QL #2 Be Familiar
SAS / Embedded Process SAS SERVER SAS/Scoring Accelerator for Hadoop SAS Data Step & DS2 SAS SERVER proc ds2 ; /* thread ~ eqiv to a mapper */ thread map_program; method run(); set dbmslib.intab; /* program statements */ end; endthread; run; /* program wrapper */ data hdf.data_reduced; dcl thread map_program map_pgm; method run(); set from map_pgm threads=N; /* reduce steps */ end; enddata; run; quit; SAS/Scoring Accelerator for Hadoop SAS/Code Accelerator for Hadoop SAS/Data Quality Accelerator for Hadoop
SAS / High Performance Analytics HADOOP SAS HPA Procedures SAS SERVER #3 Use the Cluster!
SAS / High Performance Analytics Prepare Explore / Transform Model HPDS2 HPDMDB HPSAMPLE HPSUMMARY HPCORR HPREDUCE HPIMPUTE HPBIN HPLOGISTIC HPREG HPNEURAL HPNLIN HPCOUNTREG HPMIXED HPSEVERITY HPFOREST HPSVM HPDECIDE HPQLIM HPLSO HPSPLIT HPTMINE HPTMSCORE What HPA procs … this is the short list these run in Hadoop today… HPDS2 – Parallel Execution of DS2 HPDMDB – Metadata definitions and data Summarization HPSAMPLE - Sampling and data partitioning HPSUMMARY – Summarization Descriptive Statistics HPCORR - Pearson correlation coefficients, three nonparametric measures of association, and the probabilities associated with these statistics HPREDUCE – unsupervised variable selection HPIMPUTE – Missing value replacement HPBIN – binning HPLOGISTIC – Logistic regression HPREG – regression HPNEURAL – Neural networks HPNLIN – Non linear regression HPCOUNTREG – regression of count variables HPMIXED – Mixed Linear models HPFOREST – random forest HPSVM – support vector machine HPDECIDE – Decision / Cost HPLSO – Lasso HPTMINE – Text Mining HPTSCORE – Text Scoring HPREG linear regression and variable selection HPLOGISTIC logistic regression and variable selection HPLMIXED linear mixed models HPNEURAL neural nets HPNLIN nonlinear regression and maximum likelihood HPREDUCE covariance/correlation analysis, variable reduction PROC HPREG High-performance combination of REG and GLMSELECT Supports »classical variable selection techniques »modern variable sélection techniques (LAR, LASSO) CLASS variables GLM and reference parameterizations SELECTION statement PROC HPNLIN High-performance combination of NLIN and NLP/NLMIXED »Classical nonlinear least squares (Levenberg-Marquardt) »Maximum likelihood for built-in distributions »Maximum likelihood for general, user-specified obj. functions »Boundaries, linear equality/inequality constraints
SAS / High Performance Analytics Controller Client Some processes are more complex that fits “nicely” inside the terms & conditions of the container. We can use the embedded process as a data acquisition channel, and yet perform the mathematics elsewhere (and in the first generation, elsewhere meant other operating system processes on the same server – preserving a symetric or 1:1 balance between the data parallelism and the mathematics parallelism) 2012 – SAS High Performance appliances for teradata, greenplum, oracle and hadoop
#1 Join the Family #2 Be Familiar #3 Use the cluster #4 Have a pretty face!
SAS Visual Analytics Interactive exploration, dashboards and reporting Auto-charting automatically picks the best graph Forecasting, scenario analysis, Decision Trees and other analytic visualizations Text analysis and content categorization Feature-rich mobile apps for iPad® and Android
SAS Visual Statistics July-2014 Interactive, visual application for statistical modeling and classification Multiple methods: logistic, Regression, GLM, Trees, Forest, Clustering and more… Model comparison and assessment Group BY Processing
4 Important Things (for cluster friendly software) Join the Family Be Familiar Performance Have a pretty face
Agenda SAS & Oracle Partnership Family Stories Deployment Patterns Hadoop Oracle Engineered Systems Family SAS Software Family Deployment Patterns
SAS Big Data on Big Data Appliance Flexible Architectural options for SAS deployments Can run on Starter, Half and Full configurations Optionally select nodes “N, N-1, N-2, …” for additional SAS Services such as SAS Compute Tier, SAS MidTier Optionally select node subset “N, N-1, N-2, N-3, …) for more dedicated resources for SAS Analytic Compute Environment by shifting Big Data Appliance roles Option to selectively add more memory on a per node basis depending on specific workload distribution
STARTER BDA SAS HPA Root Node SAS Visual Analytics Metadata Server SAS Compute SAS Midtier … … SAS Visual Analytics, high-Performance Analytic Compute environment Co-Located with Hadoop
… … STARTER BDA Consider: Extra Memory for 5,6? SAS HPA Root Node SAS Visual Analytics Metadata Server SAS Compute SAS Midtier Consider: Extra Memory for 5,6? … … SAS Visual Analytics, high-Performance Analytic Compute environment Co-Located with Hadoop
FULL RACK BDA SAS HPA Root Node Metadata Server SAS Compute SAS Midtier LASR Worker 18 HDFS Data 18 LASR Worker 17 … … HDFS Data 17 … SAS Visual Analytics, high-Performance Analytic Compute environment Co-Located with Hadoop
Assembled in OSC, SYDNEY AUSTRALIA FULL RACK BDA Assembled in OSC, SYDNEY AUSTRALIA
Assembled in OSC, SYDNEY AUSTRALIA FULL RACK BDA Assembled in OSC, SYDNEY AUSTRALIA
Assembled in OSC, SYDNEY AUSTRALIA FULL RACK BDA Assembled in OSC, SYDNEY AUSTRALIA
Assembled in OSC, SYDNEY AUSTRALIA FULL RACK BDA Assembled in OSC, SYDNEY AUSTRALIA Basic Smoke Tests Confirmed: Interoperate with Hadoop and Map Reduce Read and Write text files to/from HDFS Read and Write Tabular files to/from Hive (will confirm Oracle BIGSQL in OSC-SC) Read and Write SAS binary format files to/from HDFS High Degree Of Parallelism (DOP) reads via Map-Only jobs SAS LASR server co-exists on/with datanodes SAS HPA tasks scheduled on datanodes
SAS High-Performance Analytics Performance SAS Format Data (SASHDAT) 7171 SAS High-Performance Analytics Performance SAS Format Data (SASHDAT) 1107 var 11.795 Mobs 97GB 5.7GB/node 73.744 Mobs 608GB 35.7GB/node 6x Create 208.79 sec 2284.29 sec 11 Scan/Count 24.60 sec 259.38 sec 10.5 HPCORR 295.20 1410.40 4.7 HPCNTREG 336.79 1547.59 4.6 HPREDUCE (u) 236.55 2467.76 10.4 HPREDUCE (s) 219.50 2037.74 9.3 Table 1: Summation of 5/20/100/200 columns; Baseline: DOP=1 (no parallelism) 120M rows, 400 columns, reg_simtbl_400
OSC-AU FullRack BDA 408 Threads 600 GB dataset 17 servers Your Problem solved ASAP
… … Exadata IntegraTion SAS Embedded Processing (EP) to Exadata Leveraging Big Data SQL SAS HPA Root Node SAS Visual Analytics Metadata Server SAS Compute SAS Midtier LASR Worker 18 … … HDFS Data 18 SAS EP Big Data SQL
SAS High-Performance Analytics Performance 7777 SAS High-Performance Analytics Performance SAS EP Parallel Data Feeders DOP=1 DOP=24 (flash cache) Add(5) 1.25min 1.5min .5min Add(20) 2.5min Add(100) 13min .6min Add(200) 16min ~2min 1.25min (10x) Table 1: Summation of 5/20/100/200 columns; Baseline: DOP=1 (no parallelism) 120M rows, 400 columns, reg_simtbl_400
SAS High-Performance Analytics Performance 7878 SAS High-Performance Analytics Performance SAS EP Parallel Data Feeders Access Access / DBSlice SAS HPA Using EP Reg_sim_200 1:01:12 0:28:37 0:08:00 Reg_sim_400 1:49:11 0:55:33 0:16:05 (7x!) Table 2: Scan times for 2 tables (200 columns, 400 columns, 120M rows); Baseline: SAS/ACCESS vs. HPA EP feeder
SAS High-Performance Analytics Performance 7979 SAS High-Performance Analytics Performance SAS Format Data (SASHDAT) and Oracle EXADATA 1107 var 11.795 Mobs 97GB 5.7GB/node SASHDAT 907 var 79.7GB 4.7GB/node EXADATA 73.744 Mobs 608GB 35.7GB/node Create 208.79 sec 931.22 sec 2284.29 sec Scan/Count 24.60 sec 956.16 sec 259.38 sec HPCORR 295.20 833.24 1410.40 HPCNTREG 336.79 756.97 1547.59 HPREDUCE (u) 236.55 1055.11 2467.76 HPREDUCE (s) 219.50 1051.93 2037.74 Table 1: Summation of 5/20/100/200 columns; Baseline: DOP=1 (no parallelism) 120M rows, 400 columns, reg_simtbl_400
Oracle Engineered Systems for ExaData ExaLogic SuperCluster Big Data Appliance ZFS Storage Appliance Virtual Compute Appliance Database Backup, Recovery, Logging Appliance
Working together to create customer value SAS and Oracle Working together to create customer value Joint R & D development and Product Management teams in Cary and Redwood Shores Focus on driving SAS technology components to run natively in Oracle database Joint performance engineering optimizations Template physical architectures developed based on use-cases Physically tested and benchmarked together Reduction in physical effort Overall reduction in lifecycle costs Best Practice papers SAS and Oracle Engineers provide joint "Sizing and Architecture Analysis and Design"
SAS and Oracle Paul.Kent @ sas.com @hornpolish paulmkent Better Together
Oracle Confidential – Internal/Restricted/Highly Restricted