Big Data Hands-On Labs:

Name: Big Data Hands-On Labs:
Uploaded: 2017-07-11T11:38:27+00:00
Duration: PTM44S2
Channel: Willa Cross
Description: Big Data Hands-On Labs:

Big Data Hands-On Labs:
Date Time Location Tuesday 3:45pm – 4:45pm Hotel Nikko - Peninsula Wednesday 1:15pm – 2:15pm Thursday 11:30am – 12:30pm Or download: Big Data Lite Virtual Machine

Oracle Big Data Appliance for Customers and Partners
Jean-Pierre Dijcks Oracle Big Data Product Management Paul Kent SAS VP Big Data

Oracle Big Data Appliance for Customers and Partners
1 Big Data Appliance Recap Why You Should Consider Big Data Appliance Driving Business Value with SAS on Big Data Appliance Q&A 2 3 4

Oracle Big Data Management System
Oracle Big Data SQL Oracle Database Oracle Industry Models Oracle Advanced Analytics Oracle Spatial & Graph Cloudera Hadoop Oracle NoSQL Database Oracle R Advanced Analytics for Hadoop Oracle R Distribution Oracle Database Oracle Advanced Security Oracle Advanced Analytics Oracle Spatial & Graph Oracle Big Data Connectors Oracle Data Integrator Big Data Appliance Oracle Exadata SOURCES

Recap: Big Data Appliance Overview
Big Data Appliance X4-2 Sun Oracle X4-2L Servers with per server: 2 * 8 Core Intel Xeon E5 Processors 64 GB Memory 48TB Disk space Integrated Software: Oracle Linux, Oracle Java VM Oracle Big Data SQL* Cloudera Distribution of Apache Hadoop – EDH Edition Cloudera Manager Oracle R Distribution Oracle NoSQL Database * Oracle Big Data SQL is separately licensed

Recap: Standard and Modular
Starter Rack is a fully cabled and configured for growth with 6 servers In-Rack Expansion delivers 6 server modular expansion block Full Rack delivers optimal blend of capacity and expansion options Grow by adding rack – up to 18 racks without additional switches

Recap: Harness Rapid Evolution
BDA 4.0 – Sept 2014 Big Data SQL Node Migration b BDA 4.0 BDA 2.x – April 2013 Starter Rack In-Rack Expansion EM Integration BDA 3.x – April 2014 CDH 5.0 (MR2 & YARN) AAA Security Encryption BDA 1.0 – Jan 2012 Initial BDA Mammoth Install

Core Design Principles for Big Data Appliance
Operational Simplicity Simplify Access to ALL Data

Oracle Big Data SQL Oracle SQL on ALL your data All Native Oracle SQL Operators Smart Scan for Optimized Performance Oracle Security Govern all Data through a Single Set of Security Policies Operational Simplicity Simplify Access to ALL Data

Oracle Big Data SQL – A New Architecture
Powerful, high-performance SQL on Hadoop Full Oracle SQL capabilities on Hadoop SQL query processing local to Hadoop nodes Simple data integration of Hadoop and Oracle Database Single SQL point-of-entry to access all data Scalable joins between Hadoop and RDBMS data Optimized hardware Balanced Configurations No bottlenecks Big Data SQL represents a new architecture for querying data in its natural format, wherever it leves, and – when running on Oracle Big Data Appliance and Oracle Exadata – provides a world-class Big Data Management System. Oracle Confidential – Internal/Restricted/Highly Restricted

Big Data SQL 10’s of Gigabytes of Data Hadoop Cluster Oracle Database
SELECT w.sess_id, c.name FROM web_logs w, customers c WHERE w.source_country = ‘Brazil’ AND w.cust_id = c.customer_id; Relevant SQL runs on BDA nodes CUSTOMERS WEB_LOGS Big Data SQL Only columns and rows needed to answer query are returned Big Data SQL’s Smart Scan capability radically reduces the cost of joining data with Oracle Database as well. When a join between massive data in Hadoop and smaller data in Oracle occurs, Big Data SQL can process rows using Bloom filters. This ensures that only data from Hadoop which meets the join conditions are transmitted back to the database. As before, this can reduce the amount of data being transmitted and processed by the database by an order of magnitude or more. But in this case, Oracle Database is responsible for joining a part of average sized tables. By processing data at the source, whether it’s stored in Hadoop or Oracle Database, Big Data SQL ensures the best possible use of all the compute resources in a Big Data Management System. 10’s of Gigabytes of Data Hadoop Cluster Oracle Database

Big Data SQL SQL Push Down in Big Data SQL
SELECT w.sess_id, c.name FROM web_logs w, customers c WHERE w.source_country = ‘Brazil’ AND w.cust_id = c.customer_id; SQL Push Down in Big Data SQL Hadoop Scans on Unstructured Data WHERE Clause Evaluation Column Projection Bloom Filters for Better Join Performance JSON Parsing, Data Mining Model Evaluation Relevant SQL runs on BDA nodes CUSTOMERS WEB_LOGS Big Data SQL Only columns and rows needed to answer query are returned Big Data SQL’s Smart Scan capability radically reduces the cost of joining data with Oracle Database as well. When a join between massive data in Hadoop and smaller data in Oracle occurs, Big Data SQL can process rows using Bloom filters. This ensures that only data from Hadoop which meets the join conditions are transmitted back to the database. As before, this can reduce the amount of data being transmitted and processed by the database by an order of magnitude or more. But in this case, Oracle Database is responsible for joining a part of average sized tables. By processing data at the source, whether it’s stored in Hadoop or Oracle Database, Big Data SQL ensures the best possible use of all the compute resources in a Big Data Management System. 10’s of Gigabytes of Data Hadoop Cluster Oracle Database

Oracle Communications Data Model
Reference Architecture Data Sources Oracle Comms Apps (BSS/OSS) Oracle Comms Ntwk Products (Tekelec & Acme) Other Oracle Apps (CRM, ERP, etc.) Third Party Sources Data Management Big Data Platform (Hadoop/NoSQL) Relational Data Warehouse (OCDM) ETL/ELT Adapters Customer Experience Real-Time Adapters Operations Third Party Monetization Adapters Analytic Apps Feedback Loop To Other Apps

No Bottlenecks Full Stack Install and Upgrades Simplified Management Cluster Growth Critical Node Migration Always Highly Available Always Secure Very Competitive Price Point Operational Simplicity Simplify Access to ALL Data

Successful Big Data Systems Grow
From Cluster Install with HA to Large Clusters to Dealing with Operational Issues Day 1 12 node BDA for Production Hadoop HA and Security Set-up Ready to Load Data Full install with a single command: ./mammoth –i rck_1 This is a small example using the Name Nodes (HA setup) as an example how things change (automatically) on a BDA, and how critical node migration is happening. RCK_1

From Cluster Install with HA to Large Clusters to Dealing with Operational Issues Day 1 This is a small example using the Name Nodes (HA setup) as an example how things change (automatically) on a BDA, and how critical node migration is happening. RCK_1 N Example Service: Hadoop Name Nodes

From Cluster Install with HA to Large Clusters to Dealing with Operational Issues Day 90 Add 12 New Nodes across two Racks Cluster expansion with a single command: mammoth –e newhost1,…,newhostn This is a small example using the Name Nodes (HA setup) as an example how things change (automatically) on a BDA, and how critical node migration is happening. RCK_1 RCK_2 N N

From Cluster Install with HA to Large Clusters to Dealing with Operational Issues Cluster Expansion with a single command: mammoth –e newhost1,…,newhostn This is a small example using the Name Nodes (HA setup) as an example how things change (automatically) on a BDA, and how critical node migration is happening. RCK_1 RCK_2 This expansion automatically optimizes HA setup across multiple racks N Because of uniform nodes and IB networking, no data is moved N

From Cluster Install with HA to Large Clusters to Dealing with Operational Issues Day n Critical Node Failure => Primary Name Node This is a small example using the Name Nodes (HA setup) as an example how things change (automatically) on a BDA, and how critical node migration is happening. RCK_1 RCK_2 N N

From Cluster Install with HA to Large Clusters to Dealing with Operational Issues Automatic Failover to other NameNode Automatic Service Request to Oracle for HW Failure This is a small example using the Name Nodes (HA setup) as an example how things change (automatically) on a BDA, and how critical node migration is happening. RCK_1 RCK_2 N N

From Cluster Install with HA to Large Clusters to Dealing with Operational Issues Restore HA with a Single command bdacli admin_cluster migrate N1 This is a small example using the Name Nodes (HA setup) as an example how things change (automatically) on a BDA, and how critical node migration is happening. RCK_1 RCK_2 Reinstate the Repaired Node with a Single Command: bdacli admin_cluster reprovision N1 N N

30% Quicker to Deploy “Oracle Big Data Appliance is an excellent choice for customers looking to work with the full suite of Cloudera’s leading Hadoop-based technology. It’s more cost-effective and quicker to deploy than a DIY cluster.” Operational Simplicity 21% It’s logical that an engineered system would be quicker to deploy that building your own. It’s that first one that people don’t believe. But we worked with an analyst firm, ESG, and they measured list prices, did the comparison and the BDA is at least 20% cheaper than building your own comparable cluster, assuming that you have the time and skills to do so. The key word here is comparable – I’ve had lots of people tell me that they could build one cheaper, but they turn out to have a fraction of the storage. These are large, dense storage nodes, and if you use cheap pizza box servers, you can fill a rack much more cheaply, but you’ll have way less storage. Cheaper to Buy Mike Olson, Cloudera founder, Chief Strategy Officer, and Chairman of the Board

Big Data Initiative @ Oracle Global Support Services
Real-time access to better data means better insights, which means better decisions and better business results Integrate data associated with customer telemetry, configurations, service history, diagnostics, knowledge & support information Anticipate Detect Predict Automate Delight

Core Design Principles Enable Success

There is one more thing…
Business Value = Applications

Big Data Appliance powers instant Business Value
Customer Experience Management Communications Data Model Cyber Security Solutions

Introducing Paul Kent - SAS

Big Data and Big Analytics – So Much more Gunpowder!
Paul Kent VP BigData, SAS Research and Development

1. Change Safari Pics

[CON8279] Oracle Big Data Appliance: Deep Dive and Roadmap for Customers and Partners
Oracle Big Data Appliance is the premier Hadoop appliance in the market. This session describes the roadmap for customers in the areas of high-performance SQL on Hadoop and securing big data, plus overall performance improvements for Hadoop. A special focus in the session is the roadmap and benefits Oracle Big Data Appliance brings to Oracle partners. To illustrate the benefits of running on a standardized and optimized Hadoop platform, SAS presents the findings of its tests of SAS In-Memory Analytics on Oracle Big Data Appliance.

Agenda SAS & Oracle Partnership Family Stories Deployment Patterns
Hadoop Oracle Engineered Systems Family SAS Software Family Deployment Patterns

Reflection on a stronger partnership than ever
Both leaders in Big Data – Jointly solving the most difficult and demanding Big Data Problems Providing simplicity and agility to create flexible configurations Extensive engineering collaboration Can we answer: How Does it Work? How Does it Perform? 2014 #SASGF13 Copyright © 2013, SAS Institute Inc. All rights reserved.

SOURCE: http://commons.wikimedia.org/wiki/File:Tamoxifen-3D-vdW.png
the tamoxifen dilemma SOURCE:

Elephant :: 3 Good Ideas !! Never forgets Is a good (hard) worker
Is a Social Animal (teamwork)

Hadoop – Simplified View
Controller Worker Nodes MPP (Massively Parallel) hardware running database-like software “data” is stored in parts, across multiple worker nodes “work” operates in parallel ,on the different parts of the table

Idea #1 - HDFS. Never forgets!
Head Node Data 1 Data 2 Data 3 Data 4… MYFILE.TXT ..block1 -> block1 ..block2 -> block2 ..block3 -> block3

Head Node Data 1 Data 2 Data 3 Data 4… MYFILE.TXT ..block1 -> block1 block1 copy2 ..block2 -> block2 block2 copy2 ..block3 -> block3 copy2 block3

Head Node Data 1 Data 2 Data 3 Data 4… MYFILE.TXT ..block1 -> block1 block1copy2 ..block2 -> block2 block2 copy2 ..block3 -> block3 copy2 block3 X X

Redundancy Wins!

Idea #2 – MapReduce – Send the work to the Data
We Want the Youngest Person in the Room Each Row in the audience is a data node I’ll be the coordinator From outside to center, accumulate MIN Sweep from back to front. Youngest Advances

Recap: Standard and Modular
Starter Rack is a fully cabled and configured for growth with 6 servers In-Rack Expansion delivers 6 server modular expansion block Full Rack delivers optimal blend of capacity and expansion options Grow by adding rack – up to 18 racks without additional switches

Oracle Big Data SQL – A New Architecture
Powerful, high-performance SQL on Hadoop Full Oracle SQL capabilities on Hadoop SQL query processing local to Hadoop nodes Simple data integration of Hadoop and Oracle Database Single SQL point-of-entry to access all data Scalable joins between Hadoop and RDBMS data Optimized hardware Balanced Configurations No bottlenecks Big Data SQL represents a new architecture for querying data in its natural format, wherever it leves, and – when running on Oracle Big Data Appliance and Oracle Exadata – provides a world-class Big Data Management System. Oracle Confidential – Internal/Restricted/Highly Restricted

Diversity. It’s a good thing!
Impala Nyala

4 Important Things #1 Join the Family

SAS ACCESS to Hadoop SAS SERVER HADOOP Hive QL #2 Be Familiar

SAS / Embedded Process SAS SERVER SAS/Scoring Accelerator for Hadoop
SAS Data Step & DS2 SAS SERVER proc ds2 ; /* thread ~ eqiv to a mapper */ thread map_program; method run(); set dbmslib.intab; /* program statements */ end; endthread; run; /* program wrapper */ data hdf.data_reduced; dcl thread map_program map_pgm; method run(); set from map_pgm threads=N; /* reduce steps */ end; enddata; run; quit; SAS/Scoring Accelerator for Hadoop SAS/Code Accelerator for Hadoop SAS/Data Quality Accelerator for Hadoop

SAS / High Performance Analytics
HADOOP SAS HPA Procedures SAS SERVER #3 Use the Cluster!

Prepare Explore / Transform Model HPDS2 HPDMDB HPSAMPLE HPSUMMARY HPCORR HPREDUCE HPIMPUTE HPBIN HPLOGISTIC HPREG HPNEURAL HPNLIN HPCOUNTREG HPMIXED HPSEVERITY HPFOREST HPSVM HPDECIDE HPQLIM HPLSO HPSPLIT HPTMINE HPTMSCORE What HPA procs … this is the short list these run in Hadoop today… HPDS2 – Parallel Execution of DS2 HPDMDB – Metadata definitions and data Summarization HPSAMPLE - Sampling and data partitioning HPSUMMARY – Summarization Descriptive Statistics HPCORR - Pearson correlation coefficients, three nonparametric measures of association, and the probabilities associated with these statistics HPREDUCE – unsupervised variable selection HPIMPUTE – Missing value replacement HPBIN – binning HPLOGISTIC – Logistic regression HPREG – regression HPNEURAL – Neural networks HPNLIN – Non linear regression HPCOUNTREG – regression of count variables HPMIXED – Mixed Linear models HPFOREST – random forest HPSVM – support vector machine HPDECIDE – Decision / Cost HPLSO – Lasso HPTMINE – Text Mining HPTSCORE – Text Scoring HPREG linear regression and variable selection HPLOGISTIC logistic regression and variable selection HPLMIXED linear mixed models HPNEURAL neural nets HPNLIN nonlinear regression and maximum likelihood HPREDUCE covariance/correlation analysis, variable reduction PROC HPREG High-performance combination of REG and GLMSELECT Supports »classical variable selection techniques »modern variable sélection techniques (LAR, LASSO) CLASS variables GLM and reference parameterizations SELECTION statement PROC HPNLIN High-performance combination of NLIN and NLP/NLMIXED »Classical nonlinear least squares (Levenberg-Marquardt) »Maximum likelihood for built-in distributions »Maximum likelihood for general, user-specified obj. functions »Boundaries, linear equality/inequality constraints

Controller Client Some processes are more complex that fits “nicely” inside the terms & conditions of the container. We can use the embedded process as a data acquisition channel, and yet perform the mathematics elsewhere (and in the first generation, elsewhere meant other operating system processes on the same server – preserving a symetric or 1:1 balance between the data parallelism and the mathematics parallelism) 2012 – SAS High Performance appliances for teradata, greenplum, oracle and hadoop

#1 Join the Family #2 Be Familiar #3 Use the cluster #4 Have a pretty face!

SAS Visual Analytics Interactive exploration, dashboards and reporting
Auto-charting automatically picks the best graph Forecasting, scenario analysis, Decision Trees and other analytic visualizations Text analysis and content categorization Feature-rich mobile apps for iPad® and Android

SAS Visual Statistics July-2014
Interactive, visual application for statistical modeling and classification Multiple methods: logistic, Regression, GLM, Trees, Forest, Clustering and more… Model comparison and assessment Group BY Processing

4 Important Things (for cluster friendly software)
Join the Family Be Familiar Performance Have a pretty face

SAS Big Data on Big Data Appliance
Flexible Architectural options for SAS deployments Can run on Starter, Half and Full configurations Optionally select nodes “N, N-1, N-2, …” for additional SAS Services such as SAS Compute Tier, SAS MidTier Optionally select node subset “N, N-1, N-2, N-3, …) for more dedicated resources for SAS Analytic Compute Environment by shifting Big Data Appliance roles Option to selectively add more memory on a per node basis depending on specific workload distribution

STARTER BDA SAS HPA Root Node SAS Visual Analytics Metadata Server SAS Compute SAS Midtier … … SAS Visual Analytics, high-Performance Analytic Compute environment Co-Located with Hadoop

… … STARTER BDA Consider: Extra Memory for 5,6?
SAS HPA Root Node SAS Visual Analytics Metadata Server SAS Compute SAS Midtier Consider: Extra Memory for 5,6? … … SAS Visual Analytics, high-Performance Analytic Compute environment Co-Located with Hadoop

FULL RACK BDA SAS HPA Root Node Metadata Server SAS Compute SAS Midtier LASR Worker 18 HDFS Data 18 LASR Worker 17 … … HDFS Data 17 … SAS Visual Analytics, high-Performance Analytic Compute environment Co-Located with Hadoop

Assembled in OSC, SYDNEY AUSTRALIA
FULL RACK BDA Assembled in OSC, SYDNEY AUSTRALIA

Assembled in OSC, SYDNEY AUSTRALIA
FULL RACK BDA Assembled in OSC, SYDNEY AUSTRALIA Basic Smoke Tests Confirmed: Interoperate with Hadoop and Map Reduce Read and Write text files to/from HDFS Read and Write Tabular files to/from Hive (will confirm Oracle BIGSQL in OSC-SC) Read and Write SAS binary format files to/from HDFS High Degree Of Parallelism (DOP) reads via Map-Only jobs SAS LASR server co-exists on/with datanodes SAS HPA tasks scheduled on datanodes

SAS High-Performance Analytics Performance SAS Format Data (SASHDAT)
7171 SAS High-Performance Analytics Performance SAS Format Data (SASHDAT) 1107 var Mobs 97GB 5.7GB/node Mobs 608GB 35.7GB/node 6x Create sec sec 11 Scan/Count 24.60 sec sec 10.5 HPCORR 295.20 4.7 HPCNTREG 336.79 4.6 HPREDUCE (u) 236.55 10.4 HPREDUCE (s) 219.50 9.3 Table 1: Summation of 5/20/100/200 columns; Baseline: DOP=1 (no parallelism) 120M rows, 400 columns, reg_simtbl_400

OSC-AU FullRack BDA 408 Threads 600 GB dataset 17 servers
Your Problem solved ASAP

… … Exadata IntegraTion SAS Embedded Processing (EP) to Exadata
Leveraging Big Data SQL SAS HPA Root Node SAS Visual Analytics Metadata Server SAS Compute SAS Midtier LASR Worker 18 … … HDFS Data 18 SAS EP Big Data SQL

SAS High-Performance Analytics Performance
7777 SAS High-Performance Analytics Performance SAS EP Parallel Data Feeders DOP=1 DOP=24 (flash cache) Add(5) 1.25min 1.5min .5min Add(20) 2.5min Add(100) 13min .6min Add(200) 16min ~2min 1.25min (10x) Table 1: Summation of 5/20/100/200 columns; Baseline: DOP=1 (no parallelism) 120M rows, 400 columns, reg_simtbl_400

7878 SAS High-Performance Analytics Performance SAS EP Parallel Data Feeders Access Access / DBSlice SAS HPA Using EP Reg_sim_200 1:01:12 0:28:37 0:08:00 Reg_sim_400 1:49:11 0:55:33 0:16:05 (7x!) Table 2: Scan times for 2 tables (200 columns, 400 columns, 120M rows); Baseline: SAS/ACCESS vs. HPA EP feeder

7979 SAS High-Performance Analytics Performance SAS Format Data (SASHDAT) and Oracle EXADATA 1107 var Mobs 97GB 5.7GB/node SASHDAT 907 var 79.7GB 4.7GB/node EXADATA Mobs 608GB 35.7GB/node Create sec sec sec Scan/Count 24.60 sec sec sec HPCORR 295.20 833.24 HPCNTREG 336.79 756.97 HPREDUCE (u) 236.55 HPREDUCE (s) 219.50 Table 1: Summation of 5/20/100/200 columns; Baseline: DOP=1 (no parallelism) 120M rows, 400 columns, reg_simtbl_400

Oracle Engineered Systems for
ExaData ExaLogic SuperCluster Big Data Appliance ZFS Storage Appliance Virtual Compute Appliance Database Backup, Recovery, Logging Appliance

Working together to create customer value
SAS and Oracle Working together to create customer value Joint R & D development and Product Management teams in Cary and Redwood Shores Focus on driving SAS technology components to run natively in Oracle database Joint performance engineering optimizations Template physical architectures developed based on use-cases Physically tested and benchmarked together Reduction in physical effort Overall reduction in lifecycle costs Best Practice papers SAS and Oracle Engineers provide joint "Sizing and Architecture Analysis and Design"

SAS and Oracle Paul.Kent @ sas.com @hornpolish paulmkent
Better Together

Oracle Confidential – Internal/Restricted/Highly Restricted

Big Data Hands-On Labs:

Similar presentations

Presentation on theme: "Big Data Hands-On Labs:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Data Hands-On Labs:

Similar presentations

Presentation on theme: "Big Data Hands-On Labs:"— Presentation transcript:

Similar presentations

About project

Feedback