SAS on Your Cluster Serving your Data (Analysts)

Slides:



Advertisements
Similar presentations
Data Mining with R/ORE Minming Duan. 2 iTech Solution Profile Agenda R/ORE Overview 1 XML output generation using SQL 4 Integration with IBP and BIEE.
Advertisements

1 1 Apache Hadoop and the Emergence of the Enterprise Data Hub Eli Collins, Chief Technologist ©2014 Cloudera, Inc. All rights reserved.
Hardening Hadoop for the Enterprise: Managing Diverse Workloads, Securing and Governing your Big Data Platform How does IT balance the tension between.
Big Data Hands-On Labs:
Copyright © 2014, SAS Institute Inc. All rights reserved. BIG DATA, BIG INSIGHT TDWI APRIL 2014.
SAS solutions SAS ottawa platform user society nov 20th 2014.
Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Copyright © 2012, SAS Institute Inc. All rights reserved. POWERING UP ANALYTICS WITH BIG DATA - THE SAS WAY! -PRIYA SARATHY, PH.D ANALYTIC SALES CONSULTANT,
Hadoop Ecosystem Overview
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | OFSAAAI: Modeling Platform Enterprise R Modeling Platform Gagan Deep Singh Director.
SAS Analytic Solutions Running on a Hadoop Cluster using YARN
HADOOP ADMIN: Session -2
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
Copyright © 2006, SAS Institute Inc. All rights reserved. Enterprise Guide 4.2 : A Primer SHRUG : Spring 2010 Presented by: Josée Ranger-Lacroix SAS Institute.
Working With Large Datasets in Corporate Settings Ed Bassin
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Almost 4 decades of Advanced Analytics & DM expertise.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Copyright © 2008, SAS Institute Inc. All rights reserved. RMS Titanic: Using SAS Enterprise Guide To Report On A Tragedy Matt Malczewski, SAS Canada.
Introduction to Hadoop and HDFS
Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.
Copyright © 2010, SAS Institute Inc. All rights reserved. Applied Analytics Using SAS ® Enterprise Miner™
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
An Introduction to HDInsight June 27 th,
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1.
Copyright © 2013, SAS Institute Inc. All rights reserved. SAS GLOBAL FORUM: NEW & NOTEWORTHY MATT MALCZEWSKI – COMMUNITIES MANAGER.
Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
Copyright © 2004, SAS Institute Inc. All rights reserved. SAS Stored Processes An analyst’s perspective Sylvain Tremblay SAS Canada 24 February 2006.
Understanding the field & setting expectations.  Personal  International  UNT Alumni (Mathematics)  Academic  Economics & Mathematics  Professional.
All about Revolution R Enterprise
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP.
Copyright © 2010, SAS Institute Inc. All rights reserved. SAS ® Using the SAS Grid.
© Hortonworks Inc Hadoop: Beyond MapReduce Steve Loughran, Big Data workshop, June 2013.
Nov 2006 Google released the paper on BigTable.
Business Analytics Skills
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Page 1 © Hortonworks Inc – All Rights Reserved Hive: Data Organization for Performance Gopal Vijayaraghavan.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
AZURE DISTRIBUTED DATA Storage, HDInsight Hadoop, Azure Data Lake.
Information Eastman. Business Process Skills Order to Cash, Forecasting & Budgeting, etc. Process Modeling Project Management Technical Skills.
Big Data Yuan Xue CS 292 Special topics on.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Redmond Protocols Plugfest 2016 Casey Karst PolyBase in SQL Server 2016.
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Integration of Oracle and Hadoop: hybrid databases affordable at scale
Big Data, Data Mining, Tools
PROTECT | OPTIMIZE | TRANSFORM
Integration of Oracle and Hadoop: hybrid databases affordable at scale
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
APACHE HAWQ 2.X A Hadoop Native SQL Engine
September 11, Ian R Brooks Ph.D.
Introduction to Apache
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
HDInsight & Power BI By Łukasz Gołębiewski.
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Reactions to new technology….
Pig Hive HBase Zookeeper
Presentation transcript:

SAS on Your Cluster Serving your Data (Analysts) SAS is a both a Language and an Application for doing Analytics on all manner of data. Recently SAS has adapted to the Hadoop eco-system and intends to be a good citizen amongst the different choices for processing large volumes of data on your cluster.

Agenda Two ways to push work to the cluster… Using SQL Using a SAS Compute Engine on the cluster Data Implications Data in SAS Format, produce/consume with other tools Data in other Formats, produce/consume with SAS HDFS versus the Enterprise DBMS

Agenda Two ways to push work to the cluster… Using SQL Using a SAS Compute Engine on the cluster Data Implications Data in SAS Format, produce/consume with other tools Data in other Formats, produce/consume with SAS HDFS versus the Enterprise DBMS

Using SQL LIBNAME olly HADOOP SERVER=mycluster.mycompany.com USER=“kent” PASS=“sekrit”; PROC DATASETS LIB=OLLY; RUN;

Controller Workers Using SQL SAS Server Hadoop Cluster Hadoop Access Method Controller Workers LIBNANE olly HADOOP SERVER=hadoop.company.com USER=“paul” PASS=“sekrit” PROC XYZZY DATA=olly.table; RUN; Select * From olly_slice Select * From olly Potentially Big Data Select * From olly

Controller Workers Using SQL SAS Server Hadoop Cluster Hadoop Access Method Controller Workers LIBNANE olly HADOOP SERVER=hadoop.company.com USER=“paul” PASS=“sekrit” PROC MEANS DATA=olly.table; BY GRP; RUN; Select sum(x), min(x) …. From olly_slice Group By GRP Select sum(x), min(x) … From olly Group By GRP Aggregate Data ONLY Select sum(x), min(x) …. From olly Group By GRP

Same SAS syntax. (people skills) Convenient Gateway Drug  Using SQL Advantages Same SAS syntax. (people skills) Convenient Gateway Drug  Disadvantages Not really taking advantage of cluster Potentially Large datasets still transferred to SAS Server Not Many Techniques Passthru Basic Summary Statistics – YES Higher Order Math – NO

Agenda Two ways to push work to the cluster… Using SQL Using a SAS Compute Engine on the cluster Data Implications Data in SAS Format, produce/consume with other tools Data in other Formats, produce/consume with SAS HDFS versus the Enterprise DBMS

Hadoop 2.0 :: YARN to the rescue MAP REDUCE Storm Spark IMPALA Tez SAS Yarn, or better resource management HDFS

2013q4? 2014?

Hadoop – and her 2 beautiful things Data I will spread your data out over many servers to keep it safe I will facilitate a new idea that you should send the work to the data, not the other way around. Data Data Data Data Data Data

Why Do This? Because it gets the answers soooo much faster Client NameNode Some processes are more complex that fits “nicely” inside the terms & conditions of the container. We can use the embedded process as a data acquisition channel, and yet perform the mathematics elsewhere (and in the first generation, elsewhere meant other operating system processes on the same server – preserving a symetric or 1:1 balance between the data parallelism and the mathematics parallelism) 2012 – SAS High Performance appliances for teradata, greenplum and hadoop Client

SAS Server Appliance General Captains Controller Workers libname joe sashdat "/hdfs/.."; proc hpreg data=joe.class; class sex; model age = sex height weight; run; MPI Math Math Math Math Math HDFS BLKs BLKs BLKs BLKs BLKs Copyright © 2012, SAS Institute Inc. All rights reserved. Controller Workers

SAS Server Appliance General Captains Controller Workers libname joe sashdat "/hdfs/.."; proc hpreg data=joe.class; class sex; model age = sex height weight; run; MPI Math Math Math Math Math HDFS BLKs BLKs BLKs BLKs BLKs Copyright © 2012, SAS Institute Inc. All rights reserved. Controller Workers

SAS Server Appliance General Captains Controller Workers libname joe sashdat "/hdfs/.."; proc hpreg data=joe.class; class sex; model age = sex height weight; run; MPI TK TK TK TK TK MAP REDUCE JOB MAPr MAPr MAPr MAPr Copyright © 2012, SAS Institute Inc. All rights reserved. Controller Workers

Single / Multi-threaded proc logistic data=TD.mydata; class A B C; model y(event=‘1’) = A B B*C; run; proc hplogistic data=TD.mydata; class A B C; model y(event=‘1’) = A B B*C; run; Single / Multi-threaded Not aware of distributed computing environment Computes locally / where called Fetches Data as required Memory still a constraint Massively Parallel (MPP) Uses distributed computing environment Computes in massively distributed mode Work is co-located with data In-Memory Analytics 40 nodes x 96GB almost 4TB of memory Copyright © 2010, SAS Institute Inc. All rights reserved.

SAS® In-memory ANALYTICS SAS® High-Performance Statistics SAS® High-Performance Econometrics SAS® High-Performance Optimization SAS® High-Performance Data Mining1 SAS® High-Performance Text Mining SAS® High-Performance Forecasting2 HPLOGISTIC HPREG HPLMIXED HPNLMOD HPSPLIT HPGENSELECT HPCOUNTREG HPSEVERITY HPQLIM HPLSO Select features in OPTMILP OPTLP OPTMODEL HPREDUCE HPNEURAL HPFOREST HP4SCORE HPDECIDE HPTMINE HPTMSCORE HPFORECAST Common Set (HPDS2, HPDMDB, HPSAMPLE, HPSUMMARY, HPIMPUTE, HPBIN, HPCORR) Common set of HP procedures will be included in each of the individual SAS HP “Analytics” products New recently. More Coming for Xmas!

Scalability on a 12-Core Server

Acceleration by factor 106! 32 x Configuration Workflow Step CPU Runtime Ratio Client, 24 cores Explore (100K) 00:01:07:17 4.2 Partition 00:07:54:04 19.5 Impute 00:01:19:84 7.7 Transform 00:09:45:01 13.2 Logistic Regression (Step) 04:09:21:61 131.5 Total 04:29:27:67 106.1 HPA Appliance, 32 x 24 = 768 cores Explore 00:00:15:81 00:00:21:52 00:00:21:47 00:00:44:28 Logistic Regression 00:01:37:99 00:02:21:07 Server um Faktor 12, Appliance um Faktor 32 vergrössert. Würde man das NN zum Vergleich hinzuziehen, so hat man ~19h zu 3 Min. Acceleration by factor 106!

Acceleration by factor 322! 32 x Configuration Workflow Step CPU Runtime Ratio Client, 24 cores Explore 00:01:07:17 4.2 Partition 01:01:09:31 170.5 Impute 00:02:45:81 7.7 Transform 01:26:06:22 116.7 Neural Net 18:21:28:54 478.9 Total 20:52:37:05 313 HPA Appliance, 32 x 24 = 768 cores 00:00:15:81 00:00:21:52 00:00:21:47 00:00:44:28 00:02:17:40 00:04:00:48 Server um Faktor 12, Appliance um Faktor 32 vergrössert. Würde man das NN zum Vergleich hinzuziehen, so hat man ~19h zu 3 Min. Acceleration by factor 322!

Agenda Two ways to push work to the cluster… Using SQL Using a SAS Compute Engine on the cluster Data Implications Data in SAS Format, produce/consume with other tools Data in other Formats, produce/consume with SAS HDFS versus the Enterprise DBMS

Hadoop SAS Format Format Sequence Avro Trevni SASHDAT ORC Parquet Data Choices Hadoop Format Sequence Avro Trevni ORC Parquet SAS Format SASHDAT

Process with Hadoop Tools Processing Choices Hadoop Format Sequence Avro Trevni ORC Parquet SAS Format SASHDAT Process with Hadoop Tools Process with SAS NorthEast and SouthWest Quadrants are the interoperability challenges!

Process with Hadoop Tools Processing Choices Hadoop Format Sequence Avro Trevni ORC Parquet SAS Format SASHDAT Process with Hadoop Tools ✔✔✔ Process with SAS ✔✔✔ NorthEast and SouthWest Quadrants are the interoperability challenges!

Teach Hadoop (pig) about SAS Hadoop (PIG) Learns SAS Tables register pigudf.jar, sas.lasr.hadoop.jar, sas.lasr.jar; /* Load the data from sashdat */ B = load '/user/kent/class.sashdat' using com.sas.pigudf.sashdat.pig.SASHdatLoadFunc(); /* perform word-count */ Bgroup = group B by $0; Bcount = foreach Bgroup generate group, COUNT(B); dump Bcount;

Teach Hadoop (pig) about SAS Hadoop (PIG) Learns SAS Tables register pigudf.jar, sas.lasr.hadoop.jar, sas.lasr.jar; /* Load the data from a CSV in HDFS */ A = load '/user/kent/class.csv' using PigStorage(',') as (name:chararray, sex:chararray, age:int, height:double, weight:double); Store A into '/user/kent/class' using com.sas.pigudf.sashdat.pig.SASHdatStoreFunc( ’bigcdh01.unx.sas.com', '/user/kent/class_bigcdh01.xml');

Process with Hadoop Tools Processing Choices Hadoop Format Sequence Avro Trevni ORC Parquet SAS Format SASHDAT ✔✔✔ ✔✔✔ Process with Hadoop Tools Process with SAS ✔✔✔ NorthEast and SouthWest Quadrants are the interoperability challenges!

Teach Hadoop (MAP/REDUCE) about SAS How about the other way? Hadoop (PIG) Learns SAS Tables /* Create HDMD file */ proc hdmd name=gridlib.people format=delimited sep=tab file_type=custom_sequence input_format='com.sas.hadoop.ep.inputformat.sequence.PeopleCustomSequenceInputFormat' data_file='people.seq'; COLUMN name varchar(20) ctype=char; COLUMN sex varchar(1) ctype=char; COLUMN age int ctype=int32; column height double ctype=double; column weight double ctype=double; run;

SAS Server Appliance General Captains Controller Workers HIGH-PERFORMANCE ANALYTICS Alongside Hadoop (Symmetric) SAS Server Appliance General Captains libname joe hadoop … ; proc hpreg data=joe.class; class sex; model age = sex height weight; run; MPI TK TK TK TK TK MAP REDUCE JOB MAPr MAPr MAPr MAPr Copyright © 2012, SAS Institute Inc. All rights reserved. Controller Workers

Process with Hadoop Tools Processing Choices Hadoop Format Sequence Avro Trevni ORC Parquet SAS Format SASHDAT ✔✔✔ ✔✔✔ Process with Hadoop Tools Process with SAS ✔✔✔ ✔✔✔ NorthEast and SouthWest Quadrants are the interoperability challenges!

Agenda Two ways to push work to the cluster… Using SQL Using a SAS Compute Engine on the cluster Data Implications Data in SAS Format, produce/consume with other tools Data in other Formats, produce/consume with SAS HDFS versus the Enterprise DBMS

Reference architecture CLIENT GREENPLUM TERADATA ORACLE HADOOP

Not yet structured datasets >2000 columns, no problems Hadoop vs EDW Hadoop Excels at 10x Cost/TB advantage Not yet structured datasets >2000 columns, no problems Incremental growth “practical” Discovery and Experimentation Variable Selection Model Comparison EDW Still wins SQL applications Pushing analytics into LOB apps Operational CRM Optimization

Thank You! Paul.Kent @ sas.com @hornpolish paulmkent