© 2012 IBM Corporation 1 IBM PureData for Analytics Clustering three ways with Open Source R.

Slides:



Advertisements
Similar presentations
American Society Chapter 07.
Advertisements

Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Welcome to Who Wants to be a Millionaire
1 Yell / The Law and Special Education, Second Edition Copyright © 2006 by Pearson Education, Inc. All rights reserved.
Percent Composition Empirical Formula Molecular Formula
HOLD UP YOUR BOARD! Chapter 7 Review game.
Accredited Supplier Communications Plan FY09-10 Q1 to Q4 May 2009, v2.0 Home Access Marketing & Stakeholder Engagement Team.
Partitioning 2-digit numbers
1_Panel Production. 380 pannelli 45 giorni di produzione = 8.4 pannelli/day.
Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. *See PowerPoint Lecture Outline for a complete, ready-made.
Copyright © 2013 Elsevier Inc. All rights reserved.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 116.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 107.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 40.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 28.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 44.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 29.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 101.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 38.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 58.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 112.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 75.
Presented by The Coal Rush Revisited: R. W. Beck, Inc. IPED COAL POWER CONFERENCE January 18-19, 2007 St. Petersburg, FL Nicholas P. Guarriello An Economic.
Chapter 1 Image Slides Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Performance of Hedges & Long Futures Positions in CBOT Corn Goodland, Kansas March 2, 2009 Daniel OBrien, Extension Ag Economist K-State Research and Extension.
Tenths and Hundredths.
Who Wants To Be A Millionaire? Decimal Edition Question 1.
Welcome to Who Wants to be a Millionaire
£1 Million £500,000 £250,000 £125,000 £64,000 £32,000 £16,000 £8,000 £4,000 £2,000 £1,000 £500 £300 £200 £100 Welcome.
Welcome to Who Wants to be a Millionaire
Photo Slideshow Instructions (delete before presenting or this page will show when slideshow loops) 1.Set PowerPoint to work in Outline. View/Normal click.
1 A B C
Sampling in Marketing Research
Break Time Remaining 10:00.
This module: Telling the time
The basics for simulations
KARACHI FASHION WEEK CHAPTER 3 JANUARY 27 – 30, 2011 FASHION RUNWAY SHOW FASHION RUNWAY SHOW BRAND PRESENTATIONS BRAND PRESENTATIONS FASHION BRANDS EXHIBITIONS.
+ Plan de la séance: La logique de lanalyse factorielle Analyse en composantes principales/ Analyse des correspondances multiples Introduction à lanalyse.
Teds Big Boy Restaurant 5000 People were surveyed to determine if there is enough demand for a new Teds Big Boy Restaurant in the west side of town.
The Pecan Market How long will prices stay this high?? Brody Blain Vice – President.
Physical Aspects [Reflection Modelling] Hauptseminar: Augmented Reality for Driving Assistance in Cars.
Load Forecast and Basecase Development. 2 Balancing Authority Area Load Forecast - w/DSM Peak MWAnnual Energy WinterSummeravg MW Actual
1 Prediction of electrical energy by photovoltaic devices in urban situations By. R.C. Ott July 2011.
15. Oktober Oktober Oktober 2012.
Sprayer Economics Gary Schnitkey University of Illinois.
Created by Mr. Lafferty Maths Dept.
Note: A bolded number or letter refers to an entire lesson or appendix. A Adding Data Through a View ADD_MONTHS Function 03-22, 03-23, 03-46,
We are learning how to read the 24 hour clock
Your benefits are a reflection of you.
Produced by the Department of Learning and Teaching Resources, Belfast Institute. Want to be a xxxxx? Welcome to College Name Click here to start.
Want to be a xxxxx? Welcome to College Name Click here to start.
MOTION. 01. When an object’s distance from another object is changing, it is in ___.
Before Between After.
F-7 and F-7A Understanding the problems with NARFED Forms and Reports CAB 1/23/2012.
Subtraction: Adding UP
: 3 00.
5 minutes.
Types of clocks. Types of clocks Sand clock or Hourglass clock.
2.4 Bases de Dados Estudo de Caso. Caso: Caixa Eletrônico Caixa Eletrônico com acesso à Base de Dados; Cada cliente possui:  Um número de cliente  Uma.
Weisburd, Lawton, Ready, Rudes, Cave, and Nelson Presented by Breanne Cave 1.
Clock will move after 1 minute
Select a time to count down from the clock above
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
Chapter 8: Dialysis Providers 2014 ANNUAL DATA REPORT VOLUME 2: E ND -S TAGE R ENAL D ISEASE.
Effect Size and Statistical Power Analysis in Behavioral and Educational Research Effect size 1 (P. Onghena) a.m. Effect size 2 (W. Van den.
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
Presentation transcript:

© 2012 IBM Corporation 1 IBM PureData for Analytics Clustering three ways with Open Source R

© 2012 IBM Corporation 2 Using R with Puredata for Analytics Small data outside database Single Model, Serial Model Processing Large data inside database Single Model, Serial Model Processing Many small data inside database Many Model, Parallel Model Processing e.g. Bulk Parallel Execution Pull data down from database Run R on desktop or dedicated server Call INZA functions from R Process data directly against DB tables Push R into database Process data directly against DB tables Small data inside database Single Model, Serial Model Processing Push R into database Process data directly against DB tables

© 2012 IBM Corporation 3 Using R with Puredata for Analytics Small data outside database Single Model, Serial Model Processing Large data inside database Single Model, Serial Model Processing Many small data inside database Many Model, Parallel Model Processing e.g. Bulk Parallel Execution Pull data down from database Run R on desktop or dedicated server Call INZA functions from R Process data directly against DB tables Push R into database Process data directly against DB tables Small data inside database Single Model, Serial Model Processing Push R into database Process data directly against DB tables Analysis only looks at the last three scenarios

© 2012 IBM Corporation 4 Comparing performance for single model in-database Number of Observations INZA wrapper from R: nzKMeans cclust run IDB with nzSingleModel 500,000 user system elapsed user system elapsed ,000,000user system elapsed user system elapsed ,000,000user system elapsed user system elapsed ,000,000user system elapsed user system elapsed ,000,000user system elapsed user system elapsed Would expect nzKMeans to outperform cclust in-database between 5M and 6M observations Note: Tests run on a first-gen twin-fin Note: performance numbers variations are relative due to system being used during the testing

© 2012 IBM Corporation 5 Bulk-parallel execution of cclust (10K observations for each) Number of Models cclust run IDB with nzBulkModel Average time per model 50user system elapsed user system elapsed user system elapsed In general, these results would be significantly superior to running cclust serially in a dedicated environment simply due to R execution overhead and accounting for additional time required for data movement and/or partitioning

© 2012 IBM Corporation 6 Clustering three ways with Open R and IBM Puredata for Analytics  Using wrapper for INZA KMEANS (Stores resulting model in-database), single model data.nz <- nz.data.frame("BENCHMARK_DATA") system.time( nz.clust5 <- nzKMeans(data.nz, k=5,maxiter=1000,distance="euclidean",id="ID", getLabels=F,randseed=1234, outtable="admin.DATA_2_clust5d", format="kmeans",dropAfter=T) )  Running R in-database, single model (Returns resulting model to client.) system.time( data.cclust <- nzSingleModel(data.nz[,2:16], function(df){ require(cclust); cclust(as.matrix(df),5,iter.max=1000, verbose=FALSE,dist="euclidean",method="kmeans") }, force=TRUE ))  Running R in-database, bulk parallel model (Stores resulting models in-database, returns list of models by INDEX) # ua_ct is col 6, the “index” or grouping column system.time( data.cclust <- nzBulkModel(data.nz[data.nz$ID< ,2:16], 6, function(df){ require(cclust); cclust(as.matrix(df),5,iter.max=1000,verbose=FALSE,dist="euclidean",method="kmeans") }, output.name="CCLUSTBULKMODEL", clear.existing=TRUE ) )

© 2012 IBM Corporation 7 Bulk-parallel execution of cclust: Result Details Number of Rows Number of Models TimingsOverall Average Elapsed per Model Rows per Model 0.5 M50user system elapsed K 1 M100user system elapsed K 2 M100user system elapsed K 4 M500user system elapsed K 5 M500user system elapsed K