Download presentation
Presentation is loading. Please wait.
Published byMarilyn Mills Modified over 7 years ago
1
University of Arkansas Data Mining with TeradataTM Warehouse Miner
Jim Kashner CTO Data Mining
2
Copyright 2004 Teradata, a division of NCR
The Empirical Method and Decision Support … Notion Data Analysis Interpretation Supposition Proposition Hypothesis Refined Hypothesis all of the information in this presentation are “jim’s opinions numbers 8 through 224” (for today) … a framework for making decisions in the presence of uncertainty seeks to shed light on the validity or plausibility of notions, suppositions, propositions, hypotheses is iterative and circular don’t ever finish just stop at some point 11/30/2004 Copyright Teradata, a division of NCR
3
Copyright 2004 Teradata, a division of NCR
Teradata Warehouse Miner Technology Enablers for the Data Mining Process the various releases of Teradata Warehouse Miner are intended to serve as very powerful technology enablers for the Data Mining Process but, Tools Don’t Build Models, Thoughtful People Do When a good tool between the ears drives the data mining process, good models are built When too much is asked of analytical software, the risk of spurious and invalid models rises proportionately but thoughtful people who build models can also be helped by having a proven and generic process to follow The formal Teradata Data Mining Method is one of several good processes used to conduct successful data mining projects its foundation is the “tried and true” empirical method its not a prescription, just a set of carefully constructed suggestions 11/30/2004 Copyright Teradata, a division of NCR
4
Teradata Data Mining Method
Project Management Knowledge Transfer Business Issues Architecture and Technology Preparation Data Analytical Modeling Knowledge Delivery and Deployment data mining is a very iterative process the linear process depicted above serves as a guide, and identifies the chunky bits of the process 11/30/2004 Copyright Teradata, a division of NCR
5
Data Mining with Teradata Warehouse Miner Teradata’s Data Mining Method – Our Process
Architecture and Technology Preparation TWM – Stats & ADS Business Question Identification and Qualification “Data Profiling” Data Preparation and PreProcessing Data Exploration Data Transformation TWM – Analytics Analytic Modeling Model Construction and Evaluation Multivariate Statistics Machine Learning Algorithms Highly Iterative Process TWM – Deployment Model Deployment Model Deployment and Maintenance Scoring & Evaluation Lifecycle Maintenance Project Management -and- Knowledge Transfer 11/30/2004 Copyright Teradata, a division of NCR
6
Copyright 2004 Teradata, a division of NCR
Data Mining and the Empirical Method data mining is not automated discovery of hidden patterns in your data data mining is thoughtful and technology enabled discovery of hidden patterns in your data welcome to the empirical method 11/30/2004 Copyright Teradata, a division of NCR
7
Teradata as an Analytic Engine
Teradata is especially well-suited to perform complex aggregations and evaluations of sets according to conditional logic native Teradata functions expressed as SQL where indexes cannot reasonably be expected to exist for any particular aggregation, set evaluation, or conditional logic analytical modeling algorithms require an engine that can perform complex aggregations and evaluations of sets according to conditional logic the very good fit of Teradata as an analytic engine is rather obvious after considering what analytical modeling algorithms actually do under the hood 11/30/2004 Copyright Teradata, a division of NCR
8
Copyright 2004 Teradata, a division of NCR
Said another way ... Given: The following notation is used in virtually all statistical, artificial intelligence, and machine learning algorithms that denote equations used to represent and calculate data mining models: f (x) - which means sum and Σ f (x) - which means sum f (x) - which means multiply Є and Є - which mean is, and is not an element of (set theory) Question: What do they all have in common? Answer: All of these are what Teradata does better than any other engine on this planet. Note: f(x) are other supported functions, mathematical and other, either as native Teradata functions, or those that can be expressed in SQL with Teradata extensions very efficiently. 11/30/2004 Copyright Teradata, a division of NCR
9
Teradata Warehouse Miner is an ongoing experiment
TeraMinerTM Stats June, 1999 Teradata Warehouse Miner Stats, Analytics, & Deployment July, 2001 Stats, Analytics, Deployment, & ADS (Analytical Data Set generation) June, 2004 additional functionality continually in subsequent releases to each of these components of Teradata Warehouse Miner because of our success with this “experimental approach”, we continue to ask: “Why not?” Teradata continues to amaze us by what it can do our Teradata Warehouse Miner Software Engineering Team is quite amazing too 11/30/2004 Copyright Teradata, a division of NCR
10
What is Teradata Warehouse Miner ?
TWM includes a set of .NET Interfaces and a User Interface generates and executes Teradata-specific SQL ANSI SQL when possible instantiated by User Interface easily integrated into other applications (partners, custom) all analysis parameters, model definition, and analysis results stored in metadata select results or explain, or persist results in table, temporary table or view TWM includes several types of .NET Interfaces Registry independent application extensions or plug-ins Teradata Warehouse Miner Descriptive Statistics DLL Teradata Warehouse Miner ADS DLL Teradata Warehouse Miner Data Reorganization DLL Teradata Warehouse Miner Analytic Algorithm & Scoring DLLs (4) Teradata Warehouse Miner Matrix DLL Teradata Warehouse Miner Statistical Test DLL TWM includes a GUI for the desktop User interface to .NET Objects Queries Teradata Data Dictionary to aid in parameterizing functions directly using HELP syntax optionally, MDS DIM (Metadata Services Database Information Model) Interactive display of results – SQL, Data, Graphs, Reports 11/30/2004 Copyright Teradata, a division of NCR
11
Teradata Warehouse Miner High Level Architecture
Teradata RDBMS User Interface Services Teradata Platform: Teradata RDBMS Version 2 Release 4.1 or later Business Services Data Services Windows NT, 2000, XP, .NET 2003 Server Client Platform: Manager Algorithms (COM) Algorithms (.NET) Data Access Teradata ODBC Metadata Access Projects Analyses Teradata Metadata Services User Interface Visualizations Teradata Warehouse Miner Windows Interface build, maintain, and execute projects explore and manipulate results tabular and graphical parameterize .NET APIs .NET APIs & ADO .NET Interfaces (APIs) documented for developers ActiveX Data Objects DLL interface ”plug-ins” write all API parameters and all XML results in TWM metadata stored in binary data type generate & submit SQL receive query results from Teradata and present them in user interface read model definition and results stored in TWM metadata to display XML reports and graphs read model definition in TWM metadata to score and evaluate 11/30/2004 Copyright Teradata, a division of NCR
12
Teradata Warehouse Miner Data Description Functions
Univariate Statistics Count Minimum, Maximum Modes Mean Standard Deviation Standard Error Variance Coefficient of Variation Skewness Kurtosis Uncorrected Sum of Squares Corrected Sum of Squares Quantiles and Ranks Top 10/Bottom 10 Percentiles Deciles Quartiles Tertiles Top 5/Bottom 5 Ranked Values with Counts Scatter Plot Analysis 2-D and 3-D Plots of Continuous Variables Correlation Analysis Quickly view pair-wise correlations among ‘n’ variables Values Analysis (basic data quality analysis) Data Types Counts # NULL Values # Positive Values # Negatives Values # Zeros # Blanks # Unique Values Frequency Analyses Frequency of Discrete Variables N-Way Cross-Tabulation Pair-wise Cross-Tabs Histogram Analyses Histograms of Continuous Variables Options for Even Width User Defined Widths/Boundaries Quantile Adaptive Binning Overlay columns Statistics within bins Overlap Analysis Index/Key Column Consistency Data Explorer Performs basic statistical analysis on a set of tables and selected columns within any Teradata database Intelligent decisions about which functions to perform Most criteria for “Intelligent” decisions can be modified by user Values Analysis - Every column in the set of input tables Univariate Statistical Analysis - Every column of numeric or date type Frequency Analysis - Every column that has less than or equal to a number of unique values Histogram Analysis - Every numeric or date type column that has more than a number of unique values Data Visualizations 2D & 3D Histograms 2D & 3D Frequency Bar Charts Values Bar Charts & Circular Graphs Box and Whisker Plots Scatter Plots Integrated Data Explorer Graphics 11/30/2004 Copyright Teradata, a division of NCR
13
Teradata Warehouse Miner Data Derivation and Transformation Functions
Variable Creation Aggregations Count, Average, Sum, etc. Windowed Aggregates/OLAP Rank, Quantiles, Moving Sums, etc. Arithmetic operators/functions : +, -, *, /, MOD, ** ABS, EXP, LN, LOG, SQRT, etc. Trigonometric & Hyperbolic functions COS, SIN, TAN, ACOS, etc. COSH, SINH, TANH, ACOSH, etc. CASE expressions and NULL operators valued and searched types NULLIF, COALESCE Comparison operators =, >, <, <>, <=, >= Logical predicates BETWEEN…AND…, IN (expression list), etc. Variable Creation (cont) Calendar functions: day_of_week, day_of_calendar, quarter_of_year, etc. String functions LOWER, UPPER, TRIM, ||, etc. Data Type conversion SQL predicates TRUE, FALSE, NULL Variable Dimensioning Simple Dimensions Specific values Range of values Combined Dimensions Hierarchical Dimensions SysCalendar, etc. Variable Transformation Bin Coding Design Coding Recoding Rescaling Derive Hook to Variable Creation Statistical Transformations Z-Score Sigmoid NULL Value Replacement Literal value Mean value Median value Mode Imputed values 11/30/2004 Copyright Teradata, a division of NCR
14
Copyright 2004 Teradata, a division of NCR
Teradata Warehouse Miner Data Reorganization, Build ADS, Matrix Functions Data Reorganization Random Sample and Stratified Random Partitioning Denormalize/Pivoting Joining Inner Left Outer Right Outer Full Outer Build ADS Create Final ADS Create Metadata for Refresh Matrix Functions Correlation Covariance SSCP Corrected SSCP 11/30/2004 Copyright Teradata, a division of NCR
15
Copyright 2004 Teradata, a division of NCR
Teradata Warehouse Miner Analytical Techniques, Scoring, Visualizations (1) Analytic Algorithms (Multivariate Statistical Techniques) Linear Regression model statistics variable coefficients, standard errors, confidence intervals, etc. incremental R2 step-wise variable selection options forward & forward only backward & backward only Factor Analysis Principal Component Analysis Principal Axis Factors Maximum Likelihood Factors Orthogonal & Oblique Rotations Logistic Regression Logit Model Coefficients, Odds Ratios and Statistics Model Success Analysis and Lift Tables Model Scoring Linear Regression Logistic Regression Factor Analysis SQL-based model scoring all scoring SQL is provided Supporting Visualizations Scatter Plot Lift Chart Regression Plots Factor Pattern Scree Plot Multivariate Diagnostics Extensive Collinearity Diagnostics Automated Identification of Constants Row level diagnostics, and much more… SQL-based model evaluation 11/30/2004 Copyright Teradata, a division of NCR
16
Copyright 2004 Teradata, a division of NCR
Teradata Warehouse Miner Analytical Techniques, Scoring, Visualizations (2) Analytic Algorithms (AI and Machine Learning Techniques) Decision Tree/Rule Induction gini / regression (i.e., CART) Entropy (i.e., C4.5 / C5.0) CHAID pruning gini algorithm pruning gain ratio algorithm pruning manual pruning Clustering K-Means Nearest Neighbor Linkage Expectation Maximization Gaussian Mixture Model Poisson Mixture Model variable importance report Affinity and Sequence Analyses Feature Rich Implementations Support Confidence Lift z-Score Model Scoring Decision Trees Clustering Affinity and Sequence Analyses SQL-based model scoring all scoring SQL is provided Supporting Visualizations Graphical Tree Browser Interactive Pruning Text Rules Distributions Lift Charts Cluster Sizes / Distance / Measures Association Color Map Model Evaluation truth table (confusion matrix) model statistics & indices SQL-based model evaluation 11/30/2004 Copyright Teradata, a division of NCR
17
Teradata Warehouse Miner Statistical Tests
Binomial Tests Binomial Sign Rank Tests Mann-Whitney (Kruskal-Wallis) Wilcoxon Friedman Contingency Table Tests Chi-square Median Parametric Tests F (Two Way) Unequal Sample Size F (N-Way) Equal Sample Size T Normality/Equality Tests Kolmogorov-Smirnov Lilliefors Test Shapiro-Wilk D’Agostino & Pearson Omnibus Smirnov 11/30/2004 Copyright Teradata, a division of NCR
18
Copyright 2004 Teradata, a division of NCR
Why Did We Build Teradata Warehouse Miner? Integrated Data Mining Environment Modelers Build Models Business Deploys Models Other Technologies Inefficient Environment - Elapsed and Execution Times Continual Data Movement Data Redundancy Metadata Inconsistencies “Many Versions of The Truth” Teradata and TWM Efficiently Architected Environment - MPP Performance and Scalability No Data Movement No Data Redundancy Shared Metadata “One Version of The Truth” 11/30/2004 Copyright Teradata, a division of NCR
19
Copyright 2004 Teradata, a division of NCR
Why are Integrated Analytics Important? Efficiency, Performance & Scalability Mine data in an integrated environment Huge data volumes – leverages the parallelism of Teradata Minimize data redundancy Eliminate proprietary data structures Simplify data & system management Better results using larger amounts of detailed data Eliminate potential errors during data movement & external sampling Integrated model building and scoring Reduced overall modeling time Many resulting elapsed and execution time improvements have been astronomical ! Analytic Data Set Source Data Analytic Metadata Modelers Build Models Business Deploys Models 11/30/2004 Copyright Teradata, a division of NCR
20
Copyright 2004 Teradata, a division of NCR
The Teradata Warehouse Miner Goal Enable Entire Data Mining Process In Teradata Teradata Data Warehouse Data Pre- Processing Analytic Metadata Scored Data Set Model Deployment Source Data Analytic Data Set Analytical Modeling data starts and ends in the database open to accommodate 3rd party partner tools 11/30/2004 Copyright Teradata, a division of NCR
21
Teradata Warehouse Miner Projects and Analytic Modules
Teradata Warehouse Miner Projects contain one or more tasks each task is called an Analytic Module eight categories of analytic modules ADS (Analytical Data Set generation) Variable Creation Variable Transformation Build ADS Analytics (Analytic Algorithms) Descriptive Statistics Matrix Functions (correlation, …) Miscellaneous free form SQL , … Reorganization (Structure of Data) Scoring (and Model Evaluation) Statistical Tests Analytic Modules are the fundamental building blocks used to conduct data analysis in Teradata Warehouse Miner 11/30/2004 Copyright Teradata, a division of NCR
22
Teradata Warehouse Miner Elements in the Primary Window
Project Icon Analytic Module Icon ODBC Connection Icon Connection Properties Icon Run and Stop Icons Runtime Message Area Data Source Status Project Area Analysis Set-up and Results Viewing Area hmmm… I wonder what else might fill this large gray area some day... Main Menus Main Toolbar Open, Save, and Save All Icons 11/30/2004 Copyright Teradata, a division of NCR
23
Teradata Warehouse Miner The 7 Steps to Results
there are 7 basic steps in the use of Teradata Warehouse Miner* connect to an ODBC data source with appropriate permissions create a new, (or open an existing) Project add at least one Analytic Module to the Project set input and analytic options select table(s) and column(s) to be analyzed set Analytic Module parameters** set other Analytic Module options as necessary** set output and results options execute the Analytic Module (using the run icon ) optionally, save the Project(s) and Analyses examine, interpret, and use results of interest** that’s it * use these steps after you or a system administrator has set up an ODBC Data Source (DSN) on your PC. The DSN must point to source, result, and metadata Teradata databases for which you have appropriate permissions ** setting Analytic Model options, and interpreting and using results appropriately requires expertise specific to the Analytic Module chosen 11/30/2004 Copyright Teradata, a division of NCR
24
Using Teradata Warehouse Miner The 7 Steps to Results An Example
11/30/2004 Copyright Teradata, a division of NCR
25
Teradata Warehouse Miner Step 1 - connect to an ODBC data source
11/30/2004 Copyright Teradata, a division of NCR
26
Teradata Warehouse Miner Step 2 - create a new Project
11/30/2004 Copyright Teradata, a division of NCR
27
Copyright 2004 Teradata, a division of NCR
Teradata Warehouse Miner Step 3 - add an Analytic Module to the Project 11/30/2004 Copyright Teradata, a division of NCR
28
Copyright 2004 Teradata, a division of NCR
Teradata Warehouse Miner Step 4 – set input and analytic options (select table and columns to be analyzed) 11/30/2004 Copyright Teradata, a division of NCR
29
Copyright 2004 Teradata, a division of NCR
Teradata Warehouse Miner Step 4 – set input and analytic options (set Analytic Module parameters) 11/30/2004 Copyright Teradata, a division of NCR
30
Copyright 2004 Teradata, a division of NCR
Teradata Warehouse Miner Step 4 – set input and analytic options (set other Analytic Module options as necessary) 11/30/2004 Copyright Teradata, a division of NCR
31
Teradata Warehouse Miner Step 5 – set output and results options
**Note: This screen-shot is from a Scoring Module for the analytic algorithm module used in this example 11/30/2004 Copyright Teradata, a division of NCR
32
Teradata Warehouse Miner Step 6 - execute the Analytic Module
11/30/2004 Copyright Teradata, a division of NCR
33
Copyright 2004 Teradata, a division of NCR
Teradata Warehouse Miner Step 6 - execute the Analytic Module (optionally, save the Project(s) and Analyses) 11/30/2004 Copyright Teradata, a division of NCR
34
Copyright 2004 Teradata, a division of NCR
Teradata Warehouse Miner Step 7 - examine, interpret, and use results (1) 11/30/2004 Copyright Teradata, a division of NCR
35
Copyright 2004 Teradata, a division of NCR
Teradata Warehouse Miner Step 7 - examine, interpret, and use results (2) 11/30/2004 Copyright Teradata, a division of NCR
36
Tips for Navigating the Teradata Warehouse Miner Interface
on-line help and user’s guide very extensive and thorough tutorials for each function describes many of the analytical techniques in detail many reference formulae are provided use these liberally menus and toolbar runtime message area setting program options and preferences global run-time setting up Project Directories for files on PC client optionally, for local HTML reports and associated graphics 11/30/2004 Copyright Teradata, a division of NCR
37
Teradata Warehouse Miner
Demo TWM, an enabling technology to assist in addressing qualified business questions that are well suited to the processes of decision support and data mining (data exploration – data transformation – exploratory modeling – model building and validation – scoring and evaluation – lifecycle maintenance – …) 11/30/2004 Copyright Teradata, a division of NCR
38
Copyright 2004 Teradata, a division of NCR
University of Arkansas Data Mining with TeradataTM Warehouse Miner Questions and Discussion 11/30/2004 Copyright Teradata, a division of NCR
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.