Data Mining with Analysis Services

Data Mining with Analysis Services
Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions

My Background Experience Current Projects
25 years Software Development 15 years software companies as Programmer, Architect, Manager, Director, VP, CTO 5+ years Business Intelligence Consultant Current Projects Data Warehouse deployment for State of WA Child Welfare Data Integration/Warehouse architecture for Solar Energy company Data Warehouse model for State of OR Data Mining model for Houston sports league

Make Data Mining an integral component of your Data Architecture.
Overview Make Data Mining an integral component of your Data Architecture. Applications Reports Data Integration

Session Overview What is Data Mining
Data Mining Algorithms in SQL Server Creating a Data Mining Model Using the Model in an Application Using the Model in SSIS Awareness of Data Mining Architecture and Process

Application Architecture
Service Business Data Relational Database

Application Architecture with BI
Service Business Data Data Mining Relational Database Data Integration Data Warehouse Cube Text Warehouse Performance Management Reports Analysis Data Mining Text Mining Ad-hoc

Why Mine Data? If you want to… Data Mining is for you!
Predict the future Get rich buying stocks Win the lottery Win your Fantasy Football League Data Mining is for you!

Really - Why Mine Data? Maximize value of Data Competitive Edge
Uncover Hidden Relationships Find something unusual or unexpected Improve upon domain expert’s knowledge Manage large data sets Create Predictive Analytics platform Maximize value of Data Competitive Edge

Data Mining Defined Data Mining is the process of sorting through large amounts of data and picking out relevant information (Wikipedia). Data Mining is the extraction of hidden predictive information from large databases. The Process of Knowledge Discovery Sophisticated Statistical Model Discovery of Patterns and Relationships It is not Dredging, Snooping or an Invasion of Privacy

Today vs. Yesterday Legacy Future Manageable Volumes of Data
The power of SQL Domain Experts could grasp and analyze a complete Database Limited CPU Horsepower Finite Explosion of Data doubles every 3 years (Moore’s Law) Data Volumes can’t be comprehended by humans Uncover complex and difficult to find patterns for competitive edge Improve professional judgment of Domain Expert (small but valuable) Knowledge Discovery Converting Data to Information N  Infinity

Outcomes Predict number of runs Colorado Rockies would score in their next game using Neural Network. Input Previous 12 games runs scored Home/Away Predict Number runs next game Model Home Runs = (G7*.142)+(G1*.118)+… Away Runs = (G7*.129)+(G1*.091)+…

Non-linear Results Fitting a line to a set of data points to measure the effect of a single independent variable. Statistical Methods y = mx + b Data Mining

Outcomes – Decision Tree
Annual Income = 108, *(Savings Balance-27, )-2, *(Avg Check Size )-0.247*(Credit Card Limit-8, ) *(Age )-1, *(Over Drafts-1.000)

Outcomes – Time Series Total Gen-1 < and Flow Date < 3/2/2007 4:58:07 AM and Flow Date < 8/22/2006 5:54:22 PM and Flow Date < 8/2/2006 1:35:37 PM Total Gen = * Total Gen(-1) Total Gen-1 < and Flow Date >= 3/2/2007 4:58:07 AM and Flow Date < 5/16/2007 7:46:52 PM and Flow Date >= 3/15/ :56:15 PM Total Gen = * Total Gen(-2) * Total Gen(-1)

Applications Credit Risk Analysis Churn Analysis Customer Retention
Targeted Marketing Market Basket Analysis Sales Forecasting Stock Predictions Medical Diagnosis Bioscience Research Surveys Insurance Rate Quotes Credit Card Fraud Web Site Events Loan Applications Hiring and Recruiting Cross-Marketing Social Science Economics

Data Mining Sources Data Mining DW Table SSAS Cube OLTP Database Col 1
SQL Server Oracle MySQL Etc. OLTP Database Data Mining Table Col 1 Col 2 Col 3

Training the Model Cleanse Massage Select Training Set
Apply DM Algorithm Train Test

Using the Model Customer CustomerID CustName Mining Model
CheckingBalance SavingsBalance NumTransactions DMX Prediction Join SQL Join MonthlyBankAccount BankAccountID CheckingBalance SavingsBalance NumTransactions CustomerID Result Prediction Income

Prediction Join DMX SELECT Predict([Movie Purchases],3) as Movies From
[Customer Movie Association] NATURAL PREDICTION JOIN (SELECT 44 AS [Age], 'Female' AS [Gender], 'Rent' AS [Home Ownership], 'Divorced' AS [Marital Status], (SELECT 'Mrs. Doubtfire' AS [Movie Title] UNION SELECT 'My Big Fat Greek Wedding' AS [Movie Title] UNION SELECT 'Patriot Games' AS [Movie Title]) AS [Movie Purchases]) AS t

Market Basket Analysis
Approaches Clustering Classification Regression Market Basket Analysis

Mining Algorithms Time Series Naïve Bayes Association Clustering
Decision Trees Logistic Regression Sequence Clustering Neural Networks

Data Mining Algorithms
Analytical problem Examples Algorithms Classification: Assign cases to predefined classes Credit risk analysis Churn analysis Customer retention Decision Trees Naive Bayes Neural Nets Segmentation: Taxonomy for grouping similar cases Customer profile analysis Mailing campaign Clustering Sequence Clustering Association: Advanced counting for correlations Market basket analysis Advanced data exploration Association Time Series Forecasting: Predict the future Forecast sales Predict stock prices Time Series Prediction: Predict a value for a new case based on values for similar cases Quote insurance rates Predict customer income All Deviation analysis: Discover how a case or segment differs from others Credit card fraud detection Network infusion analysis * Andy Cheung, Microsoft

Time Series Analyzes how a variable changes over time.
Uses Autoregression + Decision Tree to build model Each time series is a single case No prediction join with test or actual cases Prediction is always the same for given time slots

Node Regression Formula
Time Series Tree All Stock(t-5) <= 22.83 Stock(t-5) > 22.83 Stock(t-1) <= 30.15 Stock(t-1) > 30.15 Node Regression Formula Stock = *Stock(t-1) + .21*Stock(t-2)

Naïve Bayes Probabilistic classifier based on Bayes’ theorem with strong (naive) independence assumptions. Simple Classification Algorithm Good starting point for better understanding of your data Uses only discrete data

Premium Svc = Yes Ring Tones = Yes International = No
Naïve Bayes Example Cell Phone Service Gender Premium Service Custom Ring Tones International Calls Female 53% 19% 56% 27% Male 47% 41% 14% 38% Premium Svc = Yes Ring Tones = Yes International = No Likelihood of Female = .53 * .19 * .56 * .73 = .0412 Likelihood of Male = .47 * .41 * .14 * .62 = .0167 P(Female) = .0412/( ) = 71.2% P(Male) = .0167/( ) = 28.8%

Association Rules Detect relationships or associations between specific values of categorical variables in large data sets. Uses only Discrete Data Rule - Attribute value conditions that occur frequently together in a given dataset {Male, IT, Star Wars}  {Star Trek} Itemset - A set of attribute values. Support - Total number of transactions. Confidence - Probability that {X}  {Y} Importance (Lift) –Interestingness. Measure of whether Correlation is positive, negative or none.

Logistic Regression Predict the probability of a discrete outcome from a set of variables that are continuous, discrete or both. Non-linear regression model that produces results between 0 and 1. Popular in health science for disease prediction. Marketing uses for dichotomous predictions (buy or not buy, renew or cancel). Same as Neural Network without the hidden layer. Probability = 1/1 + e-z where z = c + yx1 + yx2 + … If x1 = Weight and x2 = Age then Risk of heart attack = 1/1 + e-z where z = x1 - .7x2

Decision Tree Graphical representation displaying options, risks and the decision-making sequence. Most popular data mining model. East to visualize because of it’s graphical representation. Branches represent choices with associated risks, costs, results, or probabilities. Each test examines the value of a single column in the data and uses it to determine the next test to apply. The results of all tests determine which label to predict. Similar to human thought process when making a decision. Finds non-linear relationships. Supports classification, regression and association within the model.

Neural Network Classifies large and complex data sets by grouping cases together in a way loosely based on the brain. Most sophisticated algorithm but difficult to interpret. Works well with non-linear data and finds smooth non-linear relationships. Modeled as a group of interconnected nodes. No agreed upon definition. Microsoft algorithm is one of many techniques. Can build multiple models based on discrete inputs. H I O H Back to Input layer after weights adjusted for error I H O I H

Clustering Places data elements into related groups without advance knowledge of the group definitions. Good starting point for better understanding of your data. Finds the hidden variable that accurately classifies data. Data grouped into clusters have a high similarity based on the attribute values.

Sequence Clustering Discovers the properties of sequences by grouping them into clusters and assigning them to one of the clusters. Hybrid of sequence and clustering techniques. Typically used with web and event logs as data sources.

Conclusion Slow Adoption Where do you start? Science + Art
Not quite A.I. … yet! More Info and References TDWI – The Data Warehousing Institute ACM, IEEE Books: Data Mining with SQL Server 2005/8 (Wiley) Mining the Talk (IBM Press) Data Mining know it all (Morgan/Kaufman)

Data Mining with Analysis Services

Similar presentations

Presentation on theme: "Data Mining with Analysis Services"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining with Analysis Services

Similar presentations

Presentation on theme: "Data Mining with Analysis Services"— Presentation transcript:

Similar presentations

About project

Feedback