Download presentation

Presentation is loading. Please wait.

Published byKelly Satterthwaite Modified over 2 years ago

1

2
Data Mining with Analysis Services Carlos Bossy Principal Consultant MCTS, MCITP BI Aabcom Solutions

3
My Background 25 years Software Development 15 years software companies as Programmer, Architect, Manager, Director, VP, CTO 5+ years Business Intelligence Consultant Data Warehouse deployment for State of WA Child Welfare Data Integration/Warehouse architecture for Solar Energy company Data Warehouse model for State of OR Data Mining model for Houston sports league Current Projects Experience

4
Overview Make Data Mining an integral component of your Data Architecture.

5
Session Overview What is Data Mining Data Mining Algorithms in SQL Server Creating a Data Mining Model Using the Model in an Application Using the Model in SSIS Awareness of Data Mining Architecture and Process

6
Application Architecture Application ServiceBusinessData Relational Database

7
Application Architecture with BI Application ServiceBusinessData Relational Database Data Integration Data Warehouse Cube Text Warehouse Performance Management ReportsAnalysisData MiningText MiningAd-hocData Mining

8
If you want to… Predict the future Get rich buying stocks Win the lottery Win your Fantasy Football League Data Mining is for you! Why Mine Data?

9
Really - Why Mine Data? Uncover Hidden Relationships Find something unusual or unexpected Improve upon domain experts knowledge Manage large data sets Create Predictive Analytics platform Maximize value of Data Competitive Edge

10
Data Mining Defined Data Mining is the process of sorting through large amounts of data and picking out relevant information (Wikipedia). Data Mining is the extraction of hidden predictive information from large databases. Data Mining is the process of sorting through large amounts of data and picking out relevant information (Wikipedia). Data Mining is the extraction of hidden predictive information from large databases. The Process of Knowledge Discovery The Process of Knowledge Discovery Sophisticated Statistical Model Sophisticated Statistical Model Discovery of Patterns and Relationships Discovery of Patterns and Relationships The Process of Knowledge Discovery The Process of Knowledge Discovery Sophisticated Statistical Model Sophisticated Statistical Model Discovery of Patterns and Relationships Discovery of Patterns and Relationships It is not Dredging, Snooping or an Invasion of Privacy

11
Today vs. Yesterday Explosion of Data doubles every 3 years (Moores Law) Data Volumes cant be comprehended by humans Uncover complex and difficult to find patterns for competitive edge Improve professional judgment of Domain Expert (small but valuable) Knowledge Discovery Converting Data to Information N Infinity Future Legacy Manageable Volumes of Data The power of SQL Domain Experts could grasp and analyze a complete Database Limited CPU Horsepower Finite

12
Outcomes Predict number of runs Colorado Rockies would score in their next game using Neural Network. Input Previous 12 games runs scored Home/Away Predict Number runs next game Model HomeRuns = (G7*.142)+(G1*.118)+… Away Runs = (G7*.129)+(G1*.091)+…

13
Non-linear Results Fitting a line to a set of data points to measure the effect of a single independent variable. y = mx + b Statistical Methods Data Mining

14
Outcomes – Decision Tree Annual Income = 108, *(Savings Balance-27, )-2, *(Avg Check Size )-0.247*(Credit Card Limit-8, ) *(Age )- 1, *(Over Drafts-1.000)

15
Outcomes – Time Series Total Gen-1 < and Flow Date < 3/2/2007 4:58:07 AM and Flow Date < 8/22/2006 5:54:22 PM and Flow Date < 8/2/2006 1:35:37 PM Total Gen = * Total Gen(-1) Total Gen-1 = 3/2/2007 4:58:07 AM and Flow Date = 3/15/ :56:15 PM Total Gen = * Total Gen(-2) * Total Gen(-1)

16
Applications o Credit Risk Analysis o Churn Analysis o Customer Retention o Targeted Marketing o Market Basket Analysis o Sales Forecasting o Stock Predictions o Medical Diagnosis o Bioscience Research o Surveys o Insurance Rate Quotes o Credit Card Fraud o Web Site Events o Loan Applications o Hiring and Recruiting o Cross-Marketing o Social Science o Economics

17
Data Mining Sources SSAS Cube DW Data Mining OLTP Database SQL Server Oracle MySQL Etc.

18
CleanseMassage Select Training Set Apply DM Algorithm TrainTest Training the Model

19
Using the Model SQL Join DMX Prediction Join Result

20
Prediction Join DMX SELECT Predict([Movie Purchases],3) as Movies From [Customer Movie Association] NATURAL PREDICTION JOIN (SELECT 44 AS [Age], 'Female' AS [Gender], 'Rent' AS [Home Ownership], 'Divorced' AS [Marital Status], (SELECT 'Mrs. Doubtfire' AS [Movie Title] UNION SELECT 'My Big Fat Greek Wedding' AS [Movie Title] UNION SELECT 'Patriot Games' AS [Movie Title]) AS [Movie Purchases]) AS t

21
Approaches ClusteringClassificationRegression Market Basket Analysis

22
Mining Algorithms Time Series Time Series Naïve Bayes Naïve Bayes Association Association Clustering Clustering Decision Trees Decision Trees Logistic Regression Logistic Regression Clustering Clustering Sequence Clustering Sequence Clustering Neural Networks Neural Networks

23
Data Mining Algorithms Analytical problemExamplesAlgorithms Classification: Assign cases to predefined classes Credit risk analysis Churn analysis Customer retention Decision Trees Naive Bayes Neural Nets Segmentation: Taxonomy for grouping similar cases Customer profile analysis Mailing campaign Clustering Sequence Clustering Association: Advanced counting for correlations Market basket analysis Advanced data exploration Decision Trees Association Time Series Forecasting: Predict the future Forecast sales Predict stock prices Time Series Prediction: Predict a value for a new case based on values for similar cases Quote insurance rates Predict customer income All Deviation analysis: Discover how a case or segment differs from others Credit card fraud detection Network infusion analysis All * Andy Cheung, Microsoft

24
Time Series Uses Autoregression + Decision Tree to build model Each time series is a single case No prediction join with test or actual cases Prediction is always the same for given time slots Analyzes how a variable changes over time.

25
Time Series Tree All Stock(t-5) <= Stock(t-5) > Stock(t-1) <= Stock(t-1) > Stock = *Stock(t-1) +.21*Stock(t-2) Node Regression Formula

26
Naïve Bayes Probabilistic classifier based on Bayes theorem with strong (naive) independence assumptions. Simple Classification Algorithm Good starting point for better understanding of your data Uses only discrete data

27
Naïve Bayes Example Cell Phone Service GenderPremium Service Custom Ring TonesInternational Calls Female53%19%56%27% Male47%41%14%38% Premium Svc = Yes Ring Tones = Yes International = No Likelihood of Female =.53 *.19 *.56 *.73 =.0412 Likelihood of Male =.47 *.41 *.14 *.62 =.0167 P(Female) =.0412/( ) = 71.2% P(Male) =.0167/( ) = 28.8%

28
Association Rules Detect relationships or associations between specific values of categorical variables in large data sets. Uses only Discrete Data Rule - Attribute value conditions that occur frequently together in a given dataset {Male, IT, Star Wars} {Star Trek} Itemset - A set of attribute values. Support - Total number of transactions. Confidence - Probability that {X} {Y} Importance (Lift) –Interestingness. Measure of whether Correlation is positive, negative or none.

29
Logistic Regression Predict the probability of a discrete outcome from a set of variables that are continuous, discrete or both. Non-linear regression model that produces results between 0 and 1. Popular in health science for disease prediction. Marketing uses for dichotomous predictions (buy or not buy, renew or cancel). Same as Neural Network without the hidden layer. Probability = 1/1 + e- z where z = c + yx 1 + yx 2 + … If x1 = Weight and x2 = Age then Risk of heart attack = 1/1 + e- z where z = x 1 -.7x 2

30
Decision Tree Graphical representation displaying options, risks and the decision-making sequence. Most popular data mining model. East to visualize because of its graphical representation. Branches represent choices with associated risks, costs, results, or probabilities. Each test examines the value of a single column in the data and uses it to determine the next test to apply. The results of all tests determine which label to predict. Similar to human thought process when making a decision. Finds non-linear relationships. Supports classification, regression and association within the model.

31
Neural Network Classifies large and complex data sets by grouping cases together in a way loosely based on the brain. Most sophisticated algorithm but difficult to interpret. Works well with non-linear data and finds smooth non-linear relationships. Modeled as a group of interconnected nodes. No agreed upon definition. Microsoft algorithm is one of many techniques. Can build multiple models based on discrete inputs. I I I H H H H O O Back to Input layer after weights adjusted for error

32
Clustering Places data elements into related groups without advance knowledge of the group definitions. Good starting point for better understanding of your data. Finds the hidden variable that accurately classifies data. Data grouped into clusters have a high similarity based on the attribute values.

33
Sequence Clustering Discovers the properties of sequences by grouping them into clusters and assigning them to one of the clusters. Hybrid of sequence and clustering techniques. Typically used with web and event logs as data sources.

34
Demo

35
Conclusion Slow Adoption Where do you start? Science + Art Not quite A.I. … yet! More Info and References TDWI – The Data Warehousing Institute ACM, IEEE Books: Data Mining with SQL Server 2005/8 (Wiley) Mining the Talk (IBM Press) Data Mining know it all (Morgan/Kaufman)

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google