Overview of SQL Server Data Mining

Overview of SQL Server Data Mining
CSD305 Advanced Databases

Data Mining Definition
Act of excavation in the earth from which ore or minerals can be extracted Data Mining Act of excavation in the data from which patterns can be extracted Alternative name: Knowledge discovery in databases (KDD) Multiple disciplines: database, statistics, artificial intelligence Fast maturing technology Unlimited applicability CSD305 Advanced Databases Point of sale data collection – bar code scanners, RFID and smart card tech allows for collection of up to minute data about customer purchases along with data such as web logs from ecommerce, customer service records. Helps business better understand needs of customers. Customer profiling, targeted marketing, work flow management, store layout and fraud detection. ‘who are most profitable customers’, ‘what products can be cross-sold/up-sold’, ‘what is the revenue outlook for next year’ Medicine, science and engineering Aiding new discoveries. NASA have series of satellites generating global observations of the land surface, oceans and atmosphere. Helping to answer questions such as ‘what is relationship between frequency and intensity of ecosystem disturbances such as droughts and hurricanes to global warming’ How is land surface precipitation and temperature affected by ocean surface temperature’ Molecular biology hoping to use genomic data being gathered to understand structure and function of genes. Allowed for comparison of behaviour of thousands of genes under various situations. Perhaps isolate genes responsible for certain diseases.

Data Mining Process Define a model Train the model Test the model
Management System (DMMS) Train the model Training Data Test the model Test Data CSD305 Advanced Databases Define a model- similar to table in relational database. Includes input columns, predictable columns and an associated algorithm. Used to store patterns discovered. Train model – also called processing, feed historical data into engine, e.g. existing customers demographics and credit risk info. Mining algorithms starts analysis data, depending on efficiency it may scan dataset in one or more iterations to find correlations. Training steps usually time consuming. Most models trained in batches weekly or monthly. Once complete patterns stored and can browse model using content viewers such as decision tree and cluster viewer. Prediction – to predict need a trained model and new dataset. Engine applies rules it found in training step to new dataset and assigns prediction result for each input case. Usually quite simple, fast and can be executed in real time. Two types batch and singleton. Singleton one input case constructed on the fly not persisted in database. Mining Model Prediction using the model Prediction Input Data

Data Mining Tasks CSD305 Advanced Databases
Predictive data mining helpful in predicting unknown / future values of another data set of interest. A medical practitioner trying to diagnose a disease based on the medical test results of a patient can be considered as a predictive data mining task. Descriptive data mining finds data describing patterns and comes up with new, significant information from the available data set. A retailer trying to identify products that are purchased together can be considered as a descriptive data mining task.

Data Mining Problems Classification (prediction)
Is this student going to go to a college? Based on Gender, ParentIncome, ParentEncouragement, IQ, etc. E.g., if ParentEncouragement=Yes and IQ>100, College=Yes Classification (prediction) Similar questions: Is this a spam ? (spam filtering) How good/bad is your credit? (credit scoring) Recognition of hand-written letters (pen recognition) What is this gene like? (bioinformatics) Does this person behave like a terrorist? (TIA) CSD305 Advanced Databases Classification – most popular tasks, used for discrete target variables, business problems like: Churn analysis customers most likely to switch to competitors Risk management should loan be approved for this customer Ad targeting, web personalisation Total Information Awareness American counter terrorism information organisation.

Decision Tree Attend College 55% Yes 45% No 79% Yes 21% No 35% Yes
IQ > 100 IQ <= 100 Encouragement = Not Encouraged Encouragement = Encouraged CSD305 Advanced Databases

Data Mining Problems Regression (prediction)
What is the age of a person? Based on Hobby, MaritalStatus, NumberOfChildren, Income, HouseOwnership, NumberOfCars, … E.g., If MaritalStatus=Yes, Age = 20+4*NumberOfChildren *Income+…  Regression (prediction) Similar questions: What’s the sales amount of ice cream next month? (sales prediction) What’s the stock price of MSFT next week? (stock prediction) What’s the income of a customer? (marketing) What’s the life-time of a software bug? (bug tracking) CSD305 Advanced Databases Predictive modelling – regression, used for continuous target variables e.g. will web user make a purchase at online books store is classification as target variable is binary valued whereas forecasting future price on a stock is regression because price is a continuous-valued attribute.

Data Mining Problems Segmentation (Clustering)
Who are my Web visitors? Identify similar groups based on demographics, visiting patterns E.g., Daily news readers, users, shoppers, short-stayers, etc Segmentation (clustering) Similar questions: Identify groups of genes (bioinformatics) Identify groups of locations of Cholera incidents in London (spatial data mining) Identify group of customers in merchants (Amazon, E-Bay, MSN, WalMart etc) (target marketing) Identify groups of documents. (text categorisation) CSD305 Advanced Databases Clustering – groups of closely related observations so that observations belonging to same cluster more similar to each other than those in other cluster. Group sets of related customers, find areas of the ocean that have significant impact on earths climate and compress data.

Data Mining Problems Association Analysis (recommendation, market analysis)
What other products are purchased together with a digital camera? Based on previous purchases (shopping cart) E.g., If a digital camera is purchased, flash memory, battery, printer are also purchased. Association Analysis (recommendation, market basket analysis) Similar questions: What products to recommend in on-line stores such as Amazon.com. What items should be displayed together in merchant. What genes appear together in toxic mushrooms. CSD305 Advanced Databases Association – used to discover patterns describe strongly associated features. Discovered patterns usually inform of implication rules or feature subsets. Because of exponential size of search space goal is to extract most interesting patterns efficiently. Finding groups of genes that have related functionality identifying web pages accessed together Understanding relationships between different elements of earths climate system. Market basket analysis - if you buy a certain group of items, you are more (or less) likely to buy another group of items.

Data Mining Problems Anomaly detection (outlier detection)
Could this network packet be from a virus attack? Predict likelihood of the network packet pattern Anomaly detection (outlier detection) Similar questions: Are the hospital lab results normal (Adverse drug effect detection) Is this credit transaction fraudulent? (fraud detection) Does this person behave unusually, maybe worth high-level of security clearance? (TIA) CSD305 Advanced Databases Anomaly detection – identifying observations where characteristics are significantly different from rest of data. Known as anomalies or outliers. Goal to discover real anomalies and avoid falsely labelling normal objects A good anomaly detector must have high detection rate and low false alarm rate, Detection fraud, network intrusions, unusual patterns of disease and ecosystem disturbances

Data Mining Tasks - Summary
Classification Regression Segmentation Association Analysis Anomaly detection Sequence Analysis Time-series Analysis Text categorization Others CSD305 Advanced Databases Sequence analysis – find patterns in discrete series. E.g. DNA sequence long series composed four different sates: A,G,C and T. Web click sequence series of URLs, customer purchases first buys a computer, then speakers, then webcam. Fairly new, becoming popular due to web log analysis and DNA analysis. A time series - is a series of data points in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. heights of ocean tides, counts of sunspots, and the daily closing .value of the Dow Jones Industrial average. Text classification - is the process of classifying documents into predefined categories based on their content. It is the automated assignment of natural language texts to predefined categories

Data Mining Algorithms
Decision Trees Naïve Bayesian Clustering Sequence Clustering Association Rules Neural Network Time Series Support Vector Machines …. CSD305 Advanced Databases Decision trees – easiest to understand, tree like structure provides predictions and analysis Naïve Bayesian – looks at each attribute in an entity and determines how on its own it affects the attribute we a looking to predict. E.g. customer is good credit risk. Single attribute of customer, size of company, annual revenue it trains data to determine its effect on credit risk. Problem: result states 57% with size attribute of small are bad credit risks. Only 14% with size attribute of large are bad credit risks. So we should never extend credit to small companies? always extend credit to large companies? What if consider more than one attribute at a time. Small companies with annual profits of more 500,000 bad credit risk? Or large with profits in negative a good credit risk? Does not consider combinations of attributes, simply doesn’t know so NAÏVE. Sequence clustering primarily for sequence analysis but has other uses as well. Examines data to identify transitions from one state to another. E.e. web log navigation form one page to another. Determines as a ratio how many times each possible path is taken. Neural network – models way human neuron's function. Can use neural networks to predict product sales. A web of nodes that connect inputs from attribute values to a final output. Extremely complex algorithm developed by Microsoft in the 60’s we get to use. Support vector machines (SVMs) are supervised learning models with associated learning algorithms (machine learning algorithms)

Data Mining Algorithms
Association rules Seq. Clustering Decision Trees Neural Network Naïve Bayes Clustering Time Series √ Classification Regression Segmentaion Assoc. Analysis Anomaly Detect. Seq. Analysis Time series √ - first choice √ - second choice CSD305 Advanced Databases

Data Mining Vendors SAS (Analytics) IBM (DB2 InfoSphere Warehouse)
IBM (DB2 InfoSphere Warehouse) Oracle (ODM option to Oracle 11g) SPSS (Clementine) Insightsful (Insightful Miner) KXEN (Analytic Framework) Prudsys (Discoverer and its family) Microsoft (SQL Server 2012) Angoss (KnowledgeServer and its family) DBMiner (DBMiner) … and many others CSD305 Advanced Databases

References Pang-Ning Tan, Michael Steinbach, Vipin Kumar “Introduction to Data Mining” Pearson Education, 2006 T. Marakas “Modern Data Warehousing, Mining and Visualization: Core Concepts” Prentice Hall 2003 Data mining extensions to SQL server Comparison of different SQL implementations Also see the wikibook here CSD305 Advanced Databases

Overview of SQL Server Data Mining

Similar presentations

Presentation on theme: "Overview of SQL Server Data Mining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Overview of SQL Server Data Mining

Similar presentations

Presentation on theme: "Overview of SQL Server Data Mining"— Presentation transcript:

Similar presentations

About project

Feedback