Presentation on theme: "13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,"— Presentation transcript:
13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe, Ph.D., Consultant and Assistant Professor
About MarkTab Training and Consulting with http://marktab.com http://marktab.com Data Mining Resources and Blog at http://marktab.net http://marktab.net Ph.D. – Industrial Engineering, Georgia Tech Training and consulting internationally across many industries – SAS and Microsoft Contributed to peer-reviewed research and legislation ◦Mentoring doctoral dissertations at the accredited University of Phoenix Presenter
About Artus Assistant Professor for Analytic Information Systems and Business Intelligence PhD in computer science Research: data mining for e-commerce and mobile business Consultant
Definition 1 (Informal) Data mining is the automated or semi-automated process of discovering patterns in data.
Definition 2 Data Mining is a process using 1.Exploratory Data Analysis Statistical and visual data analysis techniques. Forming a hypothesis 2.Data Modeling & Predictions Describe data using probability distributions and Machine Learning algorithms (“model”). Fitting a hypothesis 3.Statistical Learning Theory Model selection, model evaluation 6
Data Mining Visualized Target: attribute we are interested in. Input: data available for our predictions. Function f: describes the relationship between target and input. Regrettably, f is unknown and unknowable. 7 InputTarget f ( )
Data Mining Visualized 8 InputTarget f ( ) Hypothesis h )( Unknown Real world: Data Mining model: Need to find “good” h. h is your DM “algorithm”. Input data has to be appropriate. Select and transform as needed Correct modeling of target is crucial
Top 10 Expectations BEST PRACTICE: LEARN FROM EXPERIENCE 9
People can start data mining in 10 minutes… Marketing More Scientific Better models come from days, weeks or months of iterative improvement 10 Expectation Ten
Data miners can provide provably good models with little or zero knowledge of the specific industry… Marketing More Scientific Knowing the industry and organizational goals helps orient the questions, modeling, and analysis. 11 Expectation Nine
Open source software can provide quality results worthy of peer- reviewed literature… Marketing More Scientific Commercial software with years-long service options is required for enterprise scale. 12 Expectation Eight
We can learn a lot from the current data warehouses, cubes, and big data… Marketing More Scientific We can improve our modeling by creating new data collection strategies. 13 Expectation Seven
People can build data mining models with little or zero data cleaning… Marketing More Scientific Better results happen when we organize and rearrange data for best success. 14 Expectation Six
Data mining can provide answers to problems… Marketing More Scientific Most times we only get detail insights toward larger problems, and sometimes uncover more problems than we started with. 15 Expectation Five
A little data mining knowledge can provide an organization with a competitive edge… Marketing More Scientific The edge grows along with experience and better study of the methodology and mathematics. 16 Expectation Four
Individual professionals can deliver excellent predictive analysis… Marketing More Scientific Small teams working together can help quickly and efficiently conquer some of the most difficult analytic challenges. 17 Expectation Three
Numbers speak for themselves and can influence better decision making… Marketing More Scientific Leadership strategy helps teams deliver results in the best way given the current culture. 18 Expectation Two
A lot of data mining best practices and strategies can be communicated in an hour or a day… Marketing More Scientific The best commitment is ongoing education on both data mining and machine learning technology. 19 Expectation One
Best practice: study individual attributes Histograms and frequencies (discrete) Kernel density estimates Cumulative distribution function Rank-order plots and lift charts Summary statistics (continuous) Box-and-whisker plots 21
How to Choose an Algorithm Choosing an algorithm or series of algorithms is an art One algorithm could perform different tasks Be willing to experiment with algorithms and algorithm parameters 24
Algorithms for Data Mining Tasks (1 of 2) Algorithm Name Description Microsoft Time Series Analyzes time-related data by using a linear decision tree. Patterns can be used to predict future values in the time series. Microsoft Decision Trees Makes predictions based on the relationships between columns in the dataset, and models the relationships as a tree-like series of splits on specific values. Supports the prediction of both discrete and continuous attributes. Microsoft Linear Regression If there is a linear dependency between the target variable and the variables being examined, finds the most efficient relationship between the target and its inputs. Supports prediction of continuous attributes. Microsoft Clustering Identifies relationships in a dataset that you might not logically derive through casual observation. Uses iterative techniques to group records into clusters that contain similar characteristics.
Algorithms for Data Mining Tasks (2 of 2) Algorithm NameDescription Microsoft Naïve Bayes Finds the probability of the relationship between all input and predictable columns. This algorithm is useful for quickly generating mining models to discover relationships. Supports only discrete or discretized attributes. Treats all input attributes as independent. Microsoft Logistic Regression Analyzes the factors that contribute to an outcome, where the outcome is restricted to two values, usually the occurrence or non-occurrence of an event. Supports the prediction of both discrete and continuous attributes. Microsoft Neural Network Analyzes complex input data or business problems for which a significant quantity of training data is available but for which rules cannot be easily derived by using other algorithms. Can predict multiple attributes. Can be used to classify discrete attributes and regression of continuous attributes. Microsoft Association Rules Builds rules that describe which items are likely to appear together in a transaction. Microsoft Sequence Clustering Identifies clusters of similarly ordered events in a sequence. Provides a combination of sequence analysis and clustering.
Best practice: Document your science Describe the business problem Determine how to measure success (including baseline) Document what was learned during data preparation and analysis Justify the algorithms used during the investigation List assumptions were made 27
Leadership challenges Build on organizational communications Consider redoing analysis Find results champions Celebrate the results 29
Best practice: prepare the next cycle Note strengths, weaknesses, opportunities, risks Build consensus on model expiration dates Encourage and improve the process Create insight into new future data collection 30
Conclusion Best Practices Framework Provide a data mining foundation Prepare the data Evaluate machine learning output Plan to move toward actionable decisions 31
Resources http://www.lfd.uci.edu/~gohlke/pythonlibs/ Free Win x64 Python libshttp://www.lfd.uci.edu/~gohlke/pythonlibs/ http://www.enthought.com/products/epd.php Commercial Pythonhttp://www.enthought.com/products/epd.php http://www.burns-stat.com/pages/Tutor/R_inferno.pdf R Tutorialhttp://www.burns-stat.com/pages/Tutor/R_inferno.pdf http://technet.microsoft.com/en-us/sqlserver/cc510301.aspx SQL Server Analysis Services Data Mininghttp://technet.microsoft.com/en-us/sqlserver/cc510301.aspx http://marktab.net Data Mining Portalhttp://marktab.net http://sqlserverdatamining.com Data Mining Team Portalhttp://sqlserverdatamining.com Books: “Data Mining with SQL Server 2008”, “Data Mining for Business Intelligence”, “Practical Time Series Forecasting” 32
Your consent to our cookies if you continue to use this website.