Data Mining vs. Statistics

Slides:



Advertisements
Similar presentations
Figure 3.1. Relationship of Research Design to the Previous Chapters and the Marketing Research Process Figure 3.1Relationship to the Previous Chapter.
Advertisements

1 The Effective Industrial Statistician: Necessary Knowledge and Skills William Q. Meeker Department of Statistics Center for Nondestructive Evaluation.
Pat Langley Computational Learning Laboratory Center for the Study of Language and Information Stanford University, Stanford, California
Supporting End-User Access
Chapter Three. Figure 3.1. Relationship of Research Design to the Previous Chapters and the Marketing Research Process Focus of This Chapter Relationship.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
Introduction to Research Methodology
Data Mining Glen Shih CS157B Section 1 Dr. Sin-Min Lee April 4, 2006.
1 Data Mining in Pharmaceutical Marketing and Sales Analysis Pavel Brusilovskiy, PhD Merck.
Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.
SLIDE 1IS 257 – Fall 2008 Data Mining and the Weka Toolkit University of California, Berkeley School of Information IS 257: Database Management.
Copyright c 2001 The McGraw-Hill Companies, Inc.1 Chapter 2 The Research Process: Getting Started Researcher as a detective Seeking answers to questions.
Data Mining – Intro.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
© Associate Professor Dr. Jamil Bojei, 2007
SAS_06_STOL_Tool_Cooper Automated Systems Test and Operations Language (STOL) Analysis Tool Jason G. Cooper July 20, 2006.
Introduction Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Introduction Facebook How does Facebook use your data? Where do you think.
Computer Science Universiteit Maastricht Institute for Knowledge and Agent Technology Data mining and the knowledge discovery process Summer Course 2005.
Beyond Opportunity; Enterprise Miner Ronalda Koster, Data Analyst.
TURKISH STATISTICAL INSTITUTE INFORMATION TECHNOLOGIES DEPARTMENT (Muscat, Oman) DATA MINING.
Microsoft Enterprise Consortium Data Mining Concepts Introduction: The essential background Prepared by David Douglas, University of ArkansasHosted by.
1 Industrial Design of Experiments STAT 321 Winona State University.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Qualitative Techniques. Overview of Lecture Explore basic ideas of research methodology Explore basic ideas of research methodology Evaluating what makes.
Dr. Awad Khalil Computer Science Department AUC
Chapter 5: Data Mining for Business Intelligence
Data Mining Techniques
Research method2 Dr Majed El- Farra 1 Research methods Second meeting.
Nature and Scope of Marketing Research
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
DATA MINING Team #1 Kristen Durst Mark Gillespie Banan Mandura University of DaytonMBA APR 09.
Data Mining Chun-Hung Chou
Understanding Data Analytics and Data Mining Introduction.
Introduction: The essential background
1 Business Administrators of today and tomorrow need, along with their business knowledge, analytic insight and understanding, as well the ability.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
What You Need before You Deploy Master Data Management Presented by Malcolm Chisholm Ph.D. Telephone – Fax
1 1 Slide Introduction to Data Mining and Business Intelligence.
Data Mining Process A manifestation of best practices A systematic way to conduct DM projects Different groups has different versions Most common standard.
Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved. Decision Support Systems Chapter 10.
1. INTERNET MARKET RESEARCH 2. OPERATIONAL DATA TOOLS Info. for Competitive Marketing Advantages Maher ARAFAT, June, 2010.
Copyright  2007 McGraw-Hill Pty Ltd PPTs t/a Marketing Research 2e by Hair, Lukas, Bush and Ortinau Slides prepared by Judy Rex 1-1 Chapter One Overview.
Enhancing Interactive Visual Data Analysis by Statistical Functionality Jürgen Platzer VRVis Research Center Vienna, Austria.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Guest Lecture Introduction to Data Mining Dr. Bhavani Thuraisingham September 17, 2010.
Copyright © 2012, SAS Institute Inc. All rights reserved. ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY,
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Overview of Research Designs
27-18 września Data Mining dr Iwona Schab. 2 Semester timetable ORGANIZATIONAL ISSUES, INDTRODUCTION TO DATA MINING 1 Sources of data in business,
1 STAT 5814 Statistical Data Mining. 2 Use of SAS Data Mining.
© 2006 by The McGraw-Hill Companies, Inc. All rights reserved. 1 Chapter 2 The Research Process: Getting Started Researcher as a detective –Seeking answers.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
CCT 333: Imagining the Audience in a Wired World Class 8: Quantitative Methods.
© 2001 South-Western College Publishing1 CHAPTER SEVEN DECISION SUPPORT SYSTEMS AND MARKETING RESEARCH Prepared by Jack Gifford Miami University (Ohio)
SOCIOLOGICAL RESEARCH Importance of social research Help solve social problems by understanding how they come about, and why they persist. Makes clear.
Why BI….? Most companies collect a large amount of data from their business operations. To keep track of that information, a business and would need to.
Data Mining Copyright KEYSOFT Solutions.
Defining the Marketing Research Problem and Developing an Approach
WHAT IS RESEARCH? According to Redman and Morry,
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Introduction to Marketing Research
Data Mining – Intro.
DATA MINING © Prentice Hall.
Data mining and statistical learning, lecture 1b
Overview of Research Designs
Introduction Data Mining for Business Analytics.
Choosing Research Approach and Methods
Presentation transcript:

Data Mining vs. Statistics Pavel Brusilovsky

Objectives Intro to Data Mining Data Mining vs. Statistics Data Mining vs. Text Mining Applications of Data Mining 2

Data Mining Data Mining is a cutting edge technology to analyze diverse, multidisciplinary and multidimensional complex data Data mining could identify relationships in your multidimensional and heterogeneous data that cannot be identified in any other way Successful application of state-of-the-art data mining technology to marketing and sales is indicative of analytic maturity and the success of a company Working definition of Data Mining: Data Mining is a process of discovering previously unknown and potentially useful hidden pattern in your data 3

What is the Taxonomy of Data Mining? Data mining taxonomy, based on application Data Mining Text Mining Web Mining Image Mining… Data mining taxonomy, based on the usage of domain knowledge: Verification-driven data mining Is associated with traditional quantitative approaches that permit a decision maker to express and verify organizational and personal domain knowledge Discovery driven data Mining It tied with knowledge discovery technology capable of automatically discovering previously unknown patterns hidden in the data Combination of both classes leads to synergy that can produce meaningful and reliable results that may not be obtained within the framework of each class of data mining independently Data mining taxonomy, based on estimation paradigm: supervised learning unsupervised learning 4

What is the deference between “Search” and “Discover” Source: http://www.knowledgetechnologies.org/proceedings/presentations/treloar/nathantreloar.ppt 5

Example: Amazon.com purchase suggestion Amazon.com increased sales by 15%, using data/text mining generated purchase suggestions 6

Data Mining and Related Fields Statistics: “The model is king” (Hand) Data Mining: “The data is king” 7

Is Data Mining extension of Statistics? Data Mining and Statistics: mutual fertilization with convergence Statistical Data Mining (Graduate course, George Mason University) Statistical Data Mining and Knowledge Discovery (Hardcover) by Hamparsum Bozdogan (Editor) An overview of Bayesian and frequentist issues that arise in multivariate statistical modeling involving data mining  Data Mining with Stepwise Regression (Dean Foster, Wharton School) use interactions to capture non-linearities use Bonferroni adjustment to pick variables to include use the sandwich estimator to get robust standard errors 8

What are Data Mining Myths? Myth 1: Data mining automatically discovers hidden pattern in your data Myth 2: Data mining is design for business analysts who are not professional in quantitative fields Myth 3: Data mining findings can be easily translated into decision-maker actions Myth 4: Data mining encompasses decision analysis/decision support technology 9

What are logical steps of Data Mining What are logical steps of Data Mining? SEMMA methodology (SAS Enterprise Miner) The core process of conducting data mining study includes the following steps (SEMMA): Sample Explore Modify Model Assess SEMMA is a logical organization of the functional tool set of SAS Enterprise Miner for carrying out the core tasks of data mining SEMMA is focused on the model development aspects of data mining 10

CRoss-Industry Standard Process for Data Mining (CRISP-DM) SPSS Clementine Six phases of CRISP-DM: Business understanding Data understanding Data preparation Modeling Evaluation Model deployment www.crips-dm.org 11

Statistics vs. Data Mining: Concepts Feature Statistics Data Mining Type of Problem Well structured Unstructured / Semi-structured Inference Role Explicit inference plays great role in any analysis No explicit inference Objective of the Analysis and Data Collection First – objective formulation, and then - data collection Data rarely collected for objective of the analysis/modeling Size of data set Data set is small and hopefully homogeneous Data set is large and data set is heterogeneous Paradigm/Approach Theory-based (deductive) Synergy of theory-based and heuristic-based approaches (inductive) Signal-to-Noise Ratio STNR > 3 0 < STNR <= 3 Type of Analysis Confirmative Explorative Number of variables Small Large 12

Statistics vs. Data Mining: Regression Modeling Feature Statistics Data Mining Number of inputs Small Large Type of inputs Interval scaled and categorical with small number of categories (percentage of categorical variables is small) Any mixture of interval scaled, categorical, and text variables Multicollinearity Wide range of degree of multicollinearity with intolerance to multicollinearity Severe multicollinearity is always there, tolerance to multicollinearity Distributional assumptions, homoscedasticity, outliers, missing values Intolerance to distrubitional assumption violation, homoscedasticity, Outliers/leverage points, missing values Tolerance to distributional assumption violation, outliers/leverage points, and missing values Type of model Linear / Non-linear / Parametric / Non-Parametric in low dimensional X-space (intolerance to uncharacterizable non-linearities) Non-linear and non-parametric in high dimensional X-space with tolerance to uncharacterizable non-linearities 13

Well-structured Business Problem Unstructured Business Problem What is an unstructured problem? Well-structured Business Problem Unstructured Business Problem Definition Can be described with a high degree of completeness Cannot be described with a high degree of completeness Can be solved with a high degree of certainty Cannot be resolved with a high degree of certainty Experts usually agree on the best method and best solution Experts often disagree about the best method and best solution Can be easily and uniquely translated into quantitative counterpart Cannot be easily and uniquely translated into quantitative counterpart Goal Find the best solution Find reasonable solution Complexity Ranges from very simple to complex Ranges from complex to very complex 14

What are differences between Data/Text Mining and Statistics? Statistical analysis is designed to deal with structured data in order to solve structured problem: Results are software and researcher independent Inference reflects statistical hypothesis testing Data mining is designed to deal with structured data in order to solve unstructured business problems Results are software and researcher dependent (absence of implementation standards) Inference reflects computational properties of data mining algorithm at hand Text mining is designed to deal with unstructured data in order to solve unstructured problems Results are software and researcher dependent Inference reflects computational properties and visualization capability of text mining algorithm at hand 15

When data mining technology is appropriate? Data mining technology is appropriate if: The business problem is unstructured Accurate prediction is more important than the explanation The data include the mixture of interval, nominal, ordinal, count, and text variables, and the role and the number of non-numeric variables are essential Among those variables there are a lot of irrelevant and redundant attributes The relationship among variables could be non-linear with uncharacterizable nonlinearities The data are highly heterogeneous with a large percentage of outliers, leverage points, and missing values The sample size is relatively large Important marketing and sales studies/projects have the majority of these features 16

Accurate prediction is more important than the explanation 17

What is Breiman Uncertainty Principle? Accuracy * Interpretability = Breiman’s constant Breiman uncertainty principle means that The higher method’s accuracy, the lower its interpretability, and vice versa 18

What are great Data Mining Ideas? Injecting randomness into function estimation procedure Bagging (Breiman, 1996): Apply the same unstable algorithm to different samples (with replacement) of the original data Different samples yield different models The average of the predictions of these models might be better than the predictions from any single model Boosting (Friedman, Hastie, and Tibshirani (1999): Each model is based on the same original data The first individual model is fit to the original data For the second model, subtract the predicted value from the original target value, and use the difference as the target value to train the second model For the third model, subtract weighted average of the predictions from the original target value, and use the difference as the target value to train the third model, and so on. 19

What are the best Data Mining Conferences? Annual SAS Data Mining Technology Conference The world’s largest data mining conference that balancing theory and practice Annual International Conference on Knowledge Discovery and Data Mining (KDD) Sponsored by the American Association for Artificial Intelligence (AAII) Annual International Salford Systems Data Mining Conference Focusing on solving real world challenges Business Applications of CART, MARS, TreeNet, and Random Forrest Keynote speakers: Jerome Friedman (Stanford University) and Leo Breiman (University of California, Berkeley) 20

What are the best data mining tools? Salford Systems’ Tools (CART, Random Forest, MARS, TreeNet) SAS Enterprise Miner/Text Miner SPSS Clementine Megaputer Intelligence PolyAnalyst 21

Reference (Data Mining) Randall Matignon (2007), Neural Network Modeling Using SAS Enterprise Miner , SAS® Institute Inc. David J. Hand, Data Mining: Statistics and More? The American Statistician, May 1998, Vol. 52 No. 2 http://www.amstat.org/publications/tas/hand.pdf Friedman, J.H. 1997. Data Mining and Statistics. What’s connection? Proceedings of the 29th Symposium on the Interface: Computing Science and Statistics, May 1997, Houston, Texas Doug Wielenga (2007), Identifying and Overcoming Common Data Mining Mistakes, SAS Global Forum Paper 073-2007 Nathan Treloar (2002), Text Mining: Tools, Techniques, and Applications http://www.knowledgetechnologies.org/proceedings/presentations/treloar/nathantreloar.ppt. 22