Principles of data mining

Slides:



Advertisements
Similar presentations
Estimating the detector coverage in a negative selection algorithm Zhou Ji St. Jude Childrens Research Hospital Dipankar Dasgupta The University of Memphis.
Advertisements

Applications of one-class classification
Chapter 4 Inference About Process Quality
Basic techniques for cluster detection
Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Learning Objectives Copyright © 2004 John Wiley & Sons, Inc. Data Processing, Fundamental Data Analysis, and Statistical Testing of Differences CHAPTER.
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
SAK 5609 DATA MINING Prof. Madya Dr. Md. Nasir bin Sulaiman
Statistical Methods Chichang Jou Tamkang University.
Sampling Distributions
Basic Data Mining Techniques
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Chapter 3 Hypothesis Testing. Curriculum Object Specified the problem based the form of hypothesis Student can arrange for hypothesis step Analyze a problem.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Chapter 2 Simple Comparative Experiments
Experimental Evaluation
Inferences About Process Quality
Data Mining – Intro.
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
Chapter 5 Data mining : A Closer Look.
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
Data Mining Techniques
1. An Overview of the Data Analysis and Probability Standard for School Mathematics? 2.
Statistical inference: confidence intervals and hypothesis testing.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
Inductive learning Simplest form: learn a function from examples
COMP3503 Intro to Inductive Modeling
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Learning Objectives In this chapter you will learn about the t-test and its distribution t-test for related samples t-test for independent samples hypothesis.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Analyzing and Interpreting Quantitative Data
Basic Probability (Chapter 2, W.J.Decoursey, 2003) Objectives: -Define probability and its relationship to relative frequency of an event. -Learn the basic.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
Hypothesis Testing A procedure for determining which of two (or more) mutually exclusive statements is more likely true We classify hypothesis tests in.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Chapter Twelve Copyright © 2006 John Wiley & Sons, Inc. Data Processing, Fundamental Data Analysis, and Statistical Testing of Differences.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 10 Comparing Two Populations or Groups 10.1.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Academic Research Academic Research Dr Kishor Bhanushali M
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
Chapter 6: Analyzing and Interpreting Quantitative Data
Inen 460 Lecture 2. Estimation (ch. 6,7) and Hypothesis Testing (ch.8) Two Important Aspects of Statistical Inference Point Estimation – Estimate an unknown.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
Data Mining and Decision Support
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
Chapter 1 Introduction to Statistics. Section 1.1 Fundamental Statistical Concepts.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
Appendix I A Refresher on some Statistical Terms and Tests.
Howard Community College
Machine Learning with Spark MLlib
LECTURE 33: STATISTICAL SIGNIFICANCE AND CONFIDENCE (CONT.)
Chapter 2 Simple Comparative Experiments
Data Mining: Concepts and Techniques Course Outline
Sangeeta Devadiga CS 157B, Spring 2007
9 Tests of Hypotheses for a Single Sample CHAPTER OUTLINE
Chapter 11: Inference for Distributions of Categorical Data
CHAPTER 12 More About Regression
Presentation transcript:

Principles of data mining Chapter Two Principles of data mining

Chapter Overview The process of data mining Approaches of data mining Categories of data mining problems Information patterns to be discovered Overview of data mining solutions Importance of evaluation Undertaking a data mining task in Weka Review of basic concepts in statistics and probability

Data Mining Process Input Preparing Data Input Data Mining Patterns Post-processing Input Data Output A data mining stage Flow of control from one stage to the next stage Flow of control from one stage to the previous stage Repetition of the tasks at one stage

Data Mining Process Preparation Selecting relevant features Selecting relevant records Data cleaning Deal with unknown data Data transformation Original Data sets Collected Data set Integrating data Getting necessary data details Target Data set Pre-Processed Data set Formatted Data set Formatting data into acceptable form by the mining tool

Data Mining Process Mining Determining data mining tasks Assigning roles for data for certain tasks Selecting data mining solution(s) to each task Setting necessary parameters for the solution Collecting result patterns Formatted Data set Solution3 (w1, w2, …, wm) Solution2 (t1, t2, …, tr) Parameter settings Solution1 (p1, p2, …, pn) Mining solutions Patterns

Data Mining Process Post-processing Pattern evaluation Pattern selection Pattern interpretation Patterns Evaluation criteria reject Valid Selection Selected accept Pattern Interpretation Knowledge learnt

Data Mining Process Roles of participants in data mining Participants include: Data miners / data analysts: main participant of a DM project Domain expert: main collaborators of DM project Decision makers: clients of a DM project Risk of human bias in the discovery process Important roles of domain expert Pattern interpretation (for usefulness) Pattern evaluation (for significance) Mining options (for suitable tasks, limited) Advisory on data pre-processing (for suitable operations, limited) Balancing the strength of human and machine

Data Mining Approaches Hypothesis testing approach Top-down lead by a hypothesis statement Procedure: Forming a hypothesis statement Collecting and selecting data of relevance Conducting data analysis and collecting patterns Interpreting the patterns to accept/reject the hypothesis Discovery approach Bottom-up without a hypothesis in mind Collecting and preparing data of interest Conducting data analysis and discovering possible patterns Evaluating the importance and interestingness

Data Mining Approaches Discovery approach (cont’d) Directed discovery (supervised learning): Certain aspects of the outcome, i.e. the goal, of the discovery have been specified. The discovery is to find those patterns satisfying the goal. e.g. patterns relating to the outcome of a class variable Undirected discovery (unsupervised learning): There is no specification of the goal of the discovery. The discovery is to find those patterns of some kind of significance. e.g. associative links among some attribute values

Data Mining: Problems & Patterns Classification Construct a classification model to determine the class of a given record Model Construction Method Classification Model Example Data Set (a) Model Development Phase Unseen Data Record with undetermined class Data Record with the determined class Classification Model (b) Model Use Phase

Data Mining: Problems & Patterns Various forms of classification models Instance space Neural network Decision tree Many more … List of ordered classification rules Function (linear regression)

Data Mining: Problems & Patterns Cluster detection Measure similarity among data objects and group them into clusters accordingly Input data points Clustering Method Cluster Memberships of Data Points

Data Mining: Problems & Patterns Forms of clustering results Clusters of various shapes Hierarchical clustering results Eclipse shaped clusters

Data Mining: Problems & Patterns Association rule mining Discover significant relationships between data objects Association Mining Method X  Y Various associations Between values, e.g. Apple  Coke Between categories of values, e.g. Food  Magazine Between values of attributes, e.g. Married:yes  OwnHouse:yes Over time period, e.g. year 1: Database  year 2: Data Mining

Data Mining: Problems & Patterns An example Classification model? Clusters? Association rules?

Data Mining Solutions: An Overview Classification solutions Decision tree e.g. ID3 k nearest neighbour (kNN) e.g. PEBLS Rules e.g. Sequential Cover Bayesian theorem e.g. Naïve Bayes Artificial neural network Clustering Solutions Partition-based methods e.g. K-means Hierarchical methods e.g. agglomeration Density-based methods e.g. DBScan Model-based methods e.g. Expectation-Maximisation Graph-based methods e.g. Chameleon

Data Mining Solutions: An Overview Association rule solutions Greedy methods e.g. Apriori Graph-based methods e.g. FP-Growth Methods for various associations Boolean associations Generalised associations (multi-level associations) Quantitative associations (multidimensional associations) Sequential associations (sequential patterns) Since one type of data mining problems can be transformed to another type of data mining problems, some solutions for one type can also be applied to another type.

Evaluation of Patterns Importance of evaluating result patterns Classification model must be accurate enough to be creditable Clusters must genuinely exist Association rules must have enough strengths to be believed Data descriptions must be general enough to cover a large part of the data set How do we evaluate the discovered patterns ?

Evaluation of Patterns Possible measures of interestingness Objective measures based on data and pattern Conciseness of pattern, e.g. minimum description length Coverage, e.g. coverage for classification rules Reliability, e.g. accuracy of a classification model Peculiarity, e.g. measures of difference from the norm Diversity, e.g. tendency of clusters Subjective measures based on domain knowledge Novelty Surprisingness Usefulness Applicability

Evaluation of Patterns Commonly used measures Accuracy rate or error rate for classification models True positive False positive False negative (see section 6.5.1) Quality of clusters Quality of a cluster Overall quality of all clusters (see section 4.5.1) Strengths of associations Support Confidence Lift (see section 8.1.2 and 8.6)

Data Mining in Weka Explorer The roadmap Associate Tab page Preprocess Tab page Tree Visualiser window Cluster Tab page Classify Tab page (1) (3) (2)

Data Mining in Weka Explorer Preprocess Generate random data set Display & edit data Save data set into a file Open data set from different sources Filters for pre-processing Data summary Selected attribute summary Attribute display, selection & removal from the opened data set Visualise all attributes Selected attribute visualisation Feedback messages

Data Mining in Weka Explorer Classify (as an example) Method selection & parameter setting Test option setting Result display window Task list. Menu of options available with right click.

Data Mining in Weka Explorer Classify (as an example) Method List Selecting & Changing parameters Selecting a specific method

Data Mining in Weka Explorer Visualisation Scatter plot of data object of different classes An Example Decision Tree

Probability & Statistics: A Brief Review Where probability and statistics used? Patterns found from data are probabilistic in nature Used in various measures of evaluation, e.g. confidence measure of association rules Used in data exploration stage for better understanding, e.g. maximum, minimum, mean, variance, skewness Used during the mining process to assist the discovery of patterns, e.g. information gain for decision tree induction Used as a part of patterns, e.g. naïve Bayes, Gaussian mixture model Used in comparison of patterns, e.g. classification model with significantly better accuracy

Probability & Statistics: A Brief Review Probability and conditional probability Probability of event P(E) and its meanings when: P(E) = 0, P(E) = 1 and 0 < P(E) < 1 Probabilities of multiple events: P(E and F), P(E or F) = P(E) + P(F) – P(E and F) Mutually exclusive events: P(E and F) = 0 and P(E and F) = P(E) + P(F) Conditional probability of event E given event F: P(E|F) = P(E and F)/P(F) Independent events: P(E and F) = P(E)P(F), and P(E|F) = P(E)

Probability & Statistics: A Brief Review Probability & conditional probability (example)

Probability & Statistics: A Brief Review Probability distribution of random variables Discrete random variable Continuous random variable 68% 95% P(X = x) P(a  X < b)

Probability & Statistics: A Brief Review Basic Statistics Sample mean, median and mode Variance and standard deviation Skewness

Probability & Statistics: A Brief Review Confidence interval estimate Sample mean is only an estimate of the true mean for the data population. Central limit theorem: sample means follows a normal distribution that: The mean is the true population mean X The standard deviation is Based on the central limit theorem and using the sample standard deviation to replace the true one, the following expression is used to estimate the interval for the true mean at confidence level of 1- 

Probability & Statistics: A Brief Review Confidence interval estimate (example) For this data set, n = 12, age = 26 and sage = 7.324. At confidence level of 95%, i.e. 1 -  = 0.95 and /2 = 0.025, n – 1 = 11, and therefore, t = 2.201. The interval estimate is: The interval is estimated as [21.347, 30.653] at confidence level of 95%

Probability & Statistics: A Brief Review Hypothesis testing As an introduction to statistical inference and statistic significance. Procedure: Forming null and alternative hypotheses Deciding the level of significance p Determining a test statistic and calculating its value Comparing the calculated value against known value and deciding if the null hypothesis should be rejected

Probability & Statistics: A Brief Review Hypothesis testing (example) Assuming age = 25 Hypotheses: Null: Alternative: Calculating the statistic t as: Less than t = 2.201 for p/2 = 0.025 and n – 1 = 11. Conclusion: null hypothesis is not rejected, i.e. the difference between the sample mean and the population mean is insignificant.

Chapter Summary The data mining process involves preparation of data, mining of patterns and post-processing of the patterns. Top-down and bottom-up approaches are both useful. The discovery approach can be directed or undirected. Three main streams of data mining tasks and various forms of patterns and models are introduced. Specific solutions are required for specific types of problems The importance of evaluation of patterns must be appreciated. Normal procedure of conducting data mining in Weka is explained Some important basic concepts in probability and statistics are reviewed.

References Read Chapter 2 of Data Mining Techniques and Applications Useful further references Han, J. and Kamber, M. (2006), Data Mining: Concepts and Techniques, 2nd Edition, Morgan Kaufmann Publishers, Chapter 1 Berry, M. J. A. and Linoff, G. (2004), Data Mining Techniques: For Marketing, Sales and Customer Relationship Management, 2nd ed. Wiley Computer Publishing, Chapters 1 – 2