1 Introduction to data mining G. Marcou + + Laboratoire d’infochimie, Université de Strasbourg, 4, rue Blaise Pascal, 67000 Strasbourg.

Slides:



Advertisements
Similar presentations
Clustering Basic Concepts and Algorithms
Advertisements

Evaluating Classifiers
Learning Algorithm Evaluation
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
An Overview of Machine Learning
What is Statistical Modeling
Evaluation.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Classification and risk prediction
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
© sebastian thrun, CMU, The KDD Lab Intro: Outcome Analysis Sebastian Thrun Carnegie Mellon University
Evaluation.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
1 The Expected Performance Curve Samy Bengio, Johnny Mariéthoz, Mikaela Keller MI – 25. oktober 2007 Kresten Toftgaard Andersen.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Cluster Validation.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Classification and Prediction: Basic Concepts Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Performance Metrics for Graph Mining Tasks
Chapter 5 Data mining : A Closer Look.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Evaluating Performance for Data Mining Techniques
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Exercise Session 10 – Image Categorization
Evaluating Classifiers
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Principles of Pattern Recognition
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Lecture 20: Cluster Validation
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
CpSc 810: Machine Learning Evaluation of Classifier.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Classification Techniques: Bayesian Classification
Evaluating Results of Learning Blaž Zupan
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Artificial Intelligence 8. Supervised and unsupervised learning Japan Advanced Institute of Science and Technology (JAIST) Yoshimasa Tsuruoka.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
ECE 471/571 – Lecture 2 Bayesian Decision Theory 08/25/15.
Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
An Exercise in Machine Learning
Evaluating Classification Performance
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
7. Performance Measurement
Lecture 1.31 Criteria for optimal reception of radio signals.
Machine Learning – Classification David Fenyő
Evaluating Results of Learning
Data Mining Lecture 11.
Clustering Evaluation The EM Algorithm
Classification Techniques: Bayesian Classification
SEG 4630 E-Commerce Data Mining — Final Review —
Revision (Part II) Ke Chen
Parametric Estimation
Model Evaluation and Selection
Multivariate Methods Berlin Chen, 2005 References:
Hairong Qi, Gonzalez Family Professor
Foundations 2.
Bayesian Decision Theory
Presentation transcript:

1 Introduction to data mining G. Marcou + + Laboratoire d’infochimie, Université de Strasbourg, 4, rue Blaise Pascal, Strasbourg

Motivation of data mining  Discover automatically useful information in large data repository.  Extract patterns from experience.  Predict outcome of future observations.  Learning: Set of Task If Experience increase,performance measure on the set of tasks increases ExperiencePreformance Measure

Organisation of data  Datasets are organized as instances/attributes Instances Attributes Synonyms Data points Entries Sample … Synonyms Factors Variables Measures...

Nature of data  Attributes can be:  Numeric  Nominal  Continuous  Categorical  Ordered  Ranges  Hierarchical Atom counts:O=1Cl=4N=6S=3 Molecule name: (1-methyl)(1,1,1-tributyl)azanium, tetrahexylammonium Molecular surface: Phase state: solid, amorphous, liquid, gas, ionized Intestinal absorption: not absorbed, mildly absorbed, completely absorbed Spectral domains: EC numbers: visibleUVIR EC 1. Oxidoreductases EC 2. Transferases EC 3. Hydrolases EC 4. Lyases EC 5. Isomerases EC 6. Ligases EC 6.1 Forming Carbon-Oxygen Bonds EC 6.2 Forming Carbon-Sulfur Bonds EC 6.3 Forming Carbon-Nitrogen Bonds EC 6.4 Forming Carbon-Carbon Bonds EC 6.5 Forming Phosphoric Ester Bonds EC 6.6 Forming Nitrogen—Metal Bonds

Nature of learning  Unsupervised learning  Clustering  Rules  Supervised learning  Classification  Regression  Other  Reinforcement  First order logic +

 A Concept is the target function to be learned  Concept is learned from  attributes-values  Relations  Sequences  Spatial Concept in data mining  Instance 1  Instance 2  Instance 3 …  Instance 1  Instance 2  Instance 3 … DB1 DB2

Machine Learning and Statistics  Statistician point of view  Datasets are the expression of underlying probability distributions  Datasets validate or invalidate prior hypothesis  Data miner point of view  Any hypothesis compatible with the dataset is useful  Search for all hypothesis compatible with the dataset Induction Deduction

Validation in Data Mining  Validation means that a model is build on a training set of data then applied on a test set of data.  Success and failure on the test set must be estimated.  The estimate is supposed to be representative of any new situation.  Every model must be validated.

Training/Test  Split the dataset in two parts:  One part is the training set  The other is the test set

Bootstrapping  Draw N instances with replacement from the dataset  Create a training set with these instances  Use the dataset as the test set

Cross-Validation  Split the dataset in N subsets  Use each subset as a test set while all others form a training set

Scrambling  Reassign at random the classes to the instances.  Success and failure are estimated on the scrambled data.  The goal is to estimate good success measurement by pure chance.

Clustering  Search for an internal organization of the data  Optimizes relations between instances relative to an objective function  Typical objective functions: SeparationCoherence Density Contiguity Concept

Cluster Evaluation  Essential because any dataset can be clustered by not any cluster is meaningful.  Evaluation can  Unsupervised  Supervised  Relative

Unsupervised Cluster evaluation CohesionSeparation Silhouette Proximity matrix CoPhenetic Correlation For p Nearest Neighbor Distances (NND) between instances (ω i ) and NND between rand points (u i ) Clustering Tendency

Supervised cluster evaluation Recall(3,1) Class1 N i, number of members of cluster i Precision(3,1) Cluster3 p ij Precision(i,j) Recall(i,j)

Relative analysis  Compare two clustering.  Supervised cluster analysis is a special case of relative analysis  The reference clustering is the set of classes Rand statisticsJaquard statistics N 00 : number of instances couple in different clusters for both clustering N 11 : number of instances couple in same clusters for both clusters N 01 : number of instances couple in different clusters for the first clustering and in the same clusters for the second N 10 : number of instances couple in the same clusters for the first clustering and in different one for the second.

A simple clustering algorithm: k-mean 1.Select k points as centroids 2.Form k clusters: each point is assigned to its closest centroid 3.Reset each centroid to the (geometric) center of its cluster 4.Repeat from point 2 until no change is observed 5.Repeat from point 1 until stable average clusters are obtained. XXXXXX

Classification  Definition  Assign one or several objects to predefined categories  The target function maps a set of attributes x to a set of classes y.  Learning scheme  Supervised learning  Attribute-value  Goal  Predict the outcome of future observations

Probabilities basics  Conditional probabilities  Independence of random events:  Probability of realization of event A knowing that B has occurred  The Bayes equation for independent events x i

Statistical approach to classification  Estimate the probability of an instance {x 1,x 2 } being of Class1 or Class2. Class 1 Class2

 The probability that an instance {x 1,x 2,…} belongs to class A is difficult to estimate.  Poor statistics  Consider the Bayes Equation:  With the naive assumption that {x 1,x 2,…} are independent  The prior probability, the evidence and the likelihood have better estimates  Good statistics The Naive Bayes assumption Posterior Probability Prior ProbabilityLikelihood Evidence

The Naive Bayes Classifier 1.Estimate the prior probability, P(A), for each class. 2.Estimate the likelihood, P(x|A), of each attribute for each class. 3.For a new instance, estimate the Bayes Score for each class: 4.Assign the instance to the class which possesses the highest score The value of C can be optimized

Success and failure  For N instance and a give classifier, for each class I  N TP (i):  True Positives Number of instances of class i correctly classified.  N FP (i):  False Positives Number of instances incorrectly assigned to class i.  N TN (i):  True Negatives Number of instances of other classes correctly classified.  N FN (i):  False Negatives Number of instances of class i incorrectly assigned to other classes.

Confusion Matrix  For N instances, K classes and a classifier  N ij, the number of instances of class i classified as j Class1Class2…ClassK Class1N 11 N 12 …N 1K Class2N 21 N 22 …N 2K …………… ClassKN K1 N K2 …N KK

Classification Evaluation Global measures of success Measures are estimated on all classes Local measures of success Measures are estimated for each class

Ranking success evaluation  Receiver Operator Curve (ROC)  Receiver Operator Curve Area Under the Curve (ROC AUC) Recall 1-Specificity

Losses and Risks  Errors on a different class prediction has different costs  What does it cost to mistakenly assign an instance of one class to another?  Normalized Expected Cost  Probability Cost Function Class1Class2…ClassK Class10C 12 …C 1K Class2C 21 0…C 2K …………… ClassKC K1 C K2 …0 Cost matrix Cij Asymmetric matrix

Cost Curve Worse classifier Ideal Classifier N FP N TP Probability Cost function Accept All Classifier Reject All Classifier Actual Classifier Normalized expected cost

Conclusion  Data mining extracts useful information from datasets  Clustering:  Unsupervised  Information about the data  Classification:  Supervised  Build models in order to predict outcome of future observations

Multi-Linear Regression y=ax+b a b Sum of Squared Errors (SSE)