The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke

Slides:



Advertisements
Similar presentations
Brief introduction on Logistic Regression
Advertisements

Classification with Multiple Decision Trees
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
K-NEAREST NEIGHBORS AND DECISION TREE Nonparametric Supervised Learning.
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Classification and Decision Boundaries
Lecture Notes for Chapter 4 Introduction to Data Mining
Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
CSci 8980: Data Mining (Fall 2002)
CES 514 – Data Mining Lecture 8 classification (contd…)
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Lecture 5 (Classification with Decision Trees)
Classification 10/03/07.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Classification and Prediction: Regression Analysis
Ensemble Learning (2), Tree and Forest
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
1 Data Mining Lecture 5: KNN and Bayes Classifiers.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Linear Discriminant Analysis and Its Variations Abu Minhajuddin CSE 8331 Department of Statistical Science Southern Methodist University April 27, 2002.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression Regression Trees.
Nurissaidah Ulinnuha. Introduction Student academic performance ( ) Logistic RegressionNaïve Bayessian Artificial Neural Network Student Academic.
Supervised Learning. CS583, Bing Liu, UIC 2 An example application An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc)
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Lecture Notes for Chapter 4 Introduction to Data Mining
Data Mining and Decision Support
DATA MINING LECTURE 10b Classification k-nearest neighbor classifier
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
LECTURE 05: CLASSIFICATION PT. 1 February 8, 2016 SDS 293 Machine Learning.
Extending linear models by transformation (section 3.4 in text) (lectures 3&4 on amlbook.com)
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
Chapter 26: Data Mining Part 2 Prepared by Assoc. Professor Bela Stantic.
Classification Tree Interaction Detection. Use of decision trees Segmentation Stratification Prediction Data reduction and variable screening Interaction.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Data Transformation: Normalization
Chapter 7. Classification and Prediction
k-Nearest neighbors and decision tree
Machine Learning Logistic Regression
Ch9: Decision Trees 9.1 Introduction A decision tree:
Ch8: Nonparametric Methods
CH 5: Multivariate Methods
Classification Nearest Neighbor
Overview of Supervised Learning
Machine Learning Logistic Regression
Classification Discriminant Analysis
Classification Discriminant Analysis
Classification Nearest Neighbor
COSC 4335: Other Classification Techniques
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier
Generally Discriminant Analysis
Multivariate Methods Berlin Chen
Logistic Regression Chapter 7.
Multivariate Methods Berlin Chen, 2005 References:
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Presentation transcript:

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke

Lectures Three and Four Data preprocessing Multidimensional data analysis Data mining Association rules Classification trees Clustering

Types of Attributes Numerical: Domain is ordered and can be represented on the real line (e.g., age, income) Nominal or categorical: Domain is a finite set without any natural ordering (e.g., occupation, marital status, race) Ordinal: Domain is ordered, but absolute differences between values is unknown (e.g., preference scale, severity of an injury)

Classification Goal: Learn a function that assigns a record to one of several predefined classes.

Classification Example Example training database Two predictor attributes: Age and Car-type (Sport, Minivan and Truck) Age is ordered, Car-type is categorical attribute Class label indicates whether person bought product Dependent attribute is categorical

Regression Example Example training database Two predictor attributes: Age and Car-type (Sport, Minivan and Truck) Spent indicates how much person spent during a recent visit to the web site Dependent attribute is numerical

Types of Variables (Review) Numerical: Domain is ordered and can be represented on the real line (e.g., age, income) Nominal or categorical: Domain is a finite set without any natural ordering (e.g., occupation, marital status, race) Ordinal: Domain is ordered, but absolute differences between values is unknown (e.g., preference scale, severity of an injury)

Definitions Random variables X 1, …, X k (predictor variables) and Y (dependent variable) X i has domain dom(X i ), Y has domain dom(Y) P is a probability distribution on dom(X 1 ) x … x dom(X k ) x dom(Y) Training database D is a random sample from P A predictor d is a function d: dom(X 1 ) … dom(X k ) dom(Y)

Classification Problem If Y is categorical, the problem is a classification problem, and we use C instead of Y. |dom(C)| = J. C is called the class label, d is called a classifier. Take r be record randomly drawn from P. Define the misclassification rate of d: RT(d,P) = P(d(r.X 1, …, r.X k ) != r.C) Problem definition: Given dataset D that is a random sample from probability distribution P, find classifier d such that RT(d,P) is minimized.

Regression Problem If Y is numerical, the problem is a regression problem. Y is called the dependent variable, d is called a regression function. Take r be record randomly drawn from P. Define mean squared error rate of d: RT(d,P) = E(r.Y - d(r.X 1, …, r.X k )) 2 Problem definition: Given dataset D that is a random sample from probability distribution P, find regression function d such that RT(d,P) is minimized.

Goals and Requirements Goals: To produce an accurate classifier/regression function To understand the structure of the problem Requirements on the model: High accuracy Understandable by humans, interpretable Fast construction for very large training databases

Different Types of Classifiers Linear discriminant analysis (LDA) Quadratic discriminant analysis (QDA) Density estimation methods Nearest neighbor methods Logistic regression Neural networks Fuzzy set theory Decision Trees

Difficulties with LDA and QDA Multivariate normal assumption often not true Not designed for categorical variables Form of classifier in terms of linear or quadratic discriminant functions is hard to interpret

Histogram Density Estimation Curse of dimensionality Cell boundaries are discontinuities. Beyond boundary cells, estimate falls abruptly to zero.

Kernel Density Estimation How to choose kernel bandwith h? The optimal h depends on a criterion The optimal h depends on the form of the kernel The optimal h might depend on the class label The optimal h might depend on the part of the predictor space How to choose form of the kernel?

K-Nearest Neighbor Methods Difficulties: Data must be stored; for classification of a new record, all data must be available Computationally expensive in high dimensions Choice of k is unknown

Difficulties with Logistic Regression Few goodness of fit and model selection techniques Categorical predictor variables have to be transformed into dummy vectors.

Neural Networks and Fuzzy Set Theory Difficulties: Classifiers are hard to understand How to choose network topology and initial weights? Categorical predictor variables?

What are Decision Trees? Minivan Age Car Type YES NO YES <30>=30 Sports, Truck 03060Age YES NO Minivan Sports, Truck

Decision Trees A decision tree T encodes d (a classifier or regression function) in form of a tree. A node t in T without children is called a leaf node. Otherwise t is called an internal node.

Internal Nodes Each internal node has an associated splitting predicate. Most common are binary predicates. Example predicates: Age <= 20 Profession in {student, teacher} 5000*Age + 3*Salary – > 0

Internal Nodes: Splitting Predicates Binary Univariate splits: Numerical or ordered X: X <= c, c in dom(X) Categorical X: X in A, A subset dom(X) Binary Multivariate splits: Linear combination split on numerical variables: Σ a i X i <= c k-ary (k>2) splits analogous

Leaf Nodes Consider leaf node t Classification problem: Node t is labeled with one class label c in dom(C) Regression problem: Two choices Piecewise constant model: t is labeled with a constant y in dom(Y). Piecewise linear model: t is labeled with a linear model Y = y t + Σ a i X i

Example Encoded classifier: If (age<30 and carType=Minivan) Then YES If (age <30 and (carType=Sports or carType=Truck)) Then NO If (age >= 30) Then NO Minivan Age Car Type YES NO YES <30>=30 Sports, Truck

Choice of Classification Algorithm? Example study: (Lim, Loh, and Shih, Machine Learning 2000) 33 classification algorithms 16 (small) data sets (UC Irvine ML Repository) Each algorithm applied to each data set Experimental measurements: Classification accuracy Computational speed Classifier complexity

Classification Algorithms Tree-structure classifiers: IND, S-Plus Trees, C4.5, FACT, QUEST, CART, OC1, LMDT, CAL5, T1 Statistical methods: LDA, QDA, NN, LOG, FDA, PDA, MDA, POL Neural networks: LVQ, RBF

Experimental Details 16 primary data sets, created 16 more data sets by adding noise Converted categorical predictor variables to 0-1 dummy variables if necessary Error rates for 6 data sets estimated from supplied test sets, 10-fold cross-validation used for the other data sets

Ranking by Mean Error Rate RankAlgorithmMean ErrorTime 1Polyclass hours 2Quest Multivariate min 3Logistic Regression min 6LDA s 8IND CART s 12C4.5 Rules s 16Quest Univariate s …

Other Results Number of leaves for tree-based classifiers varied widely (median number of leaves between 5 and 32 (removing some outliers)) Mean misclassification rates for top 26 algorithms are not statistically significantly different, bottom 7 algorithms have significantly lower error rates

Decision Trees: Summary Powerful data mining model for classification (and regression) problems Easy to understand and to present to non- specialists TIPS: Even if black-box models sometimes give higher accuracy, construct a decision tree anyway Construct decision trees with different splitting variables at the root of the tree

Clustering Input: Relational database with fixed schema Output: k groups of records called clusters, such that the records within a group are more similar to records in other groups More difficult than classification (unsupervised learning: no record labels are given) Usage: Exploratory data mining Preprocessing step (e.g., outlier detection)

Clustering (Contd.) In clustering we partitioning a set of records into meaningful sub-classes called clusters. Cluster: a collection of data objects that are similar to one another and thus can be treated collectively as one group. Clustering helps users to detect inherent groupings and structure in a data set.

Clustering (Contd.) Example input database: Two numerical variables How many groups are here? Requirements: Need to define similarity between records

Graphical Representation

Clustering (Contd.) Output of clustering: Representative points for each cluster Labeling of each record with each cluster number Other description of each cluster Important: Use the right distance function Scale or normalize all attributes. Example: seconds, hours, days Assign different weights associated with importance of the attribute

Clustering: Summary Finding natural groups in data Common post-processing steps: Build a decision tree with the cluster label as class label Try to explain the groups using the decision tree Visualize the clusters Examine the differences between the clusters with respect to the fields of the dataset Try different number of clusters

Web Usage Mining Data sources: Web server log Information about the web site: Site graph Metadata about each page (type, objects shown) Object concept hierarchies Preprocessing: Detect session and user context (Cookies, user authentication, personalization)

Web Usage Mining (Contd.) Data Mining Association Rules Sequential Patterns Classification Action Personalized pages Cross-selling Evaluation and Measurement Deploy personalized pages selectively Measure effectiveness of each implemented action

Large Case Study: Churn Telecommunications industry Try to predict churn (whether customer will switch long-distance carrier) Dataset: 5000 records (tiny dataset, but manageable here in class) 21 attributes, both numerical and categorical attributes (very few attributes) Data is already cleaned! No missing values, inconsistencies, etc. (again, for classroom purposes)

Churn Example: Dataset Columns State Account length: Number of months the customer has been with the company Area code Phone number International plan: yes/no Voice mail: yes/no Number of voice: Average number of voice messages per day Total (day, evening, night, international) minutes: Average number of minutes charged Total (day, evening, night, international) calls: Average number of calls made Total (day, evening, night, international) charge: Average amount charged per day Number customer service calls: Number of calls made to customer support in the last six months Churned: Did the customer switch long-distance carriers in the last six months

Churn Example: Analysis We start out by getting familiar with the dataset Record viewer Statistics visualization Evidence classifier Visualizing joint distributions Visualizing geographic distribution of churn

Churn Example: Analysis (Contd.) Building and interpreting data mining models Decision trees Clustering

Evaluating Data Mining Tools

Checklist: Integration with current applications and your data management infrastructure Ease of usage Automation Scalability to large datasets Number of records Number of attributes Datasets larger than main memory Support of sampling Export of models into your enterprise Stability of the company that offers the product

Integration With Data Management Proprietary storage format? Native support of major database systems: IBM DB2, Informix, Oracle, SQL Server, Sybase ODBC Support of parallel database systems Integration with your data warehouse

Cost Considerations Proprietary or commodity hardware and operating system Client and server might be different What server platforms are supported? Support staff needed Training of your staff members Online training, tutorials On-site training Books, course material

Data Mining Projects Checklist: Start with well-defined business questions Have a champion within the company Define measures of success and failure Main difficulty: No automation Understanding the business problem Selecting the relevant data Data transformation Selection of the right mining methods Interpretation

Understand the Business Problem Important questions: What is the problem that we need to solve? Are there certain aspects of the problem that are especially interesting? Do we need data mining to solve the problem? What information is actionable, and when? Are there important business rules that constrain our solution? What people should we keep in the loop, and with whom should we discuss intermediate results? Who are the (internal) customers of the effort?

Hiring Outside Experts? Factors: One-time problem versus ongoing process Source of data Deployment of data mining models Availability and skills of your own staff

Hiring Experts Types of experts: Your software vendor Consulting companies/centers/individuals Your goal: Develop in-house expertise

The Data Mining Market Revenues for the data mining market: $8 billion (Mega Group 1/1999) Sales of data mining software (Two Crows Corporation 6/99) 1998:$50 million 1999:$75 million 2000: $120 million Hardware companies often use their data mining software as loss-leaders (Examples: IBM, SGI)

Knowledge Management in General Percent of information technology executives citing the systems used in their knowledge management strategy (IW 4/1999) Relational Database 95% Text/Document Search80% Groupware71% Data Warehouse65% Data Mining Tools58% Expert Database/AI Tools25%

Crossing the Chasm Data mining is currently trying to cross this chasm. Great opportunities, but also great perils. You have a unique advantage by applying data mining the right way. It is not yet common knowledge how to apply data mining the right way. No major cooking recipes to make a data mining project work (yet).

Summary Database and data mining technology is crucial for any enterprise We talked about the complete data management infrastructure DBMS technology Querying WWW/DBMS integration Data warehousing and dimensional modeling OLAP Data mining

Additional Material: Web Sites Data mining companies, jobs, courses, publications, datasets, etc: ACM Special Interest Group on Knowledge Discovery and Data Mining

Additional Material: Books U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996Advances in Knowledge Discovery and Data Mining Michael Berry & Gordon Linoff, Data Mining Techniques for Marketing, Sales and Customer Support, John Wiley & Sons, 1997.Data Mining Techniques for Marketing, Sales and Customer Support Ian Witten and Eibe Frank, Data Mining, Practical Machine Learning Tools and Techniques with Java Implementations, Oct 1999Data Mining, Practical Machine Learning Tools and Techniques with Java Implementations Michael Berry & Gordon Linoff, Mastering Data Mining, John Wiley & Sons, 2000.Mastering Data Mining

Additional Material: Database Systems IBM DB2: Oracle: Sybase: Informix: Microsoft: NCR Teradata:

Questions? Prediction is very difficult, especially about the future. Niels Bohr