Classification.

Classification

分類與預測 Classification (分類) Prediction (預測)
predict categorical class labels 預測分門別類的類別標籤 categorical class labels: 離散式的或名目式的(discrete/nominal) 根據訓練組資料的分類屬性的值（即類別標籤），建構一個模型，並以之將新資料歸類 Prediction (預測) model continuous-valued functions 掌握連續值型態的函數，預測未知或遺失的值

Classification Problem
Given a database D={t1,t2,…,tn} and a set of classes C={C1,…,Cm}, the Classification Problem is to define a mapping f:DgC where each ti is assigned to one class. Actually divides D into equivalence classes. Prediction is similar, but may be viewed as having infinite number of classes.

Classification Ex: Grading
If x >= 90 then grade =A. If 80<=x<90 then grade =B. If 70<=x<80 then grade =C. If 60<=x<70 then grade =D. If x<50 then grade =F. >=90 <90 x x A <80 >=80 x B >=70 <70 x C >=60 <50 F D

Classification Ex: Letter Recognition
View letters as constructed from 5 components: Letter A Letter B Letter C Letter D Letter E Letter F

Classification Examples
Teachers classify students’ grades as A, B, C, D, or F. Identify mushrooms as poisonous or edible. Classify a river as flooding or not. Identify an individual as certain credit risk. Speech recognition: Identify the spoken “word” from the classified examples (rules). Pattern recognition: Identify a pattern from the classified ones. 典型應用信用核發，目標行銷，醫療診斷，處方(treatment)有效性分析

Classification—兩大步驟建構模型: 描述一組預定的類別（predetermined class）
每筆資料皆已具有預定的類別（由 class label attribute決定） training set :用來建構模型的資料組模型的呈現：(1) classification rules (2) decision trees (3) mathematical formulae 使用模型: 將未來或未知之objects分類估計模型之 accuracy 將已知類別的test sample灌入模型，比較模型的答案類別 Accuracy rate： test set samples正確被模型分類的比例 Test set：獨立於 training set, 否則會over-fitting 若accuracy 可接受，用此模型將未知類別標籤的資料項分類 Most common techniques use DTs, NNs, or are based on distances or statistical methods

Process 1: 建構模型 Classification Algorithms Training Data Classifier
(Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’

Process 2:使用模型 Tenured? Classifier Testing Data Unseen Data
(Jeff, Professor, 4) Tenured?

Defining Classes Partitioning Based Distance Based

Supervised vs. Unsupervised Learning
監督式學習 (classification) 監督: training data (觀察值或測量值等) 帶有指示此觀察之標籤新資料依照training set來分類非監督式學習(clustering) training data 的類別標籤未知給予一組觀察值或測量值等，目的是建構資料中存在的類別或群體(cluster)

議題 (1): Data Preparation
Data cleaning Preprocess data in order to reduce noise and handle missing values 相關性分析(Relevance analysis):即屬性的選取 (feature selection) Remove the irrelevant or redundant attributes Data transformation Generalize and/or normalize data Missing Data Ignore Replace with assumed value

議題(2): 評估 Classification Methods
預測準確率(Predictive accuracy) Classification accuracy on test data Confusion matrix OC Curve 速度與可擴充性( scalability) 建構 classifier 的時間多久何時可以使用classifier 強固性(Robustness) handling noise and missing values 可擴充性(Scalability) disk-resident databases 的效率可解釋性(Interpretability) 對model的瞭解與model可提供的insight(瞭解深度) rule有多好（ goodness measure ） decision tree size compactness of classification rules

Height Example Data accuracy

Classification Performance
True Positive False Negative False Positive True Negative

Confusion Matrix Example
Using height data example with Output1 correct and Output2 actual assignment

Operating Characteristic Curve

Classification Using Decision Trees
Partitioning based: Divide search space into rectangular regions. Tuple placed into class based on the region within which it falls. DT approaches differ in how the tree is built: DT Induction Internal nodes associated with attribute and arcs with values for that attribute. Algorithms: ID3, C4.5, CART

Decision Tree Given: D = {t1, …, tn} where ti=<ti1, …, tih>
Database schema contains {A1, A2, …, Ah} Classes C={C1, …., Cm} Decision or Classification Tree is a tree associated with D such that Each internal node is labeled with attribute, Ai Each arc is labeled with predicate which can be applied to attribute at parent Each leaf node is labeled with a class, Cj

DT Induction

DT Splits Area M Gender F Height

Comparing DTs Balanced Deep

Decision Tree Induction
Training Dataset

Output: A Decision Tree for “buys_computer”
age? <=30 overcast 30..40 >40 student? yes credit rating? no yes excellent fair no yes no yes an example from Quinlan’s ID3

Decision Tree Induction演算法
基本方法 (a greedy algorithm) top-down recursive divide-and-conquer 式的建構此樹開始時，所有的example都在樹根只能處理categorical屬性(若屬性是continuous-valued, 預做 discretization) 遞迴地將examples依據所選屬性 partition 選屬性的根據：以(1) heuristic(2)或statistical measure (如 information gain) 停止partitioning的條件某一節點的類別均相同已沒有其他可以用來partition的屬性 – 這個leaf的類別以「多數決」已經沒有sample留下

Decision Tree Induction is often based on Information Theory So

Information

DT Induction When all the marbles in the bowl are mixed up, little information is given. When the marbles in the bowl are all from one class and those in the other two classes are on either side, more information is given. Use this approach with DT Induction !

Information/Entropy Given probabilitites p1, p2, .., ps whose sum is 1, Entropy is defined as: Entropy measures the amount of randomness or surprise or uncertainty. Goal in classification no surprise entropy = 0

Entropy log (1/p) H(p,1-p)

選屬性: Information Gain (ID3/C4.5)
S contains si tuples of class Ci for i = {1, …, m} information measures info required to classify any arbitrary tuple entropy of attribute A with values {a1,a2,…,av} information gained by branching on attribute A

Information Gain 例子表 “age <=30” 14個中有5個，其中 2 ‘yes’、3 ‘ no’ 故同理
Class P: buys_computer = “yes” Class N: buys_computer = “no” I(p, n) = I(9, 5) =0.940 Compute the entropy for age: 表 “age <=30” 14個中有5個，其中 2 ‘yes’、3 ‘ no’ 故同理

由decision tree產生規則將知識以 IF-THEN 規則方式呈現每個路徑（從root到leaf) 產生一條規則
路徑上每個attribute-value pair是一個條件組合( conjunction) Leaf具有類別預測值人們對於規則比較容易懂 Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”

避免 Overfitting Overfitting: induced tree 可能會overfit training data
太多分支：有些可能是noise或outlier的不正常反應 unseen samples的accuracy不好 Two approaches 預先修剪(Prepruning): 建樹時早一點結束如果會導致goodness measure掉到某個threshold下，別再「分」了(no more split) 合適的threshold 難定後修剪(Postpruning): 從長太滿的樹中移除branches 可得到一連串漸進(progressively)修剪的樹用一組不同於 training 的data 決定那一個「修剪的樹」最佳

決定 Final Tree Size的方法分成 training (2/3) 與 testing (1/3) sets
used for data set with large number of samples 使用 cross validation, 例如, 10倍 cross validation divide the data set into k subsamples use k-1 subsamples as training data and one sub-sample as test data --- k-fold cross-validation for data set with moderate size 使用所有 data training 但使用 statistical test (如 chi-square) 估計展開(expand)或修剪(prune) 某個節點可否改善整體分配用 minimum description length (MDL) 原則當encoding minimized時，停止長樹

Information Gain example
YES: NO: Total = 14 P(YES) = 9/14 = 0.64 P(NO) =5/14 = 0.36 I(Y,N) = -( 9/14 * log2(9/14) + 5/14 * log2(5/14) ) = 0.94 Age (<= 30): I(Y,N) = -( 2/5 * log2(2/5) + 3/5 * log2(3/5) ) = 0.97 YES 2 NO 3 Age (31..40): I(Y, N) = -( 4/4 * log2(4/4) + 0/4 * log2(0/4) ) = 0 YES 4 NO 0 Age(>40):-( 2/5 * log2(2/5) + 3/5 * log2(3/5) ) = 0.97 YES 3 NO 2 E(age) = 5/14 I(Y,N)age(<=30) + 4/14 I(Y,N)age(31..40)+ 5/14 I(Y,N)age(>40) = 0.694 Gain = 0.94 – =

Example: Error Rate Test set: 84 84 2/3 1/3 3/4 1/4 1/7 3/7 56 28 2/7
6+2 E1 E2 E3 E4 E5 1/6 1/12 E1 類別錯誤的個數進入這個 leaf 的個數＝ = 2/(6+2) Etree = 2/3*(1/7*E1+ 3/7*E2+3/7*E3)+1/3*(3/4*E4+1/4*E5)

Example of overfitting
“Yes”: 11 “No” : 9 Predict: Yes “Yes”: 3 “No”: 0 Predict: Yes “Yes”: 8 “No”: 9 Predict: No Entropy = -(11/20log211/20+ 9/20log29/20) = 0.993 Average entropy = -17/20(8/17log28/17+ 9/17log29/17) = 0.848

強化(enhance)基本決策樹允許 continuous-valued 屬性處理 missing attribute values
動態定義將continuous attribute value partition為離散區間的新離散屬性處理 missing attribute values 指定共通值指定各個值的機率建立Attribute 為現有稀疏的屬性(sparsely represented)建立新attribute 可減少(簡化) fragmentation, repetition, and replication

DT Issues Choosing Splitting Attributes
Ordering of Splitting Attributes Splits Tree Structure Stopping Criteria Training Data Pruning

ID3 Creates tree using information theory concepts and tries to reduce expected number of comparison.. ID3 chooses split attribute with the highest information gain:

ID3 Example (Output1) Starting state entropy:
4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = Gain using gender: Female: 3/9 log(9/3)+6/9 log(9/6)=0.2764 Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) = Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) = Gain: – = Gain using height: – (2/15)(0.301) = Choose height as first splitting attribute

CART, C4.5, CHAID Classification And Regression Tree
maximize diversitybefore- diversityafter diversity分散度-僅含單一類別之diversity低 eg. entropy/information compute I(2,3,9)? C4.5 Post-pruning CHAID A tree is “pruned” by halting its construction early. (Pre-pruning) CHAID is restricted to categorical variables Continuous variables must be broken into ranges or replaced with classes such as high, low, medium

C4.5 ID3 favors attributes with large number of divisions
Improved version of ID3: Missing Data Continuous Data Pruning Rules GainRatio:

CART Create Binary Tree Uses entropy
Formula to choose split point, s, for node t: PL,PR probability that a tuple in the training set will be on the left or right side of the tree.

CART Example At the start, there are six choices for split point (right branch on equality): P(Gender)=2(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.224 P(1.6) = 0 P(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169 P(1.8) = 2(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.385 P(1.9) = 2(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.256 P(2.0) = 2(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.32 Split at 1.8

Classification in Large Databases
Classification—a classical problem extensively studied by statisticians and machine learning researchers Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed Why decision tree induction in data mining? relatively faster learning speed (than other classification methods) convertible to simple and easy to understand classification rules can use SQL queries for accessing databases comparable classification accuracy with other methods

Scalable Decision Tree Induction in Data Mining
SLIQ (EDBT’96 — Mehta et al.) builds an index for each attribute and only class list and the current attribute list reside in memory SPRINT (VLDB’96 — J. Shafer et al.) constructs an attribute list data structure PUBLIC (VLDB’98 — Rastogi & Shim) integrates tree splitting and tree pruning: stop growing the tree earlier RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) separates the scalability aspects from the criteria that determine the quality of the tree builds an AVC-list (attribute, value, class label)

Presentation of Classification Results

Decision Tree in SGI/MineSet 3.0

Tool: Decision Tree C4.5, the “classic” decision-tree tool, developed by J. Q. Quinlan Classification Tree in Excel EC4.5, a more efficient version of C4.5 IND, provides Gini and C4.5 decision trees

Regression Assume data fits a predefined function
Determine best values for regression coefficients c0,c1,…,cn. Assume an error: y = c0+c1x1+…+cnxn+e Estimate error using mean squared error for training set:

Linear Regression Poor Fit

Classification Using Regression
Division: Use regression function to divide area into regions. Prediction: Use regression function to predict a class membership function. Input includes desired class.

Division

Prediction

Classification Using Distance
Place items in class to which they are “closest”. Must determine distance between an item and a class. Classes represented by Centroid: Central value. Medoid: Representative point. Individual points Algorithm: KNN

K Nearest Neighbor (KNN):
Training set includes classes. Examine K items near item to be classified. New item placed in class with the most number of close items. O(q) for each tuple to be classified. (Here q is the size of the training set.)

KNN Algorithm

Classification Using Neural Networks
Typical NN structure for classification: One output node per class Output value is class membership function value Supervised learning For each tuple in training set, propagate it through NN. Adjust weights on edges to improve future classification. Algorithms: Propagation, Backpropagation, Gradient Descent

NN Issues Number of source nodes Number of hidden layers Training data
Number of sinks Interconnections Weights Activation Functions Learning Technique When to stop learning

Decision Tree vs. Neural Network

Propagation Tuple Input

NN Propagation Algorithm

Example Propagation

NN Learning Adjust weights to perform better with the associated test data. Supervised: Use feedback from knowledge of correct classification. Unsupervised: No knowledge of correct classification needed.

NN Supervised Learning

Supervised Learning Possible error values assuming output from node i is yi but should be di: Change weights on arcs based on estimated error

NN Backpropagation Propagate changes to weights backward from output layer to input layer. Delta Rule: r wij= c xij (dj – yj) Gradient Descent: technique to modify the weights in the graph.

Backpropagation

Backpropagation Algorithm

Gradient Descent

Gradient Descent Algorithm

Output Layer Learning

Hidden Layer Learning

Types of NNs Different NN structures used for different problems.
Perceptron Self Organizing Feature Map Radial Basis Function Network

Perceptron Perceptron is one of the simplest NNs. No hidden layers.

Perceptron Example Suppose: Summation: S=3x1+2x2-6
Activation: if S>0 then 1 else 0

Self Organizing Feature Map (SOFM)
SOM Competitive Unsupervised Learning Observe how neurons work in brain: Firing impacts firing of those near Neurons far apart inhibit each other Neurons have specific nonoverlapping tasks Ex: Kohonen Network

Kohonen Network

Kohonen Network Competitive Layer – viewed as 2D grid
Similarity between competitive nodes and input nodes: Input: X = <x1, …, xh> Weights: <w1i, … , whi> Similarity defined based on dot product Competitive node most similar to input “wins” Winning node weights (as well as surrounding node weights) increased.

Radial Basis Function Network
RBF function has Gaussian shape RBF Networks Three Layers Hidden layer – Gaussian activation function Output layer – Linear activation function

Radial Basis Function Network

Classification Using Rules
Perform classification using If-Then rules Classification Rule: r = <a,c> Antecedent, Consequent May generate from from other techniques (DT, NN) or generate directly. Algorithms: Gen, RX, 1R, PRISM

Generating Rules from DTs

Generating Rules Example

Generating Rules from NNs

1R Algorithm

1R Example

PRISM Algorithm

PRISM Example

Decision Tree vs. Rules Tree has implied order in which splitting is performed. Tree created based on looking at all classes. Rules have no ordering of predicates. Only need to look at one class to generate its rules. 決策樹方法的優點?缺點？

Classification.

Similar presentations

Presentation on theme: "Classification."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Classification.

Similar presentations

Presentation on theme: "Classification."— Presentation transcript:

Similar presentations

About project

Feedback