Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.

Data Mining: A Closer Look Chapter 2

2.1 Data Mining Strategies

Figure 2.1 A hierarchy of data mining strategies

Classification Learning is supervised. The dependent variable is categorical. Well-defined classes. Current rather than future behavior.

Estimation Learning is supervised. The dependent variable is numeric. Well-defined classes. Current rather than future behavior.

Prediction The emphasis is on predicting future rather than current outcomes. The output attribute may be categorical or numeric.

The Cardiology Patient Dataset

A Healthy Class Rule for the Cardiology Patient Dataset IF 169 <= Maximum Heart Rate <=202 THEN Concept Class = Healthy Rule accuracy: 85.07% Rule coverage: 34.55%

A Sick Class Rule ******************************* Rules for Class Sick 93 instances ******************************* 88.00 <= maximum heart rate <= 142.00 :rule accuracy 75.81% :rule coverage 50.54%

Explanation If maximum heart rate is low, you may be at risk of having a heart attack (for prediction) If you have a heart attack, expect your maximum heart rate to decrease (for classification) A low maximum heart rate will cause you to have a heart attack (x)

A Sick Class Rule for the Cardiology Patient Dataset IF Thal = Rev & Chest Pain Type = Asymptomatic THEN Concept Class = Sick Rule accuracy: 91.14% Rule coverage: 52.17%

Unsupervised Clustering Determine if concepts can be found in the data. Evaluate the likely performance of a supervised model. Determine a best set of input attributes for supervised learning. Detect Outliers.

Market Basket Analysis Find interesting relationships among retail products. Uses association rule algorithms.

關聯法則建立依照 Agrawal and Srikant( 1994 ) 所設計的流程，並以技術的觀點來看關聯式法則的建立，基本上可以分為下列兩個步驟： 1. 在資料庫中尋找出所有可能的多數項集合 ( Large Itemsets ) ，並且這些多數項集合的支持度 ( Support Level ) 要大於所設定的最小支持度 ( Minimal Support Level ) 。 2. 利用多數項集合以產生適當的法則。例如，假設找出的多數項集合為 X 和 Y ，則可能產生一條法則為 X→Y ，同時我們亦計算當 X 發生時也發生 Y 的機率 Support( X ∩ Y ) ∕ Support( X ) ，即所謂的信賴度 ( Confidence Level ) ；若是算出的信賴度大於所設定的最小信賴度 ( Minimal Confidence Level ) ，則此條法則就可以被確立。

最小支持度設定為 50 ％，最小信賴度為 50 ％。在 4 個交易中有 2 個交易同時出現 A 、 C( 亦即 Transaction ID 為 2000 及 1000 的交易 ) ，所以我們可計算出 AC 的支持度為 = 2 / 4 ( 50 ％ ) ；同時，在 4 個交易中有 3 個交易有出現 A ( 即 Transaction ID 為 2000 、 1000 及 4000 的交易，但又出現 C 的則有 2000 及 1000) 因此其信賴度 = 2 / 3 ( 66.6 ％ ) 。由於計算出的支持度及信賴度都大於我們所設定的最小支持度及最小信賴度，所以我們說此條法則 A→C 可以被確立。

2.2 Supervised Data Mining Techniques

The Credit Card Promotion Database

A Hypothesis for the Credit Card Promotion Database A combination of one or more of the dataset attributes differentiate Acme Credit Card Company card holders who have taken advantage of the life insurance promotion and those card holders who have chosen not to participate in the promotional offer.

A Production Rule for the Credit Card Promotion Database IF Sex = Female & 19 <=Age <= 43 THEN Life Insurance Promotion = Yes Rule Accuracy: 100.00% Rule Coverage: 66.67%

Production Rules Rule accuracy is a between-class measure. Rule coverage is a within-class measure.

Neural Networks

Figure 2.2 A multilayer fully connected neural network

See CreCardPro_forNNresult.xls

Statistical Regression Life insurance promotion = 0.5909 (credit card insurance) - 0.5455 (sex) + 0.7727 See credicard_regession.xls

Bayesian Classifier 判定為 c i 類

Simplified  naive bayeser

2.3 Association Rules

An Association Rule for the Credit Card Promotion Database IF Sex = Female & Age = over40 & Credit Card Insurance = No THEN Life Insurance Promotion = Yes

2.4 Clustering Techniques

Figure 2.3 An unsupervised cluster of the credit card database Name the cluster: negative relation Name the cluster: positive Name the cluster:?

2.5 Evaluating Performance

Evaluating Supervised Learner Models

Confusion Matrix A matrix used to summarize the results of a supervised classification. Entries along the main diagonal are correct classifications. Entries other than those on the main diagonal are classification errors.

Two-Class Error Analysis

Evaluating Numeric Output Mean absolute error Mean squared error Root mean squared error

Comparing Models by Measuring Lift

Figure 2.4 Targeted vs. mass mailing

Computing Lift

Example: population 100K, Potential customer 1K 100K mail

Total 24000 Total 20000

Unsupervised Model Evaluation Refer to sec. 7.6 1. assign each cluster as a new population. (i.e., 加一欄定各群為各類 Use supervised approach to classify the population. A good result indicates a successful clustering 2. some clustering results may be not robust (i.e., k-mean), results of evaluation may not be satisfied. 3.Apply alternative measures, i.e., between- cluster attribute-value comparison.

Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.

Similar presentations

Presentation on theme: "Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.

Similar presentations

Presentation on theme: "Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies."— Presentation transcript:

Similar presentations

About project

Feedback