Mining Changes of Classification by Correspondence Tracing Ke Wang Senqiang Zhou Simon Fraser University {wangk, Wai Chee Ada Fu Jeffrey.

Mining Changes of Classification by Correspondence Tracing Ke Wang Senqiang Zhou Simon Fraser University {wangk, szhoua}@cs.sfu.ca Wai Chee Ada Fu Jeffrey Xu Yu The Chinese University of Hong Kong adafu@cs.cuhk.edu.hk yu@se.cuhk.edu.hk

The Problem Mining changes of classification patterns as the data changes What we have: old classifier and new data What we want: the changes of classification characteristics in the new data members with a large family  shop frequently. members with a large family  shop infrequently. Example

Targets State the changes explicitly Simply comparing old and new classifiers does not work Distinguish the important changes, otherwise users could be over-whelmed Interested in the changes causing the change of classification accuracy.

Related Work [GGR99]: transfer two classifiers into a common specialization and the changes are measured by the amount of work required for such transfer. Human users hardly measure the changes in this way Have not addressed the primary goal: accuracy

Related Work (Cont.) [LHHX00]: extract changes by requiring the new classifier to be similar to the old one. Using the same splitting attributes or the same splitting in the decision tree construction. Put a severe restriction on mining important changes.

Our Approach: Basic Idea For each old rule o, trace the corresponding new rules in the new classifier through the examples they both classify New data set Old rule o New rule n 1 New rule n 2 Example: n 1 and n 2 are corresponding rules of o.

The Algorithm Input: old classifier and new data Output: the changes of classification patterns Build new classifier Identify corresponding new rules for each old rule Present changes Step 1 Step 2 Step 3

Identify The Corresponding Rules Given an old rule o: Collect the examples classified by o For each such example, identify the new rule n that classifies it characteristic change: O  Old rule o New rule n 1 New rule n 2

Any Important Changes? Given O , which changes are more important? Hint: users usually are interested in the changes causing the change of classification accuracy. Basic idea: measure the importance of changes based on the estimated accuracy of rules on future data

Pessimistic Error Estimation Consider an old rule o that classifies N examples ( in new data set) with E wrongly Observed error rate: E/N How about the error rate in the whole population? Given an confidence level CF, the upper bound of error rate is U CF (N,E) Details in [Q93]

Estimating Quantitative Change Consider a rule pair, while o(N o, E o ), n i (N n, E n ) Estimate the error rates for both rules Calculate the decrease of error Calculate the increase of accuracy Quantitative change

An Example n1: A3=1  A4=1  C3, (N=5, E=0) 3 classified by o n2: A3=2  A4=1  C2, (N=6, E=0) 4 classified by o Consider o: A4=1  C3, (N=7, E=4) Assume the new data set has 18 examples and CF=25%. Consider the quantitative changes of  The estimated error rate of o is: U CF (7, 4) = 75% The estimated error rate of n1 is: U CF (5, 0) = 24% The decrease of error rate: 75% - 24% = 51% The increase of accuracy  (o,n1) = (3/18)*51%=8.5% The increase of accuracy  (o, ) =8.5%+12%=20.5%

Types of Changes Global changes: both characteristics change and quantitative change are large. Alternative changes: characteristics change is large but its quantitative change is small.

Types of Changes (Cont.) Target changes: similar characteristics but different classes (targets) Interval changes: shift of boundary points due to the emerging of new cutting points.

Experiments Two data sets: German Credit Data from UCI repository [MM96] IPUMS Census Data [IPUMS] Goal: to verify if our algorithm can find the important changes “supposed” to be found

Methodologies For German Credit data: Plant some changes to original data and check if the proposed method finds them. For IPUMS census data: The proposed method is applied to find the changes across years or different sub- populations. Classifiers are built using C4.5 algorithm

Summaries of German Data Data description: 2 classes: bad and good 20 attributes, 13 categorical 1000 examples: 700 are “good” Changes planted: Target change Interval change Etc.

Planting Target Change Personal-status = single-male, Foreign = no  Credit = good (23, 0) Changes planted: if ( Liable-people=1 ) then change the class from good to bad 12 examples changed Consider the examples classified by the old rule.

The Changes Found Personal-status = single-male, Foreign = no  Credit = good Liable-people >1, Foreign = no  Credit = good (  = 0.54%) Personal-status = single-male, Liable-people <=1, Foreign = no  Credit = bad (  = 0.48%)

Planting Interval Change Status = 0DM, Duration > 11, Foreign = yes  Credit = bad (164, 47) Changes planted: Increase the Duration value by 6 (months) for each example classified by the old rule. 164 examples changed Consider the examples classified by the old rule.

The Changes Found Status = 0DM, Duration > 11, Foreign = yes  Credit = bad Status = 0DM, Duration > 16, Foreign = yes  Credit = bad (  = 1.20%)

Summaries of IPUMS Data Take “vetstat” as class 3 data sets: 1970, 1980 and 1990. Each data set contains the examples for several races. The proposed method is applied to find the changes across years or different sub-populations.

Interesting Changes Found 35<age ≤54  Vetstat=yes 40<age ≤72, sex=male  Vetstat=yes (  = 1.20%) A. 1970-black vs 1990-black 40<age ≤72, sex=male  Vetstat=yes bplg =china, incss ≤5748  Verstat=no (  = 4.56%) B. 1990-black vs 1990-chinese

Conclusion Extracting changes from potentially very dissimilar old and new classifiers by correspondence tracing Ranking the importance of changes Presenting the changes Experiments on real-life data sets

References [GGR99] V. Ganti, J. Gehrke, and R. Ramakrishnan. A framework for measuring changes in data characteristics. In PODS, 1999 [IPUMS] http://www.ipums.umn.edu/.http://www.ipums.umn.edu/ [LHHX00] B. Liu,W. Hsu, H.S. Han, and Y. Xia. Mining changes for real-life applications. In DaWak, 2000

References (Cont.) [MM96] C.J. Merz and P. Murphy. UCI repository of machine learning databases [Q93] J.R. Quinlan. C4.5: programs for maching learning. 1993

Mining Changes of Classification by Correspondence Tracing Ke Wang Senqiang Zhou Simon Fraser University {wangk, Wai Chee Ada Fu Jeffrey.

Similar presentations

Presentation on theme: "Mining Changes of Classification by Correspondence Tracing Ke Wang Senqiang Zhou Simon Fraser University {wangk, Wai Chee Ada Fu Jeffrey."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mining Changes of Classification by Correspondence Tracing Ke Wang Senqiang Zhou Simon Fraser University {wangk, Wai Chee Ada Fu Jeffrey.

Similar presentations

Presentation on theme: "Mining Changes of Classification by Correspondence Tracing Ke Wang Senqiang Zhou Simon Fraser University {wangk, Wai Chee Ada Fu Jeffrey."— Presentation transcript:

Similar presentations

About project

Feedback