1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.

1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery 2005/01/21 董原賓

2 Outline Issues with data stream Will old data really help? Optimal models Computing optimal models Cross Validation Decision Tree Ensemble Experiment Conclusion

3 Issues with data stream Concept drift FO i (x) : the optimal model at time stamp i There is concept drift from time stamp i-1 to time stamp i, if there are inconsistencies between FO i-1 (x) and FO i (x) Data sufficient A data set is considered sufficient if adding more data will not increase the generalization accuracy Determining the sufficient amount can be formidably expensive Even if the dataset is insufficient, still need to train a model that can best fit the changing data.

4 Will old data really help? Definition: S i : most recent data chunk SP: previous data chunks, SP = S 1 U … U S i-1 y = f(x): the underlying true model that we aim to model

5 Will old data really help? (Cont.) Three types of SP: 1. FO i (x) ≠ FO i-1 (x), obviously it will only cancel out the changing concept 2. FO i (x) = FO i-1 (x) ≠ y, both models agree that their predictions are wrong, can’t determine if they will help or not 3. FO i (x) = FO i-1 (x) = y, both models make the correct prediction, it is the only portion that may help

6 Will old data really help? (Cont.)

7 Optimal Models The two main themes of our comparison is on possible data insufficiency and concept drift Four situations: 1. New data is sufficient by itself and there is no concept drift, the optimal model should be the one trained from the new data itself.

8 Optimal Models (cont.) 2. New data is sufficient by itself and there is concept drift, the optimal model should be the one trained from the new data itself. 3. New data is insufficient by itself and there is no concept drift. If the previous data is sufficient, the optimal model should be the existing model. Otherwise, train a new model from new data plus existing data

9 Optimal Models (cont.) 4. New data is insufficient by itself and there is concept drift. Choose previous data chunks that have consistent concept with the new data chunk and combine them with the new data We will usually never know if the data is indeed sufficient. Ideally, we should compare a few sensible choices if the training cost is affordable

10 Computing optimal models Definition ： D i-1 ： the dataset that trained the most recent optimal model FO i-1 (x) and is collectively iteratively throughout the streaming mining process s i-1 ： examples selected from D i-1

11 Computing optimal models (cont.) Steps: 1. Train a model FN i (x) from the new data chunk S i only. 2. Select examples from D i-1 that both the trained new model FN i (x) and the recent optimal model FO i-1 (x) make the correct prediction. s i-1 = { ∀ (x,y) ∈ D i-1, such that, (FN i (x) = y) ∧ (FO i-1 (x) = y) }

12 Computing optimal models (cont.) 3. Train a model FN i + (x) from the data data plus the selected data in the last step (S i U s i-1 ) 4. Update the most recent model FO i-1 (x) with S i and call this model FO i-1 + (x). When update the model, keep the structure of the model and update its internal statistics

13 Computing optimal models (cont.) 5. Using “ cross-validation “ to compare the accuracy of FN i (x), FO i-1 (x), FN i + (x) and FO i-1 + (x). Choose the one that is the most accurate and name it FO i (x) 6. D i is the training set that computes FO i (x) and is one of S i, D i-1, S i U s i-1, and S i U D i-1

14 Computing optimal models (cont.) Address how the above framework finds the optimal model under all four previously discussed situations ： 1. New data is sufficient by itself and there is no concept drift. Conceptually FN i (x) should be the optimal model. FO i-1 (x), FN i + (x) and FO i-1 + (x) could be its close match

15 Computing optimal models (cont.) 2. New data is sufficient by itself and there is concept drift. FN i (x) should be the optimal model. FN i + (x) could be very similar in performance to FN i (x) 3. New data is insufficient by itself and there is no concept drift. The optimal model should be either FO i-1 (x) or FO i-1 + (x) 4. New data is insufficient by itself and there is concept drift. The optimal model should be either FN i (x) or FN i + (x)

16 Cross Validation Decision Tree Ensemble Decision tree ensemble Step1: sequentially scans the complete dataset once and finds out all features with information gain Step2: When building trees, at each step choose a remaining feature randomly Step3: The tree stop growing a branch if there are no more examples passing through that branch

17 Cross Validation Decision Tree Ensemble (cont.) Rule1: Features without information gain will never be used Rule2: Each discrete feature can be used at most once in a particular decision path Rule3: Each continuous feature can be chosen multiple times on the same decision path

18 Cross Validation Decision Tree Ensemble (cont.) Rule4: The splitting threshold is a random value within the max and min of that feature Rule5: In the training data, each example x is assign an initial weight 1.0 When missing feature value is encountered, the current weight of x is distributed across its children nodes. If the prior distribution of known values are given, the weight is distributed in proportion to this distribution. Otherwise, it is equally divided among the children nodes

19 Cross Validation Decision Tree Ensemble (cont.) Cross validation ： n: the size of the training set n-fold cross validation: leaves one example x out and use the remaining n – 1 examples to train a model and classify on the left-out example x

20 Cross Validation Decision Tree Ensemble (cont.) Compute the probability of x: Assume that we have two class labels, either fraud or non fraud, the probability if x being fraudulent is

21 Experiment (streaming data) Synthetic data ： Data with drifting concepts based on a moving hyperplane Roughly half of the examples are positive, and the other half negative Credit card fraud data ： Real life credit card transaction for cost sensitive mining. One year period and 5million transactions. Investigating a fraud transaction cost = $90 Donation dataset ： The famous donation dataset that first appeared in KDDCUP ’ 98 competition. 95412 known records. 96367 unknown records for test

22 Experiment( Synthetic dataset)

23 Experiment (credit card dataset)

24 Experiment (Donation dataset)

25 Experiment (Accuracy of Cross validation)

26 Experiment (Optimal model counts)

27 Experiment (Training time)

28 Experiment (Data sufficiency test)

29 Conclusion Using old data unselectively is like gambling Proposed a cross-validation-based framework to choose data and compare sensible choices Proposed an implementation of this framework using cross-validation decision tree ensemble

1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.

Similar presentations

Presentation on theme: "1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.

Similar presentations

Presentation on theme: "1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery."— Presentation transcript:

Similar presentations

About project

Feedback