Presentation is loading. Please wait.

Presentation is loading. Please wait.

Xiang Zhang, Feng Pan, Wei Wang, and Andrew Nobel VLDB2008 Mining Non-Redundant High Order Correlations in Binary Data.

Similar presentations


Presentation on theme: "Xiang Zhang, Feng Pan, Wei Wang, and Andrew Nobel VLDB2008 Mining Non-Redundant High Order Correlations in Binary Data."— Presentation transcript:

1 Xiang Zhang, Feng Pan, Wei Wang, and Andrew Nobel VLDB2008 Mining Non-Redundant High Order Correlations in Binary Data

2 Outline Motivation Properties related to NIFSs Pruning candidates by mutual information The algorithm Bounds based on pair-wise correlations Bounds based on Hamming distances Discussion

3 Motivation Example: Suppose X, Y and Z are binary features, where X and Y are disease SNPs, Z=X(XOR)Y is the complex disease trait. {X,Y,Z} have strong correlation. But there are no correlation in{X,Z},{Y,Z} and {X,Y}. Summary We can see that the high order correlation pattern cannot be identified by only examing the pair-wise correlations Two aspects of the desired correlation patterns: The correlation involves more than two features The correlation is non-redundant, i.e., removing any feature will greatly reduce the correlation

4 (Cont.) be the relative entropy reduction of Y based on X. Consider three features, i.e., the relative entropy reduction of given or alone is small. i.e., or jointly reduce the uncertainty of more than they do separately. This strong correlation exists only when these three features are considered together.

5 (Cont.) In this paper, author study the problem of finding non- redundant high order correlations in binary data. NIFSs(Non-redundant Interacting Feature Subsets): The features in an NIFS together has high multi-information All subsets of an NIFS have low multi-information. The computational challenge of finding NIFSs:  To enumerate feature combinations to find the feature subsets that have high correlation. ‚ For each such subset, it must be checked all its subsets to make sure there is no redundancy.

6 Definition of NIFS A subset of features is NIFS if the following two criteria are satisfied: is an SFS Every proper subset is a WFS Ex. is a NIFS is a SFS are WFSs, where

7 Properties related to NIFSs  (Downward closure property of WFSs): If feature subset is a WFS, then all its subsets are WFSs Advantage: This greatly reduces the complexity of the problem. ‚ Let be a NIFS. Any is not a NIFS

8 Pruning candidates by mutual information is not a WFS, i.e., All supersets of can be safely pruned. Ex. Let,

9 Algorithm

10 Upper and lower bounds based on pair- wise correlations is the average entropy in bits per symbol of a randomly drawn k-element subset of

11 Algorithm(Cont.) Suppose that the current candidate feature subset is, check whether all subsets of V of size (b-a-1) are WFSs., the subtree of V can be pruned. In case 2, C(V) must be calculated and checked all subsets of V.

12 (Cont.), there is no need to calculate C(V) and directly proceed to its subtree. Using adding proposition to get upper and lower bounds on the multi-information for each direct child node of V, it must be calculate C(V).

13 Discussion Using an entropy-based correlation measurement to address the problem of finding non-redundant interacting feature subsets.

14 (Cont.)

15 Let and are SFSs are WFSs To require that any subset of an NIFS is weakly correlated.

16 Adding proposition Where, Hamming distance


Download ppt "Xiang Zhang, Feng Pan, Wei Wang, and Andrew Nobel VLDB2008 Mining Non-Redundant High Order Correlations in Binary Data."

Similar presentations


Ads by Google